On 11 Jun 2020, at 12:09, Gilles Peskine via mbed-tls mbed-tls@lists.trustedfirmware.org wrote:
On 11/06/2020 11:24, Martin Man via mbed-tls wrote:
I think this is a bug and the dn_gets should simply leave the UTF-8 multibyte untouched when parsing it out from a field tagged with ASN.1 tag 12 (utf-8).
We are not going to do Unicode normalization in Mbed TLS: that would be far too complex for a library that runs on systems with ~1e5 bytes available for code. So Unicode strings would only be processed correctly if the application passes normalized strings and CAs only generate certificates with normalized strings. But that would be an improvement on converting non-ASCII characters to '?'.
Definitely agree that normalization is not needed. I think this problem could be split into two parts:
1) When a const char* is passed into mbedtls_x509write_crt_set_subject_name, the mbedtls will currently encode it into ASN tag 12 UTF8. Not sure what validation is done. But it could perhaps do at least a basic validation of what the C string passed in is to avoid generating a cert with crippled DN. Alternatively you can simply trust the developer to pass in correct UTF8 and document this. This is a API design decision of what input is allowed to be passed into the method and what validation is done on this.
2) When the mbedtls_x509_dn_gets extracts a C string from the ASN.1 tagged as 12, it could validate that it is indeed valid UTF-8, or just leave it as is and push it out. Again, this is about what we expect the library to do.
I’m not an expert on whether this can in any way be used to trick MBEDLTS to do bad things when sending in a malformed certificate, say a one where DN is encoded as UTF-8 but contains illegal UTF-8 in the payload.
thanks for listening, Martin