On 11/06/2020 11:24, Martin Man via mbed-tls wrote:
The code in mbedtls_x509_dn_gets fails to properly handle the UTF-8 multibyte sequence 0xe2 0x80 0x99 and turns it into 0xe2 0x80 0x3f.
There is a fix floating around development branch mentioned hereĀ https://github.com/ARMmbed/mbedtls/pull/3326/files%C2%A0which essentially replaces all control chars with question marks.
I think this is a bug and the dn_gets should simply leave the UTF-8 multibyte untouched when parsing it out from a field tagged with ASN.1 tag 12 (utf-8).
That code is from an earlier era (mid 2000s, I think) when most systems used an 8-bit encoding, but non-8-bit-clean systems were still common. A '\x80' in text might be transformed to '\x00' with disastrous consequences.
But over a decade later, I don't think non-8-bit-clean systems are a concern anymore. Leaving all non-ASCII characters alone sounds reasonable to me.
We are not going to do Unicode normalization in Mbed TLS: that would be far too complex for a library that runs on systems with ~1e5 bytes available for code. So Unicode strings would only be processed correctly if the application passes normalized strings and CAs only generate certificates with normalized strings. But that would be an improvement on converting non-ASCII characters to '?'.