Hi all,
this will be a long mail. Sorry for that.
In the past weeks I've been using mbedTLS 2.16.5 for implementing crypto on an ARM Cortex M4 (STM32F479). This was my first experience with mbedTLS, but I have some (almost 20 years) experience with applied and high-assurance crypto. So maybe the following thoughts fit into the discussion of plans for version 3.0 of Mbed TLS.
In the end, I achieved everything that was required for my project with mbedTLS, but some things surprised me or it took a while to find out.
I'll enumerate the following points for easier reference. Nothing of the following is meant to embarrass anyone, just my personal thoughts.
1. I really missed an Initialize, Update, Finalize (IUF) interface for CCM.
For GCM, we have mbedtls_gcm_init(), mbedtls_gcm_setkey(), mbedtls_gcm_starts(), mbedtls_gcm_update() iterated, mbedtls_gcm_finish(), mbedtls_gcm_free() or the comfort functions mbedtls_gcm_crypt_and_tag() and mbedtls_gcm_auth_decrypt(). For CCM, only mbedtls_ccm_init(), mbedtls_ccm_setkey(), mbedtls_ccm_encrypt_and_tag() or mbedtls_ccm_auth_decrypt() and mbedtls_ccm_free(). With this interface it was only possible to encrypt and tag 128 kByte on my target system, while with GCM I could encrypt much larger files.
see Github issue #662 and my comment there
2. The next step, of course, is to integrate this into the higher mbedtls_cipher layer.
Regarding higher, abstract layers: I often didn't understand which interface I was supposed to use. In general, I like to use the lowest available interface, for example, #include "mbedtls/sha512.h" when I want to use sha512. However, if I need HMAC-SHA-512 or HKDF-HMAC-SHA-512 then I have to use the interface in md.h. For hash functions this is fine. Almost all hash functions are supported via md.h. (I missed SHA-512/256 which is sometimes preferable to SHA-256 on 64bit systems).
But with cipher.h, I can only access Chacha20Poly1305 and AES-GCM, not AES-CCM.
3. For certification and evaluation purposes I need some test vectors for each crypto function on target. While I know about the comprehensive self-test program I'm now talking about built-in functions like mbedtls_sha512_self_test(), etc to be enabled with #define MBEDTLS_SELF_TEST.
These self-tests are very different in coverage. For SHA-384 and SHA-512 they are fine, for HMAC-SHA-384 and HMAC-SHA-512 I couldn't find any as well as for HKDF-HMAC-SHA-256 (in RFC 5869) or HKDF-HMAC-SHA-384/512 (official test vectors difficult to find). AES-CTR and AES-XTS are only tested with key length 128 bit, not with 256 bit. AES-CCM is not tested with 256 bit and even for 128 bit, the test vector from the standard NIST SP 800-38C with long additional data is not used. The builtin self-test for GCM is the best I've seen with mbedtls: all three key lengths are tested as well as the IUF-interface and the comfort function. Bravo!
4. That I couldn't configure AES-256 only, i.e. without AES-128 and AES-192, was to be expected (and the code overhead is not that much). But in modern modes of operations nobody needs AES decryption, only the forward direction. Sometimes modern publications as Schwabe/Stoffelen "All the AES you need on Cortex-M3 and M4" provide only the forward direction.
So, it would be fine if one could configure an AES (ECB) encryption only without decryption.
Of course, this is only possible if we don't use CBC mode, etc. This wouldn't only save the AES decryption code but also the rather large T-tables for decryption.
5. Regarding AES or better the AES context-type definition
typedef struct mbedtls_aes_context { int nr; /*!< The number of rounds. */ uint32_t *rk; /*!< AES round keys. */ uint32_t buf[68]; /*!< Unaligned data buffer. This buffer can hold 32 extra Bytes, which can be used for one of the following purposes: <ul><li>Alignment if VIA padlock is used.</li> <li>Simplifying key expansion in the 256-bit case by generating an extra round key. </li></ul> */ } mbedtls_aes_context;
I really don't understand why we need additional 2176 bit in EVERY AES context. I would understand 128 bit (one block size) or even 512 bit (for example for CTR optimization which is not used!). But 2176 bit in every AES context? The VIA padlock is not very common, I suppose. But even if it were, this doesn't justify such memory overhead.
How wasteful this is, one can see in the next type definition
/** * \brief The AES XTS context-type definition. */ typedef struct mbedtls_aes_xts_context { mbedtls_aes_context crypt; /*!< The AES context to use for AES block encryption or decryption. */ mbedtls_aes_context tweak; /*!< The AES context used for tweak computation. */ } mbedtls_aes_xts_context;
The tweak context is for the encryption of exactly 128 bit, not more.
6. In general, the contexts of mbedTLS are rather full of implementation specific details. Most extreme is mbedtls_ecp_group in ecp.h. Wouldn't it be clearer if one separates the standard things (domain parameters in this case) from implementation specific details?
7. While at Elliptic Curve Cryptography: I assume that some of you know that projectives coordinates as outer interface to ECC are dangerous, see David Naccache, Nigel P. Smart, Jacques Stern: Projective Coordinates Leak, Eurocrypt 2004, pp. 257–267. Therefore, the usual interface in ECC standards are either affine points or compressed affine points (Okay, with the modern curves Curve25519 and Curve 448 it's X only.).
Now with
/** * \brief The ECP point structure, in Jacobian coordinates. * * \note All functions expect and return points satisfying * the following condition: <code>Z == 0</code> or * <code>Z == 1</code>. Other values of \p Z are * used only by internal functions. * The point is zero, or "at infinity", if <code>Z == 0</code>. * Otherwise, \p X and \p Y are its standard (affine) * coordinates. */ typedef struct mbedtls_ecp_point { mbedtls_mpi X; /*!< The X coordinate of the ECP point. */ mbedtls_mpi Y; /*!< The Y coordinate of the ECP point. */ mbedtls_mpi Z; /*!< The Z coordinate of the ECP point. */ } mbedtls_ecp_point;
you have Jacobian coordinates, i.e. projective coordinates, as outer interface. In the comment, its is noted that only the affine part is used, but can this be assured? In all circumstances?
8. In my personal opinion the definition
/** * \brief The ECP key-pair structure. * * A generic key-pair that may be used for ECDSA and fixed ECDH, for example. * * \note Members are deliberately in the same order as in the * ::mbedtls_ecdsa_context structure. */ typedef struct mbedtls_ecp_keypair { mbedtls_ecp_group grp; /*!< Elliptic curve and base point */ mbedtls_mpi d; /*!< our secret value */ mbedtls_ecp_point Q; /*!< our public value */ } mbedtls_ecp_keypair;
is dangerous. Why not differentiate between private and public key and domain parameters? How often does it happen by accident with this structure that you give the private key (unneeded and dangerous) together with the public key to ECDSA signature verification? Obviously this was known (and perhaps it happened) to the authors of programs\ecdsa.c with the following comment
/* * Transfer public information to verifying context * * We could use the same context for verification and signatures, but we * chose to use a new one in order to make it clear that the verifying * context only needs the public key (Q), and not the private key (d). */
What is sometimes useful, is to have the public key at hand when you have performed a private key operation (as countermeasure against fault attacks, verify after signing). But for ECC the verification procedure if often too expensive (in contrast to cheap RSA verify).
9. Regarding ECC examples: I found it very difficult that there isn't a single example with known test vectors as in the relevant crypto standards, i.e. FIPS 186-4 and ANSI X9.62-2005, with raw public keys. What I mean are (defined) curves, public key value Q=(Qx,Qy) and known signature values r and s. In the example ecdsa.c you generate your own key pair and read/write the signature in serialized form. In the example programs/pkey/pk_sign.c and pk_verify.c you use a higher interface pk.h and keys in PEM format.
So, it took me a while for a program to verify (all) known answer tests in the standards (old standards as ANSI X9.62 1998 have more detailed known answer tests). One needs this interface with raw public keys for example for CAVP tests, see The FIPS 186-4 Elliptic Curve Digital Signature Algorithm Validation System (ECDSA2VS).
10. While debugging mbedtls_ecdsa_verify() in my example program, I found out, that the ECDSA, ECC and MPI operations are very, let's say, nested. So, IMHO there is a lot of function call overhead and special cases. It would be interesting to see what's the performance impact of a clean, straight-forward mbedtls_ecdsa_verify without restartable code, etc. to the current one.
11. In the moment, there is no single known answer tests for ECDSA (which could be activated with #define MBEDTLS_SELF_TEST). I wouldn't say that you need an example for every curve and hash combination, as it is done in ECDSA2VS CAVP, but one example for one of the NIST curves and one for Curve25519 and - if I have a wish free - one for Brainpool would be fine. And this would solve #9 above.
12. Just a minor issue: I only needed ECDSA signature verification, therefore I only included MBEDTLS_ASN1_PARSE_C. But it is not possible to compile without MBEDTLS_ASN1_WRITE_C needed for ECDSA signature generation.
13. Feature request: Since it was irrelevant for my task (only verification, no generation) I didn't have a detailed look a your ECC side-channel countermeasures. But obviously you use the same protected code for scalar multiplication in verify and sign, right? Wouldn't it be possible to use Shamir's trick in verification with fast unprotected multi-scalar multiplication. In the moment, mbedtls_ecdsa_verify is a factor 4-5 slower than mbedtls_ecdsa_sign, while OpenSSLs verify is faster than sign.
14. Design question: In the moment, both GCM and CCM use their own implementation of CTR encryption which is very simple. But then we have mbedtls_aes_crypt_ctr() in aes.h which is very simple, too. Let's assume at one day we have a performance optimized CTR encryption (for example from Schwabe & Stoffelen) with all fancy stuff like counter-mode caching etc. Then this would have to be replaced at three places at minimum. While isn't the code at this point more modularized? Is this a dedicated design decision? Why do I find at so many places
for( i = 0; i < 16; i++ ) y[i] ^= b[i];
instead of a fast 128-bit XOR macro with 32bit aligned data?
So, that's it for the moment. I hope I could give some hints for the further development of mbedTLS. Feel free to discuss any of the above points. It's clear to me that we cannot have both: clear and simple to understand code and performance records.
Ciao,
Torsten