Hello Gilles,
Thank you.
I think I solved it. It was due to TLS being an unusual task in this project in that it moves a load of data in one go.
TLS by definition requires a 16k rx buffer and a 16k tx buffer, but we are an HTTPS *Client* and thus control how much we send *out*, so we have a 16k rx buffer and a 4k tx buffer. The output message is (experimentally) typically 1k for the negotation, and 4k would be reached only with actual data but we control that because TLS would split up larger tx packets. We have a 4k buffer defined as the maximum unencrypted payload.
The problem was in LWIP. When an application calls the LWIP API, initially the API code runs at the application's priority (obviously). Then it gets message-passed to the LWIP core, which runs at whatever priority has been configured for that - in our case 40. LWIP then grabs some buffers (PBUFs) out of a pool of buffers and shoves this tx data into those, in the hope of sending them out via the PHY interface down the wire.
The problem was that I had just 4 PBUFs configured. Together with other applications running, this wasn't enough for whatever TLS needed. I found that, for my particular project, 5 were needed, so I went to 6. Each of these is (in my project) 1 MTU size (1.5k) so you don't want to just chuck in 20 of them :)
What LWIP was waiting for, for the 3 seconds, I don't know. It was just kept busy by TLS while TLS was doing the session setup - all of ~3 secs. I do know from tests that with 4 PBUF buffers in LWIP, the maximum that LWIP accepts for tx (via its netconn API and almost certainly also via its sockets API) in one chunk is 4k. And TLS must have been getting close to that, and one other application, sending out about 1.5k, pushed it over the edge.
So it remains unsolved other than "more buffers fixed it". There is a lot of stuff like this e.g. I found LWIP falls over if you have just 2 RX buffers at the PHY ETH layer and one of them gets a broadcast packet in it and the other gets a data packet, concurrently. LWIP then hangs for about 1 second. This was solved by a) having 4 PHY buffers and b) filtering out multicast packets on the way from PHY to LWIP (except ARPs).
The problem is that a lot of this stuff gets set up with loads of buffers everywhere and if it works, it works, and nobody knows the limits.
Regards,
Peter
Hi Peter,
This question is probably better suited for an LWIP or RTOS forum, since Mbed TLS doesn't have any knowledge of thread priorities, and the question would be pretty much the same for any implementation of a network protocol that uses cryptography.
As someone with only very limited experience of doing synchronization through thread priority, my impression is that Mbed TLS isn't integrated correctly. I think that only the BIO functions (send and recv callbacks passed to mbedtls_ssl_set_bio)) should run at LWIP priority. Preparing the data to send before it's been copied to the networking stack by f_send, and processing the received data after it's been copied from the networking stack by f_recv, should not happen at high priority. These are CPU-intensive tasks with no deadline (unless the TLS processing itself has a deadline), so they should run at low priority. I'd expect the callback functions to do the necessary runtime priority adjustment if such adjustments are necessary.
Best regards,
-- Gilles Peskine Mbed TLS developer
On 12/02/2023 14:53, Peter via mbed-tls wrote:
Hi All,
I have a couple of tasks which use LWIP and which get suspended for a few seconds during TLS RSA/EC crypto. One (a primitive http server) uses Netconn and the other (a serial to TCP data copy process) uses sockets.
I also have a number of tasks which don't do any networking and which run as they should, throughout. Experimentation of what priority these need is difficult but it looks like it needs to be at/above the tasks which invoke TLS. If their priority is 0 then TLS hangs them up as well.
After much experimentation with RTOS priorities, this is what I found, and I wonder if it is right:
TCP/IP applications (whether using the LWIP Netconn API or the LWIP socket API) should run at a priority equal to or lower than that of LWIP itself which [in this project] is osPriorityHigh (40). TCP/IP applications can be run with a priority all the way down to tskIDLE_PRIORITY (0).
The exception is if TLS is in use. TLS does not yield to the RTOS; you get a solid CPU time lump of ~3 secs (STM 32F417, hardware AES only). TLS starts in the priority of the task which invokes it, but subsequent TLS-driven TCP/IP operations run at the priority of LWIP. So when TLS is doing the session setup crypto, tasks of a priority lower than LWIP get suspended. If this gap is an issue, the priority of the relevant tasks should be equal to LWIP's. Furthermore, due to the structure of LWIP, the priority of a task using it should not be higher than LWIP (24) since it might cause blocking on socket (or netconn) writes.
Does this make sense?
It looks like LWIP blocks all netconn and socket ops when TLS is using it. Is that possible?
I am running with
#define LWIP_TCPIP_CORE_LOCKING 0 #define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1 #define SYS_LIGHTWEIGHT_PROT 1
as this was found to have much better granularity for task switching. With LWIP_TCPIP_CORE_LOCKING=1 you end up with a crude mutex across the entire API call, which is fine in a single RTOS task.
Thank you in advance for any pointers.
Peter