Hi,
Today, I measured the call overhead on the function entry to TF-M is significant and will cause side effects for time deterministic MCU applications using the MDK debugger on STM32L5.
Compiler: AC6.14 -oz (optimized for image size) TFM configuration: TFM_LVL=1, library mode, TFM_NS_CLIENT_IDENTIFICATION = OFF
--- Execution time measurement: Function call of NS psa_open_key to corresponding secure function: NS: dispatch -> S: tfm_crypto_open_key 2135 cycles NS: dispatch -> S: psa_open_key 2536 cycles NS: psa_open_key -> S: psa_open_key 2825 cycles (this is with RTOS mutex overhead)
tfm_core_sfn_request(const struct tfm_sfn_req_s *desc_ptr) { __ASM volatile( "PUSH {r4-r12, lr} \n" "SVC %[SVC_REQ] \n" <--- effectively disables interrupts for 1970 Cycles "MOV r4, #0 \n"
On Musca (~48MHz) the overhead is 45us for a TF-M call.
--- Code Size overhead: Each TFM function has the following flow:
tfm_ns_interface_dispatch (this is a central function) #33 result = fn(arg0, arg1, arg2, arg3); -> calls each TF-M function with individual veneer tfm_core_partition_request (which is again central function)
As function inlining is used, the each veneer requires 180 bytes. In my system there are 4 ITS and 46 Crypto functions; with the net result of ~10K code for just the veneer entries.
Here are some suggestions:
* Using a central entry point to TF-M could save ~10KB; I suggest a table driven approach (could be generated from "manifest" information). * In LVL1 isolation, why is it required to switch from NS: thread->S: handler->S: thread mode. Is it not possible to just call NS: thread-> S: thread? * Disabling NS interrupts for 1970 cycles will be problematic for many time critical applications that are ISR driven; some is caused by parameter checking: * current sequence: first check, then copied (which requires to disable interrupts); Better: First copy, then check could avoid ISR blocking.
I hope this helps to improve TFM.
Reinhard
Hi Reinhard, All,
Thanks to your research we have exact measurements before any improvements. Good to know it. Suppose I have heard same order numbers on one of Tech Forums or during Workshop in Lyon. I think such issues shall be addressed via design proposal, review in community ended with code change. Personally I like your suggestions. Let's discuss it on one of Tech Forums.
Best Regards, Anton Komlev
From: Reinhard Keil Reinhard.Keil@arm.com Sent: 04 March 2020 13:56 To: tf-m@lists.trustedfirmware.org Cc: nd nd@arm.com; Anton Komlev Anton.Komlev@arm.com Subject: Entry to TF-M
Hi,
Today, I measured the call overhead on the function entry to TF-M is significant and will cause side effects for time deterministic MCU applications using the MDK debugger on STM32L5.
Compiler: AC6.14 -oz (optimized for image size) TFM configuration: TFM_LVL=1, library mode, TFM_NS_CLIENT_IDENTIFICATION = OFF
--- Execution time measurement: Function call of NS psa_open_key to corresponding secure function: NS: dispatch -> S: tfm_crypto_open_key 2135 cycles NS: dispatch -> S: psa_open_key 2536 cycles NS: psa_open_key -> S: psa_open_key 2825 cycles (this is with RTOS mutex overhead)
tfm_core_sfn_request(const struct tfm_sfn_req_s *desc_ptr) { __ASM volatile( "PUSH {r4-r12, lr} \n" "SVC %[SVC_REQ] \n" <--- effectively disables interrupts for 1970 Cycles "MOV r4, #0 \n"
On Musca (~48MHz) the overhead is 45us for a TF-M call.
--- Code Size overhead: Each TFM function has the following flow:
tfm_ns_interface_dispatch (this is a central function) #33 result = fn(arg0, arg1, arg2, arg3); -> calls each TF-M function with individual veneer tfm_core_partition_request (which is again central function)
As function inlining is used, the each veneer requires 180 bytes. In my system there are 4 ITS and 46 Crypto functions; with the net result of ~10K code for just the veneer entries.
Here are some suggestions:
* Using a central entry point to TF-M could save ~10KB; I suggest a table driven approach (could be generated from "manifest" information). * In LVL1 isolation, why is it required to switch from NS: thread->S: handler->S: thread mode. Is it not possible to just call NS: thread-> S: thread? * Disabling NS interrupts for 1970 cycles will be problematic for many time critical applications that are ISR driven; some is caused by parameter checking: * current sequence: first check, then copied (which requires to disable interrupts); Better: First copy, then check could avoid ISR blocking.
I hope this helps to improve TFM.
Reinhard
tf-m@lists.trustedfirmware.org