optee_benchmark found optee_os function vm_set_ctx() spent much time

23 Jun 2023


      Hi, Jens, Olivier:
 I'm a colleague of Yuye, and we are doing optee benchmark and performance optimization.
 We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0
 We think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL.
 So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost.
 We tested on an ArmV9 data-center server, and we got average time is about 20 us.
 After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time.
 Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay.
 I added logs to print some pointers, following is the log in update_current_ctx()
I found that many InvokeCommand repeat following pattern
E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session
E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28
E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session
E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0
I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28
My question is: 
 Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then setting tsd->ctx to NULL ctx, to remove the vm_set_ctx() cost of time ?
 Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of 10k times InvokeCommand ?
Thanks. Best wishes.
------------------------------------------------------------------
发件人：Olivier DeprezOlivier.Deprez@arm.com
日　期：2023年05月30日 21:07:20
收件人：Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)meijianqiang.mjq@alibaba-inc.com
抄　送：op-teeop-tee@lists.trustedfirmware.org; hafniumhafnium@lists.trustedfirmware.org; 黄明hm281385@alibaba-inc.com
主　题：Re: optee_benchmark pmccfiltr_el0
 Hi Yuye,
 In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre).
 See Arm ARM D11.5.3 Prohibiting event and cycle counting
 Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... >
 See also  https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... >
 Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
 You could try to permit cycle counting in the secure world for the sake of a one shot experiment,
 but note that this has never been tried, and this should probably not be productized as things stand.
 Regards,
 Olivier.
From: 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com
Sent: 30 May 2023 10:38
To: Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez Olivier.Deprez@arm.com
Cc: op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com
Subject: optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level.
Regards,
Yuye.
------------------------------------------------------------------
发件人：梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com
发送时间：2023年5月26日(星期五) 16:59
收件人：Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez olivier.deprez@arm.com
抄　送：op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com
主　题：optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path:
./out/benchmark ../optee_examples/out/ca/optee_example_hello_world
After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. 
In cold boot, the following code will be executed during hafnium initialization:
vm->arch.trapped_features |= HF_FEATURE_PERFMON;
This will prevent the secondary vm from accessing the performance counter registers. 
We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium.
But the value of pmccfiltr_el0 remains unchanged and cannot be counted. 
We tried to read the register in hafnium and found that there was no change either.
In contrast, in the normal world, pmccfiltr_el0 counts normally. 
Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present?
Thanks for the support.
Regards,
Yuye.

2026

2025

2024

2023

2022

2021

2020

optee_benchmark found optee_os function vm_set_ctx() spent much time