Hi,
Comments below.
On Fri, Jun 23, 2023 at 3:10 AM 高海源(码源) haiyuan.ghy@alibaba-inc.com wrote:
Hi, Jens, Olivier:
I'm a colleague of Yuye, and we are doing optee benchmark and
performance optimization. We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0
We think the host API TEEC_InvokeCommand() is sensitive of delay,
namely TEE framework like OpenEnclave called ECALL. So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. We tested on an ArmV9 data-center server, and we got average time is about 20 us.
After using optee_benchmark tool, we found optee_os
function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is the log in update_current_ctx()
[JW] That's interesting.
I found that many InvokeCommand repeat following pattern
E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session
E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28
E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session
E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0
I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28
[JW] I guess that's because you're calling the same TA repeatedly.
My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then NULL ctx, to remove the vm_set_ctx() cost of time? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of ECALLs?
[JW] Does it make any difference if you compile OP-TEE with CFG_CORE_PREALLOC_EL0_TBLS=y ?
The next step (with CFG_CORE_PREALLOC_EL0_TBLS=y) is to see which part of vm_set_ctx() we should try to optimize or if we should try to find a way of delaying the call to vm_set_ctx().
Cheers, Jens
Thanks. Best wishes.
发件人:Olivier DeprezOlivier.Deprez@arm.com 日 期:2023年05月30日 21:07:20 收件人:Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)< meijianqiang.mjq@alibaba-inc.com> 抄 送:op-teeop-tee@lists.trustedfirmware.org; hafnium< hafnium@lists.trustedfirmware.org>; 黄明hm281385@alibaba-inc.com 主 题:Re: optee_benchmark pmccfiltr_el0
Hi Yuye,
In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre).
See Arm ARM D11.5.3 Prohibiting event and cycle counting
Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc...
See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin...
Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
You could try to permit cycle counting in the secure world for the sake of a one shot experiment, but note that this has never been tried, and this should probably not be productized as things stand.
Regards, Olivier.
*From:* 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com *Sent:* 30 May 2023 10:38 *To:* Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez < Olivier.Deprez@arm.com> *Cc:* op-tee op-tee@lists.trustedfirmware.org; hafnium < hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com *Subject:* optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level.
Regards, Yuye.
发件人:梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com 发送时间:2023年5月26日(星期五) 16:59 收件人:Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez < olivier.deprez@arm.com> 抄 送:op-tee op-tee@lists.trustedfirmware.org; hafnium < hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com 主 题:optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path: ./out/benchmark ../optee_examples/out/ca/optee_example_hello_world After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. In cold boot, the following code will be executed during hafnium initialization: vm->arch.trapped_features |= HF_FEATURE_PERFMON; This will prevent the secondary vm from accessing the performance counter registers. We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium. But the value of pmccfiltr_el0 remains unchanged and cannot be counted. We tried to read the register in hafnium and found that there was no change either. In contrast, in the normal world, pmccfiltr_el0 counts normally. Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present?
Thanks for the support.
Regards, Yuye.