Re: optee_benchmark found optee_os function vm_set_ctx() spent much time

26 Jun 2023

      Hi,
Comments below.
On Fri, Jun 23, 2023 at 3:10 AM 高海源(码源) haiyuan.ghy@alibaba-inc.com wrote:
...
Hi, Jens, Olivier:
I'm a colleague of Yuye, and we are doing optee benchmark and

performance optimization.
    We are using optee_os/optee_client/optee_examples/optee_benchmark
version 3.20.0
We think the host API TEEC_InvokeCommand() is sensitive of delay,

namely TEE framework like OpenEnclave called ECALL.
    So we remove all prints in hello_world_ta.c
's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times
to calculate average time cost.
    We tested on an ArmV9 data-center server,  and we got average time is
about 20 us.
After using optee_benchmark tool, we found optee_os

function user_ta_enter() took most of time.
    Specially, user_ta_enter() calls ts_push_current_session()
and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend
7 us and 2 us respectively, mostly of 20 us delay.
    I added logs to print some pointers,  following is the log
in update_current_ctx()
[JW] That's interesting.
...
I found that many InvokeCommand repeat following pattern
E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4,
call ts_push_current_session
E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38
tsd->ctx 0x0 ctx 0x8940b5d28
E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4,
call ts_pop_current_session
E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx
0x8940b5d28 ctx 0x0
I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28
[JW] I guess that's because you're calling the same TA repeatedly.
...
My question is:
    Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then
NULL ctx, to remove the vm_set_ctx() cost of time?
    Is there any optimization method, likely many InvokeCommand in the
same session reuse the same Non-NULL ctx, to reduce average delay of
ECALLs?
[JW] Does it make any difference if you compile OP-TEE with
CFG_CORE_PREALLOC_EL0_TBLS=y ?
The next step (with CFG_CORE_PREALLOC_EL0_TBLS=y) is to see which part of
vm_set_ctx() we should try to optimize or if we should try to find a way of
delaying the call to vm_set_ctx().
Cheers,
Jens
...
Thanks. Best wishes.

发件人：Olivier DeprezOlivier.Deprez@arm.com
日 期：2023年05月30日 21:07:20
收件人：Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)<
meijianqiang.mjq@alibaba-inc.com>
抄 送：op-teeop-tee@lists.trustedfirmware.org; hafnium<
hafnium@lists.trustedfirmware.org>; 黄明hm281385@alibaba-inc.com
主 题：Re: optee_benchmark pmccfiltr_el0
Hi Yuye,
In general the consensus is that PMU cycle and event counting in EL3 &
secure world has to be disabled. I gather this is to avoid probing crypto
algorithms timings, or leverage cache-based timing side channels (e.g.
spectre).
See Arm ARM D11.5.3 Prohibiting event and cycle counting
Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc...
See also
https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin...
Note there are various knobs depending on implemented architecture
extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
You could try to permit cycle counting in the secure world for the sake of
a one shot experiment,
but note that this has never been tried, and this should probably not be
productized as things stand.
Regards,
Olivier.

*From:* 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com
*Sent:* 30 May 2023 10:38
*To:* Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez <
Olivier.Deprez@arm.com>
*Cc:* op-tee op-tee@lists.trustedfirmware.org; hafnium <
hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com
*Subject:* optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register
configuration and that pmccntr_el0 can be read at any exception level.
Regards,
Yuye.

发件人：梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com
发送时间：2023年5月26日(星期五) 16:59
收件人：Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez <
olivier.deprez@arm.com>
抄 送：op-tee op-tee@lists.trustedfirmware.org; hafnium <
hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com
主 题：optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to
test benchmark by executing the following command at optee_benchmark path:
./out/benchmark ../optee_examples/out/ca/optee_example_hello_world
After entering into the benchmark pta, the bm_timestamp function attempts
to read the pmccfiltr_el0 register.
In cold boot, the following code will be executed during hafnium
initialization:
vm->arch.trapped_features |= HF_FEATURE_PERFMON;
This will prevent the secondary vm from accessing the performance counter
registers.
We remove the code, the bm_timestamp function can read pmccfiltr_el0
without trapping into hafnium.
But the value of pmccfiltr_el0 remains unchanged and cannot be counted.
We tried to read the register in hafnium and found that there was no
change either.
In contrast, in the normal world, pmccfiltr_el0 counts normally.
Is it related to the pmu register configuration or does sel1 not support
the pmccfiltr_el0 count at present?
Thanks for the support.
Regards,
Yuye.

2025

2024

2023

2022

2021

2020

Re: optee_benchmark found optee_os function vm_set_ctx() spent much time