Hi, Jens, Olivier: I'm a colleague of Yuye, and we are doing optee benchmark and performance optimization. We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0 We think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL. So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. We tested on an ArmV9 data-center server, and we got average time is about 20 us. After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is the log in update_current_ctx() I found that many InvokeCommand repeat following pattern E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28 E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0 I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28 My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then NULL ctx, to remove the vm_set_ctx() cost of time? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of ECALLs? Thanks. Best wishes. ------------------------------------------------------------------ 发件人:Olivier DeprezOlivier.Deprez@arm.com 日 期:2023年05月30日 21:07:20 收件人:Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)meijianqiang.mjq@alibaba-inc.com 抄 送:op-teeop-tee@lists.trustedfirmware.org; hafniumhafnium@lists.trustedfirmware.org; 黄明hm281385@alibaba-inc.com 主 题:Re: optee_benchmark pmccfiltr_el0 Hi Yuye, In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre). See Arm ARM D11.5.3 Prohibiting event and cycle counting Cycle and event counting is disabled by: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... > See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... > Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2 You could try to permit cycle counting in the secure world for the sake of a one shot experiment, but note that this has never been tried, and this should probably not be productized as things stand. Regards, Olivier. From: 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com Sent: 30 May 2023 10:38 To: Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez Olivier.Deprez@arm.com Cc: op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com Subject: optee_benchmark pmccfiltr_el0 Hi, It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level. Regards, Yuye. ------------------------------------------------------------------ 发件人:梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com 发送时间:2023年5月26日(星期五) 16:59 收件人:Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez olivier.deprez@arm.com 抄 送:op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com 主 题:optee_benchmark pmccfiltr_el0 Hi, Jens, Olivier, In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path: ./out/benchmark ../optee_examples/out/ca/optee_example_hello_world After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. In cold boot, the following code will be executed during hafnium initialization: vm->arch.trapped_features |= HF_FEATURE_PERFMON; This will prevent the secondary vm from accessing the performance counter registers. We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium. But the value of pmccfiltr_el0 remains unchanged and cannot be counted. We tried to read the register in hafnium and found that there was no change either. In contrast, in the normal world, pmccfiltr_el0 counts normally. Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present? Thanks for the support. Regards, Yuye.