The content of this message was lost. It was probably cross-posted to multiple lists and previously handled on another list.
Hi, Jens, Olivier: I'm a colleague of Yuye, and we are doing optee benchmark and performance optimization. We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0 We think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL. So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. We tested on an ArmV9 data-center server, and we got average time is about 20 us. After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is the log in update_current_ctx() I found that many InvokeCommand repeat following pattern E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28 E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0 I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28 My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then setting tsd->ctx to NULL ctx, to remove the vm_set_ctx() cost of time ? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of 10k times InvokeCommand ? Thanks. Best wishes. ------------------------------------------------------------------ 发件人:Olivier DeprezOlivier.Deprez@arm.com 日 期:2023年05月30日 21:07:20 收件人:Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)meijianqiang.mjq@alibaba-inc.com 抄 送:op-teeop-tee@lists.trustedfirmware.org; hafniumhafnium@lists.trustedfirmware.org; 黄明hm281385@alibaba-inc.com 主 题:Re: optee_benchmark pmccfiltr_el0 Hi Yuye, In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre). See Arm ARM D11.5.3 Prohibiting event and cycle counting Cycle and event counting is disabled by: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... > See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... > Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2 You could try to permit cycle counting in the secure world for the sake of a one shot experiment, but note that this has never been tried, and this should probably not be productized as things stand. Regards, Olivier. From: 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com Sent: 30 May 2023 10:38 To: Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez Olivier.Deprez@arm.com Cc: op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com Subject: optee_benchmark pmccfiltr_el0 Hi, It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level. Regards, Yuye. ------------------------------------------------------------------ 发件人:梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com 发送时间:2023年5月26日(星期五) 16:59 收件人:Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez olivier.deprez@arm.com 抄 送:op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com 主 题:optee_benchmark pmccfiltr_el0 Hi, Jens, Olivier, In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path: ./out/benchmark ../optee_examples/out/ca/optee_example_hello_world After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. In cold boot, the following code will be executed during hafnium initialization: vm->arch.trapped_features |= HF_FEATURE_PERFMON; This will prevent the secondary vm from accessing the performance counter registers. We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium. But the value of pmccfiltr_el0 remains unchanged and cannot be counted. We tried to read the register in hafnium and found that there was no change either. In contrast, in the normal world, pmccfiltr_el0 counts normally. Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present? Thanks for the support. Regards, Yuye.
Hi, experts: We are doing optee benchmark and performance optimization. We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0 We think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL. So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. We tested on an ArmV9 data-center server, and we got average time is about 20 us. After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is the log in update_current_ctx() I found that many InvokeCommand repeat following pattern E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28 E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0 I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28 My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then setting tsd->ctx to NULL ctx, to remove the vm_set_ctx() cost of time ? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of 10k times InvokeCommand ? Thanks. ------------------------------------------------------------------ 发件人:Olivier DeprezOlivier.Deprez@arm.com 日 期:2023年05月30日 21:07:20 收件人:Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)meijianqiang.mjq@alibaba-inc.com 抄 送:op-teeop-tee@lists.trustedfirmware.org; hafniumhafnium@lists.trustedfirmware.org; 黄明hm281385@alibaba-inc.com 主 题:Re: optee_benchmark pmccfiltr_el0 Hi Yuye, In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre). See Arm ARM D11.5.3 Prohibiting event and cycle counting Cycle and event counting is disabled by: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc... > See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin... > Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2 You could try to permit cycle counting in the secure world for the sake of a one shot experiment, but note that this has never been tried, and this should probably not be productized as things stand. Regards, Olivier. From: 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com Sent: 30 May 2023 10:38 To: Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez Olivier.Deprez@arm.com Cc: op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com Subject: optee_benchmark pmccfiltr_el0 Hi, It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level. Regards, Yuye. ------------------------------------------------------------------ 发件人:梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com 发送时间:2023年5月26日(星期五) 16:59 收件人:Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez olivier.deprez@arm.com 抄 送:op-tee op-tee@lists.trustedfirmware.org; hafnium hafnium@lists.trustedfirmware.org; 黄明(连一) hm281385@alibaba-inc.com 主 题:optee_benchmark pmccfiltr_el0 Hi, Jens, Olivier, In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path: ./out/benchmark ../optee_examples/out/ca/optee_example_hello_world After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. In cold boot, the following code will be executed during hafnium initialization: vm->arch.trapped_features |= HF_FEATURE_PERFMON; This will prevent the secondary vm from accessing the performance counter registers. We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium. But the value of pmccfiltr_el0 remains unchanged and cannot be counted. We tried to read the register in hafnium and found that there was no change either. In contrast, in the normal world, pmccfiltr_el0 counts normally. Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present? Thanks for the support. Regards, Yuye.
Hi, experts: I am doing optee benchmark and performance optimization. I use optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0 I think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL. So I remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. I tested on an ArmV9 data-center server, and I got average time is about 20 us. After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is 'git diff' in update_current_ctx() diff --git a/core/kernel/ts_manager.c b/core/kernel/ts_manager.c index b2794634..9a221138 100644 --- a/core/kernel/ts_manager.c +++ b/core/kernel/ts_manager.c @@ -26,6 +26,7 @@ static void update_current_ctx(struct thread_specific_data *tsd) ctx = s->ctx; } + EMSG("session %p tsd %p tsd->ctx %p ctx %p", s, tsd, tsd->ctx, ctx); if (tsd->ctx != ctx) vm_set_ctx(ctx); /* I found that many InvokeCommand repeat following pattern E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28 E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0 I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28 My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then setting tsd->ctx to NULL ctx, to remove the vm_set_ctx() cost of time ? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of 10k times InvokeCommand ? Thanks.
Hi,
Comments below.
On Fri, Jun 23, 2023 at 3:10 AM 高海源(码源) haiyuan.ghy@alibaba-inc.com wrote:
Hi, Jens, Olivier:
I'm a colleague of Yuye, and we are doing optee benchmark and
performance optimization. We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0
We think the host API TEEC_InvokeCommand() is sensitive of delay,
namely TEE framework like OpenEnclave called ECALL. So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost. We tested on an ArmV9 data-center server, and we got average time is about 20 us.
After using optee_benchmark tool, we found optee_os
function user_ta_enter() took most of time. Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay. I added logs to print some pointers, following is the log in update_current_ctx()
[JW] That's interesting.
I found that many InvokeCommand repeat following pattern
E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session
E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28
E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session
E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0
I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28
[JW] I guess that's because you're calling the same TA repeatedly.
My question is: Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then NULL ctx, to remove the vm_set_ctx() cost of time? Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of ECALLs?
[JW] Does it make any difference if you compile OP-TEE with CFG_CORE_PREALLOC_EL0_TBLS=y ?
The next step (with CFG_CORE_PREALLOC_EL0_TBLS=y) is to see which part of vm_set_ctx() we should try to optimize or if we should try to find a way of delaying the call to vm_set_ctx().
Cheers, Jens
Thanks. Best wishes.
发件人:Olivier DeprezOlivier.Deprez@arm.com 日 期:2023年05月30日 21:07:20 收件人:Jens Wiklanderjens.wiklander@linaro.org; 梅建强(禹夜)< meijianqiang.mjq@alibaba-inc.com> 抄 送:op-teeop-tee@lists.trustedfirmware.org; hafnium< hafnium@lists.trustedfirmware.org>; 黄明hm281385@alibaba-inc.com 主 题:Re: optee_benchmark pmccfiltr_el0
Hi Yuye,
In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre).
See Arm ARM D11.5.3 Prohibiting event and cycle counting
Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/arc...
See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardenin...
Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
You could try to permit cycle counting in the secure world for the sake of a one shot experiment, but note that this has never been tried, and this should probably not be productized as things stand.
Regards, Olivier.
*From:* 梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com *Sent:* 30 May 2023 10:38 *To:* Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez < Olivier.Deprez@arm.com> *Cc:* op-tee op-tee@lists.trustedfirmware.org; hafnium < hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com *Subject:* optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level.
Regards, Yuye.
发件人:梅建强(禹夜) meijianqiang.mjq@alibaba-inc.com 发送时间:2023年5月26日(星期五) 16:59 收件人:Jens Wiklander jens.wiklander@linaro.org; Olivier Deprez < olivier.deprez@arm.com> 抄 送:op-tee op-tee@lists.trustedfirmware.org; hafnium < hafnium@lists.trustedfirmware.org>; 黄明(连一) hm281385@alibaba-inc.com 主 题:optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path: ./out/benchmark ../optee_examples/out/ca/optee_example_hello_world After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register. In cold boot, the following code will be executed during hafnium initialization: vm->arch.trapped_features |= HF_FEATURE_PERFMON; This will prevent the secondary vm from accessing the performance counter registers. We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium. But the value of pmccfiltr_el0 remains unchanged and cannot be counted. We tried to read the register in hafnium and found that there was no change either. In contrast, in the normal world, pmccfiltr_el0 counts normally. Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present?
Thanks for the support.
Regards, Yuye.
Hi, Jens, I tried CFG_CORE_PREALLOC_EL0_TBLS=y to compile optee_os. I used the same optee_example_hello_world CA and TA, tested 10K times InvokeCommand. I got average of time was 17 us, somewhat less than previous 20 us.
Any suggestion? Thanks for help.
Hi,
On Mon, Jun 26, 2023 at 10:39 AM haiyuan.ghy--- via OP-TEE op-tee@lists.trustedfirmware.org wrote:
Hi, Jens, I tried CFG_CORE_PREALLOC_EL0_TBLS=y to compile optee_os. I used the same optee_example_hello_world CA and TA, tested 10K times InvokeCommand. I got average of time was 17 us, somewhat less than previous 20 us.
A small improvement, at least it didn't make things worse. :-)
Any suggestion? Thanks for help.
With CFG_CORE_PREALLOC_EL0_TBLS=y each TA has preallocated translation tables that don't need to be completely reinitialized each time. I guess some of the saved time above is from core_mmu_populate_user_map() taking a bit less time. It would be interesting to know how the time is spent by vm_set_ctx(), assuming that's still the main problem.
There are also the calls to tlbi_all() and icache_inv_all() in core_mmu_set_user_map() that might be a bit brutal. If you're only calling the same TA repeatedly it could be interesting to so how how much time can be saved by skipping those two calls. So we know if it's worth trying to optimize that part.
Cheers, Jens
Hi Jens I tried CFG_CORE_PREALLOC_EL0_TBLS = y, along with removing tlbi_all() and icache_inv_all() in core_mmu_set_user_map(), within "#ifdef ARM64 ... #endif" macros. Using this optee_os version, I tested the same 10K times InvokeCommand, with 3 test rounds.
1st test round, average time of 10K InvokeCommand is 13 us. 2nd test round, average time is 14 us. 3rd round, testing CA hanged then my server reboot :( Following are Panic logs in UART console:
E/TC:019 000 Core data-abort at address 0x4009c000 (write permission fault) E/TC:019 000 esr 0x9600004f ttbr0 0x20008940db000 ttbr1 0x00000000 cidr 0x0 E/TC:019 000 cpu #19 cpsr 0x42000004 E/TC:019 000 x0 000000004009c000 x1 00000008a0018148 E/TC:019 000 x2 00000000000005cc x3 000000004009c000 E/TC:019 000 x4 00000008a0018168 x5 000000004009c5c0 E/TC:019 000 x6 0000000000000000 x7 00000000000005c0 E/TC:019 000 x8 0000000894aab9b8 x9 000000000585e451 E/TC:019 000 x10 00000000bc273022 x11 00000008900645c8 E/TC:019 000 x12 00000000c3d70b77 x13 00000000d0798a3e E/TC:019 000 x14 0000000000000000 x15 4604d0c2c6d35ee5 E/TC:019 000 x16 000000089001629c x17 0000000000000000 E/TC:019 000 x18 0000000000000000 x19 000000089409ff10 E/TC:019 000 x20 00000000000005cc x21 00000008a0018148 E/TC:019 000 x22 000000004009c000 x23 00000000000005cc E/TC:019 000 x24 00000000000165cc x25 0000000000016714 E/TC:019 000 x26 0000000000016000 x27 0000000000000000 E/TC:019 000 x28 0000000000000000 x29 0000000894aabb20 E/TC:019 000 x30 000000089001ae4c elr 0000000890007928 E/TC:019 000 sp_el0 0000000894aabb20 E/TC:019 000 TEE load address @ 0x890004000 E/TC:019 000 Call stack: E/TC:019 000 0x890007928 E/TC:019 000 0x890015a0c E/TC:019 000 0x89000e82c E/TC:019 000 0x890009fa8 E/TC:019 000 0x890008d38 E/TC:019 000 Panic 'unhandled pageable abort' at core/arch/arm/kernel/abort.c:572 <abort_handler> E/TC:019 000 TEE load address @ 0x890004000 E/TC:019 000 Call stack: E/TC:019 000 0x89000c254 E/TC:019 000 0x8900165ec E/TC:019 000 0x89000ba70 E/TC:019 000 0x890008e68
I guess removing tlbi_all() and icache_inv_all() can save 17-13=4us, but induces problems.
Hi,
On Thu, Jun 29, 2023 at 8:55 AM haiyuan.ghy--- via OP-TEE op-tee@lists.trustedfirmware.org wrote:
Hi Jens I tried CFG_CORE_PREALLOC_EL0_TBLS = y, along with removing tlbi_all() and icache_inv_all() in core_mmu_set_user_map(), within "#ifdef ARM64 ... #endif" macros. Using this optee_os version, I tested the same 10K times InvokeCommand, with 3 test rounds.
1st test round, average time of 10K InvokeCommand is 13 us. 2nd test round, average time is 14 us. 3rd round, testing CA hanged then my server reboot :( Following are Panic logs in UART console:
E/TC:019 000 Core data-abort at address 0x4009c000 (write permission fault) E/TC:019 000 esr 0x9600004f ttbr0 0x20008940db000 ttbr1 0x00000000 cidr 0x0 E/TC:019 000 cpu #19 cpsr 0x42000004 E/TC:019 000 x0 000000004009c000 x1 00000008a0018148 E/TC:019 000 x2 00000000000005cc x3 000000004009c000 E/TC:019 000 x4 00000008a0018168 x5 000000004009c5c0 E/TC:019 000 x6 0000000000000000 x7 00000000000005c0 E/TC:019 000 x8 0000000894aab9b8 x9 000000000585e451 E/TC:019 000 x10 00000000bc273022 x11 00000008900645c8 E/TC:019 000 x12 00000000c3d70b77 x13 00000000d0798a3e E/TC:019 000 x14 0000000000000000 x15 4604d0c2c6d35ee5 E/TC:019 000 x16 000000089001629c x17 0000000000000000 E/TC:019 000 x18 0000000000000000 x19 000000089409ff10 E/TC:019 000 x20 00000000000005cc x21 00000008a0018148 E/TC:019 000 x22 000000004009c000 x23 00000000000005cc E/TC:019 000 x24 00000000000165cc x25 0000000000016714 E/TC:019 000 x26 0000000000016000 x27 0000000000000000 E/TC:019 000 x28 0000000000000000 x29 0000000894aabb20 E/TC:019 000 x30 000000089001ae4c elr 0000000890007928 E/TC:019 000 sp_el0 0000000894aabb20 E/TC:019 000 TEE load address @ 0x890004000 E/TC:019 000 Call stack: E/TC:019 000 0x890007928 E/TC:019 000 0x890015a0c E/TC:019 000 0x89000e82c E/TC:019 000 0x890009fa8 E/TC:019 000 0x890008d38 E/TC:019 000 Panic 'unhandled pageable abort' at core/arch/arm/kernel/abort.c:572 <abort_handler> E/TC:019 000 TEE load address @ 0x890004000 E/TC:019 000 Call stack: E/TC:019 000 0x89000c254 E/TC:019 000 0x8900165ec E/TC:019 000 0x89000ba70 E/TC:019 000 0x890008e68
I guess removing tlbi_all() and icache_inv_all() can save 17-13=4us, but induces problems.
Yes, but it might be worth trying to do more precise TLB and i-cache invalidates since it has a noticeable impact on the time spent.
Cheers, Jens
op-tee@lists.trustedfirmware.org