Hi, Jens, Olivier:
I'm a colleague of Yuye, and we are doing optee benchmark and performance optimization.
We are using optee_os/optee_client/optee_examples/optee_benchmark version 3.20.0
We think the host API TEEC_InvokeCommand() is sensitive of delay, namely TEE framework like OpenEnclave called ECALL.
So we remove all prints in hello_world_ta.c 's TA_InvokeCommandEntryPoint(), host test TEEC_InvokeCommand() 10k times to calculate average time cost.
We tested on an ArmV9 data-center server, and we got average time is about 20 us.
After using optee_benchmark tool, we found optee_os function user_ta_enter() took most of time.
Specially, user_ta_enter() calls ts_push_current_session() and ts_pop_current_session(), then calls vm_set_ctx() twice, which spend 7 us and 2 us respectively, mostly of 20 us delay.
I added logs to print some pointers, following is the log in update_current_ctx()
I found that many InvokeCommand repeat following pattern
E/TC:??? 000 user_ta_enter:155 ====== session 0x8940b5d80 func 2 cmd 4, call ts_push_current_session
E/TC:??? 000 update_current_ctx:29 session 0x8940b5d80 tsd 0x89008dd38 tsd->ctx 0x0 ctx 0x8940b5d28
E/TC:??? 000 user_ta_enter:199 ====== session 0x8940b5d80 func 2 cmd 4, call ts_pop_current_session
E/TC:??? 000 update_current_ctx:29 session 0x0 tsd 0x89008dd38 tsd->ctx 0x8940b5d28 ctx 0x0
I noticed many InvokeCommand use the same ctx pointer like 0x8940b5d28
My question is:
Can many InvokeCommand avoid setting tsd->ctx to Non-NULL ctx, then NULL ctx, to remove the vm_set_ctx() cost of time?
Is there any optimization method, likely many InvokeCommand in the same session reuse the same Non-NULL ctx, to reduce average delay of ECALLs?
Thanks. Best wishes.
------------------------------------------------------------------
发件人:Olivier Deprez<Olivier.Deprez(a)arm.com>
日 期:2023年05月30日 21:07:20
收件人:Jens Wiklander<jens.wiklander(a)linaro.org>; 梅建强(禹夜)<meijianqiang.mjq(a)alibaba-inc.com>
抄 送:op-tee<op-tee(a)lists.trustedfirmware.org>; hafnium<hafnium(a)lists.trustedfirmware.org>; 黄明<hm281385(a)alibaba-inc.com>
主 题:Re: optee_benchmark pmccfiltr_el0
Hi Yuye,
In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre).
See Arm ARM D11.5.3 Prohibiting event and cycle counting
Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/ar… <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/ar… >
See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardeni… <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardeni… >
Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
You could try to permit cycle counting in the secure world for the sake of a one shot experiment,
but note that this has never been tried, and this should probably not be productized as things stand.
Regards,
Olivier.
From: 梅建强(禹夜) <meijianqiang.mjq(a)alibaba-inc.com>
Sent: 30 May 2023 10:38
To: Jens Wiklander <jens.wiklander(a)linaro.org>; Olivier Deprez <Olivier.Deprez(a)arm.com>
Cc: op-tee <op-tee(a)lists.trustedfirmware.org>; hafnium <hafnium(a)lists.trustedfirmware.org>; 黄明(连一) <hm281385(a)alibaba-inc.com>
Subject: optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level.
Regards,
Yuye.
------------------------------------------------------------------
发件人:梅建强(禹夜) <meijianqiang.mjq(a)alibaba-inc.com>
发送时间:2023年5月26日(星期五) 16:59
收件人:Jens Wiklander <jens.wiklander(a)linaro.org>; Olivier Deprez <olivier.deprez(a)arm.com>
抄 送:op-tee <op-tee(a)lists.trustedfirmware.org>; hafnium <hafnium(a)lists.trustedfirmware.org>; 黄明(连一) <hm281385(a)alibaba-inc.com>
主 题:optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path:
./out/benchmark ../optee_examples/out/ca/optee_example_hello_world
After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register.
In cold boot, the following code will be executed during hafnium initialization:
vm->arch.trapped_features |= HF_FEATURE_PERFMON;
This will prevent the secondary vm from accessing the performance counter registers.
We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium.
But the value of pmccfiltr_el0 remains unchanged and cannot be counted.
We tried to read the register in hafnium and found that there was no change either.
In contrast, in the normal world, pmccfiltr_el0 counts normally.
Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present?
Thanks for the support.
Regards,
Yuye.
Hi,
I'm testing with Hafnium as SPMC at S-EL2 and OP-TEE as an SP at S-EL1 on
QEMU v7.0.0. I've run into a few problems and fixed most of them.
I believe the setup is similar to what Shiju is using in this mail thread
https://lists.trustedfirmware.org/archives/list/hafnium@lists.trustedfirmwa…
My setup can be duplicated with:
repo init -u https://github.com/jenswi-linaro/manifest.git -m qemu_v8.xml \
-b qemu_sel2
repo sync -j8
(cd hafnium && git submodule init && git submodule update)
cd build
make -j8 toolchains
make -j8 all
make run-only
With this xtest -x 1034 passes, xtest 1034 often causes
ERROR: Data abort: pc=0xe1198b8, esr=0x96000006, ec=0x25, far=0x9c
Panic: EL2 exception
Xtest runs dreadfully slow, I haven't investigated why yet, but at
least it works.
This is based on patches provided by Olivier at:
[1] https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/16412/2
[2] https://review.trustedfirmware.org/c/hafnium/hafnium/+/16323/7
I've also encountered the problem cache maintenance problem Shiju
described in the mail thread above:
NOTICE: Trapped access to system register write: op0=1, op1=0, crn=7,
crm=14, op2=2, rt=11.
It can be worked around by compiling OP-TEE with
CFG_CORE_WORKAROUND_NSITR_CACHE_PRIME=n. I'm pretty sure we do dcache
clean+inv elsewhere so I'm surprised it fails here. Is Hafnium expected
to block dcache clean+inv?
For Hafnium I've added two patches on top of [2], available at
https://github.com/jenswi-linaro/hafnium/tree/qemu_sel2:
- 79b4d2cbe06e SPMC: add missing ME initialization for secondary cores
- 659c79d5eacf feat(mm): fix FEAT_LPA workaround
For TF-A I've added a few patches on top of [1], available at
https://github.com/jenswi-linaro/arm-trusted-firmware/tree/qemu_sel2:
- a040396cae9e feat(qemu): add tos-fw-config for sel2 spmc
- 4f7d91723485 fix(qemu): change TOS_FW_CONFIG_NAME value
- fbfc9a222c7f spmd_smc_handler() add s/ns state to SMC traces
- ca65081b9cdc feat(sptool): add dependency to SP image
- b1e1b46a0680 fix(qemu): restore code to added needed psci nodes
For OP-TEE I've also added a few patches, available at
https://github.com/jenswi-linaro/optee_os/tree/qemu_sel2:
- 1057def23777 plat-vexpress: sel2 spmc: update for hafnium
- f18a54ed3524 core: ffa: use hvc instead of smc with S-EL2
- d18bbc92f7c1 core: mobj_ffa_add_pages_at() trust addresses from SPMC
There's also one patch for QEMU on top of v7.0.0, available at:
https://github.com/jenswi-linaro/qemu/tree/qemu_sel2
- 0c1e39672dcb Read PS bits from VTCR_EL2
The QEMU problem is fixed in v.7.1.0, but I can't get that version of
QEMU to work with TF-A. I guess it's because of yet another new CPU
feature since I'm running with "-cpu max".
I'll try to upstream the Hafnium and TF-A patches that make sense on
their own.
What's the plan with the interrupt controller?
How will OP-TEE be able to handle secure interrupts?
The hafnium git pulls in a few git submodules and even the source code
for a Linux kernel.
I guess this is useful in your internal CI setup, but when used
isolated as in my setup it makes no sense at all.
It would also be nice to be able to build with an external toolchain.
I hope this is a temporary situation, I don't see why Hafnium should
be pickier about toolchain than for instance TF-A.
Speaking of building, I haven't been able to figure out how to build
only for the QEMU variant I need so right now I'm building for
everything and that takes a bit longer than necessary.
I'm going to maintain the setup above as long as it's relevant to me. I may
add more patches on the branches or even rebase as needed. So if anyone is
using this, keep in mind that my branches may change without warning.
Thanks,
Jens
Hi, Olivier,
That makes it clearer to me. Thanks a lot.
Regards,
Yuye.
------------------------------------------------------------------
发件人:Olivier Deprez <Olivier.Deprez(a)arm.com>
发送时间:2023年5月30日(星期二) 21:07
收件人:Jens Wiklander <jens.wiklander(a)linaro.org>; 梅建强(禹夜) <meijianqiang.mjq(a)alibaba-inc.com>
抄 送:op-tee <op-tee(a)lists.trustedfirmware.org>; hafnium <hafnium(a)lists.trustedfirmware.org>; 黄明(连一) <hm281385(a)alibaba-inc.com>
主 题:Re: optee_benchmark pmccfiltr_el0
Hi Yuye,
In general the consensus is that PMU cycle and event counting in EL3 & secure world has to be disabled. I gather this is to avoid probing crypto algorithms timings, or leverage cache-based timing side channels (e.g. spectre).
See Arm ARM D11.5.3 Prohibiting event and cycle counting
Cycle and event counting is disabled by:
https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/ar… <https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/include/ar… >
See also https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardeni… <https://trustedfirmware-a.readthedocs.io/en/latest/process/security-hardeni… >
Note there are various knobs depending on implemented architecture extensions FEAT_PMUv3 / FEAT_PMUv3pX / FEAT_Debugv8p2
You could try to permit cycle counting in the secure world for the sake of a one shot experiment,
but note that this has never been tried, and this should probably not be productized as things stand.
Regards,
Olivier.
From: 梅建强(禹夜) <meijianqiang.mjq(a)alibaba-inc.com>
Sent: 30 May 2023 10:38
To: Jens Wiklander <jens.wiklander(a)linaro.org>; Olivier Deprez <Olivier.Deprez(a)arm.com>
Cc: op-tee <op-tee(a)lists.trustedfirmware.org>; hafnium <hafnium(a)lists.trustedfirmware.org>; 黄明(连一) <hm281385(a)alibaba-inc.com>
Subject: optee_benchmark pmccfiltr_el0
Hi,
It is confirmed that the problem is related to the pmu register configuration and that pmccntr_el0 can be read at any exception level.
Regards,
Yuye.
------------------------------------------------------------------
发件人:梅建强(禹夜) <meijianqiang.mjq(a)alibaba-inc.com>
发送时间:2023年5月26日(星期五) 16:59
收件人:Jens Wiklander <jens.wiklander(a)linaro.org>; Olivier Deprez <olivier.deprez(a)arm.com>
抄 送:op-tee <op-tee(a)lists.trustedfirmware.org>; hafnium <hafnium(a)lists.trustedfirmware.org>; 黄明(连一) <hm281385(a)alibaba-inc.com>
主 题:optee_benchmark pmccfiltr_el0
Hi, Jens, Olivier,
In case of that optee runs at sel1 and hafnium runs at sel2, we want to test benchmark by executing the following command at optee_benchmark path:
./out/benchmark ../optee_examples/out/ca/optee_example_hello_world
After entering into the benchmark pta, the bm_timestamp function attempts to read the pmccfiltr_el0 register.
In cold boot, the following code will be executed during hafnium initialization:
vm->arch.trapped_features |= HF_FEATURE_PERFMON;
This will prevent the secondary vm from accessing the performance counter registers.
We remove the code, the bm_timestamp function can read pmccfiltr_el0 without trapping into hafnium.
But the value of pmccfiltr_el0 remains unchanged and cannot be counted.
We tried to read the register in hafnium and found that there was no change either.
In contrast, in the normal world, pmccfiltr_el0 counts normally.
Is it related to the pmu register configuration or does sel1 not support the pmccfiltr_el0 count at present?
Thanks for the support.
Regards,
Yuye.