Hi Raghu,
Thanks for you explaining your point.
A fix has been prepared in: https://review.trustedfirmware.org/c/hafnium/hafnium/+/13026.
It reverts from the use of telnet communication with the model from "hftest.py" and allows for the model to be reloaded in between tests.
Would you be able to have a look and let us know if the reported issue has been fixed?
Best regards, João Alves
________________________________ From: raghu.ncstate@icloud.com raghu.ncstate@icloud.com Sent: Sunday, December 5, 2021 3:12 AM To: raghu.ncstate@icloud.com raghu.ncstate@icloud.com; Joao Alves Joao.Alves@arm.com; Federico Recanati Federico.Recanati@arm.com; Olivier Deprez Olivier.Deprez@arm.com Cc: 'Raghu Krishnamurthy via Hafnium' hafnium@lists.trustedfirmware.org Subject: RE: [Hafnium] Bug in hftest.py
BTW, I was playing around with it just a little more. The problem is worse than I originally thought. I added a static variable to hafnium in one_time_init() and incremented it and the same issue happens(expectedly so)! A static variable carried over values from the previous run. Now imagine this situation:
Src/arc/aarch64/plat/ffa/spmc.c has a whole bunch of static and global variables which have values initialized, not through functions, but through declarations. If you end up having an SPMC crash, or if you have test code that has a bug or purposely does not clean up the SPMC state or release resources, you will have hard to find bugs. An example of this could be the sri_state variable, whose state could be left in the TRIGGERED state since TFTF did not handle the interrupt or call the notification API to get notification info for whatever reason. Now, because of this sri_state_set will then malfunction because the initial state is expected to be HANDLED but is actually in the TRIGGERED state from a previous test run..
Similarly, there are various static variables throughout the hypervisor code that could retain values from the previous run, if it is not in the BSS section. Cpu_count and cpus variables in src/cpu.c are concrete examples of this and with prints, you can see that cpu_count is 1 in the first test that runs but 8 in the subsequent tests. This hasn’t caused any bugs/issues but this is a slippery slope and will likely cause unexpected and hard to debug issues down the road...
Thanks Raghu
-----Original Message----- From: Hafnium hafnium-bounces@lists.trustedfirmware.org On Behalf Of Raghu Krishnamurthy via Hafnium Sent: Saturday, December 4, 2021 4:56 PM To: 'Joao Alves' Joao.Alves@arm.com; 'Federico Recanati' Federico.Recanati@arm.com; 'Olivier Deprez' Olivier.Deprez@arm.com Cc: 'Raghu Krishnamurthy via Hafnium' hafnium@lists.trustedfirmware.org Subject: Re: [Hafnium] Bug in hftest.py
Joao, Olivier, Thanks for the response.
fix first problem which is already a good progress IMO!
[RK] yes, no doubt 😊 This had to be fixed anyway.
AFAIU this is an embarrassing limitation for this specific test configuration. But this does not affect any current checked in test, right?
[RK] Correct. But the patches that I have pushed recently to enable multi-EC S-EL0 partitions will likely be affected by it. Any test that uses static variables will run into this issue. So even though it doesn’t affect any checked in test, it will likely affect tests in the very near future.
Do you think it would be ok to go through with any of these solutions?
[RK] Joao, the solution you have proposed may work for the normal world test control partition. However, secure partitions/services are also affected by this same problem and the macros are not really used in services..We could and that would need more thought.
Moreover, any new person coming in to write tests will find it unintuitive to have to manually have setup and tear down functions to initialize every static and global variables on first entry. As the tests get more complicated and the code base grows, I don’t think manual clean up/init of these variables will work very well. The optimization is fundamentally breaking a programmers expectation on behavior of static variables…
Take https://review.trustedfirmware.org/c/hafnium/hafnium/+/12663/5/test/vmapi/el... – my most recent use of static variable. I had to spend hours debugging the issue, because when I ran a subsequent test, the images were not reloaded and memory in the FVP held data values from the previous test run, due to which the secondary entry point would never be registered and the statically initialized variable would be true, even though it is initialized as false. Running this patch set with the top of tree hftest.py and should show the issue yet again.
We are effectively saying that the test partitions should be stateless and all state should be initialized from code, as opposed to from the image… that is okay for me now that I know the issue and can deal with it, but I think that is less than friendly for new folks trying to hop onboard hafnium..
I understand the reason this optimization was done, but an optimization cannot break functionality..we need to explore other ways to improve FVP test time that does not require the burden to shift to people writing code..
That is because the NWd Primary issues an PSCI call at the end of the tests to reset the system. We can't achieve the same effect with the standalone SP, and hence >>the FVP is rebooted and the binaries reloaded.
[RK] Does this hold with the latest changes that seems to be moving all secure world tests to have a normal world test driver? Not sure I follow why this limitation still applies.
Thanks
-Raghu
From: Joao Alves Joao.Alves@arm.com Sent: Monday, November 29, 2021 2:04 AM To: Federico Recanati Federico.Recanati@arm.com; raghu.ncstate@icloud.com; Olivier Deprez Olivier.Deprez@arm.com Cc: 'Raghu Krishnamurthy via Hafnium' hafnium@lists.trustedfirmware.org Subject: Re: [Hafnium] Bug in hftest.py
Hi Raghu,
I'd like to add that resetting the model instead of rebooting and reloading the binaries was intended, for the sake of improving the time it takes to run the tests with test_spmc.sh. Loading FVP takes time, and by resetting we make the test scripts a bit more usable with FVP. You can verify on your setup that when running test_spmc.sh, the tests setup with a standalone SP takes a lot longer than those with NWd Primary + SPs. That is because the NWd Primary issues an PSCI call at the end of the tests to reset the system. We can't achieve the same effect with the standalone SP, and hence the FVP is rebooted and the binaries reloaded.
Granted this created the issue you raised with static data, that we haven't detected with the tests we currently have in place.
IMO there are two ways we can avoid this, whilst maintaining the increased performance with the NWd Primary VM + SP test setup.
If the static data to be used is defined within the scope of a specific test/test suite, one can recur to the "TEAR_DOWN" macro. It performs clean-up operations after each test. https://review.trustedfirmware.org/plugins/gitiles/hafnium/hafnium/+/refs/he...
If the static data to be used is global, as defined with your repro.patch, we can create an init function to be called after a cold boot that resets the state of these static variables.
Do you think it would be ok to go through with any of these solutions?
Cheers,
João Alves
_____
From: Hafnium <hafnium-bounces@lists.trustedfirmware.org mailto:hafnium-bounces@lists.trustedfirmware.org > on behalf of Olivier Deprez via Hafnium <hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org > Sent: Monday, November 29, 2021 9:13 AM To: Federico Recanati <Federico.Recanati@arm.com mailto:Federico.Recanati@arm.com >; raghu.ncstate@icloud.com mailto:raghu.ncstate@icloud.com <raghu.ncstate@icloud.com mailto:raghu.ncstate@icloud.com > Cc: 'Raghu Krishnamurthy via Hafnium' <hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org > Subject: Re: [Hafnium] Bug in hftest.py
Hi Raghu,
My mistake for not making it clearer. Below thread mentions two problems with the Hypervisor+SPMC configuration: 1/ the test run stability issue 2/ the stale data left between test runs (intent for T955)
In terms of priority order, I thought it's better focusing on problem 1/ first and start from a clean sheet to permit tackling 2/. Federico's changes effectively fix first problem which is already a good progress IMO! The second problem remains which is the reason why the ticket is still open. I'm not yet too sure how to properly fix and I understand why this problem exists. AFAIU this is an embarrassing limitation for this specific test configuration. But this does not affect any current checked in test, right? I know you provided a reproducer so the issue can certainly trigger depending on how a test is designed.
Regards, Olivier.
________________________________________ From: Hafnium <hafnium-bounces@lists.trustedfirmware.org mailto:hafnium-bounces@lists.trustedfirmware.org > on behalf of Raghu Krishnamurthy via Hafnium <hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org > Sent: 26 November 2021 22:45 To: Federico Recanati Cc: 'Raghu Krishnamurthy via Hafnium' Subject: Re: [Hafnium] Bug in hftest.py
Hi Federico,
Thanks. I see you patch fixes the flakiness with tests but does not fix the original problem mentioned in the first email of this thread. Were you able to try the repro patch I provided and ensure that your change fixes the problem? I don’t believe it does, since the problem I reported does not have anything to do with the telnet port, but really with NOT reloading test-binaries but re-running the tests, which causes global variables whose initial/default values have changed, to see values from previous runs...
Thanks Raghu
-----Original Message----- From: Hafnium <hafnium-bounces@lists.trustedfirmware.org mailto:hafnium-bounces@lists.trustedfirmware.org > On Behalf Of Federico Recanati via Hafnium Sent: Wednesday, November 17, 2021 6:34 AM To: hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org Subject: Re: [Hafnium] Bug in hftest.py
Hi Raghu,
a fix for hftest.py has been merged: https://review.trustedfirmware.org/c/hafnium/hafnium/+/12481
addressing the random failures of both-world tests and supporting connection to telnet ports other than 5000
Cheers, Federico
____________________________________
From: raghu.ncstate@icloud.com mailto:raghu.ncstate@icloud.com <raghu.ncstate@icloud.com mailto:raghu.ncstate@icloud.com > Sent: 04 August 2021 21:01 To: Olivier Deprez; 'Raghu Krishnamurthy via Hafnium' Subject: RE: [Hafnium] Bug in hftest.py
Thanks Olivier. I've created https://developer.trustedfirmware.org/T955 to track. Understood all of this is new. I do have local fixes to get around the issue so not a hurry to have a fix merged, but something to consider and fix since it will eventually show up.
the both worlds test scenario is not 100% stable on my machine
[RK] Likewise. I've noticed that this is caused by lingering FVP processes. Usually I ps -ax | grep for FVP instances, kill and then run tests and I never see failures after that. The issue that I faced was that the lingering FVP would take up telnet ports and the newly spawned ones use different ports(>5004) than what hftest.py expects. It appears that when tests fail, we may not be cleaning up/exiting processes properly, but I haven't checked. Or the code may be just fine and a ctrl+c leaves those processes lingering.
Thanks Raghu
-----Original Message----- From: Olivier Deprez <Olivier.Deprez@arm.com mailto:Olivier.Deprez@arm.com > Sent: Tuesday, August 3, 2021 11:51 PM To: 'Raghu Krishnamurthy via Hafnium' <hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org >; raghu.ncstate@icloud.com mailto:raghu.ncstate@icloud.com Subject: Re: [Hafnium] Bug in hftest.py
Hi Raghu,
Thanks for reporting. This part of the test infrastructure (testing the SPMC) is still very fresh and requires improvement iterations so please bear with us. Also a reason it's not yet part of the automated non-regression with jenkins (as opposed to the legacy kokoro/test.sh). For the time being we still mostly rely on the TF-A CI for testing on the secure side.
IIUC this change was made to help with the test time as the FVP takes long to reload on every test. But indeed it might have the side effect you describe. So either we revert the FVP reloading on every test. Or another (somewhat hackish) possibility is to clear the mentioned variables from within the test (or make them part of BSS)?
To be fair, the both worlds test scenario is not 100% stable on my machine (for some reason the connection is not always successful between the FVP and hftest) hence limiting confidence/robustness of my testing and investigations. So I wonder is the scripting is still somewhat a bit fragile.
Regards, Olivier.
From: Hafnium <hafnium-bounces@lists.trustedfirmware.org mailto:hafnium-bounces@lists.trustedfirmware.org > on behalf of Raghu Krishnamurthy via Hafnium <hafnium@lists.trustedfirmware.org mailto:hafnium@lists.trustedfirmware.org > Sent: 03 August 2021 23:47 To: 'Raghu Krishnamurthy via Hafnium' Subject: [Hafnium] Bug in hftest.py
Hi All,
Wanted to report to you that commit 18a25f9241f86ba2d637011ff465ce3869e8651b in hafnium "appears" broken. The issue with the optimization in this patch is that the partition images are not reloaded for each test run, which means a previous test could have written data to say SRAM, and the following test would use the old values from the previous test, when the same image is executed again from SRAM for a following test. This would be a problem for pretty much anything in the data section of a partition. In my case, I have a counter in the data section of my partition, which does not get reset back to its original value.
I've attached a patch to help repro the issue. Fix is to disable the optimization or somehow reload the images for each run. This affects only "both world" tests.
Let me know if I'm missing something here.
Apply patch and run timeout --foreground 300s ./test/hftest/hftest.py --out_partitions out/reference/secure_aem_v8a_fvp_vm_clang --log out/reference/kokoro_log --spmc out/reference/secure_aem_v8a_fvp_clang/hafnium.bin --driver=fvp --hypervisor out/reference/aem_v8a_fvp_clang/hafnium.bin --partitions_json test/vmapi/ffa_secure_partitions/ffa_both_world_partitions_test.json
The command line is from kokoro/test_spmc.sh.
Thanks
Raghu
-- Hafnium mailing list Hafnium@lists.trustedfirmware.org mailto:Hafnium@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/hafnium
-- Hafnium mailing list Hafnium@lists.trustedfirmware.org mailto:Hafnium@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/hafnium
-- Hafnium mailing list Hafnium@lists.trustedfirmware.org mailto:Hafnium@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/hafnium -- Hafnium mailing list Hafnium@lists.trustedfirmware.org mailto:Hafnium@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/hafnium
-- Hafnium mailing list Hafnium@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/hafnium