Hi,
if we remove/compile out https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/19897, would it fix all 3 problems?
[VW] Yes, it does. That is what I am doing downstream.
But I agree we need to do something to support RAS in general in coming weeks.
[VW] @Oliviermailto:Olivier.Deprez@arm.com I am surprised that we rushed to merge the solution when these limitations were known earlier. Can we please ensure that future changes are tested for every combination before merging? The path to v2.9 for NVIDIA is now much harder due to this mishap. Any patch we merge will be on top of v2.9, making the situation less than ideal as we won't be upgrading the platforms to v2.9 per se.
While we figure out the correct design, would following change help platforms?
https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/21406
[VW] @Oliviermailto:Olivier.Deprez@arm.com this patch works for now.
Presently a Group0 interrupt traps to SEL2 and is delegated to EL3 to the same sort of SPMD platform handler. I wonder if this is a possible case in your system and leading to the same problem?
[VW] Inventing a parallel path to EHF has its downside. The main issue is that the RAS library is tightly bound to EHF today. For the RAS handling to work as expected, we need to ensure that RAS library functions also work with the SPMD interrupt handler. The best approach would be to make SPMD a client of EHF, so that the design is maintainable in the future.
Yes this is known, and comes down to the EL3/SEL2 interface stability. ABI additions over spec iterations are a challenge to support.
[VW] I suggest that we make it a top priority to keep the implementations independent and discoverable at all times. This way we donot force vendors to upgrade multiple SW entities.
Is the SPMD likely to be ahead compared to the SPMC e.g. SPMD v2.x + SPMC v2.y / x >= y?
How much of a minor version difference would that be?
[VW] Please assume that there will be a version mismatch (minor and/or major) at all times. Please provide runtime knobs that are part of the configuration data (fconf, dts, boot parameters). This way the enablement and deployment becomes easier.
It looks a similar situation to above, as the RAS framework is not hooked to the platform handler (this may just be a temporary gap?).
[VW] We have multiple G0 interrupts which work well with the priority-based design of the EHF. I'm afraid that the move to the SPMD interrupt handler would mean that we have to design EHF-like design at the platform layer.
I'm not yet clear on the design tbh, this deserves more thinking and discussion.
[VW] I'm more than happy to discuss future enhancement. I am questioning the need to implement the SPMD interrupt handler. The SMC based approach is clear, but the need for another interrupt handler is not clear. Why can't we keep EHF as the only INTR_TYPE_EL3 interrupt handler in the system?
-Varun
________________________________ From: Olivier Deprez Olivier.Deprez@arm.com Sent: Friday, June 9, 2023 7:52 AM To: Raghupathy Krishnamurthy raghupathyk@nvidia.com; Varun Wadekar vwadekar@nvidia.com; TF-A Mailing List tf-a@lists.trustedfirmware.org; Nicolas Benech nbenech@nvidia.com Subject: Re: EHF and SPMD G0 interrupt handling issues
External email: Use caution opening links or attachments
Hi Raghu,
Yes, absolutely, this is by the FF-A spec, and for now I'm fine with this design. I was alluding to Varun's comment:
The RAS library uses EHF and its functions are not accessible to the platform to call from the 'plat_spmd_handle_group0_interrupt' handler. This interrupt handling now creates an unwanted and longer chain from the interrupt handler to the actual RAS handler in the platform port. I covered this in (2) and (3) from my list. Platforms might have to recreate something similar to EHF within their platform ports if they are asked to remove support for EHF altogether.
Without EHF, a Group0 RAS interrupt ends up in https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/services/st... To avoid this situation, we permit EL3_EXCEPTION_HANDLING=1 in which case this platform handler is discarded, and the interrupt is managed by EHF/RAS framework the legacy way.
For a Group0 interrupt delegated from SEL2 to EL3, the default handler is https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/services/st... It looks a similar situation to above, as the RAS framework is not hooked to the platform handler (this may just be a temporary gap?). Moreover, a question is whether this case occurs in your platform, if we expect RAS interrupts to only happen while the normal world runs?
Longer term, are you in favour of moving away from EHF and glue the platform handler to the RAS framework? I was thinking about this other suggestion 'we should make SPMD a client of the EHF ' I'm not yet clear on the design tbh, this deserves more thinking and discussion.
Regards, Olivier.
________________________________ From: Raghupathy Krishnamurthy raghupathyk@nvidia.com Sent: 09 June 2023 04:49 To: Olivier Deprez Olivier.Deprez@arm.com; Varun Wadekar vwadekar@nvidia.com; TF-A Mailing List tf-a@lists.trustedfirmware.org; Nicolas Benech nbenech@nvidia.com Subject: Re: EHF and SPMD G0 interrupt handling issues
Olivier, can you elaborate on the problem with “Presently a Group0 interrupt traps to SEL2 and is delegated to EL3 to the same sort of SPMD platform handler. I wonder if this is a possible case in your system and leading to the same problem?”. Isnt that exactly how the group0 SMC based handling designed?
From: Olivier Deprez Olivier.Deprez@arm.com Date: Thursday, June 8, 2023 at 10:14 AM To: Raghupathy Krishnamurthy raghupathyk@nvidia.com, Varun Wadekar vwadekar@nvidia.com, TF-A Mailing List tf-a@lists.trustedfirmware.org, Nicolas Benech nbenech@nvidia.com Subject: Re: EHF and SPMD G0 interrupt handling issues
External email: Use caution opening links or attachments
Hi Varun and Raghu,
Thanks both for the detailed replies and investigations.
I appreciate Group0 interrupt handling - while SPMD/SEL2 SPMC are present - is a fresh proposal that hasn't been deployed yet so hitting the real world usage scenario. Moreover 2 worlds RAS scenarios in this same configuration is not designed nor tested in the reference software stack (I'm not aware of downstream design deployments). Those are partly reasons why we did not consider SPD=spmd EL3_EXCEPTION_HANDLING=1 so far. The pitfall is that this doesn't trigger a build error, but a runtime misbehaviour as you hinted. But I agree we need to do something to support RAS in general in coming weeks.
While we figure out the correct design, would following change help platforms?
https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/21406
This should match your suggestion of omitting the SPMD interrupt registration when EL3_EXCEPTION_HANDLING=1.
I believe that's acceptable in the current situation and shouldn't break our test cases AFAIU.
It remains a question, what is the expected behaviour for a Group0 interrupt occurring while the secure world runs? Presently a Group0 interrupt traps to SEL2 and is delegated to EL3 to the same sort of SPMD platform handler. I wonder if this is a possible case in your system and leading to the same problem?
* Coming to the compatibility and deployment concerns. There is an inherent assumption that platforms will deploy TF-A and Hafnium v2.9 at the same time
Yes this is known, and comes down to the EL3/SEL2 interface stability. ABI additions over spec iterations are a challenge to support.
Testing mix and matched SPMD/SPMC versions isn't investigated too far, mostly because of differing partner deployment models. We agreed with our tech mgt matching SPMD and SPMC versions is the most common and easiest model to support by the reference software stack. This also aligns with the fact EL3+SEL2 are the same TCB and 'likely' to evolve at same time to follow bug fixes and feature additions.
In order to understand better how we could improve, can you tell a bit more about the typical scenario?
Is the SPMD likely to be ahead compared to the SPMC e.g. SPMD v2.x + SPMC v2.y / x >= y?
How much of a minor version difference would that be?
Regards,
Olivier.
________________________________
From: Raghupathy Krishnamurthy raghupathyk@nvidia.com Sent: 06 June 2023 21:26 To: Varun Wadekar vwadekar@nvidia.com; Olivier Deprez Olivier.Deprez@arm.com; TF-A Mailing List tf-a@lists.trustedfirmware.org; Nicolas Benech nbenech@nvidia.com Subject: RE: EHF and SPMD G0 interrupt handling issues
I see the issue Varun has. https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/19897 introduced a change where SPMD unconditionally registers for INTR_TYPE_EL3. If we compile both EHF and SPMD_SPM_AT_EL2, we have an issue. This combination used to work with https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/16047 because EHF could be enabled with SPMD, but we cannot with the latest change to support group0 interrupts.
Fundamentally, we need to disable the use of EHF (which I think Varun is saying is problematic because we use it). I had posted a comment here: https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/19897/comment... because of the below situation so that a platform could explicitly compile it out, when used with EHF.
Varun, if we remove/compile out https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/19897, would it fix all 3 problems? I think it does.
From: Varun Wadekar vwadekar@nvidia.com Sent: Tuesday, June 6, 2023 11:24 AM To: Raghupathy Krishnamurthy raghupathyk@nvidia.com; Olivier Deprez Olivier.Deprez@arm.com; TF-A Mailing List tf-a@lists.trustedfirmware.org; Nicolas Benech nbenech@nvidia.com Subject: Re: EHF and SPMD G0 interrupt handling issues
Hi,
Thanks for the links. I agree that things look good on paper, but the ground reality does not match the plan, IMO.
For platforms that enable SPMD_AT_SEL2 and EHF, the INTR_TYPE_EL3 handler is registered twice - ehf.c and spmd_main.c. I covered this in (1) from my list.
The RAS library uses EHF and its functions are not accessible to the platform to call from the 'plat_spmd_handle_group0_interrupt' handler. This interrupt handling now creates an unwanted and longer chain from the interrupt handler to the actual RAS handler in the platform port. I covered this in (2) and (3) from my list. Platforms might have to recreate something similar to EHF within their platform ports if they are asked to remove support for EHF altogether.
Coming to the compatibility and deployment concerns. There is an inherent assumption that platforms will deploy TF-A and Hafnium v2.9 at the same time. This assumption has led to the design choices in the code where we used static macros instead of runtime mechanisms to detect the availability of the support. I am not a big fan of increasing dependencies between independent SW components as it creates unwanted work for platforms and increases TTM.
The long-term approach should be to ensure that SPMD and EHF work in all possible combinations. The short-term approach should be to fix this issue by either reverting the change or introducing a workaround.
-Varun
________________________________
From: Raghupathy Krishnamurthy <raghupathyk@nvidia.commailto:raghupathyk@nvidia.com> Sent: Tuesday, June 6, 2023 4:45 PM To: Olivier Deprez <Olivier.Deprez@arm.commailto:Olivier.Deprez@arm.com>; TF-A Mailing List <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org>; Varun Wadekar <vwadekar@nvidia.commailto:vwadekar@nvidia.com>; Nicolas Benech <nbenech@nvidia.commailto:nbenech@nvidia.com> Subject: RE: EHF and SPMD G0 interrupt handling issues
Agree with Olivier. We should line up to FF-A spec recommendation.
Varun, if there are other issues caused by this happy to sync internally. +@Nicolas Benechmailto:nbenech@nvidia.com for vis (Nico, FYI – this is on public mailing list)
-Raghu
From: Olivier Deprez <Olivier.Deprez@arm.commailto:Olivier.Deprez@arm.com> Sent: Tuesday, June 6, 2023 7:53 AM To: TF-A Mailing List <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org>; Varun Wadekar <vwadekar@nvidia.commailto:vwadekar@nvidia.com> Cc: Raghupathy Krishnamurthy <raghupathyk@nvidia.commailto:raghupathyk@nvidia.com> Subject: Re: EHF and SPMD G0 interrupt handling issues
External email: Use caution opening links or attachments
Hi Varun,
* for platforms with SPMD_SPM_AT_SEL2=1. These platforms already use EHF for servicing RAS interrupts today.
Can you please have a look at https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/16047 ?
and https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/16047/6/docs/...
The model, by the FF-A specification, is to permit G0 interrupts to trap to EL3 when NWd runs.
A G0 interrupt is routed to a SP through the SPMD/SPMC by the use of EL3-SP direct messages:
https://review.trustedfirmware.org/q/topic:%22el3_direct_msg%22+(status:open...)
When SEL1/0 runs, G0 interrupts are first trapped to SEL2 and forwarded to EL3 by the FFA_EL3_INTR_HANDLE ABI.
I appreciate the legacy capability to let G0 interrupts trap to EL3 while SWd runs is not possible/recommended with this design.
This might indeed break earlier implementations; would it make sense aligning SW stacks to the latest of the FF-A spec recommendations?
I let Raghu chime in if need be.
Regards,
Olivier.
________________________________
From: Varun Wadekar via TF-A <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org> Sent: 06 June 2023 13:12 To: TF-A Mailing List <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org> Subject: [TF-A] EHF and SPMD G0 interrupt handling issues
Hi,
We are in the process of upgrading the downstream TF-A to v2.9 for platforms with SPMD_SPM_AT_SEL2=1. These platforms already use EHF for servicing RAS interrupts today.
I noticed that v2.9 has added G0 interrupt handling support to the SPMD. But I think the SPMD support still needs some work as it does not play nicely with EHF.
I have found the following issues with the implementation.
1. SPMD and EHF both register handlers for G0 interrupts. But the interrupt management framework only allows one handler for INTR_TYPE_EL3. 2. The RAS framework still uses EHF and the SPMD interrupt handler breaks that functionality. 3. The SPMD handler calls into the platform which does not have any means to invoke the RAS interrupt handler.
IMO, we should make SPMD a client of the EHF instead of creating yet another way for interrupt handling. For now, I register SPMD's G0 interrupt handler only if EL3_EXCEPTION_HANDLING=0, as a workaround.
Thoughts?