Hi Achin, I agree that a malicious trusted code can deliberately disrupt all functionality and nothing is bullet proof. But without this G0 routing even a normal secure OS will have to mask the critical events unintentionally. ex: when it wants to mask a non secure interrupt.
Somehow this looks odd to me that the critical event will have to wait for non critical trusted work. However, in favour of simplicity and if all of you are aligned then I can align to this as well.
Thanks Sandeep On Wed, Oct 7, 2020 at 3:29 PM Achin Gupta Achin.Gupta@arm.com wrote:
Hi Sandeep,
A few comments inline from a SW architecture/FF-A perspective.
On 5 Oct 2020, at 12:28, Sandeep Tripathy via TF-A tf-a@lists.trustedfirmware.org wrote:
Hi Olivier, Appreciate the details. I have a different perception of G0 interrupts and their relevance to RAS/ critical events. Comments in line. Thanks Sandeep
On Fri, Oct 2, 2020 at 6:47 PM Olivier Deprez Olivier.Deprez@arm.com wrote:
Hi Sandeep,
Here are a few more details. The reasoning differs when considering pre-Armv8.4 platforms (1) vs Armv8.4 platforms onwards with secure virtualization enabled (2).
Case (1):
EHF framework unifies EL3 exceptions delivered via different vectors and allows them to be handled in a common way. It is also allowing exception delegation handling to lower secure ELs. This framework although primarily used for RAS, is also used for SDEI and platform EL3 interrupts. EL3's role in this case is about trapping and routing the event to appropriate the component (when the interrupt/exception is not handled solely at EL3).
The interoperability between EHF and a Trusted OS is not accurately defined apart from this guidance in EHF documentation: "In order for S-EL1 software to handle Non-secure interrupts while having EHF enabled, the dispatcher must adopt a model where Non-secure interrupts are received at EL3, but are then synchronously handled over to S-EL1."
Until then for the specific RAS handling scenario, this was delegated to a StandaloneMM partition running at S-EL0 (through the SPM-MM implementation) and not necessarily delegated to a TOS.
Reliability is provided by the feature of G0 interrupt that it can not be masked by lower ELs. Such interrupt being handled at EL3 or being delegated to other components does not impact the reliable feature of G0 interrupt. Sure its handling must be offloaded to other components to keep EL3 firmware light. But If it were just about handling an interrupt then it could have been entirely handled in each state without even requiring an EL3 interrupt type.
Reliability in RAS is a different concept. RAS error interrupts do not provide reliability. They report unreliable operation.
Routing RAS interrupts to EL3 is an implementation choice called Firmware First Handling (FFH). Indeed, the interrupts could be routed to a lower EL which is called Kernel first handling (KFH).
Agree. In KFH case also consider platform events which a platform might want to handle at highest possible priority not necessarily standard RAS events.
For e.g. an implementation could decide to handle corrected errors Kernel first. Uncorrected errors could be routed to a platform controller instead of firmware or be routed to both. There is no single solution.
With FFH, the main requirement is that an uncorrected error must be handled even if the Normal world is not in a position to do so. There are non-technical requirements too but lets not go there. So I don’t think there is a requirement that "no lower EL" should be able to mask the interrupt.
EL3/S-EL1 and EL3/S-EL2 are at the same privilege level as far as access to the physical address space is concerned. G0 interrupts could be routed to EL3 but they can be disabled by S-EL1 or S-EL2 by programming the GIC distributor.
The main point being that software in all privileged exception levels in the Secure world must be trusted to handle RAS errors in the Normal world. Routing G0 interrupts to EL3 is not a silver bullet.
I think 'Priority' or ability to define something as critical is also one factor (which I associated with reliability) which is different from 'Trust' and 'Privilege'. G0 interrupts routed to EL3 and with 'agreed EL3 PMR range' can provide the tool to handle something of critical nature mimicking NMI. It is still not failsafe from the non abiding trusted software.
When support for FFH was added to TF-A, there was no use case to put software in S-EL1. This EL is owned by TF-A which deploys a simple shim layer. The EHF was developed with this assumption in mind.
If your requirement is to put a Trusted OS in S-EL1 and continue doing RAS error handling, then the requirements of the Trusted OS w.r.t the interrupt routing model must be factored in. Hence, the question about what exactly are your requirements.
- RAS error handling along with OP-TEE is one use case. - *Forceful powering down of all cores* on a critical event like watchdog interrupt is also a must have for our platform. - Apart from these I feel the ability to do IPI (hiprio) to any core at EL3 can provide interesting applications eg: debug infrastructure to collect secure context. All these will still work with the current routing model and all those RFCs to implement (IPI/sbsa watchdog) are still valid but, with the possibility of being masked by lower priority contexts.
I can understand the desire to reuse EHF but it cannot come at the cost of not meeting the TOS requirements. It needs a SW architecture discussion first. It might be possible to preempt S-EL1 and route RAS errors to EL3 in some cases. A cooperative model (2) between S-EL1 and EL3 (as Olivier described) is what most Trusted OSs implement today. It would be good to understand why that would not work for RAS.
I am only questioning the critical events being blocked by lower priority contexts. I think it should work without any modification to the S_EL2/S_EL1 or EL3. Treat it just like any other Non-secure interrupt ie: On a foreign interrupt, it can exit the secure world all the way up to Nwd and then if G0 interrupt is pending it will be taken at EL3. At EL3 the G0 interrupt handlers can be directly registered to IHF. Or maybe EHF can be refactored to fallback to the routing model CSS:0 TEL3:0. CSS:1 TEL3:1 ?
In order to better help you, we would need more information on the scenario you intend to achieve, and the environment (Arm architecture version and extensions, GIC version). Or maybe your question was out of curiosity for the longer term approach (2) as described below?
As per sbsa level III spec: sbsa non secure watchdog WS1 (reset) must be targeted at EL3. The patch in review ref: https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/5495 And we would want a watchdog interrupt to preempt all execution context. I would expect the same with any RAS or SDEI critical prio events.
Thanks for pointing this out!
SBSA applies to the Server segment. It was reasonable to assume that Secure firmware almost entirely resides in EL3. Hence the guidance. We will look at rewording this in a future release. The intent is that since it is a Non-secure Watchdog, the WS1 signal must not be masked by the Normal world.
The BSA applies to all segments. It leaves routing of WS1 implementation defined as long as the Normal world cannot mask it. It could be routed to S-EL1 or S-EL2 if that fulfils the requirement.
I think it is desirable for such events (secure WS0 and why not NS WS1 as well) to have priority over all other contexts and ability to preempt them as well.
Another misc application of our platform is to be able to forcefully turn off/ halt/ just ping any core at any execution context (S/NS). These motivated me to leverage EHF. But the idea of dropping EHF in future designs makes me think now !
Our current system is pre Armv8.4. We will stick to case (1). Case (2) ie SPMD was just my quoricity. However, I felt PSA-FFA may replace TOS specific SPDs someday. Making SPMD relevant in this discussion even with pre 8.4 systems. Because at least the TOS will have to follow one policy.
The SPMD is indeed meant to replace TOS specific SPDs. It is meant to cater for the RAS use case as well. From an FF-A perspective, a cooperative approach is simpler. I would like to understand why this would not work for RAS error interrupts as well. Reuse of EHF is an implementation level discussion and I don’t think that is off the table even with (2).
Case (2):
As a general rule, it is preferred that EL3 reduces its footprint and minimises platform specific handling code.
Agreed. Applies to case(1) aswell and heavy lifting to be delegated to lower ELs in either security states. My concern is on 'Taking' the interrupt handling (mjust)can be delegated.
It would certainly be desirable to reuse the EHF. However, it is not possible to delegate the heavy lifting to preempted software in S-EL1 or S-EL2 without significantly increasing their complexity. This is not the current direction of travel of FF-A.
EHF framework would most probably not be enabled at all. The priority logic provided by the GIC PMR register to mask NS interrupts cannot really work as before because all of trusted EL3/S-EL2 and untrusted S-EL1 SPs can manipulate this register.
This is a limitation. This can be taken care of by cooperative software design. ie. S_EL2/S_EL1 will not set PMR out of its range. And the platform defines what's EL3 priority range. GIC_LOWEST_EL3_PRI.
This falls under the solution space. It would be good to understand what is it you want to run on S-EL1/S-EL2 first.
OP-TEE (No S_EL2).
Any secure/non-secure interrupt triggered while running SEL1/SEL0 is trapped first by the S-EL2 firmware (or the so-called SPMC). This translates into SCR_EL3.FIQ/IRQ=0 in the secure world. Group1NS interrupts are redirected to SPMD for routing to NWd.
A Group0 interrupt is possibly redirected to a platform driver into an S-EL1 secure partition (e.g. a RAS handling service). Hence it does no longer hold true that Group0 interrupts are necessarily qualified as "EL3 interrupts". It is still possible to redirect Group0 interrupts from S-EL2 to EL3 and be handled there, but as said, this is a less preferred approach.
Either way when NWd runs (with SCR_EL3.FIQ=1/IRQ=0), a Group1S/Group0 secure interrupt is trapped at EL3 and routed to SPMD then SPMC. The SPMC can take the decision to resume the secure partition which registered the corresponding secure INTID.
This design does mean that SDEI interrupt handling would need SPMC and BL31 collaboration and this is something we are working on.
I understood this scheme. But it means RAS interrupts and other critical events will always have blackout periods even with proper software design.
RAS interrupts will have blackout periods even if a SMC is handled entirely in EL3. How is routing them to S-EL1 or S-EL2 any different?
Except this "Fast SMC entirely handled in EL3" rest of the contexts can be preempted. And these are more controlled than a driver in TOS at S_EL1 (possibly from another vendor). I am not sure yet but if we go down this path and if required we can do something about this EL3 SMC context as well to make it preemptable only by EL3 interrupt. Of course with audited trusted software. If a trusted code deliberately wants to mask critical events it can. Maybe some of it can be taken care of by S_EL2 if present.
Afaiu, the RAS architecture spec does not lay down any time limits on by when an error must be reported. All RAS errors are not critical errors. Even critical errors e.g. uncontainable errors report something that has already happened. With unrecoverable errors, ESBs ensure that the problem is contained to a particular EL or Security state.
Could you elaborate on what timing requirements you have and why a cooperative model would cause problems?
I agree, most of the critical events may wait. But there is a possibility to wait forever. A trusted software will not be malicious but can fail.
Whereas with the other routing model scheme the reliability of EHF handlers can be retained with the constraint of PMR ranges. There may be something I am missing.
I don’t think “reliability” is an argument here. It is about reusing the EHF in EL3. It is not off the table but we cannot overlook other evolutions in the software and hardware architecture since the EHF was written.
The other way to reuse EHF is to fallback to the existing routing model. ie: EHF handlers (EL3 handlers) can preempt only the yielding SMC contexts just like any non-secure interrupt does.
Let me know what you think.
Cheers, Achin
Hope this helps.
Regards, Olivier.
From: TF-A tf-a-bounces@lists.trustedfirmware.org on behalf of Olivier Deprez via TF-A tf-a@lists.trustedfirmware.org Sent: 28 September 2020 14:01 To: Sandeep Tripathy; Soby Mathew Cc: tf-a@lists.trustedfirmware.org; nd Subject: Re: [TF-A] Query SPD/SPMD behavior with EHF
Hi Sandeep,
Your question is very valid and we're discussing options internally.
We will come back to you with a consolidated answer shortly.
Regards, Olivier.
From: Sandeep Tripathy Sent: Monday, September 28, 2020 05:28 To: Soby Mathew Cc: Dan Handley; tf-a@lists.trustedfirmware.org; nd; Olivier Deprez Subject: Re: [TF-A] Query SPD/SPMD behavior with EHF
Thanks Soby and Dan for confirmation on TSPD. I can see a few more gaps in the related area.
"The EL3 interrupts (G0 interrupts) should be able to pre-empt Fast SMC i.e. any execution context for that matter ". This should apply to all SPDs including SPMD. However I learned from @Oliver that SPMD/SPMC design traps FIQs to S_EL2.
In that case a RAS interrupt can be masked by S_EL2 software (eg: Hafnium). Probably by design it will be ensured that S_EL2 will never mask the physical FIQ ?
S_EL2 FIQ handler will exit to EL3/SPMD by SMC call. And depending on the pending interrupt type either it can exit to NWd OR invoke el3 fiq vector handler synchronously ?
Are there limitations if we trap fiq to EL3 instead ?
Thanks Sandeep On Fri, Sep 18, 2020 at 6:26 PM Soby Mathew Soby.Mathew@arm.com wrote:
Hi Sandeep
Except during yielding SMC ‘disable_intr_rm_local(INTR_TYPE_NS, SECUE);’ is in effect. Intention is to avoid NS interrupt preempt secure execution (Fast SMC). But I think that will also disable G0 interrupt as both NS interrupt and G0 interrupt are on FIQ. EHF already ensures this by GIC PMR adjustment. So disabling routing model seems unnecessary in this case. This is my understanding from the code please confirm if this is correct.
The EL3 interrupts (G0 interrupts) should be able to pre-empt Fast SMC. Hence the usage of GIC PMR to mask the NS interrupts. As Dan says, the TSP_NS_INTR_ASYNC_PREEMPT predates the EHF design and it seems there is a problem as you describe.
EHF already ensures this by GIC PMR adjustment. So disabling routing model seems unnecessary in this case. This is my understanding from the code please confirm if this is correct.
You are right. Routing model manipulation is not required when EL3 interrupts are present as GIC PMR manipulation should take care of the required behaviour for yielding vs atomic SMC. You also need to ensure it works as expected when EL3 interrupts are not enabled and when EHF is disabled.
Best Regards Soby Mathew
-----Original Message----- From: TF-A tf-a-bounces@lists.trustedfirmware.org On Behalf Of Sandeep Tripathy via TF-A Sent: 17 September 2020 16:53 To: Dan Handley Dan.Handley@arm.com Cc: tf-a@lists.trustedfirmware.org Subject: Re: [TF-A] Query TSPD behavior with EHF
Hi Dan, I am not sure if this is mentioned anywhere in any documents but I think EHF handlers should be able to preempt all execution contexts at lower ELs and lower ELs should never be able to mask such interrupts. If the behavioral expectation is set the implementation can be fixed.
Thanks Sandeep
On Thu, Sep 17, 2020 at 7:57 PM Dan Handley via TF-A <tf- a@lists.trustedfirmware.org> wrote:
A correction...
> -----Original Message----- > From: TF-A tf-a-bounces@lists.trustedfirmware.org On Behalf Of Dan > Handley via TF-A > Sent: 17 September 2020 15:14 >>> >>> I want to handle something similar in OP-TEED along with EHF >>> depending on >> what is the expected behavior. >>> > Hmm, I thought OP-TEED was more like the
TSP_NS_INTR_ASYNC_PREEMPT=0
> case, where NS interrupts are routed to S-EL1 while processing a > yielding SMC in S- EL1? Perhaps that's a better TSPD config for you to
follow?
> Sorry, if EL3_EXCEPTION_HANDLING=1 then obviously NS interrupts are
routed to EL3 first, but the TSPD re-enables NS interrupts before handing over to the TSP to handle yielding calls, via a call to ehf_allow_ns_preemption.
Right, that is the case for yielding SMC handling where both NS interrupts and EL3/G0 interrupts can preempt the S_EL1/S_EL2 context. But I would expect the same routing model even for 'Fast SMC' unlike what is happening in TSPD.
Dan.
-- TF-A mailing list TF-A@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/tf-a
-- TF-A mailing list TF-A@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/tf-a
-- TF-A mailing list TF-A@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/tf-a
-- TF-A mailing list TF-A@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/tf-a