Hi Ming,
This issue is analysed and following technical report provided earlier in another thread. I am reposting it here. Both kernel and TF-A teams involved in preparing this report.
To begin with, SDEI implementation in TF-A relies on GIC priority masking to implement the requirement as listed in Section 6.2 of the SDEI specification [1]: “ The PE is not already handling the same or a higher priority event. ” for an SDEI dispatch to happen. The priorities in the system are managed by the Exception Handling framework (EHF [2] ) which provides helper functions to set priorities when an Event is not tied to a physical Interrupt directly. Appendix 7 in the SDEI spec mentions that "The [SDEI] handler routine similar to an interrupt handling routine, handles the event, clears the device interrupt and completes the event." So, there is an expectation in TF-A implementation that the SDEI handler is like an interrupt handling routine, and that the Client will follow through the lifecycle of the SDEI event handle before resuming Normal Execution.
SDEI implementation brings down the priority once the client (kernel) returns SDEI_COMPLETE/COMPLETE_RESUME. However, in this case, the kernel runs into a panic() before the SDEI COMPLETE(_AND_RESUME) which means the active priority of the system remains at the SDEI Normal Event handler level.
To fix it from the firmware POV, we need to bring down the active priority before the SDEI handler dispatch and queue up any further SDEI events in EL3 while the SDEI handler is executing. This solution needs a lot of thinking and probably a design change in TF-A as we need to take care of queueing mechanism and handling secure interrupts in the SDEI context as Secure interrupts which were previously masked due to SDEI Event Normal Priority are now eligible for handling and might result in a context switch to NS thus corrupting the original NS stashed context.
The one easy workaround which may apply to your case is to not hold the priority when the SDEI handler is running. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
diff --git a/services/std_svc/sdei/sdei_intr_mgmt.c b/services/std_svc/sdei/sdei_intr_mgmt.c index 87a1fb7..9ae3be1 100644 --- a/services/std_svc/sdei/sdei_intr_mgmt.c +++ b/services/std_svc/sdei/sdei_intr_mgmt.c @@ -644,21 +644,10 @@ */ ns_ctx = restore_and_resume_ns_context();
- /* Activate the priority corresponding to the event being dispatched */ - ehf_activate_priority(sdei_event_priority(map)); - /* Dispatch event synchronously */ setup_ns_dispatch(map, se, ns_ctx, &dispatch_jmp); begin_sdei_synchronous_dispatch(&dispatch_jmp);
- /* - * We reach here when a client completes the event. - * - * Deactivate the priority level that was activated at the time of - * explicit dispatch. - */ - ehf_deactivate_priority(sdei_event_priority(map)); - return 0; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This patch has the following limitations:
* Another SDEI event (lower priority) can pre-empt the dispatched event handler. * A secure interrupt can pre-empt the SDEI handler and can switch to NS. * Non-secure interrupt can pre-empt the SDEI handler.
Hence, we in TF- A cannot upstream this patch, due to the above limitations. But these limitations can be fine-tuned and reworked according to your system requirements. So, you can try and apply this patch downstream.
Please revert back if you have any suggestions or comments.
[1] https://developer.arm.com/documentation/den0054/latest [2] https://trustedfirmware-a.readthedocs.io/en/latest/components/exception-hand...
________________________________ From: Ming Huang huangming@linux.alibaba.com Sent: 08 March 2023 14:29 To: Mark Dykes Mark.Dykes@arm.com; James Morse James.Morse@arm.com; tf-a@lists.trustedfirmware.org tf-a@lists.trustedfirmware.org; John Powell John.Powell@arm.com; Manish Pandey2 Manish.Pandey2@arm.com; Joanna Farley Joanna.Farley@arm.com Subject: Re: [TF-A] About kdump hang issue in SDEI dispatch
Hi Mark,
Was the patch pushed to master and what was the commit ID?
Thanks, Ming
On 12/16/21 11:53 PM, Mark Dykes wrote:
Ming, After some discussion with James and other team members we have determined that a selective masking of the interrupts from the GIC might be the best solution. I have one more team member to run this by to make sure this is the right approach. If this is the case I will talk with Doug and Joanna about how we can provide a patch probably after the holidays. Let us know your view if you have any major concerns or questions. Thanks for your patience!!...
Mark
-----Original Message----- From: Ming Huang huangming@linux.alibaba.com Sent: Tuesday, December 7, 2021 9:17 PM To: James Morse James.Morse@arm.com; tf-a@lists.trustedfirmware.org; zhangliguang@linux.alibaba.com; Mark Dykes Mark.Dykes@arm.com; John Powell John.Powell@arm.com; Manish Pandey2 Manish.Pandey2@arm.com; Joanna Farley Joanna.Farley@arm.com Subject: Re: [TF-A] About kdump hang issue in SDEI dispatch
Hi TF-A guys,
Any comments about this issue?
Thanks, Ming
On 11/13/21 7:53 AM, Ming Huang via TF-A wrote:
Hi Mark & John & Manish,
Any comment about this? In my mind, these issues was caused by cooperation between TF-A and kernel.
Thanks, Ming
On 11/11/21 12:02 AM, James Morse wrote:
Hello!
On 05/11/2021 10:06, Ming Huang wrote:
When hest acpi table configure Hardware Error Notification type as Software Delegated Exception(0x0B) for RAS event, kernel RAS interacts with TF-A by SDEI mechanism. On the firmware first system, kernel was notified by TF-A sdei call.
The calling flow like as below when fatal RAS error happens: TF-A notify kernel flow: sdei_dispatch_event() ehf_activate_priority() call sdei callback // callback registered by kerenl ehf_deactivate_priority()
You've got an IRQ taken to EL3 somewhere in here, and EL3 then uses the GIC PMR to mask interrupts for secure world, which implicitly masks them for the normal world too.
Kernel sdei callback:
....
panic()
What matters here is the kernel panic()d. It can do this for any number of reasons, it may run out of stack space, execute an illegal instruction etc. The kernel must always be able to panic().
We want to modify the panic case only in sdei_asm_handler flow, not other case like run out of stack space, execute an illagal instruction etc.
The problem is what happens next, the kernel triggers kdump, and kump inherits the state of the system that the EL3 firmware handed to the SDEI event.... where interrupts are masked by the GIC PMR.
If fatal RAS error occured, panic was called in sdei_asm_handle() without ehf_deactivate_priority executed, which lead interrupt masked.
So interrupt should be restored before panic otherwise kdump will hang.
'before kdump' is impossible for EL3 to know.
Linux makes a SDEI_PE_MASK() call from inside the handler as part of smp_send_stop(). If the Kdump kernel supports SDEI, it will eventually run SDEI_PRIVATE_RESET() and SDEI_SHARED_RESET(), but interrupts need to work well before that point.
In the process of sdei, a SDEI_EVENT_COMPLETE(or SDEI_EVENT_COMPLETE_AND_RESUME) call should be called before panic for a completed run of ehf_deactivate_priority(). The ehf_deactivate_priority() function restore pmr_el1 to original value(>0x80).
The SDEI dispatch flow was broken if SDEI_EVENT_COMPLETE was not be called.
The expected flow hasn't occurred, but nothing 'illegal' is happening. Section 6.1.2 lists SDEI_PE_MASK() and the two reset calls as always being available, regardless of the event handler state.
The surprise here is that interrupts are broken by SDEI.
This will bring about two issue: 1 Kdump will hang for firmware reporting fatal RAS event by SDEI; (as explain above) 2 For NMI scene,TF-A enable a secure timer, the PPI 29 will trigger periodically. Kernel register a callback for hard lockup. The below code will not be called when panic in kernel callback:
... because it depends on an interrupt being taken by EL3, and interrupts are still masked by PMR.
How to fix above issues? I think the root cause is that kernel broken the SDEI dispatch flow, so kernel should modify to fix these issues.
The surprise here is that interrupts are broken by SDEI. The OS always needs to be able to panic(), which might trigger kdump. Kdump needs interrupts to work before it probes SDEI.
The OS may panic() for reasons other than your GHES RAS example.
It is not possible for the OS to 'break out' of the SDEI event as the panic() call occurs deep on the stack, and completing the event allows another event to be delivered, which will overwrite the state that cause the kernel to call panic() in the first place.
The only scope for paper-ing over this problem in the OS is to also make the reset calls in smp_send_stop(). I suspect this won't solve the problem without EL3 changes too.
This needs tackling from the PMR angle.
I'm not familiar with the TFA code that uses PMR to mask all interrupts when an SDEI event is delivered. Could it stop doing that? Is it upstream, or out of tree? (git grep doesn't find any references to pmr under services/std_svc/sdei.)
We use the upstrean TF-A source code. PMR is modified in ehf_activate_priority();
As it is, this design allows the normal-world to prevent secure-world interrupts from being delivered by holding a CPU in the event handler.
Presumably the use of PMR is intended to ensure 'the' secure-world interrupt that caused the event to be delivered is not taken a second time. (EL3 can't mask secure-world interrupts using PSTATE when it enters the event handler as the event handler is a lower EL.)
Instead of masking all interrupts, could it disable 'the' secure-world interrupt that caused the event? The GIC has banks of I{S,C}ENABLER<n> registers for enabling and disabling interrupts. The interrupt can stay disabled until the event is completed, unregistered or a reset call unregisters it. This is better as secure-world interrupts are still delivered and handled by secure-world. The OS may take a long time to handle an SDEI event, starving a secure-world watchdog and causing a reset once the normal-world finishes.
Even better would be to leave the interrupt enabled, and handle the interrupt without delivering another SDEI event. You probably do want 'the' interrupt to be delivered while the event handler is running, this may be the only way you can detect an error that occurs while the error-handler is running.
Thanks,
James