[TF-A] About kdump hang issue in SDEI dispatch

5 Nov 2021


      Hi James & TF-A guys,
When hest acpi table configure Hardware Error Notification type as
Software Delegated Exception(0x0B) for RAS event, kernel RAS interacts with
TF-A by SDEI mechanism. On the firmware first system, kernel was notified by
TF-A sdei call.
The calling flow like as below when fatal RAS error happens:
TF-A notify kernel flow:
  sdei_dispatch_event()
    ehf_activate_priority()
      call sdei callback  // callback registered by kerenl
    ehf_deactivate_priority()
Kernel sdei callback:
  sdei_asm_handler()
    __sdei_handler()
      _sdei_handler()
        sdei_event_handler()
          ghes_sdei_critical_callback()
            ghes_in_nmi_queue_one_entry()
              /* if RAS error is fatal */
              __ghes_panic()
                panic()
If fatal RAS error occured, panic was called in sdei_asm_handle()
without ehf_deactivate_priority executed, which lead interrupt masked.
If interrupt masked, system would be halted in kdump flow like this:
arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for cmdq
arm-smmu-v3 arm-smmu-v3.3.auto: allocated 32768 entries for evtq
arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for priq
arm-smmu-v3 arm-smmu-v3.3.auto: SMMU currently enabled! Resetting...
So interrupt should be restored before panic otherwise kdump will hang.
In the process of sdei, a SDEI_EVENT_COMPLETE(or SDEI_EVENT_COMPLETE_AND_RESUME)
call should be called before panic for a completed run of ehf_deactivate_priority().
The ehf_deactivate_priority() function restore pmr_el1 to original value(>0x80).
The SDEI dispatch flow was broken if SDEI_EVENT_COMPLETE was not be called.
This will bring about two issue:
1 Kdump will hang for firmware reporting fatal RAS event by SDEI;
  (as explain above)
2 For NMI scene，TF-A enable a secure timer, the PPI 29 will trigger periodically.
  Kernel register a callback for hard lockup. The below code will not be
  called when panic in kernel callback:
  TF-A, services/std_svc/sdei/sdei_intr_mgmt.c sdei_intr_handler():
  /*
   * We reach here when client completes the event.
   *
   * If the cause of dispatch originally interrupted the Secure world,
   * resume Secure.
   *
   * No need to save the Non-secure context ahead of a world switch: the
   * Non-secure context was fully saved before dispatch, and has been
   * returned to its pre-dispatch state.
   */
  if (sec_state == SECURE)
    restore_and_resume_secure_context();
/*
   * The event was dispatched after receiving SDEI interrupt. With
   * the event handling completed, EOI the corresponding
   * interrupt.
   */
  if ((map->ev_num != SDEI_EVENT_0) && !is_map_bound(map)) {
    ERROR("Invalid SDEI mapping: ev=%u\n", map->ev_num);
    panic();
  }
  plat_ic_end_of_interrupt(intr_raw);
How to fix above issues?
I think the root cause is that kernel broken the SDEI dispatch flow, so kernel
should modify to fix these issues.
Thanks,
Ming

2025

2024

2023

2022

2021

2020

2019

2018

[TF-A] About kdump hang issue in SDEI dispatch