Hi,

Just for record, I came across a similar (not the same) problem and discussed this in TF-A LTS context [1].

> IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).

If I understood correctly, we clear the SCR_EL3.EA bit on exit from EL3 to a lower EL. So an SError in lower EL shouldn't trap to EL3. Did I miss something?

[1] https://lists.trustedfirmware.org/archives/list/tfa-lts@lists.trustedfirmware.org/thread/GFIAJVP6PDDGJGHXFRKDTPZPLZN7PIK4/

On Fri, Mar 10, 2023 at 6:10 AM Ming Huang via TF-A <tf-a@lists.trustedfirmware.org> wrote:
Hi Manish,

See the below comments please.

On 3/9/23 7:45 PM, Manish Pandey2 wrote:
> Hi Ming,
>
> IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).
>
>  The reason to unmask SError is to catch SError caused by EL3 during handling of lower EL SError. I think having a synchronization barrier(esb) may prevent multiple trapping of SError.
>  Just to let you know I am actively working on RAS refactoring in TF-A (and fixing bugs). I will add you to those patches.
>
>  For fixing this particular issue, I am planning below change, would you please test if it works for you? No need to unmask EAs if we are already handling an EA (unlike other exceptions)

The result is similar to mask EA(remove "msr     daifclr, #DAIF_ABT_BIT").
--------------log:
[   48.944653] pcieport 0000:b0:00.0: EDR: EDR event received
Detected DPC, skip AER
  core[64] mm(925) return: 2
--handler(925) end
[   48.959091] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
[   48.967129] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
[   48.972710] igb 0000:b1:00.0 ens51f0: PCIe link lost

[root@localhost.localdomain /root/hm]
#[   49.141737] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
[   49.150315] igb 0000:b1:00.0: enabling device (0000 -> 0002)
ERROR:   Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411
ue_cnt:0x0
  esr_el3 = be000411
    Exception Class = 2f: SError interrupt.
Print cpu register:
  |-elr_el3: ffff80001063a188
  |-far_el3: 7abce97690b57fe1
  |-scr_el3: 403073d
  |-sctlr_el3: 30cd183f
  |-LR: ffff80001063a2f0
  |-SP: ffff000807388000
  |-x0: 0000000000000000  x1: ffff800011ed500c
  |-x2: 0000000000000000  x3: ffff800011ed5004
  |-x4: ffff800011b73000  x5: 000000000000000c
  |-x6: ffff800010c1d3f8  x7: 0000000000000000
  |-x8: 00000000000bffe8  x9: ffff80001063a2f0
  |-x10: ffff800011792338  x11: 0000000000000003
  |-x12: ffff800011852378  x13: 0000000000000001
  |-x14: 0000000000000c80  x15: 0000000000aaaaaa
  |-x16: 0000000000000000  x17: 7228206465726f6e
  |-x18: 0000000000000010  x19: ffff04000401a080
  |-x20: ffff800010f522b8  x21: ffff04000401a0a0
  |-x22: ffff00080b4e2370  x23: ffff00080b4e1fe8
  |-x24: 00000000f4b1e03c  x25: ffff00080b4e10c8
  |-x26: ffff8000114b1880  x27: ffff00080737f5c0
  |-x28: ffff0008072d3400  x29: ffff8000121cbab0
INFO:    mpidr:81000000, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR:   Handle Exception on 0x81000000 from EL2, reason=0 esr_el3=0xbe000411
INFO:    spsr:82401009 elr:ffff80001063a188 far:7abce97690b57fe1
INFO:    SCR: 403073d
           |-ns: 1
           |-eel2: 0
INFO:    HCR: 488000000
           |-teg: 1
           |-amo: 0
           |-tea: 0
INFO:    Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+380.
ERROR:   Excepton received on 0x81000000, spsr_el3:600001c9,reason:2 esr_el3:0x80000011
ue_cnt:0x0
  esr_el3 = 5e000000
    Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
  |-elr_el3: ffff800010027aac
  |-far_el3: 7abce97690b57fe1
  |-scr_el3: 403073d
  |-sctlr_el3: 30cd183f
  |-LR: ffff8000108d569c
  |-SP: ffff000807388000
  |-x0: 00000000c400002b  x1: 0000000000000000
  |-x2: 0000000000000000  x3: 0000000000000000
  |-x4: 0000000000000000  x5: 0000000000000000
  |-x6: 0000000000000000  x7: 0000000000000000
  |-x8: ffff8000121cb710  x9: ffff8000108d6078
  |-x10: ffff8000121cb750  x11: 0000000000000000
  |-x12: 61646e6f63657320  x13: 0a73555043207972
  |-x14: 0000000000000000  x15: 0000000000000000
  |-x16: 0000000000000000  x17: 0000000000000000
  |-x18: 0000000000000060  x19: ffff800010f522b8
  |-x20: 00000000000f423d  x21: ffff800011996a68
  |-x22: ffff8000114a0c08  x23: ffff8000114a0000
  |-x24: ffff8000119b7000  x25: ffff00080b4e10c8
  |-x26: ffff8000114b1880  x27: ffff00080737f5c0
  |-x28: ffff000807388000  x29: ffff8000121cb6e0
INFIO:N  F  Omp:id r: 81 00 00m00, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR:   Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011
INFO:    spsr:600001c9 elr:ffff800010027aac far:7abce97690b57fe1
INFO:    SCR: 403073d
           |-ns: 1
           |-eel2: 0
INFO:    HCR: 488000000
           |-teg: 1
           |-amo: 0
           |-tea: 0
INFO:    Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200.
ERROR:   Excepton received on 0x81000000, spsr_el3:800003c9,reason:2 esr_el3:0x80000011
ue_cnt:0x0
  esr_el3 = 5e000000
    Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
  |-elr_el3: ffff800010027aac
  |-far_el3: 7abce97690b57fe1
  |-scr_el3: 403073d
  |-sctlr_el3: 30cd183f
  |-LR: ffff8000108d569c
  |-SP: ffff000807388000
  |-x0: 00000000c4000038  x1: 000000000000001d
  |-x2: 0000000000000000  x3: 0000000000000000
  |-x4: 0000000000000000  x5: 0000000000000000
  |-x6: 0000000000000000  x7: 0000000000000000
  |-x8: 303d294b53414d43  x9: ffff8000108d5f28
  |-x10: 3736313d454d4954  x11: 0a37333630323438
  |-x12: 78303d294b53414d  x13: 5448534152430a30
  |-x14: 4c454e52454b0a30  x15: 303d54455346464f
  |-x16: 352d50314355312d  x17: 664d427369412f42
  |-x18: 0000000000000001  x19: ffff800010f522b8
  |-x20: 0000000000000000  x21: 0000000000000000
  |-x22: ffff800011acd000  x23: ffff8000121cb2e8
  |-x24: ffff8000119b7000  x25: ffff00080b4e10c8
  |-x26: ffff8000114b1880  x27: ffff00080737f5c0
  |-x28: ffff000807388000  x29: ffff8000121cb200
INFO:    mpidr:81000000, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR:   Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011
INFO:    spsr:800003c9 elr:ffff800010027aac far:7abce97690b57fe1
INFO:    SCR: 403073d
           |-ns: 1
           |-eel2: 0
INFO:    HCR: 488000000
           |-teg: 1
           |-amo: 0
           |-tea: 0
INFO:    Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200.
INFO:    clear eoi id:1d
[   49.473997] SError Interrupt on CPU0, code 0xbe000411 -- SError
[   49.473998] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G            E     5.10.134-10.git.819992ee1.an8.aarch64 #1
[   49.473998] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023
[   49.473998] Workqueue: kacpi_notify acpi_os_execute_deferred
[   49.474000] pstate: 82401009 (Nzcv daif +PAN -UAO +TCO BTYPE=--)
[   49.474000] pc : __pci_write_msi_msg+0x108/0x210
[   49.474000] lr : default_restore_msi_irq+0x60/0x88
[   49.474001] sp : ffff8000121cbab0
[   49.474001] x29: ffff8000121cbab0 x28: ffff0008072d3400
[   49.474001] x27: ffff00080737f5c0 x26: ffff8000114b1880
[   49.474002] x25: ffff00080b4e10c8 x24: 00000000f4b1e03c
[   49.474003] x23: ffff00080b4e1fe8 x22: ffff00080b4e2370
[   49.474003] x21: ffff04000401a0a0 x20: ffff800010f522b8
[   49.474004] x19: ffff04000401a080 x18: 0000000000000010
[   49.474004] x17: 7228206465726f6e x16: 0000000000000000
[   49.474005] x15: 0000000000aaaaaa x14: 0000000000000c80
[   49.474005] x13: 0000000000000001 x12: ffff800011852378
[   49.474006] x11: 0000000000000003 x10: ffff800011792338
[   49.474006] x9 : ffff80001063a2f0 x8 : 00000000000bffe8
[   49.474007] x7 : 0000000000000000 x6 : ffff800010c1d3f8
[   49.474007] x5 : 000000000000000c x4 : ffff800011b73000
[   49.474008] x3 : ffff800011ed5004 x2 : 0000000000000000
[   49.474008] x1 : ffff800011ed500c x0 : 0000000000000000
[   49.474009] Kernel panic - not syncing: Asynchronous SError Interrupt
[   49.474010] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G            E     5.10.134-10.git.819992ee1.an8.aarch64 #1
[   49.474010] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023
[   49.474010] Workqueue: kacpi_notify acpi_os_execute_deferred
[   49.474011] Call trace:
[   49.474011]  dump_backtrace+0x0/0x200
[   49.474011]  show_stack+0x1c/0x28
[   49.474011]  dump_stack+0xd4/0x128
[   49.474012]  panic+0x178/0x394
[   49.474012]  nmi_panic+0x6c/0xa0
[   49.474012]  arm64_serror_panic+0x84/0x90
[   49.474012]  do_serror+0x38/0x60
[   49.474014]  el1_error+0x7c/0xf4
[   49.474014]  __pci_write_msi_msg+0x108/0x210
[   49.474015]  default_restore_msi_irq+0x60/0x88
[   49.474015]  arch_restore_msi_irqs+0x3c/0x58
[   49.474015]  pci_restore_msi_state+0x8c/0x210
[   49.474015]  pci_restore_state.part.42+0x34c/0x4a8
[   49.474016]  pci_restore_state+0x20/0x28
[   49.474016]  igb_io_slot_reset+0x3c/0xc8 [igb]
[   49.474016]  report_slot_reset+0x4c/0x98
[   49.474016]  pci_walk_bus+0x68/0xc0
[   49.474017]  pci_walk_bridge+0x20/0x40
[   49.474017]  pcie_do_recovery+0x1f0/0x290
[   49.474017]  edr_handle_event+0x1f0/0x270
[   49.474017]  acpi_ev_notify_dispatch+0x60/0x70
[   49.474018]  acpi_os_execute_deferred+0x20/0x38
[   49.474018]  process_one_work+0x1b8/0x420
[   49.474018]  worker_thread+0x158/0x510
[   49.474018]  kthread+0x114/0x118
[   49.474019] SMP: stopping secondary CPUs
-----------------log end


>
> --- a/bl31/aarch64/runtime_exceptions.S
> +++ b/bl31/aarch64/runtime_exceptions.S
> @@ -402,11 +402,8 @@ end_vector_entry fiq_aarch32
>  vector_entry serror_aarch32
>         save_x30
>         apply_at_speculative_wa
> -#if RAS_EXTENSION
> +       esb
>         msr     daifclr, #DAIF_ABT_BIT
> -#else
> -       check_and_unmask_ea
> -#endif
>         b       handle_lower_el_async_ea
>
>
> Thanks
> Manish
> ________________________________
> From: Ming Huang <huangming@linux.alibaba.com>
> Sent: 08 March 2023 14:22
> To: Jeenu Viswambharan <Jeenu.Viswambharan@arm.com>
> Cc: Manish Pandey2 <Manish.Pandey2@arm.com>; tf-a@lists.trustedfirmware.org <tf-a@lists.trustedfirmware.org>
> Subject: About Unmask the SError in serror_aarch64
>
> Hi Jeenu,
>
>  vector_entry serror_aarch64
> +       msr     daifclr, #DAIF_ABT_BIT
>
> Why the SError must be unmask in serror_arrch64?
>
> We found unmask SError will lead TF-A panic with output "Unhandled Exception in EL3".
> --------------------------------------------------------------------------------------log:
> Detected DPC, skip AER
>   core[64] mm(925) return: 2
> [  109.883159] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
> [  109.891195] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
> [  109.896776] igb 0000:b1:00.0 ens51f0: PCIe link lost
>
> [root@localhost.localdomain /root/hm]
> #[  110.068410] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
> [  110.076988] igb 0000:b1:00.0: enabling device (0000 -> 0002)
> Unhandled Exception in EL3.
> x30            = 0x00000000ff013b84
> x0             = 0x0000000000000000
> x1             = 0xffff800011e7500c
> x2             = 0x0000000000000000
> x3             = 0xffff800011e75004
> x4             = 0xffff800011a82000
> x5             = 0x000000000000000c
> x6             = 0x00000000000000fb
> x7             = 0xffff800011441e80
> x8             = 0x0000000000000000
> x9             = 0xffff800010655270
> x10            = 0x00000000ffff8000
> x11            = 0xffff800011701e80
> x12            = 0x0000000000000001
> x13            = 0xffff800010c08cc0
> x14            = 0x0000000000000c80
> x15            = 0x0000000000000001
> x16            = 0x67692070552f6e77
> x17            = 0x7228206465726f6e
> x18            = 0x0000000000000030
> x19            = 0xffff04000032de80
> x20            = 0xffff04000032dea0
> x21            = 0xffff040002c2f000
> x22            = 0xffff000809280000
> x23            = 0xffff00080a217800
> x24            = 0xffff800011853f80
> x25            = 0xffff040002c2e0c8
> x26            = 0xffff800011923de8
> x27            = 0xffff000817fc8740
> x28            = 0x0000000000000000
> x29            = 0xffff80001e9ebb00
> scr_el3        = 0x000000000403073d
> sctlr_el3      = 0x0000000030cd183f
> cptr_el3       = 0x0000000000000100
> tcr_el3        = 0x0000000080843514
> daif           = 0x00000000000003c0
> mair_el3       = 0x00000000004404ff
> spsr_el3       = 0x00000000624002cd
> elr_el3        = 0x00000000ff013d84
> ttbr0_el3      = 0x00000000ff093001
> esr_el3        = 0x00000000be000011
> far_el3        = 0x7abce97e90b5fee1
> spsr_el1       = 0x0000000000000000
> elr_el1        = 0x0000000000000000
> spsr_abt       = 0x0000000000000000
> spsr_und       = 0x0000000000000000
> spsr_irq       = 0x0000000000000000
> spsr_fiq       = 0x0000000000000000
> sctlr_el1      = 0x0000000030d00800
> actlr_el1      = 0x0000000000000000
> cpacr_el1      = 0x0000000000000000
> csselr_el1     = 0x0000000000000002
> sp_el1         = 0x0000000000000000
> esr_el1        = 0x0000000000000000
> ttbr0_el1      = 0x0000000000000000
> ttbr1_el1      = 0x0000000000000000
> mair_el1       = 0x0000000000000000
> amair_el1      = 0x0000000000000000
> tcr_el1        = 0x0000000000000000
> tpidr_el1      = 0xffff800f6e6f1000
> tpidr_el0      = 0x00000000f6e1ede0
> tpidrro_el0    = 0x0000000000000000
> par_el1        = 0xff000000f4214b80
> mpidr_el1      = 0x0000000081000000
> afsr0_el1      = 0x0000000000000000
> afsr1_el1      = 0x0000000000000000
> contextidr_el1 = 0x0000000000000000
> vbar_el1       = 0x0000000000000000
> cntp_ctl_el0   = 0x0000000000000000
> cntp_cval_el0  = 0x000000010b938e04
> cntv_ctl_el0   = 0x0000000000000000
> cntv_cval_el0  = 0x0000000000000000
> cntkctl_el1    = 0x0000000000000000
> sp_el0         = 0xffff000809280000
> isr_el1        = 0x0000000000000040
> cpupwrctlr_el1 = 0x0000000000000000
> --------------------------------------------------------------------------------------
>
> Remove the above line(mask SError), TF-A continue execute with some useful output.
> --------------------------------------------------------------------------------------log:
> Detected DPC, skip AER
>   core[64] mm(925) return: 2
> [  278.193642] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
> [  278.201680] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
> [  278.207262] igb 0000:b1:00.0 ens51f0: PCIe link lost
>
> [root@localhost.localdomain /root/hm]
> #[  278.378416] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
> [  278.386993] igb 0000:b1:00.0: enabling device (0000 -> 0002)
> ERROR:   Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411
> ue_cnt:0x0
>     Exception Class = 2f: SError interrupt.
> Print cpu register:
>   |-elr_el3: ffff8000106550f8
>   |-far_el3: 7abce97e92b57fe1
>   |-scr_el3: 403073d
>   |-sctlr_el3: 30cd183f
>   |-LR: ffff800010655270
>   |-SP: ffff0008091d1200
>   |-x0: 0000000000000000  x1: ffff800011e7500c
>   |-x2: 0000000000000000  x3: ffff800011e75004
>   |-x4: ffff800011a82000  x5: 000000000000000c
>   |-x6: 00000000000000fb  x7: ffff800011441e80
>   |-x8: 0000000000000000  x9: ffff800010655270
>   |-x10: 00000000ffff8000  x11: ffff800011701e80
>   |-x12: 0000000000000001  x13: ffff800010c08cc0
>   |-x14: 0000000000000c80  x15: 0000000000000001
>   |-x16: 67692070552f6e77  x17: 7228206465726f6e
>   |-x18: 0000000000000030  x19: ffff040005a7e100
>   |-x20: ffff040005a7e120  x21: ffff040001e9f000
>   |-x22: ffff0008091d1200  x23: ffff000809cdf800
>   |-x24: ffff800011853f80  x25: ffff040001e9e0c8
>   |-x26: ffff800011923de8  x27: ffff000807549140
>   |-x28: 0000000000000000  x29: ffff80001cc9bb00
> INFO:    mpidr:81000000, stop s-wtd.
> Return to lower EL by SDEI
> ->Core[0](0x81000000) received intr=0(exception), cnt=0x1
> RTC: 2023-03-08 10:15:38
> ERROR:   Excepton received on 0x81000000, spsr_el3:620003c5,reason:2 esr_el3:0x80000011
> ue_cnt:0x0
>     Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
> Print cpu register:
>   |-elr_el3: ff01a430
>   |-far_el3: 7abce97e92b57fe1
>   |-scr_el3: 4000e3c
>   |-sctlr_el3: 30cd183f
>   |-LR: ff208570
>   |-SP: ff631d80
>   |-x0: 00000000c4000061  x1: 0000000000000000
>   |-x2: 0000000000000000  x3: 0000000000000000
>   |-x4: 0000000000000000  x5: 0000000000000000
>   |-x6: 0000000000000000  x7: 0000000000000000
>   |-x8: 0000000000000000  x9: 0000000000000002
>   |-x10: 0000000000000002  x11: 0000000000000000
>   |-x12: 0000000000000002  x13: 0000000000000002
>   |-x14: 0000000000000001  x15: 00000000000000ff
>   |-x16: 00000000ffa97650  x17: 00000000000000f8
>   |-x18: 0000000000000000  x19: 0000000000000001
>   |-x20: 0000000000000000  x21: 00000000ff20c240
>   |-x22: 00000000ff20c25a  x23: 8000000000000009
>   |-x24: 0000000076726473  x25: 00000000ff20d1e0
>   |-x26: 00000000ffbdcc10  x27: 00000000ff631ef8
>   |-x28: 00000000ffbff328  x29: 00000000ff631d80
> INFO:    mpidr:81000000, stop s-wtd.
> Have report fatal
>     Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
> Print cpu register:
> --------------------------------------------------------------------------------------
>
> Regards,
> Ming
>
--
TF-A mailing list -- tf-a@lists.trustedfirmware.org
To unsubscribe send an email to tf-a-leave@lists.trustedfirmware.org