Hi Ming,

IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).

 The reason to unmask SError is to catch SError caused by EL3 during handling of lower EL SError. I think having a synchronization barrier(esb) may prevent multiple trapping of SError.
 Just to let you know I am actively working on RAS refactoring in TF-A (and fixing bugs). I will add you to those patches.

 For fixing this particular issue, I am planning below change, would you please test if it works for you? No need to unmask EAs if we are already handling an EA (unlike other exceptions)

--- a/bl31/aarch64/runtime_exceptions.S
+++ b/bl31/aarch64/runtime_exceptions.S
@@ -402,11 +402,8 @@ end_vector_entry fiq_aarch32
 vector_entry serror_aarch32
        save_x30
        apply_at_speculative_wa
-#if RAS_EXTENSION
+       esb
        msr     daifclr, #DAIF_ABT_BIT
-#else
-       check_and_unmask_ea
-#endif
        b       handle_lower_el_async_ea


Thanks
Manish

From: Ming Huang <huangming@linux.alibaba.com>
Sent: 08 March 2023 14:22
To: Jeenu Viswambharan <Jeenu.Viswambharan@arm.com>
Cc: Manish Pandey2 <Manish.Pandey2@arm.com>; tf-a@lists.trustedfirmware.org <tf-a@lists.trustedfirmware.org>
Subject: About Unmask the SError in serror_aarch64
 
Hi Jeenu,

 vector_entry serror_aarch64
+       msr     daifclr, #DAIF_ABT_BIT

Why the SError must be unmask in serror_arrch64?

We found unmask SError will lead TF-A panic with output "Unhandled Exception in EL3".
--------------------------------------------------------------------------------------log:
Detected DPC, skip AER
  core[64] mm(925) return: 2
[  109.883159] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
[  109.891195] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
[  109.896776] igb 0000:b1:00.0 ens51f0: PCIe link lost

[root@localhost.localdomain /root/hm]
#[  110.068410] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
[  110.076988] igb 0000:b1:00.0: enabling device (0000 -> 0002)
Unhandled Exception in EL3.
x30            = 0x00000000ff013b84
x0             = 0x0000000000000000
x1             = 0xffff800011e7500c
x2             = 0x0000000000000000
x3             = 0xffff800011e75004
x4             = 0xffff800011a82000
x5             = 0x000000000000000c
x6             = 0x00000000000000fb
x7             = 0xffff800011441e80
x8             = 0x0000000000000000
x9             = 0xffff800010655270
x10            = 0x00000000ffff8000
x11            = 0xffff800011701e80
x12            = 0x0000000000000001
x13            = 0xffff800010c08cc0
x14            = 0x0000000000000c80
x15            = 0x0000000000000001
x16            = 0x67692070552f6e77
x17            = 0x7228206465726f6e
x18            = 0x0000000000000030
x19            = 0xffff04000032de80
x20            = 0xffff04000032dea0
x21            = 0xffff040002c2f000
x22            = 0xffff000809280000
x23            = 0xffff00080a217800
x24            = 0xffff800011853f80
x25            = 0xffff040002c2e0c8
x26            = 0xffff800011923de8
x27            = 0xffff000817fc8740
x28            = 0x0000000000000000
x29            = 0xffff80001e9ebb00
scr_el3        = 0x000000000403073d
sctlr_el3      = 0x0000000030cd183f
cptr_el3       = 0x0000000000000100
tcr_el3        = 0x0000000080843514
daif           = 0x00000000000003c0
mair_el3       = 0x00000000004404ff
spsr_el3       = 0x00000000624002cd
elr_el3        = 0x00000000ff013d84
ttbr0_el3      = 0x00000000ff093001
esr_el3        = 0x00000000be000011
far_el3        = 0x7abce97e90b5fee1
spsr_el1       = 0x0000000000000000
elr_el1        = 0x0000000000000000
spsr_abt       = 0x0000000000000000
spsr_und       = 0x0000000000000000
spsr_irq       = 0x0000000000000000
spsr_fiq       = 0x0000000000000000
sctlr_el1      = 0x0000000030d00800
actlr_el1      = 0x0000000000000000
cpacr_el1      = 0x0000000000000000
csselr_el1     = 0x0000000000000002
sp_el1         = 0x0000000000000000
esr_el1        = 0x0000000000000000
ttbr0_el1      = 0x0000000000000000
ttbr1_el1      = 0x0000000000000000
mair_el1       = 0x0000000000000000
amair_el1      = 0x0000000000000000
tcr_el1        = 0x0000000000000000
tpidr_el1      = 0xffff800f6e6f1000
tpidr_el0      = 0x00000000f6e1ede0
tpidrro_el0    = 0x0000000000000000
par_el1        = 0xff000000f4214b80
mpidr_el1      = 0x0000000081000000
afsr0_el1      = 0x0000000000000000
afsr1_el1      = 0x0000000000000000
contextidr_el1 = 0x0000000000000000
vbar_el1       = 0x0000000000000000
cntp_ctl_el0   = 0x0000000000000000
cntp_cval_el0  = 0x000000010b938e04
cntv_ctl_el0   = 0x0000000000000000
cntv_cval_el0  = 0x0000000000000000
cntkctl_el1    = 0x0000000000000000
sp_el0         = 0xffff000809280000
isr_el1        = 0x0000000000000040
cpupwrctlr_el1 = 0x0000000000000000
--------------------------------------------------------------------------------------

Remove the above line(mask SError), TF-A continue execute with some useful output.
--------------------------------------------------------------------------------------log:
Detected DPC, skip AER
  core[64] mm(925) return: 2
[  278.193642] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
[  278.201680] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
[  278.207262] igb 0000:b1:00.0 ens51f0: PCIe link lost

[root@localhost.localdomain /root/hm]
#[  278.378416] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
[  278.386993] igb 0000:b1:00.0: enabling device (0000 -> 0002)
ERROR:   Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411
ue_cnt:0x0
    Exception Class = 2f: SError interrupt.
Print cpu register:
  |-elr_el3: ffff8000106550f8
  |-far_el3: 7abce97e92b57fe1
  |-scr_el3: 403073d
  |-sctlr_el3: 30cd183f
  |-LR: ffff800010655270
  |-SP: ffff0008091d1200
  |-x0: 0000000000000000  x1: ffff800011e7500c
  |-x2: 0000000000000000  x3: ffff800011e75004
  |-x4: ffff800011a82000  x5: 000000000000000c
  |-x6: 00000000000000fb  x7: ffff800011441e80
  |-x8: 0000000000000000  x9: ffff800010655270
  |-x10: 00000000ffff8000  x11: ffff800011701e80
  |-x12: 0000000000000001  x13: ffff800010c08cc0
  |-x14: 0000000000000c80  x15: 0000000000000001
  |-x16: 67692070552f6e77  x17: 7228206465726f6e
  |-x18: 0000000000000030  x19: ffff040005a7e100
  |-x20: ffff040005a7e120  x21: ffff040001e9f000
  |-x22: ffff0008091d1200  x23: ffff000809cdf800
  |-x24: ffff800011853f80  x25: ffff040001e9e0c8
  |-x26: ffff800011923de8  x27: ffff000807549140
  |-x28: 0000000000000000  x29: ffff80001cc9bb00
INFO:    mpidr:81000000, stop s-wtd.
Return to lower EL by SDEI
->Core[0](0x81000000) received intr=0(exception), cnt=0x1
RTC: 2023-03-08 10:15:38
ERROR:   Excepton received on 0x81000000, spsr_el3:620003c5,reason:2 esr_el3:0x80000011
ue_cnt:0x0
    Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
  |-elr_el3: ff01a430
  |-far_el3: 7abce97e92b57fe1
  |-scr_el3: 4000e3c
  |-sctlr_el3: 30cd183f
  |-LR: ff208570
  |-SP: ff631d80
  |-x0: 00000000c4000061  x1: 0000000000000000
  |-x2: 0000000000000000  x3: 0000000000000000
  |-x4: 0000000000000000  x5: 0000000000000000
  |-x6: 0000000000000000  x7: 0000000000000000
  |-x8: 0000000000000000  x9: 0000000000000002
  |-x10: 0000000000000002  x11: 0000000000000000
  |-x12: 0000000000000002  x13: 0000000000000002
  |-x14: 0000000000000001  x15: 00000000000000ff
  |-x16: 00000000ffa97650  x17: 00000000000000f8
  |-x18: 0000000000000000  x19: 0000000000000001
  |-x20: 0000000000000000  x21: 00000000ff20c240
  |-x22: 00000000ff20c25a  x23: 8000000000000009
  |-x24: 0000000076726473  x25: 00000000ff20d1e0
  |-x26: 00000000ffbdcc10  x27: 00000000ff631ef8
  |-x28: 00000000ffbff328  x29: 00000000ff631d80
INFO:    mpidr:81000000, stop s-wtd.
Have report fatal
    Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
--------------------------------------------------------------------------------------

Regards,
Ming