Hi Manish,
See the below comments please.
On 3/9/23 7:45 PM, Manish Pandey2 wrote:
> Hi Ming,
>
> IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).
>
> The reason to unmask SError is to catch SError caused by EL3 during handling of lower EL SError. I think having a synchronization barrier(esb) may prevent multiple trapping of SError.
> Just to let you know I am actively working on RAS refactoring in TF-A (and fixing bugs). I will add you to those patches.
>
> For fixing this particular issue, I am planning below change, would you please test if it works for you? No need to unmask EAs if we are already handling an EA (unlike other exceptions)
The result is similar to mask EA(remove "msr daifclr, #DAIF_ABT_BIT").
--------------log:
[ 48.944653] pcieport 0000:b0:00.0: EDR: EDR event received
Detected DPC, skip AER
core[64] mm(925) return: 2
--handler(925) end
[ 48.959091] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
[ 48.967129] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
[ 48.972710] igb 0000:b1:00.0 ens51f0: PCIe link lost
[root@localhost.localdomain /root/hm]
#[ 49.141737] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
[ 49.150315] igb 0000:b1:00.0: enabling device (0000 -> 0002)
ERROR: Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411
ue_cnt:0x0
esr_el3 = be000411
Exception Class = 2f: SError interrupt.
Print cpu register:
|-elr_el3: ffff80001063a188
|-far_el3: 7abce97690b57fe1
|-scr_el3: 403073d
|-sctlr_el3: 30cd183f
|-LR: ffff80001063a2f0
|-SP: ffff000807388000
|-x0: 0000000000000000 x1: ffff800011ed500c
|-x2: 0000000000000000 x3: ffff800011ed5004
|-x4: ffff800011b73000 x5: 000000000000000c
|-x6: ffff800010c1d3f8 x7: 0000000000000000
|-x8: 00000000000bffe8 x9: ffff80001063a2f0
|-x10: ffff800011792338 x11: 0000000000000003
|-x12: ffff800011852378 x13: 0000000000000001
|-x14: 0000000000000c80 x15: 0000000000aaaaaa
|-x16: 0000000000000000 x17: 7228206465726f6e
|-x18: 0000000000000010 x19: ffff04000401a080
|-x20: ffff800010f522b8 x21: ffff04000401a0a0
|-x22: ffff00080b4e2370 x23: ffff00080b4e1fe8
|-x24: 00000000f4b1e03c x25: ffff00080b4e10c8
|-x26: ffff8000114b1880 x27: ffff00080737f5c0
|-x28: ffff0008072d3400 x29: ffff8000121cbab0
INFO: mpidr:81000000, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR: Handle Exception on 0x81000000 from EL2, reason=0 esr_el3=0xbe000411
INFO: spsr:82401009 elr:ffff80001063a188 far:7abce97690b57fe1
INFO: SCR: 403073d
|-ns: 1
|-eel2: 0
INFO: HCR: 488000000
|-teg: 1
|-amo: 0
|-tea: 0
INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+380.
ERROR: Excepton received on 0x81000000, spsr_el3:600001c9,reason:2 esr_el3:0x80000011
ue_cnt:0x0
esr_el3 = 5e000000
Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
|-elr_el3: ffff800010027aac
|-far_el3: 7abce97690b57fe1
|-scr_el3: 403073d
|-sctlr_el3: 30cd183f
|-LR: ffff8000108d569c
|-SP: ffff000807388000
|-x0: 00000000c400002b x1: 0000000000000000
|-x2: 0000000000000000 x3: 0000000000000000
|-x4: 0000000000000000 x5: 0000000000000000
|-x6: 0000000000000000 x7: 0000000000000000
|-x8: ffff8000121cb710 x9: ffff8000108d6078
|-x10: ffff8000121cb750 x11: 0000000000000000
|-x12: 61646e6f63657320 x13: 0a73555043207972
|-x14: 0000000000000000 x15: 0000000000000000
|-x16: 0000000000000000 x17: 0000000000000000
|-x18: 0000000000000060 x19: ffff800010f522b8
|-x20: 00000000000f423d x21: ffff800011996a68
|-x22: ffff8000114a0c08 x23: ffff8000114a0000
|-x24: ffff8000119b7000 x25: ffff00080b4e10c8
|-x26: ffff8000114b1880 x27: ffff00080737f5c0
|-x28: ffff000807388000 x29: ffff8000121cb6e0
INFIO:N F Omp:id r: 81 00 00m00, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR: Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011
INFO: spsr:600001c9 elr:ffff800010027aac far:7abce97690b57fe1
INFO: SCR: 403073d
|-ns: 1
|-eel2: 0
INFO: HCR: 488000000
|-teg: 1
|-amo: 0
|-tea: 0
INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200.
ERROR: Excepton received on 0x81000000, spsr_el3:800003c9,reason:2 esr_el3:0x80000011
ue_cnt:0x0
esr_el3 = 5e000000
Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
Print cpu register:
|-elr_el3: ffff800010027aac
|-far_el3: 7abce97690b57fe1
|-scr_el3: 403073d
|-sctlr_el3: 30cd183f
|-LR: ffff8000108d569c
|-SP: ffff000807388000
|-x0: 00000000c4000038 x1: 000000000000001d
|-x2: 0000000000000000 x3: 0000000000000000
|-x4: 0000000000000000 x5: 0000000000000000
|-x6: 0000000000000000 x7: 0000000000000000
|-x8: 303d294b53414d43 x9: ffff8000108d5f28
|-x10: 3736313d454d4954 x11: 0a37333630323438
|-x12: 78303d294b53414d x13: 5448534152430a30
|-x14: 4c454e52454b0a30 x15: 303d54455346464f
|-x16: 352d50314355312d x17: 664d427369412f42
|-x18: 0000000000000001 x19: ffff800010f522b8
|-x20: 0000000000000000 x21: 0000000000000000
|-x22: ffff800011acd000 x23: ffff8000121cb2e8
|-x24: ffff8000119b7000 x25: ffff00080b4e10c8
|-x26: ffff8000114b1880 x27: ffff00080737f5c0
|-x28: ffff000807388000 x29: ffff8000121cb200
INFO: mpidr:81000000, stop s-wtd.
SDEI no ready, return to lower el by trigger exception.
ERROR: Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011
INFO: spsr:800003c9 elr:ffff800010027aac far:7abce97690b57fe1
INFO: SCR: 403073d
|-ns: 1
|-eel2: 0
INFO: HCR: 488000000
|-teg: 1
|-amo: 0
|-tea: 0
INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200.
INFO: clear eoi id:1d
[ 49.473997] SError Interrupt on CPU0, code 0xbe000411 -- SError
[ 49.473998] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G E 5.10.134-10.git.819992ee1.an8.aarch64 #1
[ 49.473998] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023
[ 49.473998] Workqueue: kacpi_notify acpi_os_execute_deferred
[ 49.474000] pstate: 82401009 (Nzcv daif +PAN -UAO +TCO BTYPE=--)
[ 49.474000] pc : __pci_write_msi_msg+0x108/0x210
[ 49.474000] lr : default_restore_msi_irq+0x60/0x88
[ 49.474001] sp : ffff8000121cbab0
[ 49.474001] x29: ffff8000121cbab0 x28: ffff0008072d3400
[ 49.474001] x27: ffff00080737f5c0 x26: ffff8000114b1880
[ 49.474002] x25: ffff00080b4e10c8 x24: 00000000f4b1e03c
[ 49.474003] x23: ffff00080b4e1fe8 x22: ffff00080b4e2370
[ 49.474003] x21: ffff04000401a0a0 x20: ffff800010f522b8
[ 49.474004] x19: ffff04000401a080 x18: 0000000000000010
[ 49.474004] x17: 7228206465726f6e x16: 0000000000000000
[ 49.474005] x15: 0000000000aaaaaa x14: 0000000000000c80
[ 49.474005] x13: 0000000000000001 x12: ffff800011852378
[ 49.474006] x11: 0000000000000003 x10: ffff800011792338
[ 49.474006] x9 : ffff80001063a2f0 x8 : 00000000000bffe8
[ 49.474007] x7 : 0000000000000000 x6 : ffff800010c1d3f8
[ 49.474007] x5 : 000000000000000c x4 : ffff800011b73000
[ 49.474008] x3 : ffff800011ed5004 x2 : 0000000000000000
[ 49.474008] x1 : ffff800011ed500c x0 : 0000000000000000
[ 49.474009] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 49.474010] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G E 5.10.134-10.git.819992ee1.an8.aarch64 #1
[ 49.474010] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023
[ 49.474010] Workqueue: kacpi_notify acpi_os_execute_deferred
[ 49.474011] Call trace:
[ 49.474011] dump_backtrace+0x0/0x200
[ 49.474011] show_stack+0x1c/0x28
[ 49.474011] dump_stack+0xd4/0x128
[ 49.474012] panic+0x178/0x394
[ 49.474012] nmi_panic+0x6c/0xa0
[ 49.474012] arm64_serror_panic+0x84/0x90
[ 49.474012] do_serror+0x38/0x60
[ 49.474014] el1_error+0x7c/0xf4
[ 49.474014] __pci_write_msi_msg+0x108/0x210
[ 49.474015] default_restore_msi_irq+0x60/0x88
[ 49.474015] arch_restore_msi_irqs+0x3c/0x58
[ 49.474015] pci_restore_msi_state+0x8c/0x210
[ 49.474015] pci_restore_state.part.42+0x34c/0x4a8
[ 49.474016] pci_restore_state+0x20/0x28
[ 49.474016] igb_io_slot_reset+0x3c/0xc8 [igb]
[ 49.474016] report_slot_reset+0x4c/0x98
[ 49.474016] pci_walk_bus+0x68/0xc0
[ 49.474017] pci_walk_bridge+0x20/0x40
[ 49.474017] pcie_do_recovery+0x1f0/0x290
[ 49.474017] edr_handle_event+0x1f0/0x270
[ 49.474017] acpi_ev_notify_dispatch+0x60/0x70
[ 49.474018] acpi_os_execute_deferred+0x20/0x38
[ 49.474018] process_one_work+0x1b8/0x420
[ 49.474018] worker_thread+0x158/0x510
[ 49.474018] kthread+0x114/0x118
[ 49.474019] SMP: stopping secondary CPUs
-----------------log end
>
> --- a/bl31/aarch64/runtime_exceptions.S
> +++ b/bl31/aarch64/runtime_exceptions.S
> @@ -402,11 +402,8 @@ end_vector_entry fiq_aarch32
> vector_entry serror_aarch32
> save_x30
> apply_at_speculative_wa
> -#if RAS_EXTENSION
> + esb
> msr daifclr, #DAIF_ABT_BIT
> -#else
> - check_and_unmask_ea
> -#endif
> b handle_lower_el_async_ea
>
>
> Thanks
> Manish
> ________________________________
> From: Ming Huang <huangming@linux.alibaba.com>
> Sent: 08 March 2023 14:22
> To: Jeenu Viswambharan <Jeenu.Viswambharan@arm.com>
> Cc: Manish Pandey2 <Manish.Pandey2@arm.com>; tf-a@lists.trustedfirmware.org <tf-a@lists.trustedfirmware.org>
> Subject: About Unmask the SError in serror_aarch64
>
> Hi Jeenu,
>
> vector_entry serror_aarch64
> + msr daifclr, #DAIF_ABT_BIT
>
> Why the SError must be unmask in serror_arrch64?
>
> We found unmask SError will lead TF-A panic with output "Unhandled Exception in EL3".
> --------------------------------------------------------------------------------------log:
> Detected DPC, skip AER
> core[64] mm(925) return: 2
> [ 109.883159] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
> [ 109.891195] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
> [ 109.896776] igb 0000:b1:00.0 ens51f0: PCIe link lost
>
> [root@localhost.localdomain /root/hm]
> #[ 110.068410] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
> [ 110.076988] igb 0000:b1:00.0: enabling device (0000 -> 0002)
> Unhandled Exception in EL3.
> x30 = 0x00000000ff013b84
> x0 = 0x0000000000000000
> x1 = 0xffff800011e7500c
> x2 = 0x0000000000000000
> x3 = 0xffff800011e75004
> x4 = 0xffff800011a82000
> x5 = 0x000000000000000c
> x6 = 0x00000000000000fb
> x7 = 0xffff800011441e80
> x8 = 0x0000000000000000
> x9 = 0xffff800010655270
> x10 = 0x00000000ffff8000
> x11 = 0xffff800011701e80
> x12 = 0x0000000000000001
> x13 = 0xffff800010c08cc0
> x14 = 0x0000000000000c80
> x15 = 0x0000000000000001
> x16 = 0x67692070552f6e77
> x17 = 0x7228206465726f6e
> x18 = 0x0000000000000030
> x19 = 0xffff04000032de80
> x20 = 0xffff04000032dea0
> x21 = 0xffff040002c2f000
> x22 = 0xffff000809280000
> x23 = 0xffff00080a217800
> x24 = 0xffff800011853f80
> x25 = 0xffff040002c2e0c8
> x26 = 0xffff800011923de8
> x27 = 0xffff000817fc8740
> x28 = 0x0000000000000000
> x29 = 0xffff80001e9ebb00
> scr_el3 = 0x000000000403073d
> sctlr_el3 = 0x0000000030cd183f
> cptr_el3 = 0x0000000000000100
> tcr_el3 = 0x0000000080843514
> daif = 0x00000000000003c0
> mair_el3 = 0x00000000004404ff
> spsr_el3 = 0x00000000624002cd
> elr_el3 = 0x00000000ff013d84
> ttbr0_el3 = 0x00000000ff093001
> esr_el3 = 0x00000000be000011
> far_el3 = 0x7abce97e90b5fee1
> spsr_el1 = 0x0000000000000000
> elr_el1 = 0x0000000000000000
> spsr_abt = 0x0000000000000000
> spsr_und = 0x0000000000000000
> spsr_irq = 0x0000000000000000
> spsr_fiq = 0x0000000000000000
> sctlr_el1 = 0x0000000030d00800
> actlr_el1 = 0x0000000000000000
> cpacr_el1 = 0x0000000000000000
> csselr_el1 = 0x0000000000000002
> sp_el1 = 0x0000000000000000
> esr_el1 = 0x0000000000000000
> ttbr0_el1 = 0x0000000000000000
> ttbr1_el1 = 0x0000000000000000
> mair_el1 = 0x0000000000000000
> amair_el1 = 0x0000000000000000
> tcr_el1 = 0x0000000000000000
> tpidr_el1 = 0xffff800f6e6f1000
> tpidr_el0 = 0x00000000f6e1ede0
> tpidrro_el0 = 0x0000000000000000
> par_el1 = 0xff000000f4214b80
> mpidr_el1 = 0x0000000081000000
> afsr0_el1 = 0x0000000000000000
> afsr1_el1 = 0x0000000000000000
> contextidr_el1 = 0x0000000000000000
> vbar_el1 = 0x0000000000000000
> cntp_ctl_el0 = 0x0000000000000000
> cntp_cval_el0 = 0x000000010b938e04
> cntv_ctl_el0 = 0x0000000000000000
> cntv_cval_el0 = 0x0000000000000000
> cntkctl_el1 = 0x0000000000000000
> sp_el0 = 0xffff000809280000
> isr_el1 = 0x0000000000000040
> cpupwrctlr_el1 = 0x0000000000000000
> --------------------------------------------------------------------------------------
>
> Remove the above line(mask SError), TF-A continue execute with some useful output.
> --------------------------------------------------------------------------------------log:
> Detected DPC, skip AER
> core[64] mm(925) return: 2
> [ 278.193642] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000
> [ 278.201680] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected
> [ 278.207262] igb 0000:b1:00.0 ens51f0: PCIe link lost
>
> [root@localhost.localdomain /root/hm]
> #[ 278.378416] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC)
> [ 278.386993] igb 0000:b1:00.0: enabling device (0000 -> 0002)
> ERROR: Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411
> ue_cnt:0x0
> Exception Class = 2f: SError interrupt.
> Print cpu register:
> |-elr_el3: ffff8000106550f8
> |-far_el3: 7abce97e92b57fe1
> |-scr_el3: 403073d
> |-sctlr_el3: 30cd183f
> |-LR: ffff800010655270
> |-SP: ffff0008091d1200
> |-x0: 0000000000000000 x1: ffff800011e7500c
> |-x2: 0000000000000000 x3: ffff800011e75004
> |-x4: ffff800011a82000 x5: 000000000000000c
> |-x6: 00000000000000fb x7: ffff800011441e80
> |-x8: 0000000000000000 x9: ffff800010655270
> |-x10: 00000000ffff8000 x11: ffff800011701e80
> |-x12: 0000000000000001 x13: ffff800010c08cc0
> |-x14: 0000000000000c80 x15: 0000000000000001
> |-x16: 67692070552f6e77 x17: 7228206465726f6e
> |-x18: 0000000000000030 x19: ffff040005a7e100
> |-x20: ffff040005a7e120 x21: ffff040001e9f000
> |-x22: ffff0008091d1200 x23: ffff000809cdf800
> |-x24: ffff800011853f80 x25: ffff040001e9e0c8
> |-x26: ffff800011923de8 x27: ffff000807549140
> |-x28: 0000000000000000 x29: ffff80001cc9bb00
> INFO: mpidr:81000000, stop s-wtd.
> Return to lower EL by SDEI
> ->Core[0](0x81000000) received intr=0(exception), cnt=0x1
> RTC: 2023-03-08 10:15:38
> ERROR: Excepton received on 0x81000000, spsr_el3:620003c5,reason:2 esr_el3:0x80000011
> ue_cnt:0x0
> Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
> Print cpu register:
> |-elr_el3: ff01a430
> |-far_el3: 7abce97e92b57fe1
> |-scr_el3: 4000e3c
> |-sctlr_el3: 30cd183f
> |-LR: ff208570
> |-SP: ff631d80
> |-x0: 00000000c4000061 x1: 0000000000000000
> |-x2: 0000000000000000 x3: 0000000000000000
> |-x4: 0000000000000000 x5: 0000000000000000
> |-x6: 0000000000000000 x7: 0000000000000000
> |-x8: 0000000000000000 x9: 0000000000000002
> |-x10: 0000000000000002 x11: 0000000000000000
> |-x12: 0000000000000002 x13: 0000000000000002
> |-x14: 0000000000000001 x15: 00000000000000ff
> |-x16: 00000000ffa97650 x17: 00000000000000f8
> |-x18: 0000000000000000 x19: 0000000000000001
> |-x20: 0000000000000000 x21: 00000000ff20c240
> |-x22: 00000000ff20c25a x23: 8000000000000009
> |-x24: 0000000076726473 x25: 00000000ff20d1e0
> |-x26: 00000000ffbdcc10 x27: 00000000ff631ef8
> |-x28: 00000000ffbff328 x29: 00000000ff631d80
> INFO: mpidr:81000000, stop s-wtd.
> Have report fatal
> Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled.
> Print cpu register:
> --------------------------------------------------------------------------------------
>
> Regards,
> Ming
>
--
TF-A mailing list -- tf-a@lists.trustedfirmware.org
To unsubscribe send an email to tf-a-leave@lists.trustedfirmware.org