If I understood correctly, we clear the SCR_EL3.EA bit on exit from EL3 to a lower EL. So an SError in lower EL shouldn't trap to EL3. Did I miss something?
It's platform's choice, whether lower EL wants to handle EA itself or let EL3 handle it. (SCR_EL3.EA controls this behaviour). Its controlled by HANDLE_EA_EL3_FIRST macro in TF-A
@Ming Huangmailto:huangming@linux.alibaba.com,
With your latest logs i can see that SError is not trapping multiple times at EL3 hence not seeing "report_unhandled_exception"
thanks Manish ________________________________ From: Okash Khawaja okash@google.com Sent: 21 March 2023 11:22 To: Ming Huang huangming@linux.alibaba.com Cc: Manish Pandey2 Manish.Pandey2@arm.com; Jeenu Viswambharan Jeenu.Viswambharan@arm.com; tf-a@lists.trustedfirmware.org tf-a@lists.trustedfirmware.org; Bipin Ravi Bipin.Ravi@arm.com Subject: Re: [TF-A] Re: About Unmask the SError in serror_aarch64
Hi,
Just for record, I came across a similar (not the same) problem and discussed this in TF-A LTS context [1].
IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).
If I understood correctly, we clear the SCR_EL3.EA bit on exit from EL3 to a lower EL. So an SError in lower EL shouldn't trap to EL3. Did I miss something?
[1] https://lists.trustedfirmware.org/archives/list/tfa-lts@lists.trustedfirmwar...
On Fri, Mar 10, 2023 at 6:10 AM Ming Huang via TF-A <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org> wrote: Hi Manish,
See the below comments please.
On 3/9/23 7:45 PM, Manish Pandey2 wrote:
Hi Ming,
IMO, what's happening is SError was caused by lower EL and it traps to EL3(RAS_EXTENSION means SCR_EL3.EA=1) and it goes to "serror_aarch64" vector entry. After unmasking SError at EL3 it again causes SError at EL3 and ends up in weak implementation of "plat_handle_el3_ea"(ends up report_unhandled_exception).
The reason to unmask SError is to catch SError caused by EL3 during handling of lower EL SError. I think having a synchronization barrier(esb) may prevent multiple trapping of SError. Just to let you know I am actively working on RAS refactoring in TF-A (and fixing bugs). I will add you to those patches.
For fixing this particular issue, I am planning below change, would you please test if it works for you? No need to unmask EAs if we are already handling an EA (unlike other exceptions)
The result is similar to mask EA(remove "msr daifclr, #DAIF_ABT_BIT"). --------------log: [ 48.944653] pcieport 0000:b0:00.0: EDR: EDR event received Detected DPC, skip AER core[64] mm(925) return: 2 --handler(925) end [ 48.959091] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000 [ 48.967129] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected [ 48.972710] igb 0000:b1:00.0 ens51f0: PCIe link lost
[root@localhost.localdomain /root/hm] #[ 49.141737] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC) [ 49.150315] igb 0000:b1:00.0: enabling device (0000 -> 0002) ERROR: Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411 ue_cnt:0x0 esr_el3 = be000411 Exception Class = 2f: SError interrupt. Print cpu register: |-elr_el3: ffff80001063a188 |-far_el3: 7abce97690b57fe1 |-scr_el3: 403073d |-sctlr_el3: 30cd183f |-LR: ffff80001063a2f0 |-SP: ffff000807388000 |-x0: 0000000000000000 x1: ffff800011ed500c |-x2: 0000000000000000 x3: ffff800011ed5004 |-x4: ffff800011b73000 x5: 000000000000000c |-x6: ffff800010c1d3f8 x7: 0000000000000000 |-x8: 00000000000bffe8 x9: ffff80001063a2f0 |-x10: ffff800011792338 x11: 0000000000000003 |-x12: ffff800011852378 x13: 0000000000000001 |-x14: 0000000000000c80 x15: 0000000000aaaaaa |-x16: 0000000000000000 x17: 7228206465726f6e |-x18: 0000000000000010 x19: ffff04000401a080 |-x20: ffff800010f522b8 x21: ffff04000401a0a0 |-x22: ffff00080b4e2370 x23: ffff00080b4e1fe8 |-x24: 00000000f4b1e03c x25: ffff00080b4e10c8 |-x26: ffff8000114b1880 x27: ffff00080737f5c0 |-x28: ffff0008072d3400 x29: ffff8000121cbab0 INFO: mpidr:81000000, stop s-wtd. SDEI no ready, return to lower el by trigger exception. ERROR: Handle Exception on 0x81000000 from EL2, reason=0 esr_el3=0xbe000411 INFO: spsr:82401009 elr:ffff80001063a188 far:7abce97690b57fe1 INFO: SCR: 403073d |-ns: 1 |-eel2: 0 INFO: HCR: 488000000 |-teg: 1 |-amo: 0 |-tea: 0 INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+380. ERROR: Excepton received on 0x81000000, spsr_el3:600001c9,reason:2 esr_el3:0x80000011 ue_cnt:0x0 esr_el3 = 5e000000 Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled. Print cpu register: |-elr_el3: ffff800010027aac |-far_el3: 7abce97690b57fe1 |-scr_el3: 403073d |-sctlr_el3: 30cd183f |-LR: ffff8000108d569c |-SP: ffff000807388000 |-x0: 00000000c400002b x1: 0000000000000000 |-x2: 0000000000000000 x3: 0000000000000000 |-x4: 0000000000000000 x5: 0000000000000000 |-x6: 0000000000000000 x7: 0000000000000000 |-x8: ffff8000121cb710 x9: ffff8000108d6078 |-x10: ffff8000121cb750 x11: 0000000000000000 |-x12: 61646e6f63657320 x13: 0a73555043207972 |-x14: 0000000000000000 x15: 0000000000000000 |-x16: 0000000000000000 x17: 0000000000000000 |-x18: 0000000000000060 x19: ffff800010f522b8 |-x20: 00000000000f423d x21: ffff800011996a68 |-x22: ffff8000114a0c08 x23: ffff8000114a0000 |-x24: ffff8000119b7000 x25: ffff00080b4e10c8 |-x26: ffff8000114b1880 x27: ffff00080737f5c0 |-x28: ffff000807388000 x29: ffff8000121cb6e0 INFIO:N F Omp:id r: 81 00 00m00, stop s-wtd. SDEI no ready, return to lower el by trigger exception. ERROR: Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011 INFO: spsr:600001c9 elr:ffff800010027aac far:7abce97690b57fe1 INFO: SCR: 403073d |-ns: 1 |-eel2: 0 INFO: HCR: 488000000 |-teg: 1 |-amo: 0 |-tea: 0 INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200. ERROR: Excepton received on 0x81000000, spsr_el3:800003c9,reason:2 esr_el3:0x80000011 ue_cnt:0x0 esr_el3 = 5e000000 Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled. Print cpu register: |-elr_el3: ffff800010027aac |-far_el3: 7abce97690b57fe1 |-scr_el3: 403073d |-sctlr_el3: 30cd183f |-LR: ffff8000108d569c |-SP: ffff000807388000 |-x0: 00000000c4000038 x1: 000000000000001d |-x2: 0000000000000000 x3: 0000000000000000 |-x4: 0000000000000000 x5: 0000000000000000 |-x6: 0000000000000000 x7: 0000000000000000 |-x8: 303d294b53414d43 x9: ffff8000108d5f28 |-x10: 3736313d454d4954 x11: 0a37333630323438 |-x12: 78303d294b53414d x13: 5448534152430a30 |-x14: 4c454e52454b0a30 x15: 303d54455346464f |-x16: 352d50314355312d x17: 664d427369412f42 |-x18: 0000000000000001 x19: ffff800010f522b8 |-x20: 0000000000000000 x21: 0000000000000000 |-x22: ffff800011acd000 x23: ffff8000121cb2e8 |-x24: ffff8000119b7000 x25: ffff00080b4e10c8 |-x26: ffff8000114b1880 x27: ffff00080737f5c0 |-x28: ffff000807388000 x29: ffff8000121cb200 INFO: mpidr:81000000, stop s-wtd. SDEI no ready, return to lower el by trigger exception. ERROR: Handle Exception on 0x81000000 from EL2, reason=2 esr_el3=0x80000011 INFO: spsr:800003c9 elr:ffff800010027aac far:7abce97690b57fe1 INFO: SCR: 403073d |-ns: 1 |-eel2: 0 INFO: HCR: 488000000 |-teg: 1 |-amo: 0 |-tea: 0 INFO: Trap exception to EL2 with new spsr:3c9 elr:ffff800010011000+200. INFO: clear eoi id:1d [ 49.473997] SError Interrupt on CPU0, code 0xbe000411 -- SError [ 49.473998] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G E 5.10.134-10.git.819992ee1.an8.aarch64 #1 [ 49.473998] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023 [ 49.473998] Workqueue: kacpi_notify acpi_os_execute_deferred [ 49.474000] pstate: 82401009 (Nzcv daif +PAN -UAO +TCO BTYPE=--) [ 49.474000] pc : __pci_write_msi_msg+0x108/0x210 [ 49.474000] lr : default_restore_msi_irq+0x60/0x88 [ 49.474001] sp : ffff8000121cbab0 [ 49.474001] x29: ffff8000121cbab0 x28: ffff0008072d3400 [ 49.474001] x27: ffff00080737f5c0 x26: ffff8000114b1880 [ 49.474002] x25: ffff00080b4e10c8 x24: 00000000f4b1e03c [ 49.474003] x23: ffff00080b4e1fe8 x22: ffff00080b4e2370 [ 49.474003] x21: ffff04000401a0a0 x20: ffff800010f522b8 [ 49.474004] x19: ffff04000401a080 x18: 0000000000000010 [ 49.474004] x17: 7228206465726f6e x16: 0000000000000000 [ 49.474005] x15: 0000000000aaaaaa x14: 0000000000000c80 [ 49.474005] x13: 0000000000000001 x12: ffff800011852378 [ 49.474006] x11: 0000000000000003 x10: ffff800011792338 [ 49.474006] x9 : ffff80001063a2f0 x8 : 00000000000bffe8 [ 49.474007] x7 : 0000000000000000 x6 : ffff800010c1d3f8 [ 49.474007] x5 : 000000000000000c x4 : ffff800011b73000 [ 49.474008] x3 : ffff800011ed5004 x2 : 0000000000000000 [ 49.474008] x1 : ffff800011ed500c x0 : 0000000000000000 [ 49.474009] Kernel panic - not syncing: Asynchronous SError Interrupt [ 49.474010] CPU: 0 PID: 15 Comm: kworker/0:1 Kdump: loaded Tainted: G E 5.10.134-10.git.819992ee1.an8.aarch64 #1 [ 49.474010] Hardware name: AisPManu AliServer-Xuanwu2.0AM-02-1UC1P-5B/AisBMf, BIOS 1.2.M1.AL.P.139.00 03/10/2023 [ 49.474010] Workqueue: kacpi_notify acpi_os_execute_deferred [ 49.474011] Call trace: [ 49.474011] dump_backtrace+0x0/0x200 [ 49.474011] show_stack+0x1c/0x28 [ 49.474011] dump_stack+0xd4/0x128 [ 49.474012] panic+0x178/0x394 [ 49.474012] nmi_panic+0x6c/0xa0 [ 49.474012] arm64_serror_panic+0x84/0x90 [ 49.474012] do_serror+0x38/0x60 [ 49.474014] el1_error+0x7c/0xf4 [ 49.474014] __pci_write_msi_msg+0x108/0x210 [ 49.474015] default_restore_msi_irq+0x60/0x88 [ 49.474015] arch_restore_msi_irqs+0x3c/0x58 [ 49.474015] pci_restore_msi_state+0x8c/0x210 [ 49.474015] pci_restore_state.part.42+0x34c/0x4a8 [ 49.474016] pci_restore_state+0x20/0x28 [ 49.474016] igb_io_slot_reset+0x3c/0xc8 [igb] [ 49.474016] report_slot_reset+0x4c/0x98 [ 49.474016] pci_walk_bus+0x68/0xc0 [ 49.474017] pci_walk_bridge+0x20/0x40 [ 49.474017] pcie_do_recovery+0x1f0/0x290 [ 49.474017] edr_handle_event+0x1f0/0x270 [ 49.474017] acpi_ev_notify_dispatch+0x60/0x70 [ 49.474018] acpi_os_execute_deferred+0x20/0x38 [ 49.474018] process_one_work+0x1b8/0x420 [ 49.474018] worker_thread+0x158/0x510 [ 49.474018] kthread+0x114/0x118 [ 49.474019] SMP: stopping secondary CPUs -----------------log end
--- a/bl31/aarch64/runtime_exceptions.S +++ b/bl31/aarch64/runtime_exceptions.S @@ -402,11 +402,8 @@ end_vector_entry fiq_aarch32 vector_entry serror_aarch32 save_x30 apply_at_speculative_wa -#if RAS_EXTENSION
esb msr daifclr, #DAIF_ABT_BIT
-#else
check_and_unmask_ea
-#endif b handle_lower_el_async_ea
Thanks Manish ________________________________ From: Ming Huang <huangming@linux.alibaba.commailto:huangming@linux.alibaba.com> Sent: 08 March 2023 14:22 To: Jeenu Viswambharan <Jeenu.Viswambharan@arm.commailto:Jeenu.Viswambharan@arm.com> Cc: Manish Pandey2 <Manish.Pandey2@arm.commailto:Manish.Pandey2@arm.com>; tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org> Subject: About Unmask the SError in serror_aarch64
Hi Jeenu,
vector_entry serror_aarch64
msr daifclr, #DAIF_ABT_BIT
Why the SError must be unmask in serror_arrch64?
We found unmask SError will lead TF-A panic with output "Unhandled Exception in EL3". --------------------------------------------------------------------------------------log: Detected DPC, skip AER core[64] mm(925) return: 2 [ 109.883159] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000 [ 109.891195] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected [ 109.896776] igb 0000:b1:00.0 ens51f0: PCIe link lost
[root@localhost.localdomain /root/hm] #[ 110.068410] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC) [ 110.076988] igb 0000:b1:00.0: enabling device (0000 -> 0002) Unhandled Exception in EL3. x30 = 0x00000000ff013b84 x0 = 0x0000000000000000 x1 = 0xffff800011e7500c x2 = 0x0000000000000000 x3 = 0xffff800011e75004 x4 = 0xffff800011a82000 x5 = 0x000000000000000c x6 = 0x00000000000000fb x7 = 0xffff800011441e80 x8 = 0x0000000000000000 x9 = 0xffff800010655270 x10 = 0x00000000ffff8000 x11 = 0xffff800011701e80 x12 = 0x0000000000000001 x13 = 0xffff800010c08cc0 x14 = 0x0000000000000c80 x15 = 0x0000000000000001 x16 = 0x67692070552f6e77 x17 = 0x7228206465726f6e x18 = 0x0000000000000030 x19 = 0xffff04000032de80 x20 = 0xffff04000032dea0 x21 = 0xffff040002c2f000 x22 = 0xffff000809280000 x23 = 0xffff00080a217800 x24 = 0xffff800011853f80 x25 = 0xffff040002c2e0c8 x26 = 0xffff800011923de8 x27 = 0xffff000817fc8740 x28 = 0x0000000000000000 x29 = 0xffff80001e9ebb00 scr_el3 = 0x000000000403073d sctlr_el3 = 0x0000000030cd183f cptr_el3 = 0x0000000000000100 tcr_el3 = 0x0000000080843514 daif = 0x00000000000003c0 mair_el3 = 0x00000000004404ff spsr_el3 = 0x00000000624002cd elr_el3 = 0x00000000ff013d84 ttbr0_el3 = 0x00000000ff093001 esr_el3 = 0x00000000be000011 far_el3 = 0x7abce97e90b5fee1 spsr_el1 = 0x0000000000000000 elr_el1 = 0x0000000000000000 spsr_abt = 0x0000000000000000 spsr_und = 0x0000000000000000 spsr_irq = 0x0000000000000000 spsr_fiq = 0x0000000000000000 sctlr_el1 = 0x0000000030d00800 actlr_el1 = 0x0000000000000000 cpacr_el1 = 0x0000000000000000 csselr_el1 = 0x0000000000000002 sp_el1 = 0x0000000000000000 esr_el1 = 0x0000000000000000 ttbr0_el1 = 0x0000000000000000 ttbr1_el1 = 0x0000000000000000 mair_el1 = 0x0000000000000000 amair_el1 = 0x0000000000000000 tcr_el1 = 0x0000000000000000 tpidr_el1 = 0xffff800f6e6f1000 tpidr_el0 = 0x00000000f6e1ede0 tpidrro_el0 = 0x0000000000000000 par_el1 = 0xff000000f4214b80 mpidr_el1 = 0x0000000081000000 afsr0_el1 = 0x0000000000000000 afsr1_el1 = 0x0000000000000000 contextidr_el1 = 0x0000000000000000 vbar_el1 = 0x0000000000000000 cntp_ctl_el0 = 0x0000000000000000 cntp_cval_el0 = 0x000000010b938e04 cntv_ctl_el0 = 0x0000000000000000 cntv_cval_el0 = 0x0000000000000000 cntkctl_el1 = 0x0000000000000000 sp_el0 = 0xffff000809280000 isr_el1 = 0x0000000000000040 cpupwrctlr_el1 = 0x0000000000000000
Remove the above line(mask SError), TF-A continue execute with some useful output. --------------------------------------------------------------------------------------log: Detected DPC, skip AER core[64] mm(925) return: 2 [ 278.193642] pcieport 0000:b0:00.0: DPC: containment event, status:0x0005 source:0xb000 [ 278.201680] pcieport 0000:b0:00.0: DPC: ERR_FATAL detected [ 278.207262] igb 0000:b1:00.0 ens51f0: PCIe link lost
[root@localhost.localdomain /root/hm] #[ 278.378416] pcieport 0000:b0:00.0: pciehp: Slot(51): Link Down/Up ignored (recovered by DPC) [ 278.386993] igb 0000:b1:00.0: enabling device (0000 -> 0002) ERROR: Excepton received on 0x81000000, spsr_el3:82401009,reason:0 esr_el3:0xbe000411 ue_cnt:0x0 Exception Class = 2f: SError interrupt. Print cpu register: |-elr_el3: ffff8000106550f8 |-far_el3: 7abce97e92b57fe1 |-scr_el3: 403073d |-sctlr_el3: 30cd183f |-LR: ffff800010655270 |-SP: ffff0008091d1200 |-x0: 0000000000000000 x1: ffff800011e7500c |-x2: 0000000000000000 x3: ffff800011e75004 |-x4: ffff800011a82000 x5: 000000000000000c |-x6: 00000000000000fb x7: ffff800011441e80 |-x8: 0000000000000000 x9: ffff800010655270 |-x10: 00000000ffff8000 x11: ffff800011701e80 |-x12: 0000000000000001 x13: ffff800010c08cc0 |-x14: 0000000000000c80 x15: 0000000000000001 |-x16: 67692070552f6e77 x17: 7228206465726f6e |-x18: 0000000000000030 x19: ffff040005a7e100 |-x20: ffff040005a7e120 x21: ffff040001e9f000 |-x22: ffff0008091d1200 x23: ffff000809cdf800 |-x24: ffff800011853f80 x25: ffff040001e9e0c8 |-x26: ffff800011923de8 x27: ffff000807549140 |-x28: 0000000000000000 x29: ffff80001cc9bb00 INFO: mpidr:81000000, stop s-wtd. Return to lower EL by SDEI ->Core[0](0x81000000) received intr=0(exception), cnt=0x1 RTC: 2023-03-08 10:15:38 ERROR: Excepton received on 0x81000000, spsr_el3:620003c5,reason:2 esr_el3:0x80000011 ue_cnt:0x0 Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled. Print cpu register: |-elr_el3: ff01a430 |-far_el3: 7abce97e92b57fe1 |-scr_el3: 4000e3c |-sctlr_el3: 30cd183f |-LR: ff208570 |-SP: ff631d80 |-x0: 00000000c4000061 x1: 0000000000000000 |-x2: 0000000000000000 x3: 0000000000000000 |-x4: 0000000000000000 x5: 0000000000000000 |-x6: 0000000000000000 x7: 0000000000000000 |-x8: 0000000000000000 x9: 0000000000000002 |-x10: 0000000000000002 x11: 0000000000000000 |-x12: 0000000000000002 x13: 0000000000000002 |-x14: 0000000000000001 x15: 00000000000000ff |-x16: 00000000ffa97650 x17: 00000000000000f8 |-x18: 0000000000000000 x19: 0000000000000001 |-x20: 0000000000000000 x21: 00000000ff20c240 |-x22: 00000000ff20c25a x23: 8000000000000009 |-x24: 0000000076726473 x25: 00000000ff20d1e0 |-x26: 00000000ffbdcc10 x27: 00000000ff631ef8 |-x28: 00000000ffbff328 x29: 00000000ff631d80 INFO: mpidr:81000000, stop s-wtd. Have report fatal Exception Class = 17: SMC instruction execution in AArch64 state, when SMC is not disabled. Print cpu register:
Regards, Ming
-- TF-A mailing list -- tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org To unsubscribe send an email to tf-a-leave@lists.trustedfirmware.orgmailto:tf-a-leave@lists.trustedfirmware.org