Hi James,
Appologize for reply so late. I miss out your email.
As your hint, I make a einj ACPI table with a bug for tiggering EA in kernel and then get the call trace that we expect. __einj_error_trigger below __sdei_asm_handler, indicate where error occured.
Thank you very much!
My test log: ---------------------------------------------- [root@rootfs einj]$echo 0x8A2000000 > param1 [root@rootfs einj]$echo 0xfffffffffffff000 > param2 [root@rootfs einj]$echo 0x1 > flags [root@rootfs einj]$echo 0x08 > error_type [root@rootfs einj]$echo 1 > error_inject INFO: Core[ER1R](0OxR810:100 00) re ceiEed ras intr=115, cnt=1. on 0x181000000, spsr_el3:62400009,reason:1 esr_el3:0Einj ErrType:8 Flags:1, ApicId:0 DDDR Einj Addr:8A2000000 Range:FFFFFFFFFFFFF000 nknInvalid & clear cache(8A2000000) [64](0x181000000) received ras intr=0, cnt=0. CPU RAS mm handler: EventId=C4000048
[ 199.563096] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 [ 199.563228] {2}[Hardware Error]: event severity: fatal [ 199.563317] {2}[Hardware Error]: Error 0, type: fatal [ 199.563379] {2}[Hardware Error]: section_type: general processor error [ 199.563439] {2}[Hardware Error]: processor_type: 2, ARM [ 199.563502] {2}[Hardware Error]: processor_isa: 4, ARM A64 [ 199.563556] {2}[Hardware Error]: error_type: 0x00 [ 199.563618] {2}[Hardware Error]: operation: 0, unknown or generic [ 199.563679] {2}[Hardware Error]: processor_id: 0x0000000000000040 [ 199.563735] Kernel panic - not syncing: Fatal hardware error! [ 199.563890] CPU: 2 PID: 236 Comm: bash Tainted: G E 5.10.23 #1 [ 199.563958] Hardware name: Default Default/Default, BIOS 1.2.AL.E.105.00 08/13/2021 [ 199.564000] Call trace: [ 199.564047] dump_backtrace+0x0/0x1e0 [ 199.564091] show_stack+0x1c/0x24 [ 199.564138] dump_stack+0xcc/0x120 [ 199.564178] panic+0x154/0x360 [ 199.564226] __ghes_panic+0x7c/0x80 [ 199.564283] ghes_in_nmi_queue_one_entry+0x1fc/0x2f4 [ 199.564335] ghes_sdei_critical_callback+0x50/0xb0 [ 199.564385] sdei_event_handler+0x3c/0xb0 [ 199.564429] _sdei_handler+0x88/0x120 [ 199.564471] __sdei_handler+0x24/0x50 [ 199.564519] __sdei_asm_handler+0xbc/0x15c [ 199.564570] __einj_error_trigger+0x7c/0x3d0 [einj] [ 199.564621] __einj_error_inject+0x24c/0x2ac [einj] [ 199.564668] error_inject_set+0xcc/0x134 [einj] [ 199.564715] simple_attr_write+0xfc/0x130 [ 199.564759] debugfs_attr_write+0x50/0x90 [ 199.564802] vfs_write+0xd0/0x290 [ 199.564846] ksys_write+0x70/0x100 [ 199.564891] __arm64_sys_write+0x20/0x30 [ 199.564939] el0_svc_common.constprop.0+0x70/0x170 [ 199.564986] do_el0_svc+0x74/0x90 [ 199.565031] el0_svc+0x1c/0x30 [ 199.565079] el0_sync_handler+0xa8/0xac [ 199.565121] el0_sync+0x140/0x180 [ 199.565172] SMP: stopping secondary CPUs [ 199.565224] Kernel Offset: disabled [ 199.565278] CPU features: 0x9850817,7a60aa38 [ 199.565356] Memory Limit: none INFO: PSCI Power Domain Map: INFO: Domain Node : Level 1, parent_node -1, State OFF (0x2) INFO: Domain Node : Level 1, parent_node -1, State OFF (0x2) ----------------------------------------------
On 6/23/21 6:06 PM, James Morse wrote:
Hello,
On 23/06/2021 02:21, Ming Huang wrote:
On 6/2/21 10:42 PM, James Morse wrote:
The kernel version: 5.8.0 is not a kernel.org stable kernel.
Works for me: | Hardware name: Foundation-v8A (DT) | Call trace: | dump_backtrace+0x0/0x1b0 | show_stack+0x18/0x24 | dump_stack+0xc0/0x11c | ghes_in_nmi_queue_one_entry+0x138/0x2f0 | ghes_sdei_normal_callback+0x30/0x6c | sdei_event_handler+0x60/0x1d4 | __sdei_handler+0xc4/0x220 | __sdei_asm_handler+0xbc/0x168 | arch_cpu_idle+0xc/0x20 | cpu_startup_entry+0x24/0x80 | secondary_start_kernel+0x138/0x184 | root@localhost:/sys/kernel/debug# uname -r | 5.8.0-00007-g0db1a507bbda-dirty
How can you print the call trace rightly? In my test, I can't get call trace rightly with reporting CPER + SDEI.
Linux version 5.10.0 [root@m1rootfs /]$devmem 0 ERROR: Excepton received on 0x81000000, spsr_el3:60001000,reason:0 esr_el3:0xbe000011 [ 129.772039] sdei: sdei_event_handler this handler event: 806 [ 129.772519] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 [ 129.772768] {2}[Hardware Error]: event severity: fatal [ 129.772962] {2}[Hardware Error]: Error 0, type: fatal [ 129.773154] {2}[Hardware Error]: section_type: general processor error [ 129.773395] {2}[Hardware Error]: processor_type: 2, ARM [ 129.773503] {2}[Hardware Error]: processor_isa: 4, ARM A64 [ 129.773605] {2}[Hardware Error]: error_type: 0x00 [ 129.773725] {2}[Hardware Error]: operation: 0, unknown or generic [ 129.773833] {2}[Hardware Error]: processor_id: 0x0000000000000000 [ 129.773943] Kernel panic - not syncing: Fatal hardware error! [ 129.774060] CPU: 0 PID: 201 Comm: devmem Not tainted 5.10.0+ #25 [ 129.774195] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.E.050.00 06/02/2021 [ 129.774273] Call trace: [ 129.774378] dump_backtrace+0x0/0x1f0 [ 129.774459] show_stack+0x24/0x70 [ 129.774543] dump_stack+0xbc/0x114 [ 129.774621] panic+0x158/0x364 [ 129.774723] __raw_spin_lock_irqsave.constprop.0+0x0/0xa0 [ 129.774820] ghes_in_nmi_queue_one_entry+0x204/0x2fc [ 129.774917] ghes_sdei_normal_callback+0x58/0xc0 [ 129.775005] sdei_event_handler+0x50/0xe8 [ 129.775090] _sdei_handler+0x8c/0x160 [ 129.775167] __sdei_handler+0x28/0x50 [ 129.775265] __sdei_asm_handler+0xbc/0x174
This _is_ the call stack. What else were you expecting to see?
Your SPSR_EL3 debug message shows the exception came from EL0. The kernel has not found any kernel stack frames below __sdei_asm_handler, because there aren't any! The kernel does not produce stack traces for user-space.
[ 129.779672] SMP: stopping secondary CPUs [ 129.779766] Kernel Offset: disabled
Thanks,
James