On 02/06/2021 07:11, Ming Huang wrote:> On 6/1/21 6:57 PM, James Morse wrote:>> On 28/05/2021 13:10, Ming Huang wrote>>> I did the test report SDEI to kernel with fatal severity in APEI / CPER while EL3 received>>> SEA(SCR_EL3.EA = 1). Kernel will panic and print calltrace, but this calltrace was not the>>> position where error occured(another word where throw SEA), instead calltrace in ghes.c.>>>> This used to work in the cases where it was possible, but the stack tracing stuff has been>> changed a little over the recent months.>>>> You didn't mention a kernel version.>>>>>>> How can SDEI solution let kernel print calltrace at right position?>>>> The error was fatal, if the physical address was memory then its just chance that>> process-A was using it not process-B.>>>> That said, the SDEI entry code in the kernel does try to set the records up to allow the>> stack tracer to walk onto another stack - but this isn't always possible:>> The stack tracer needs the frame records to be present and correct, if you took a>> exception 'x29' needs to be the current frame pointer, but this is only guaranteed at>> function boundaries. If you took an exception during the functions prologue or epilogue,>> the values seen by the stack tracer will be inconsistent.>>>> For arm64, linux can't provide a 'reliable' stack trace over an exception, but it does>> provide a best effort.>>>>>>> For issue analysis, the right position calltrace is very useful. For ACPI firmware-first,>>> we set SCR_EL3.EA = 1, although the solution rethrow EA back to kernel will suffer from some>>> problems, but this solution can let kernel print calltrace at right position.>>>> If you see a complete stacktrace for synchronous external abort delivered directly to the>> kernel, but not via EL3 and back into SDEI, its likely a problem the kernel has stepping>> between stacks. SDEI always has to do this, external aborts never do this.>>>> Which kernel version do you see this?>> (A report to linux-arm-kernel@lists.infradead.org would help too)> > The kernel version: 5.8.0 is not a kernel.org stable kernel.
Works for me: | Hardware name: Foundation-v8A (DT) | Call trace: | dump_backtrace+0x0/0x1b0 | show_stack+0x18/0x24 | dump_stack+0xc0/0x11c | ghes_in_nmi_queue_one_entry+0x138/0x2f0 | ghes_sdei_normal_callback+0x30/0x6c | sdei_event_handler+0x60/0x1d4 | __sdei_handler+0xc4/0x220 | __sdei_asm_handler+0xbc/0x168 | arch_cpu_idle+0xc/0x20 | cpu_startup_entry+0x24/0x80 | secondary_start_kernel+0x138/0x184 | root@localhost:/sys/kernel/debug# uname -r | 5.8.0-00007-g0db1a507bbda-dirty
In my mind, ELR_EL3 and x29 should report to kernel via APEI/CPER for kernel to print calltrace at error position.
That is not possible.
The kernel can only start a stacktrace from the current location. If your event is synchronous, the kernel should be able to chain one stack onto another, as it does above. If your event is asynchronous, the stack trace is meaningless.
But there is no suited table in APEI/CPER to report this.
See the UEFI spec (where CPER is defined): Table N-26 ARMv8 AArch64 GPRs (Type 4). But, all you can hope to get for populating all this is the kernel print it out. The kernel isn't going to do anything with it.
Thanks,
James