[AMD Official Use Only - AMD Internal Distribution Only]
Hi Mathew,
Thank you for your response.
From: Rohit Mathew <Rohit.Mathew@arm.com>
Sent: Thursday, January 8, 2026 8:30 PM
To: Kummari, Prasad <Prasad.Kummari@amd.com>; Sammit Joshi <Sammit.Joshi@arm.com>; scan-admin--- via TF-A <tf-a@lists.trustedfirmware.org>
Cc: Belsare, Akshay <akshay.belsare@amd.com>; Bollapalli, Maheedhar Sai <MaheedharSai.Bollapalli@amd.com>; Simek, Michal <michal.simek@amd.com>; Kummari, Prasad <Prasad.Kummari@amd.com>; Manish Pandey2 <Manish.Pandey2@arm.com>; Boyan Karatotev <Boyan.Karatotev@arm.com>;
Chris Kay <Chris.Kay@arm.com>
Subject: Re: ZynqMP regression with NUMA_AWARE_PER_CPU changes with ENABLED_LTO : Linux runtime hang due to EL3 re-entry
|
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
|
Hi Prasad,
Thanks for sharing all the build artefacts and requested information. The register states (X2 and SP_EL0) + map/dump files for the DDR boot shows that CPU2 went into exception but the exact point of crash is unfortunately
not recoverable from the shared artefacts - some registers are lost due to scrambled logs and the other ones don’t show any relatable offsets/addresses which can be traced back from the dump and map files. The OCM boot logs doesn’t show an exception so there
isn’t much to decode from those logs.
The map and dump files for the failed LTO builds doesn’t build look different from the passing ones we have on our end (We have 3 passing builds/3 different platforms). The order of how the objects are placed in
the section changes between an LTO and non-LTO build, but per-cpu accessors are object-order-agnostic within a CPUs space in general.
You did mention that the cores that went into crash then further woke up in bl31_warmboot_entrypoint. Could this mean that the CPU suspend sequence from the Zync’s PMU was issued before the core went into an exception
and this caused the core to not land at the WFI but instead go into an exception? Is this something that could be checked relatively easily? Also, do you know the exact instruction at which such cores were hung in the warm boot entrypoint? If we don’t get
much information, we might have to think about adding a debug patch to retrieve more about the crash. Could we also check:
Observations:
The cores did not actually wake up. Instead, they were found in no power, running and reset catch state, with the CPU program counter pointing to bl31_warmboot_entrypoint. When an interrupt is manually injected
using the XSDB debugger, the core transitions to the running state. It appears that the CPU suspend sequence was issued to the ZynqMP PMU, after which an exception was triggered as per logs. The exact instruction at which the core gets stuck could not be determined,
as the behavior is quite random. In some cases, certain CPUs enter reset catch with the PC pointing to bl31_warmboot_entrypoint.
We performed three iterations in ddr. The system booted correctly once, while the issue occurred intermittently in the remaining two runs.
Refer to the logs with BL31 runtime logs disabled for a proper dump:
DMSG:
[ 6.617446] sd 2:0:0:0: [sdb] Attached SCSI removable disk
INIT: Entering runlevel: 5
Configuring network interfaces... [ 7.050920] macb ff0e0000.ethernet eth0: PHY [ff0e0000.ethernet-ffffffff:0c] driver [TI DP83867] (irq=POLL)
[ 7.060769] macb ff0e0000.ethernet eth0: configuring for phy/rgmii-id link mode
[ 7.068763] macb ff0e0000.ethernet: gem-ptp-timer ptp clock registered.
udhcpc: started, v1.36.1
udhcpc: broadcasting discover
Unhandled Exception in EL3.
x30 = 0x0000000000000002
x0 = 0x0000000000000000
x1 = 0x0000000000011638
x2 = 0x00000000000125c0
x3 = 0x000000000000003f
x4 = 0x00000000000002c0
x5 = 0x0000000055540000
x6 = 0x000000000000b554
x7 = 0x00000000f9010000
x8 = 0x0000000000000008
x9 = 0x000000000000b1b0
x10 = 0x000000002000ff00
x11 = 0x000000000000b554
x12 = 0x00000000f9010400
x13 = 0x0000000000001ff0
x14 = 0x0000000000000001
x15 = 0x0000000000001100
x16 = 0x6710328481440018
x17 = 0x5000735413ee4101
x18 = 0x201f0181a5141414
x19 = 0x0040420cc905600d
x20 = 0x9320242204661cc5
x21 = 0x00da002043400012
x22 = 0x000042401021fc8f
x23 = 0x8402230493c01e4c
x24 = 0x102238309108d512
x25 = 0x908ce0c38144b4ab
x26 = 0xe22408240944ae0f
x27 = 0x0040008048920017
x28 = 0x194c481653804086
x29 = 0x0af2ec062181d02d
scr_el3 = 0x0000000000000238
sctlr_el3 = 0x0000000030cd183f
cptr_el3 = 0x0000000000000000
tcr_el3 = 0x0000000080803520
daif = 0x00000000000003c0
mair_el3 = 0x00000000004400ff
spsr_el3 = 0x00000000200002cc
elr_el3 = 0x0000000000000002
ttbr0_el3 = 0x0000000000011bc0
esr_el3 = 0x000000008a000000
far_el3 = 0x0000000000000002
mpidr_el1 = 0x0000000080000003
sp_el0 = 0x0000000000011640
isr_el1 = 0x0000000000000000
dacr32_el2 = 0x0000000000000000
ifsr32_el2 = 0x0000000000000000
cpuectlr_el1 = 0x0000000000000040
cpumerrsr_el1 = 0x000000000804020a
l2merrsr_el1 = 0x0000000010008120
cpuactlr_el1 = 0x00001000090ca000
gicc_hppir = 0x00000000000003fe
gicc_ahppir = 0x0000000000000801
gicc_ctlr = 0x00000000000001e9
gicd_ispendr regs (Offsets 0x200-0x278)
Offset Value
0x200: 0x0000000000000012
0x208: 0x0000000000000000
0x210: 0x0000000000000000
0x218: 0x0000000000000000
0x220: 0x0000000000000000
0x228: 0x0000000000000000
0x230: 0x0000000000000000
0x238: 0x0000000000000000
0x240: 0x0000000000000000
0x248: 0x0000000000000000
0x250: 0x0000000000000000
0x258: 0x0000000000000000
0x260: 0x0000000000000000
0x268: 0x0000000000000000
0x270: 0x0000000000000000
0x278: 0x0000000000000000
cci_snoop_ctrl_cluster0x100000000c0000003
cci_snoop_ctrl_cluster1x100000000c0000000
cci_snoop_ctrl_cluster1x100000000c0000000
[ 10.151108] macb ff0e0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control tx
[ 14.824964] platform ina226-u76: deferred probe pending: iio_hwmon: Failed to get channels
[ 14.833259] platform ina226-u77: deferred probe pending: iio_hwmon: Failed to get channels
[ 14.841532] platform ina226-u78: deferred probe pending: iio_hwmon: Failed to get channels
[ 14.849803] platform ina226-u87: deferred probe pending: iio_hwmon: Failed to get channels
[ 14.858074] platform ina226-u85: deferred probe pending: iio_hwmon: Failed to get channels
[ 14.866344] platform ina226-u86: deferred probe pending: iio_hwmon: Failed to get channels
…..hang
debugger logs:
xsdb% ta
1 PS TAP
2 PMU
3 MicroBlaze PMU (Sleeping. No clock)
4 PL
5 PSU
6 RPU
7 Cortex-R5 #0 (Halted)
8 Cortex-R5 #1 (Lock Step Mode)
9 APU
10* Cortex-A53 #0 (No Power)
11 Cortex-A53 #1 (No Power)
12 Cortex-A53 #2 (Running)
13 Cortex-A53 #3 (Running)
xsdb% Info: Cortex-A53 #0 (target 10) Running (No Power)
xsdb% Info: Cortex-A53 #2 (target 12) Running (No Power)
xsdb% Info: Cortex-A53 #2 (target 12) Running (No Power)
xsdb% Info: Cortex-A53 #1 (target 11) Running (No Power)
xsdb% Info: Cortex-A53 #1 (target 11) Running (No Power)
xsdb% Info: Cortex-A53 #1 (target 11) Running (APB AP transaction error, DAP status 0x30000021)
xsdb% Info: Cortex-A53 #2 (target 12) Running (No Power)
xsdb% Info: Cortex-A53 #1 (target 11) Running (No Power)
xsdb% Info: Cortex-A53 #0 (target 10) Stopped at 0x1194 (Reset Catch)
xsdb% Info: Cortex-A53 #1 (target 11) Running (No Power)
xsdb% Info: Cortex-A53 #1 (target 11) Running (No Power)
xsdb% ta 10
9 APU
10 Cortex-A53 #0 (Reset Catch, EL3(S)/A64)
11 Cortex-A53 #1 (Running)
12 Cortex-A53 #2 (Running)
13 Cortex-A53 #3 (Running)
xsdb% rrd pc
pc: 0000000000001194
bl31.dump:
2622 0000000000001194 <bl31_warm_entrypoint>:
2623 1194: d2810600 mov x0, #0x830 // #2096
2624 1198: f2a618a0 movk x0, #0x30c5, lsl #16
2625 119c: d51e1000 msr sctlr_el3, x0
2626 11a0: d5033fdf isb
2627 11a4: 1004b2e0 adr x0, a800 <sync_exception_sp_el0>
2628 11a8: d51ec000 msr vbar_el3, x0
2629 11ac: d5033fdf isb
aarch64-linux/bin/aarch64-linux-gnu-gcc --version
ls (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
Yes, DEBUG=1 and DEBUG=0 is ENABLE_LTO=1 is enabled by default in platform.mk.
It seems that not everyone is getting the mails from the mailing list at from Arm side. Would it be okay to move the rest of the discussion/debug to discord TF-A so that everyone can participate? Let us know.
Yes, please and could you please share the debug patch and will apply share required logs to you.
Regards,
Rohit