On 10/10/2020 11:07, Biju Das wrote:
Hi Biju,
Subject: Re: commit 75fab6496e5fce9a11 ("libc: memset: improve performance by avoiding single byte writes") causing BL31 boot failure on Renesas RZ/G2 platforms.
On 09/10/2020 20:45, Biju Das wrote:
Hi,
I have found the issue. Please see inline.
On 09/10/2020 18:37, Biju Das wrote:
Hi Biju,
>>>> Subject: Re: commit 75fab6496e5fce9a11 ("libc: memset: improve >>>> performance by avoiding single byte writes") causing BL31 boot >>>> failure on Renesas RZ/G2 platforms. >>>> >>>> On 08/10/2020 20:05, Biju Das wrote: >>>>> Then on target, found that BL31 is failed to boot[3], which is >>>>> fixed by reverting the commit 75fab6496e5fce9a11 >>>>> ("libc: memset: improve performance by avoiding single byte >>>>> writes"), see >>>> the logs [3]. >>>> >>>> Mmmh, interesting. Can you build with "DEBUG=1" to get more >>>> output from BL31? >>> >>> Sure Will do and provide feedback. > >>>> I see some calls to memset() from code in drivers/renesas/rcar. >>>> Can you add some debug prints at the top of memset() to dump >>>> the parameters on each call? To see what breaks it? >>> >>> OK. > > I believe commenting [1] fixed the issue which corresponds to > memset parameters[2]. > Looks like calling rcar_log_init-->memset from [3] is causing the issue.
Many thanks for debugging this, that's helpful!
But without your patch it is working fine. Only thing is, you have added extra code to > improve the speed. I am not sure how it is impacting here. > Currently I don't have access to Lauterbach to debug further. Any thoughts ???
So this RCAR_BL31_LOG_BASE address is mapped as device memory,
what
kind of device is this? SRAM? Or something non-volatile, even? Even if it's SRAM, sometimes the controller managing this does not support all AMBA accesses (hence it's a device, not memory).
It is DRAM only, with Security protected attribute.
Mmh, interesting. Why is it then mapped as device?
#define MAP_ATFW_LOG
MAP_REGION_FLAT(RCAR_BL31_LOG_BASE,
\ RCAR_BL31_LOG_SIZE, \ MT_DEVICE | MT_RW | MT_SECURE)
You could try using MT_NORMAL instead. Or ...
In general letting the compiler access device memory directly is asking for trouble, every access should go through MMIO accessors. Linux has memset_io() for those kind of problems.
Can you try the attached patch? That should downgrade the stores to 32-bit ones. Not a solution, but would help to isolate the problem.
Patch didn't fix the issue.
I see, bummer. Does it work if you apply this on top?
diff --git a/lib/libc/memset.c b/lib/libc/memset.c index 5ad6ede51..7478871e3 100644 --- a/lib/libc/memset.c +++ b/lib/libc/memset.c @@ -7,6 +7,7 @@ #include <stddef.h> #include <string.h> #include <stdint.h> +#include <lib/mmio.h>
void *memset(void *dst, int val, size_t count) { @@ -34,7 +35,8 @@ void *memset(void *dst, int val, size_t count) /* Use 32-bit writes for as long as possible. */ ptr32 = (void *)ptr; for (; count >= 4; count -= 4) { -*ptr32++ = fill; +mmio_write_32((uintptr_t)ptr32, fill); +ptr32++; }
/* Handle the remaining part byte-per-byte. */
However I copied the same code emmc_memset to rcar_memset() and
with that it works fine.
Looks like byte access works OK. Any thoughts??
That's weird, I just see "str w1, [x3, x5]" in the disassembly, and since everything is nicely aligned, it should actually work. That is still with the note that letting the compiler access device memory is wrong, but the generated code is actually very similar in this special case
here.
Issue is with your original commit[1], it is entering infinite loop. I
commented that while loop and it is fine.
[1]https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/li b/libc/memset.c#n43
That's very weird. I stared at the code for a while, and can't see the issue. I also put it through my unit tests again, and tried to mimic your special case (clearing 81840 bytes at 0xyyyy0010). That worked fine for me every time. Do you see the actual problem?
Can you send me the resulting binary: build/rcar/release/libc/memset.o? Just to check what your compiler made of it?
Here it is [1] https://gitlab.com/bijudas/debug
Thanks, that is actually identical to what my compiler produced, so that means toolchain-wise we are on the same page.
Just a suggestion, Why can't we use finite for loops with pointer arithmetic instead of the while loops?
- first for loop with ptr64 and ptr64 +(remainingcount >> 3) and
- second for loop with ptr and ptr + (remainingcount & 7)
But what would that change?
Please see the code generated by compiler for working case and non working case.
Just to clarify: with "working" you mean "commenting the last while loop"? Which breaks the algorithm, as we would lose the tail bytes to be filled, in case the end of the to-be-filled region is not 64-bit aligned? So it's not an option to just skip that. It just happens to work in your case because the end *is* 64-bit aligned, so the final loop is not needed.
And besides, I still don't see the problem. I stepped through the disassembly, with (<aligned ptr>, 0, 81840) for the parameters and don't see why there would be a problem. Once it's done with the 64-bit write section, both x2 and x6 are 0, so the cmp at 0x48 sets the Z bit, and "b.eq 68" branches to the ret. So the whole final loop is skipped, as intended.
Please see the line inserted by compiler for non working case. 3c:928000e4 movx4, #0xfffffffffffffff8 // #-8
Yes, that has a particular purpose (which is not very obvious because this is -Os optimised): It is used to *subtract* the already filled bytes from the original length, to then serve as the limit for the final while loop - which is actually turned in a for loop in assembly! There is no msub in AArch64, so the compiler uses madd and makes one of the factors negative.
So, can you please double check that: memset(0x44040010, 0, 81840); is really the call that hangs your system? And that it really does because so it's being stuck in the last while loop?
And can you make this RCAR_BL31_LOG_BASE region MT_NORMAL memory and see if the problem still persists?
biju@biju-VirtualBox:~/work/test/trusted-firmware-a$ /usr/share/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-objdump -S build/rzg/release/libc/memset.o
build/rzg/release/libc/memset.o: file format elf64-littleaarch64
Disassembly of section .text.memset:
0000000000000000 <memset>: 0:b4000342 cbzx2, 68 <memset+0x68> 4:12001c21 andw1, w1, #0xff 8:aa0003e5 movx5, x0 c:f24008a6 andsx6, x5, #0x7 10:54000261 b.ne5c <memset+0x5c> // b.any 14:d3781c24 ubfizx4, x1, #8, #8 18:92401c23 andx3, x1, #0xff 1c:aa030083 orrx3, x4, x3 20:d2800004 movx4, #0x0 // #0 24:aa034063 orrx3, x3, x3, lsl #16 28:aa038063 orrx3, x3, x3, lsl #32 2c:cb040047 subx7, x2, x4 30:f1001cff cmpx7, #0x7 34:540001c8 b.hi6c <memset+0x6c> // b.pmore 38:d343fc43 lsrx3, x2, #3 3c:928000e4 movx4, #0xfffffffffffffff8 // #-8 40:8b030ca5 addx5, x5, x3, lsl #3 44:9b040862 maddx2, x3, x4, x2 48:eb06005f cmpx2, x6 4c:540000e0 b.eq68 <memset+0x68> // b.none 50:382668a1 strbw1, [x5, x6] 54:910004c6 addx6, x6, #0x1 58:17fffffc b48 <memset+0x48> 5c:380014a1 strbw1, [x5], #1 60:f1000442 subsx2, x2, #0x1 64:54fffd41 b.nec <memset+0xc> // b.any 68:d65f03c0 ret 6c:f82468a3 strx3, [x5, x4] 70:91002084 addx4, x4, #0x8 74:17ffffee b2c <memset+0x2c>
Disassembly with commenting last while loop (Working case)
biju@biju-VirtualBox:~/work/test/trusted-firmware-a$ /usr/share/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-objdump -S memset.o
memset.o: file format elf64-littleaarch64
Disassembly of section .text.memset:
0000000000000000 <memset>: 0:b4000262 cbzx2, 4c <memset+0x4c> 4:12001c21 andw1, w1, #0xff 8:aa0003e3 movx3, x0 c:f2400864 andsx4, x3, #0x7 10:54000181 b.ne40 <memset+0x40> // b.any 14:92401c25 andx5, x1, #0xff 18:d3781c21 ubfizx1, x1, #8, #8 1c:aa050021 orrx1, x1, x5 20:aa014021 orrx1, x1, x1, lsl #16 24:aa018021 orrx1, x1, x1, lsl #32 28:cb040045 subx5, x2, x4 2c:f1001cbf cmpx5, #0x7 30:540000e9 b.ls4c <memset+0x4c> // b.plast 34:f8246861 strx1, [x3, x4] 38:91002084 addx4, x4, #0x8 3c:17fffffb b28 <memset+0x28> 40:38001461 strbw1, [x3], #1 44:f1000442 subsx2, x2, #0x1 48:54fffe21 b.nec <memset+0xc> // b.any 4c:d65f03c0 ret
Sure, having no tail loop simplifies the algorithm, but is wrong. So what is the point here?
Cheers, Andre