|
Similar to commit a6c30873ee4a ("ARM: 8989/1: use .fpu assembler
directives instead of assembler arguments").
GCC and GNU binutils support setting the "sub arch" via -march=,
-Wa,-march, target function attribute, and .arch assembler directive.
Clang was missing support for -Wa,-march=, but this was implemented in
clang-13.
The behavior of both GCC and Clang is to
prefer -Wa,-march= over -march= for assembler and assembler-with-cpp
sources, but Clang will warn about the -march= being unused.
clang: warning: argument unused during compilation: '-march=armv6k'
[-Wunused-command-line-argument]
Since most assembler is non-conditionally assembled with one sub arch
(modulo arch/arm/delay-loop.S which conditionally is assembled as armv4
based on CONFIG_ARCH_RPC, and arch/arm/mach-at91/pm-suspend.S which is
conditionally assembled as armv7-a based on CONFIG_CPU_V7), prefer the
.arch assembler directive.
Add a few more instances found in compile testing as found by Arnd and
Nathan.
Link: https://github.com/llvm/llvm-project/commit/1d51c699b9e2ebc5bcfdbe85c74cc871426333d4
Link: https://bugs.llvm.org/show_bug.cgi?id=48894
Link: https://github.com/ClangBuiltLinux/linux/issues/1195
Link: https://github.com/ClangBuiltLinux/linux/issues/1315
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
ARMv6 and greater introduced a new instruction ("bx") which can be used
to return from function calls. Recent CPUs perform better when the
"bx lr" instruction is used rather than the "mov pc, lr" instruction,
and this sequence is strongly recommended to be used by the ARM
architecture manual (section A.4.1.1).
We provide a new macro "ret" with all its variants for the condition
code which will resolve to the appropriate instruction.
Rather than doing this piecemeal, and miss some instances, change all
the "mov pc" instances to use the new macro, with the exception of
the "movs" instruction and the kprobes code. This allows us to detect
the "mov pc, lr" case and fix it up - and also gives us the possibility
of deploying this for other registers depending on the CPU selection.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Stephen Warren <swarren@nvidia.com> # Tegra Jetson TK1
Tested-by: Robert Jarzmik <robert.jarzmik@free.fr> # mioa701_bootresume.S
Tested-by: Andrew Lunn <andrew@lunn.ch> # Kirkwood
Tested-by: Shawn Guo <shawn.guo@freescale.com>
Tested-by: Tony Lindgren <tony@atomide.com> # OMAPs
Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com> # Armada XP, 375, 385
Acked-by: Sekhar Nori <nsekhar@ti.com> # DaVinci
Acked-by: Christoffer Dall <christoffer.dall@linaro.org> # kvm/hyp
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> # PXA3xx
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> # Xen
Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> # ARMv7M
Tested-by: Simon Horman <horms+renesas@verge.net.au> # Shmobile
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
|
|
Currently mx53 (CortexA8) running at 1GHz reports:
Calibrating delay loop... 663.55 BogoMIPS (lpj=3317760)
Tom Evans verified that alignments of 0x0 and 0x8 run the two instructions of __loop_delay in one clock cycle (1 clock/loop), while alignments of 0x4 and 0xc take 3 clocks to run the loop twice. (1.5 clock/loop)
The original object code looks like this:
00000010 <__loop_const_udelay>:
10: e3e01000 mvn r1, #0
14: e51f201c ldr r2, [pc, #-28] ; 0 <__loop_udelay-0x8>
18: e5922000 ldr r2, [r2]
1c: e0800921 add r0, r0, r1, lsr #18
20: e1a00720 lsr r0, r0, #14
24: e0822b21 add r2, r2, r1, lsr #22
28: e1a02522 lsr r2, r2, #10
2c: e0000092 mul r0, r2, r0
30: e0800d21 add r0, r0, r1, lsr #26
34: e1b00320 lsrs r0, r0, #6
38: 01a0f00e moveq pc, lr
0000003c <__loop_delay>:
3c: e2500001 subs r0, r0, #1
40: 8afffffe bhi 3c <__loop_delay>
44: e1a0f00e mov pc, lr
After adding the 'align 3' directive to __loop_delay (align to 8 bytes):
00000010 <__loop_const_udelay>:
10: e3e01000 mvn r1, #0
14: e51f201c ldr r2, [pc, #-28] ; 0 <__loop_udelay-0x8>
18: e5922000 ldr r2, [r2]
1c: e0800921 add r0, r0, r1, lsr #18
20: e1a00720 lsr r0, r0, #14
24: e0822b21 add r2, r2, r1, lsr #22
28: e1a02522 lsr r2, r2, #10
2c: e0000092 mul r0, r2, r0
30: e0800d21 add r0, r0, r1, lsr #26
34: e1b00320 lsrs r0, r0, #6
38: 01a0f00e moveq pc, lr
3c: e320f000 nop {0}
00000040 <__loop_delay>:
40: e2500001 subs r0, r0, #1
44: 8afffffe bhi 40 <__loop_delay>
48: e1a0f00e mov pc, lr
4c: e320f000 nop {0}
, which now reports:
Calibrating delay loop... 996.14 BogoMIPS (lpj=4980736)
Some more test results:
On mx31 (ARM1136) running at 532 MHz, before the patch:
Calibrating delay loop... 351.43 BogoMIPS (lpj=1757184)
On mx31 (ARM1136) running at 532 MHz after the patch:
Calibrating delay loop... 528.79 BogoMIPS (lpj=2643968)
Also tested on mx6 (CortexA9) and on mx27 (ARM926), which shows the same
BogoMIPS value before and after this patch.
Reported-by: Tom Evans <tom_usenet@optusnet.com.au>
Suggested-by: Tom Evans <tom_usenet@optusnet.com.au>
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
|