summaryrefslogtreecommitdiff
path: root/arch/arm64/kernel
AgeCommit message (Collapse)Author
2024-11-14perf/core: Correct perf sampling with guest VMsColton Lewis
Previously any PMU overflow interrupt that fired while a VCPU was loaded was recorded as a guest event whether it truly was or not. This resulted in nonsense perf recordings that did not honor perf_event_attr.exclude_guest and recorded guest IPs where it should have recorded host IPs. Rework the sampling logic to only record guest samples for events with exclude_guest = 0. This way any host-only events with exclude_guest set will never see unexpected guest samples. The behaviour of events with exclude_guest = 0 is unchanged. Note that events configured to sample both host and guest may still misattribute a PMI that arrived in the host as a guest event depending on KVM arch and vendor behavior. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241113190156.2145593-6-coltonlewis@google.com
2024-11-14perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()Colton Lewis
For clarity, rename the arch-specific definitions of these functions to perf_arch_* to denote they are arch-specifc. Define the generic-named functions in one place where they can call the arch-specific ones as needed. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-12arm64/ptrace: Clarify documentation of VL configuration via ptraceMark Brown
When we configure SVE, SSVE or ZA via ptrace we allow the user to configure the vector length and specify any of the flags that are accepted when configuring via prctl(). This includes the S[VM]E_SET_VL_ONEXEC flag which defers the configuration of the VL until an exec(). We don't do anything to limit the provision of register data as part of configuring the _ONEXEC VL but as a function of the VL enumeration support we do this will be interpreted using the vector length currently configured for the process. This is all a bit surprising, and probably we should just not have allowed register data to be specified with _ONEXEC, but it's our ABI so let's add some explicit documentation in both the ABI documents and the source calling out what happens. The comments are also missing the fact that since SME does not have a mandatory 128 bit VL it is possible for VL enumeration to result in the configuration of a higher VL than was requested, cover that too. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241106-arm64-sve-ptrace-vl-set-v1-1-3b164e8b559c@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-12arm64/mm: Change protval as 'pteval_t' in map_range()Anshuman Khandual
pgprot_t has been defined as an encapsulated structure with pteval_t as its element. Hence it is prudent to use pteval_t as the type instead of via the size based u64. Besides pteval_t type might be different size later on with FEAT_D128. Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20241111075249.609493-1-anshuman.khandual@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-08Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "Here is a (hopefully) final round of arm64 fixes for 6.12 that address some user-visible floating point register corruption. Both of the Marks have been working on this for a couple of weeks and we've ended up in a position where SVE is solid but SME still has enough pending issues that the most pragmatic solution for the release and stable backports is to disable the feature. Yes, it's a shame, but the hardware is rare as hen's teeth at the moment and we're better off getting back to a known good state before fixing it all properly. We're also improving the selftests for 6.13 to help avoid merging broken code in the future. Anyway, the good news is that we're removing a lot more code than we're adding. Summary: - Fix handling of SVE traps from userspace on preemptible kernels when converting the saved floating point state into SVE state. - Remove broken support for the SMCCCv1.3 "SVE discard hint" optimisation. - Disable SME support, as the current support code suffers from numerous issues around signal delivery, ptrace access and context-switch which can lead to user-visible corruption of the register state" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Kconfig: Make SME depend on BROKEN for now arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hint arm64/sve: Discard stale CPU state when handling SVE traps
2024-11-08arm64/scs: Deal with 64-bit relative offsets in FDE framesArd Biesheuvel
In some cases, the compiler may decide to emit DWARF FDE frames with 64-bit signed fields for the code offset and range fields. This may happen when using the large code model, for instance, which permits an executable to be spread out over more than 4 GiB of address space. Whether this is the case can be inferred from the augmentation data in the CIE frame, so decode this data before processing the FDE frames. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Tested-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20241106185513.3096442-7-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-08arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE framesArd Biesheuvel
The dynamic SCS patching code pretends to parse the DWARF augmentation data in the CIE (header) frame, and handle accordingly when processing the individual FDE frames based on this CIE frame. However, the boolean variable is defined inside the loop, and so the parsed value is ignored. The same applies to the code alignment field, which is also read from the header but then discarded. This was never spotted before because Clang is the only compiler that supports dynamic SCS patching (which is essentially an Android feature), and the unwind tables it produces are highly uniform, and match the de facto defaults. So instead of testing for the 'z' flag in the augmentation data field, require a fixed augmentation data string of 'zR', and simplify the rest of the code accordingly. Also introduce some error codes to specify why the patching failed, and log it to the kernel console on failure when this happens when loading a module. (Doing so for vmlinux is infeasible, as the patching is done extremely early in the boot.) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Tested-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20241106185513.3096442-6-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-08arm64: uprobes: Optimize cache flushes for xol slotLiao Chang
The profiling of single-thread selftests bench reveals a bottlenect in caches_clean_inval_pou() on ARM64. On my local testing machine, this function takes approximately 34% of CPU cycles for trig-uprobe-nop and trig-uprobe-push. This patch add a check to avoid unnecessary cache flush when writing instruction to the xol slot. If the instruction is same with the existing instruction in slot, there is no need to synchronize D/I cache. Since xol slot allocation and updates occur on the hot path of uprobe handling, The upstream kernel running on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz reveals this optimization has obvious gain for nop and push testcases. Before (next-20240918) ---------------------- uprobe-nop ( 1 cpus): 0.418 ± 0.001M/s ( 0.418M/s/cpu) uprobe-push ( 1 cpus): 0.411 ± 0.005M/s ( 0.411M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 0.350 ± 0.000M/s ( 0.350M/s/cpu) uretprobe-push ( 1 cpus): 0.353 ± 0.000M/s ( 0.353M/s/cpu) uretprobe-ret ( 1 cpus): 1.074 ± 0.001M/s ( 1.074M/s/cpu) After ----- uprobe-nop ( 1 cpus): 0.926 ± 0.000M/s ( 0.926M/s/cpu) uprobe-push ( 1 cpus): 0.910 ± 0.001M/s ( 0.910M/s/cpu) uprobe-ret ( 1 cpus): 2.056 ± 0.001M/s ( 2.056M/s/cpu) uretprobe-nop ( 1 cpus): 0.653 ± 0.001M/s ( 0.653M/s/cpu) uretprobe-push ( 1 cpus): 0.645 ± 0.000M/s ( 0.645M/s/cpu) uretprobe-ret ( 1 cpus): 1.093 ± 0.001M/s ( 1.093M/s/cpu) Signed-off-by: Liao Chang <liaochang1@huawei.com> Link: https://lore.kernel.org/r/20240919121719.2148361-1-liaochang1@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-07asm-generic: introduce text-patching.hMike Rapoport (Microsoft)
Several architectures support text patching, but they name the header files that declare patching functions differently. Make all such headers consistently named text-patching.h and add an empty header in asm-generic for architectures that do not support text patching. Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANGMasahiro Yamada
Commit be2881824ae9 ("arm64/build: Assert for unwanted sections") introduced an assertion to ensure that the .data.rel.ro section does not exist. However, this check does not work when CONFIG_LTO_CLANG is enabled, because .data.rel.ro matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Move the ASSERT() above the RW_DATA() line. Fixes: be2881824ae9 ("arm64/build: Assert for unwanted sections") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241106161843.189927-1-masahiroy@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-07arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hintMark Rutland
SMCCCv1.3 added a hint bit which callers can set in an SMCCC function ID (AKA "FID") to indicate that it is acceptable for the SMCCC implementation to discard SVE and/or SME state over a specific SMCCC call. The kernel support for using this hint is broken and SMCCC calls may clobber the SVE and/or SME state of arbitrary tasks, though FPSIMD state is unaffected. The kernel support is intended to use the hint when there is no SVE or SME state to save, and to do this it checks whether TIF_FOREIGN_FPSTATE is set or TIF_SVE is clear in assembly code: | ldr <flags>, [<current_task>, #TSK_TI_FLAGS] | tbnz <flags>, #TIF_FOREIGN_FPSTATE, 1f // Any live FP state? | tbnz <flags>, #TIF_SVE, 2f // Does that state include SVE? | | 1: orr <fid>, <fid>, ARM_SMCCC_1_3_SVE_HINT | 2: | << SMCCC call using FID >> This is not safe as-is: (1) SMCCC calls can be made in a preemptible context and preemption can result in TIF_FOREIGN_FPSTATE being set or cleared at arbitrary points in time. Thus checking for TIF_FOREIGN_FPSTATE provides no guarantee. (2) TIF_FOREIGN_FPSTATE only indicates that the live FP/SVE/SME state in the CPU does not belong to the current task, and does not indicate that clobbering this state is acceptable. When the live CPU state is clobbered it is necessary to update fpsimd_last_state.st to ensure that a subsequent context switch will reload FP/SVE/SME state from memory rather than consuming the clobbered state. This and the SMCCC call itself must happen in a critical section with preemption disabled to avoid races. (3) Live SVE/SME state can exist with TIF_SVE clear (e.g. with only TIF_SME set), and checking TIF_SVE alone is insufficient. Remove the broken support for the SMCCCv1.3 SVE saving hint. This is effectively a revert of commits: * cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") * a7c3acca5380 ("arm64: smccc: Save lr before calling __arm_smccc_sve_check()") ... leaving behind the ARM_SMCCC_VERSION_1_3 and ARM_SMCCC_1_3_SVE_HINT definitions, since these are simply definitions from the SMCCC specification, and the latter is used in KVM via ARM_SMCCC_CALL_HINTS. If we want to bring this back in future, we'll probably want to handle this logic in C where we can use all the usual FPSIMD/SVE/SME helper functions, and that'll likely require some rework of the SMCCC code and/or its callers. Fixes: cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241106160448.2712997-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-06arm64/sve: Discard stale CPU state when handling SVE trapsMark Brown
The logic for handling SVE traps manipulates saved FPSIMD/SVE state incorrectly, and a race with preemption can result in a task having TIF_SVE set and TIF_FOREIGN_FPSTATE clear even though the live CPU state is stale (e.g. with SVE traps enabled). This has been observed to result in warnings from do_sve_acc() where SVE traps are not expected while TIF_SVE is set: | if (test_and_set_thread_flag(TIF_SVE)) | WARN_ON(1); /* SVE access shouldn't have trapped */ Warnings of this form have been reported intermittently, e.g. https://lore.kernel.org/linux-arm-kernel/CA+G9fYtEGe_DhY2Ms7+L7NKsLYUomGsgqpdBj+QwDLeSg=JhGg@mail.gmail.com/ https://lore.kernel.org/linux-arm-kernel/000000000000511e9a060ce5a45c@google.com/ The race can occur when the SVE trap handler is preempted before and after manipulating the saved FPSIMD/SVE state, starting and ending on the same CPU, e.g. | void do_sve_acc(unsigned long esr, struct pt_regs *regs) | { | // Trap on CPU 0 with TIF_SVE clear, SVE traps enabled | // task->fpsimd_cpu is 0. | // per_cpu_ptr(&fpsimd_last_state, 0) is task. | | ... | | // Preempted; migrated from CPU 0 to CPU 1. | // TIF_FOREIGN_FPSTATE is set. | | get_cpu_fpsimd_context(); | | if (test_and_set_thread_flag(TIF_SVE)) | WARN_ON(1); /* SVE access shouldn't have trapped */ | | sve_init_regs() { | if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) { | ... | } else { | fpsimd_to_sve(current); | current->thread.fp_type = FP_STATE_SVE; | } | } | | put_cpu_fpsimd_context(); | | // Preempted; migrated from CPU 1 to CPU 0. | // task->fpsimd_cpu is still 0 | // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then: | // - Stale HW state is reused (with SVE traps enabled) | // - TIF_FOREIGN_FPSTATE is cleared | // - A return to userspace skips HW state restore | } Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set by calling fpsimd_flush_task_state() to detach from the saved CPU state. This ensures that a subsequent context switch will not reuse the stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the new state to be reloaded from memory prior to a return to userspace. Fixes: cccb78ce89c4 ("arm64/sve: Rework SVE access trap to convert state in registers") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20241030-arm64-fpsimd-foreign-flush-v1-1-bd7bd66905a2@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-11-05arm64: Add support for FEAT_HAFTYicong Yang
Armv8.9/v9.4 introduces the feature Hardware managed Access Flag for Table descriptors (FEAT_HAFT). The feature is indicated by ID_AA64MMFR1_EL1.HAFDBS == 0b0011 and can be enabled by TCR2_EL1.HAFT so it has a dependency on FEAT_TCR2. Adds the Kconfig for FEAT_HAFT and support detecting and enabling the feature. The feature is enabled in __cpu_setup() before MMU on just like HA. A CPU capability is added to notify the user of the feature. Add definition of P{G,4,U,M}D_TABLE_AF bit and set the AF bit when creating the page table, which will save the hardware from having to update them at runtime. This will be ignored if FEAT_HAFT is not enabled. The AF bit of table descriptors cannot be managed by the software per spec, unlike the HA. So this should be used only if it's supported system wide by system_supports_haft(). Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20241102104235.62560-4-yangyicong@huawei.com Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [catalin.marinas@arm.com: added the ID check back to __cpu_setup in case of future CPU errata] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-04arm64: signal: Remove unused macroKevin Brodsky
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used; just remove it. Reviewed-by: Dave Martin <Dave.Martin@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Link: https://lore.kernel.org/r/20241029144539.111155-4-kevin.brodsky@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-04arm64: signal: Remove unnecessary check when saving POE stateKevin Brodsky
The POE frame record is allocated unconditionally if POE is supported. If the allocation fails, a SIGSEGV is delivered before setup_sigframe() can be reached. As a result there is no need to consider poe_offset before saving POR_EL0; just remove that check. This is in line with other frame records (FPMR, TPIDR2). Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Link: https://lore.kernel.org/r/20241029144539.111155-3-kevin.brodsky@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-04arm64/fpsimd: Fix a typoChristophe JAILLET
s/FPSMID/FPSIMD/ M and I swapped. Fix it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/2cbcb42615e9265bccc9b746465d7998382e605d.1730539907.git.christophe.jaillet@wanadoo.fr Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-02arm64: vdso: Use only one single vvar mappingThomas Weißschuh
The vvar mapping is the same for all processes. Use a single mapping to simplify the logic and align it with the other architectures. In addition this will enable the move of the vvar handling into generic code. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Will Deacon <will@kernel.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-5-b64f0842d512@linutronix.de
2024-11-02arm64: vdso: Drop LBASE_VDSOThomas Weißschuh
This constant is always "0", providing no value and making the logic harder to understand. Also prepare for a consolidation of the vdso linkerscript logic by aligning it with other architectures. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-4-b64f0842d512@linutronix.de
2024-11-01Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The important one is a change to the way in which we handle protection keys around signal delivery so that we're more closely aligned with the x86 behaviour, however there is also a revert of the previous fix to disable software tag-based KASAN with GCC, since a workaround materialised shortly afterwards. I'd love to say we're done with 6.12, but we're aware of some longstanding fpsimd register corruption issues that we're almost at the bottom of resolving. Summary: - Fix handling of POR_EL0 during signal delivery so that pushing the signal context doesn't fail based on the pkey configuration of the interrupted context and align our user-visible behaviour with that of x86. - Fix a bogus pointer being passed to the CPU hotplug code from the Arm SDEI driver. - Re-enable software tag-based KASAN with GCC by using an alternative implementation of '__no_sanitize_address'" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: signal: Improve POR_EL0 handling to avoid uaccess failures firmware: arm_sdei: Fix the input parameter of cpuhp_remove_state() Revert "kasan: Disable Software Tag-Based KASAN with GCC" kasan: Fix Software Tag-Based KASAN with GCC
2024-11-01arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumersMarc Zyngier
Despite KVM now being able to deal with XS-tagged TLBIs, we still don't expose these feature bits to KVM. Plumb in the feature in ID_AA64ISAR1_EL1. Fixes: 0feec7769a63 ("KVM: arm64: nv: Add handling of NXS-flavoured TLBI operations") Signed-off-by: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241031083519.364313-1-maz@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-01arm64: Return early when break handler is found on linked-listLiao Chang
The search for breakpoint handlers iterate through the entire linked list. Given that all registered hook has a valid fn field, and no registered hooks share the same mask and imm. This commit optimize the efficiency slightly by returning early as a matching handler is found. Signed-off-by: Liao Chang <liaochang1@huawei.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241024034120.3814224-1-liaochang1@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-31arm64: cpufeature: discover CPU support for MPAMJames Morse
ARMv8.4 adds support for 'Memory Partitioning And Monitoring' (MPAM) which describes an interface to cache and bandwidth controls wherever they appear in the system. Add support to detect MPAM. Like SVE, MPAM has an extra id register that describes some more properties, including the virtualisation support, which is optional. Detect this separately so we can detect mismatched/insane systems, but still use MPAM on the host even if the virtualisation support is missing. MPAM needs enabling at the highest implemented exception level, otherwise the register accesses trap. The 'enabled' flag is accessible to lower exception levels, but its in a register that traps when MPAM isn't enabled. The cpufeature 'matches' hook is extended to test this on one of the CPUs, so that firmware can emulate MPAM as disabled if it is reserved for use by secure world. Secondary CPUs that appear late could trip cpufeature's 'lower safe' behaviour after the MPAM properties have been advertised to user-space. Add a verify call to ensure late secondaries match the existing CPUs. (If you have a boot failure that bisects here its likely your CPUs advertise MPAM in the id registers, but firmware failed to either enable or MPAM, or emulate the trap as if it were disabled) Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Joey Gouly <joey.gouly@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241030160317.2528209-4-joey.gouly@arm.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-29of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verifyUsama Arif
__pa() is only intended to be used for linear map addresses and using it for initial_boot_params which is in fixmap for arm64 will give an incorrect value. Hence save the physical address when it is known at boot time when calling early_init_dt_scan for arm64 and use it at kexec time instead of converting the virtual address using __pa(). Note that arm64 doesn't need the FDT region reserved in the DT as the kernel explicitly reserves the passed in FDT. Therefore, only a debug warning is fixed with this change. Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20241023171426.452688-1-usamaarif642@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-10-29arm64: signal: Improve POR_EL0 handling to avoid uaccess failuresKevin Brodsky
Reset POR_EL0 to "allow all" before writing the signal frame, preventing spurious uaccess failures. When POE is supported, the POR_EL0 register constrains memory accesses based on the target page's POIndex (pkey). This raises the question: what constraints should apply to a signal handler? The current answer is that POR_EL0 is reset to POR_EL0_INIT when invoking the handler, giving it full access to POIndex 0. This is in line with x86's MPK support and remains unchanged. This is only part of the story, though. POR_EL0 constrains all unprivileged memory accesses, meaning that uaccess routines such as put_user() are also impacted. As a result POR_EL0 may prevent the signal frame from being written to the signal stack (ultimately causing a SIGSEGV). This is especially concerning when an alternate signal stack is used, because userspace may want to prevent access to it outside of signal handlers. There is currently no provision for that: POR_EL0 is reset after writing to the stack, and POR_EL0_INIT only enables access to POIndex 0. This patch ensures that POR_EL0 is reset to its most permissive state before the signal stack is accessed. Once the signal frame has been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up to the signal handler to enable access to additional pkeys if needed. As to sigreturn(), it expects having access to the stack like any other syscall; we only need to ensure that POR_EL0 is restored from the signal frame after all uaccess calls. This approach is in line with the recent x86/pkeys series [1]. Resetting POR_EL0 early introduces some complications, in that we can no longer read the register directly in preserve_poe_context(). This is addressed by introducing a struct (user_access_state) and helpers to manage any such register impacting user accesses (uaccess and accesses in userspace). Things look like this on signal delivery: 1. Save original POR_EL0 into struct [save_reset_user_access_state()] 2. Set POR_EL0 to "allow all" [save_reset_user_access_state()] 3. Create signal frame 4. Write saved POR_EL0 value to the signal frame [preserve_poe_context()] 5. Finalise signal frame 6. If all operations succeeded: a. Set POR_EL0 to POR_EL0_INIT [set_handler_user_access_state()] b. Else reset POR_EL0 to its original value [restore_user_access_state()] If any step fails when setting up the signal frame, the process will be sent a SIGSEGV, which it may be able to handle. Step 6.b ensures that the original POR_EL0 is saved in the signal frame when delivering that SIGSEGV (so that the original value is restored by sigreturn). The return path (sys_rt_sigreturn) doesn't strictly require any change since restore_poe_context() is already called last. However, to avoid uaccess calls being accidentally added after that point, we use the same approach as in the delivery path, i.e. separating uaccess from writing to the register: 1. Read saved POR_EL0 value from the signal frame [restore_poe_context()] 2. Set POR_EL0 to the saved value [restore_user_access_state()] [1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/ Fixes: 9160f7e909e1 ("arm64: add POE signal support") Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Link: https://lore.kernel.org/r/20241029144539.111155-2-kevin.brodsky@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-10-28arm64: Use new fallback IO memcpy/memsetJulian Vetter
Use the new fallback memcpy_{from,to}io and memset_io functions from lib/iomem_copy.c on the arm64 processor architecture. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com> Signed-off-by: Julian Vetter <jvetter@kalrayinc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-10-23arm64: Add command-line override for ID_AA64MMFR0_EL1.ECVMarc Zyngier
It appears that relatively popular hardware out there implements the CNTPOFF_EL2 variant of FEAT_ECV, advertises it via ID_AA64MMFR0_EL1, but cannot be bothered to set SCR_EL3.ECVEn to 1. You would probably think that "this is fine, EL3 will take the trap on access to CNTPOFF_EL2 and flip the ECVEn bit", as that's what a semi-decent firmware implementation would do. But no. None of that. This particular implementation takes the trap, considers its purpose in life, decides that it has none, and *RESETS* the system. Yes, x1e001de, I'm talking about you. In order to allow this machine to be promoted slightly above the level of a glorified door-stop, add a new "id_aa64mmfr0.ecv" override. allowing the kernel to pretend this option was never there. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241021181434.1052974-1-maz@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: Enable memory encrypt for RealmsSuzuki K Poulose
Use the memory encryption APIs to trigger a RSI call to request a transition between protected memory and shared memory (or vice versa) and updating the kernel's linear map of modified pages to flip the top bit of the IPA. This requires that block mappings are not used in the direct map for realm guests. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Co-developed-by: Steven Price <steven.price@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-10-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: Enforce bounce buffers for realm DMASteven Price
Within a realm guest it's not possible for a device emulated by the VMM to access arbitrary guest memory. So force the use of bounce buffers to ensure that the memory the emulated devices are accessing is in memory which is explicitly shared with the host. This adds a call to swiotlb_update_mem_attributes() which calls set_memory_decrypted() to ensure the bounce buffer memory is shared with the host. For non-realm guests or hosts this is a no-op. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-8-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23efi: arm64: Map Device with Prot SharedSuzuki K Poulose
Device mappings need to be emulated by the VMM so must be mapped shared with the host. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-7-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: rsi: Map unprotected MMIO as decryptedSuzuki K Poulose
Instead of marking every MMIO as shared, check if the given region is "Protected" and apply the permissions accordingly. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-6-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: rsi: Add support for checking whether an MMIO is protectedSuzuki K Poulose
On Arm CCA, with RMM-v1.0, all MMIO regions are shared. However, in the future, an Arm CCA-v1.0 compliant guest may be run in a lesser privileged partition in the Realm World (with Arm CCA-v1.1 Planes feature). In this case, some of the MMIO regions may be emulated by a higher privileged component in the Realm world, i.e, protected. Thus the guest must decide today, whether a given MMIO region is shared vs Protected and create the stage1 mapping accordingly. On Arm CCA, this detection is based on the "IPA State" (RIPAS == RIPAS_IO). Provide a helper to run this check on a given range of MMIO. Also, provide a arm64 helper which may be hooked in by other solutions. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-5-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: realm: Query IPA size from the RMMSteven Price
The top bit of the configured IPA size is used as an attribute to control whether the address is protected or shared. Query the configuration from the RMM to assertain which bit this is. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-4-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23arm64: Detect if in a realm and set RIPAS RAMSuzuki K Poulose
Detect that the VM is a realm guest by the presence of the RSI interface. This is done after PSCI has been initialised so that we can check the SMCCC conduit before making any RSI calls. If in a realm then iterate over all memory ensuring that it is marked as RIPAS RAM. The loader is required to do this for us, however if some memory is missed this will cause the guest to receive a hard to debug external abort at some random point in the future. So for a belt-and-braces approach set all memory to RIPAS RAM. Any failure here implies that the RAM regions passed to Linux are incorrect so panic() promptly to make the situation clear. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Co-developed-by: Steven Price <steven.price@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20241017131434.40935-3-steven.price@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-17arm64: Support AT_HWCAP3Mark Brown
We have filled all 64 bits of AT_HWCAP2 so in order to support discovery of further features provide the framework to use the already defined AT_HWCAP3 for further CPU features. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241004-arm64-elf-hwcap3-v2-2-799d1daad8b0@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: stacktrace: unwind exception boundariesMark Rutland
When arm64's stack unwinder encounters an exception boundary, it uses the pt_regs::stackframe created by the entry code, which has a copy of the PC and FP at the time the exception was taken. The unwinder doesn't know anything about pt_regs, and reports the PC from the stackframe, but does not report the LR. The LR is only guaranteed to contain the return address at function call boundaries, and can be used as a scratch register at other times, so the LR at an exception boundary may or may not be a legitimate return address. It would be useful to report the LR value regardless, as it can be helpful when debugging, and in future it will be helpful for reliable stacktrace support. This patch changes the way we unwind across exception boundaries, allowing both the PC and LR to be reported. The entry code creates a frame_record_meta structure embedded within pt_regs, which the unwinder uses to find the pt_regs. The unwinder can then extract pt_regs::pc and pt_regs::lr as two separate unwind steps before continuing with a regular walk of frame records. When a PC is unwound from pt_regs::lr, dump_backtrace() will log this with an "L" marker so that it can be identified easily. For example, an unwind across an exception boundary will appear as follows: | el1h_64_irq+0x6c/0x70 | _raw_spin_unlock_irqrestore+0x10/0x60 (P) | __aarch64_insn_write+0x6c/0x90 (L) | aarch64_insn_patch_text_nosync+0x28/0x80 ... with a (P) entry for pt_regs::pc, and an (L) entry for pt_regs:lr. Note that the LR may be stale at the point of the exception, for example, shortly after a return: | el1h_64_irq+0x6c/0x70 | default_idle_call+0x34/0x180 (P) | default_idle_call+0x28/0x180 (L) | do_idle+0x204/0x268 ... where the LR points a few instructions before the current PC. This plays nicely with all the other unwind metadata tracking. With the ftrace_graph profiler enabled globally, and kretprobes installed on generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered by magic-sysrq + L reports: | Call trace: | show_stack+0x20/0x40 (CF) | dump_stack_lvl+0x60/0x80 (F) | dump_stack+0x18/0x28 | nmi_cpu_backtrace+0xfc/0x140 | nmi_trigger_cpumask_backtrace+0x1c8/0x200 | arch_trigger_cpumask_backtrace+0x20/0x40 | sysrq_handle_showallcpus+0x24/0x38 (F) | __handle_sysrq+0xa8/0x1b0 (F) | handle_sysrq+0x38/0x50 (F) | pl011_int+0x460/0x5a8 (F) | __handle_irq_event_percpu+0x60/0x220 (F) | handle_irq_event+0x54/0xc0 (F) | handle_fasteoi_irq+0xa8/0x1d0 (F) | generic_handle_domain_irq+0x34/0x58 (F) | gic_handle_irq+0x54/0x140 (FK) | call_on_irq_stack+0x24/0x58 (F) | do_interrupt_handler+0x88/0xa0 | el1_interrupt+0x34/0x68 (FK) | el1h_64_irq_handler+0x18/0x28 | el1h_64_irq+0x6c/0x70 | default_idle_call+0x34/0x180 (P) | default_idle_call+0x28/0x180 (L) | do_idle+0x204/0x268 | cpu_startup_entry+0x3c/0x50 (F) | rest_init+0xe4/0xf0 | start_kernel+0x744/0x750 | __primary_switched+0x88/0x98 Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-11-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: stacktrace: report recovered PCsMark Rutland
When analysing a stacktrace it can be useful to know whether an unwound PC has been rewritten by fgraph or kretprobes, as in some situations these may be suspect or be known to be unreliable. This patch adds flags to track when an unwind entry has recovered the PC from fgraph and/or kretprobes, and updates dump_backtrace() to log when this is the case. The flags recorded are: "F" - the PC was recovered from fgraph "K" - the PC was recovered from kretprobes These flags are recorded and logged in addition to the original source of the unwound PC. For example, with the ftrace_graph profiler enabled globally, and kretprobes installed on generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered by magic-sysrq + L reports: | Call trace: | show_stack+0x20/0x40 (CF) | dump_stack_lvl+0x60/0x80 (F) | dump_stack+0x18/0x28 | nmi_cpu_backtrace+0xfc/0x140 | nmi_trigger_cpumask_backtrace+0x1c8/0x200 | arch_trigger_cpumask_backtrace+0x20/0x40 | sysrq_handle_showallcpus+0x24/0x38 (F) | __handle_sysrq+0xa8/0x1b0 (F) | handle_sysrq+0x38/0x50 (F) | pl011_int+0x460/0x5a8 (F) | __handle_irq_event_percpu+0x60/0x220 (F) | handle_irq_event+0x54/0xc0 (F) | handle_fasteoi_irq+0xa8/0x1d0 (F) | generic_handle_domain_irq+0x34/0x58 (F) | gic_handle_irq+0x54/0x140 (FK) | call_on_irq_stack+0x24/0x58 (F) | do_interrupt_handler+0x88/0xa0 | el1_interrupt+0x34/0x68 (FK) | el1h_64_irq_handler+0x18/0x28 | el1h_64_irq+0x64/0x68 | default_idle_call+0x34/0x180 | do_idle+0x204/0x268 | cpu_startup_entry+0x40/0x50 (F) | rest_init+0xe4/0xf0 | start_kernel+0x744/0x750 | __primary_switched+0x80/0x90 Note that as these flags are reported next to the recovered PC value, they appear on the callers of instrumented functions. For example gic_handle_irq() has a "K" marker because generic_handle_domain_irq() was instrumented with kretprobes and had its return address rewritten. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-9-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: stacktrace: report source of unwind dataMark Rutland
When analysing a stacktrace it can be useful to know where an unwound PC came from, as in some situations certain sources may be suspect or known to be unreliable. In future it would also be useful to track this so that certain unwind steps can be performed in a stateful manner. For example when unwinding across an exception boundary, we'd ideally unwind pt_regs::pc, then pt_regs::lr, then the next frame record. This patch adds an enumerated set of unwind sources, tracks this during the unwind, and updates dump_backtrace() to log these for interesting unwind steps. The interesting sources recorded are: "C" - the PC came from the caller of an unwind function. "T" - the PC came from thread_saved_pc() for a blocked task. "P" - the PC came from a pt_regs::pc. "U" - the PC came from an unknown source (indicates an unwinder error). ... with nothing recorded when the PC came from a frame_record::pc as this is the vastly common case and logging this would make it difficult to spot the more interesting cases. For example, when triggering a backtrace via magic-sysrq + L, the CPU handling the sysrq will have a backtrace whose first element is the caller (C) of dump_backtrace(): | Call trace: | show_stack+0x18/0x30 (C) | dump_stack_lvl+0x60/0x80 | dump_stack+0x18/0x24 | nmi_cpu_backtrace+0xfc/0x140 | ... ... and other CPUs will have a backtrace whose first element is their pt_regs::pc (P) at the instant the backtrace IPI was taken: | Call trace: | _raw_spin_unlock_irqrestore+0x8/0x50 (P) | wake_up_process+0x18/0x24 | process_timeout+0x14/0x20 | call_timer_fn.isra.0+0x24/0x80 | ... Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-8-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()Mark Rutland
Currently dump_backtrace() can only print the PC value at each step of the unwind, as this is all the information that arch_stack_walk() passes to the dump_backtrace_entry() callback. In future we'd like to print some additional information, such as the origin of entries (e.g. PC, LR, FP) and/or the reliability thereof. In preparation for doing so, this patch moves dump_backtrace() over to kunwind_stack_walk(), which passes the full kunwind_state to the callback. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-7-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: use a common struct frame_recordMark Rutland
Currently the signal handling code has its own struct frame_record, the definition of struct pt_regs open-codes a frame record as an array, and the kernel unwinder hard-codes frame record offsets. Move to a common struct frame_record that can be used throughout the kernel. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-6-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: pt_regs: swap 'unused' and 'pmr' fieldsMark Rutland
In subsequent patches we'll want to add an additional u64 to struct pt_regs. To make space, this patch swaps the 'unused' and 'pmr' fields, as the 'pmr' value only requires bits[7:0] and can safely fit into a u32, which frees up a 64-bit unused field. The 'lockdep_hardirqs' and 'exit_rcu' fields should eventually be moved out of pt_regs and managed locally within entry-common.c, so I've left those as-is for the moment. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-5-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: pt_regs: rename "pmr_save" -> "pmr"Mark Rutland
The pt_regs::pmr_save field is weirdly named relative to all other pt_regs fields, with a '_save' suffix that doesn't make anything clearer and only leads to more typing to access the field. Remove the '_save' suffix. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241017092538.1859841-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: mops: Handle MOPS exceptions from EL1Kristina Martsenko
We will soon be using MOPS instructions in the kernel, so wire up the exception handler to handle exceptions from EL1 caused by the copy/set operation being stopped and resumed on a different type of CPU. Add a helper for advancing the single step state machine, similarly to what the EL0 exception handler does. Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Link: https://lore.kernel.org/r/20240930161051.3777828-3-kristina.martsenko@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17arm64: probes: Disable kprobes/uprobes on MOPS instructionsKristina Martsenko
FEAT_MOPS instructions require that all three instructions (prologue, main and epilogue) appear consecutively in memory. Placing a kprobe/uprobe on one of them doesn't work as only a single instruction gets executed out-of-line or simulated. So don't allow placing a probe on a MOPS instruction. Fixes: b7564127ffcb ("arm64: mops: detect and enable FEAT_MOPS") Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com> Link: https://lore.kernel.org/r/20240930161051.3777828-2-kristina.martsenko@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17KVM: arm64: Shave a few bytes from the EL2 idmap codeMarc Zyngier
Our idmap is becoming too big, to the point where it doesn't fit in a 4kB page anymore. There are some low-hanging fruits though, such as the el2_init_state horror that is expanded 3 times in the kernel. Let's at least limit ourselves to two copies, which makes the kernel link again. At some point, we'll have to have a better way of doing this. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
2024-10-16hugetlb: arm64: add mte supportYang Shi
Enable MTE support for hugetlb. The MTE page flags will be set on the folio only. When copying hugetlb folio (for example, CoW), the tags for all subpages will be copied when copying the first subpage. When freeing hugetlb folio, the MTE flags will be cleared. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Link: https://lore.kernel.org/r/20241001225220.271178-1-yang@os.amperecomputing.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15arm64: insn: Simulate nop instruction for better uprobe performanceLiao Chang
v2->v1: 1. Remove the simuation of STP and the related bits. 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI scenario. As Andrii pointed out, the uprobe/uretprobe selftest bench run into a counterintuitive result that nop and push variants are much slower than ret variant [0]. The root cause lies in the arch_probe_analyse_insn(), which excludes 'nop' and 'stp' from the emulatable instructions list. This force the kernel returns to userspace and execute them out-of-line, then trapping back to kernel for running uprobe callback functions. This leads to a significant performance overhead compared to 'ret' variant, which is already emulated. Typicall uprobe is installed on 'nop' for USDT and on function entry which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr and fp into stack regardless kernel or userspace binary. In order to improve the performance of handling uprobe for common usecases. This patch supports the emulation of Arm64 equvialents instructions of 'nop' and 'push'. The benchmark results below indicates the performance gain of emulation is obvious. On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. xol (1 cpus) ------------ uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) emulation (1 cpus) ------------------- uprobe-nop: 1.862 ± 0.002M/s (1.862M/prod) uprobe-push: 1.743 ± 0.006M/s (1.743M/prod) uprobe-ret: 1.840 ± 0.001M/s (1.840M/prod) uretprobe-nop: 0.964 ± 0.004M/s (0.964M/prod) uretprobe-push: 0.936 ± 0.004M/s (0.936M/prod) uretprobe-ret: 0.940 ± 0.001M/s (0.940M/prod) As shown above, the performance gap between 'nop/push' and 'ret' variants has been significantly reduced. Due to the emulation of 'push' instruction needs to access userspace memory, it spent more cycles than the other. As Mark suggested [1], it is painful to emulate the correct atomicity and ordering properties of STP, especially when it interacts with MTE, POE, etc. So this patch just focus on the simuation of 'nop'. The simluation of STP and related changes will be addressed in a separate patch. [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ CC: Andrii Nakryiko <andrii.nakryiko@gmail.com> CC: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Liao Chang <liaochang1@huawei.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20240909071114.1150053-1-liaochang1@huawei.com [catalin.marinas@arm.com: small tweaks following MarkR's comments] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSETMark Rutland
The PREEMPT_DISABLE_OFFSET definition was added in commit: 24534b3511828c66 ("arm64: assembler: add macros to conditionally yield the NEON under PREEMPT") ... but hasn't been used since commit: 3931261ecf46151a ("arm64: fpsimd: Bring cond_yield asm macro in line with new rules") Remove PREEMPT_DISABLE_OFFSET. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241007123921.549340-8-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICEMark Rutland
The DMA_TO_DEVICE and DMA_FROM_DEVICE defintitons in asm-offsets duplicate the common defintions from <linux/dma-direction.h> (which used to live in <linux/dma-mapping.h>), and haven't been used from asseembly code since commit: 7eacf1858bc86fe9 ("arm64: mm: Remove assembly DMA cache maintenance wrappers") Remove them both. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241007123921.549340-7-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15arm64: asm-offsets: remove VM_EXEC and PAGE_SZMark Rutland
The VM_EXEC definition duplicates the common VM_EXEC definition from <linux/mm.h>. The common definition cannot safely be included by assembly code but currently we don't need to use VM_EXEC in assembly. The PAGE_SZ definition duplicates arm64's definition of PAGE_SIZE from <asm/page-def.h> which can safely be included from assembly code and should be used directly. Remove the duplicate definitions. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241007123921.549340-6-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>