summaryrefslogtreecommitdiff
path: root/arch/arm64/kernel
AgeCommit message (Collapse)Author
2025-05-08arm64/fpsimd: Consistently preserve FPSIMD state during clone()Mark Rutland
In arch_dup_task_struct() we try to ensure that the child task inherits the FPSIMD state of its parent, but this depends on the parent task's saved state being in FPSIMD format, which is not always the case. Consequently the child task may inherit stale FPSIMD state in some cases. This can happen when the parent's state has been modified by ptrace since syscall entry, as writes to the NT_ARM_SVE regset may save state in SVE format. This has been possible since commit: bc0ee4760364 ("arm64/sve: Core task context handling") More recently it has been possible for a task's FPSIMD/SVE state to be saved before lazy discarding was guaranteed to occur, in which case preemption could cause the effective FPSIMD state to be saved in SVE format non-deterministically. This has been possible since commit: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Fix this by saving the parent task's effective FPSIMD state into FPSIMD format before copying the task_struct. As this requires modifying the parent's fpsimd_state, we must save+flush the state to avoid racing with concurrent manipulation. Similar issues exist when the parent has streaming mode state, and will be addressed by subsequent patches. Fixes: bc0ee4760364 ("arm64/sve: Core task context handling") Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-12-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Remove redundant task->mm checkMark Rutland
For historical reasons, arch_dup_task_struct() only calls fpsimd_preserve_current_state() when current->mm is non-NULL, but this is no longer necessary. Historically TIF_FOREIGN_FPSTATE was only managed for user threads, and was never set for kernel threads. At the time, various functions attempted to avoid saving/restoring state for kernel threads by checking task_struct::mm to try to distinguish user threads from kernel threads. We added the current->mm check to arch_dup_task_struct() in commit: 6eb6c80187c5 ("arm64: kernel thread don't need to save fpsimd context.") ... where the intent was to avoid pointlessly saving state for kernel threads, which never had live state (and the saved state would never be consumed). Subsequently we began setting TIF_FOREIGN_FPSTATE for kernel threads, and removed most of the task_struct::mm checks in commit: df3fb9682045 ("arm64: fpsimd: Eliminate task->mm checks") ... but we missed the check in arch_dup_task_struct(), which is similarly redundant. Remove the redundant check from arch_dup_task_struct(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-11-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Use SMSTOP behaviour in setup_return()Mark Rutland
Historically the behaviour of setup_return() was nondeterministic, depending on whether the task's FSIMD/SVE/SME state happened to be live. We fixed most of that in commit: 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early") ... but we didn't decide on how clearing PSTATE.SM should behave, and left a TODO comment to that effect. Use the new task_smstop_sm() helper to make this behave as if an SMSTOP instruction was used to exit streaming mode. This would have been the most common behaviour prior to the commit above. Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-10-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Add task_smstop_sm()Mark Rutland
In a few places we want to transition a task from streaming mode to non-streaming mode, e.g. signal delivery where we historically tried to use an SMSTOP SM instruction. Add a new helper to manipulate a task's state in the same way as an SMSTOP SM instruction. I have not added a corresponding helper to simulate the effects of SMSTART SM. Only ptrace transitions a task into streaming mode, and ptrace has distinct semantics for such transitions. Per ARM DDI 0487 L.a, section B1.4.6: | RRSWFQ | When the Effective value of PSTATE.SM is changed by any method from 0 | to 1, an entry to Streaming SVE mode is performed, and all implemented | bits of Streaming SVE register state are set to zero. | RKFRQZ | When the Effective value of PSTATE.SM is changed by any method from 1 | to 0, an exit from Streaming SVE mode is performed, and in the | newly-entered mode, all implemented bits of the SVE scalable vector | registers, SVE predicate registers, and FFR, are set to zero. Per ARM DDI 0487 L.a, section C5.2.9: | On entry to or exit from Streaming SVE mode, FPMR is set to 0 Per ARM DDI 0487 L.a, section C5.2.10: | On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, | IXC, IDC, QC} are set to 1 and the remaining bits are set to 0. This means bits 0, 1, 2, 3, 4, 7, and 27 respectively, i.e. 0x0800009f Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-9-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Factor out {sve,sme}_state_size() helpersMark Rutland
In subsequent patches we'll need to determine the SVE/SME state size for a given SVE VL and SME VL regardless of whether a task is currently configured with those VLs. Split the sizing logic out of sve_state_size() and sme_state_size() so that we don't need to open-code this logic elsewhere. At the same time, apply minor cleanups: * Move sve_state_size() into fpsimd.h, matching the placement of sme_state_size(). * Remove the feature checks from sve_state_size(). We only call sve_state_size() when at least one of SVE and SME are supported, and when either of the two is not supported, the task's corresponding SVE/SME vector length will be zero. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-8-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Clarify sve_sync_*() functionsMark Rutland
The sve_sync_{to,from}_fpsimd*() functions are intended to extract/insert the currently effective FPSIMD state of a task regardless of whether the task's state is saved in FPSIMD format or SVE format. Historically they were only used by ptrace, but sve_sync_to_fpsimd() is now used more widely, and sve_sync_from_fpsimd_zeropad() may be used more widely in future. When FPSIMD/SVE state tracking was changed across commits: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") ... sve_sync_to_fpsimd() was updated to consider task->thread.fp_type rather than the task's TIF_SVE and PSTATE.SM, but (apparently due to an oversight) sve_sync_from_fpsimd_zeropad() was left as-is, leaving the two inconsistent. Due to this, sve_sync_from_fpsimd_zeropad() may copy state from task->thread.uw.fpsimd_state into task->thread.sve_state when task->thread.fp_type == FP_STATE_FPSIMD. This is redundant (but benign) as task->thread.uw.fpsimd_state is the effective state that will be restored, and task->thread.sve_state will not be consumed. For consistency, and to avoid the redundant work, it better for sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type alone, matching sve_sync_to_fpsimd(). The naming of both functions is somehat unfortunate, as it is unclear when and why they copy state. It would be better to describe them in terms of the effective state. Considering all of the above, clean this up: * Adjust sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type. * Update comments to clarify the intended semantics/usage. I've removed the description that task->thread.sve_state must have been allocated, as this is only necessary when task->thread.fp_type == FP_STATE_SVE, which itself implies that task->thread.sve_state must have been allocated. * Rename the functions to more clearly indicate when/why they copy state: - sve_sync_to_fpsimd() => fpsimd_sync_from_effective_state() - sve_sync_from_fpsimd_zeropad => fpsimd_sync_to_effective_state_zeropad() Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-7-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace: Consistently handle partial writes to NT_ARM_(S)SVEMark Rutland
Partial writes to the NT_ARM_SVE and NT_ARM_SSVE regsets using an payload are handled inconsistently and non-deterministically. A comment within sve_set_common() indicates that we intended that a partial write would preserve any effective FPSIMD/SVE state which was not overwritten, but this has never worked consistently, and during syscalls the FPSIMD vector state may be non-deterministically preserved and may be erroneously migrated between streaming and non-streaming SVE modes. The simplest fix is to handle a partial write by consistently zeroing the remaining state. As detailed below I do not believe this will adversely affect any real usage. Neither GDB nor LLDB attempt partial writes to these regsets, and the documentation (in Documentation/arch/arm64/sve.rst) has always indicated that state preservation was not guaranteed, as is says: | The effect of writing a partial, incomplete payload is unspecified. When the logic was originally introduced in commit: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support") ... there were two potential behaviours, depending on TIF_SVE: * When TIF_SVE was clear, all SVE state would be zeroed, excluding the low 128 bits of vectors shared with FPSIMD, FPSR, and FPCR. * When TIF_SVE was set, all SVE state would be zeroed, including the low 128 bits of vectors shared with FPSIMD, but excluding FPSR and FPCR. Note that as writing to NT_ARM_SVE would set TIF_SVE, partial writes to NT_ARM_SVE would not be idempotent, and if a first write preserved the low 128 bits, a subsequent (potentially identical) partial write would discard the low 128 bits. When support for the NT_ARM_SSVE regset was added in commit: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") ... the above behaviour was retained for writes to the NT_ARM_SVE regset, though writes to the NT_ARM_SSVE would always zero the SVE registers and would not inherit FPSIMD register state. This happened as fpsimd_sync_to_sve() only copied the FPSIMD regs when TIF_SVE was clear and PSTATE.SM==0. Subsequently, when FPSIMD/SVE state tracking was changed across commits: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") ... there was no corresponding update to the ptrace code, nor to fpsimd_sync_to_sve(), which stil considers TIF_SVE and PSTATE.SM rather than the saved fp_type. The saved state can be in the FPSIMD format regardless of whether TIF_SVE is set or clear, and the saved type can change non-deterministically during syscalls. Consequently a subsequent partial write to the NT_ARM_SVE or NT_ARM_SSVE regsets may non-deterministically preserve the FPSIMD state, and may migrate this state between streaming and non-streaming modes. Clean this up by never attempting to preserve ANY state when writing an SVE payload to the NT_ARM_SVE/NT_ARM_SSVE regsets, zeroing all relevant state including FPSR and FPCR. This simplifies the code, makes the behaviour deterministic, and avoids migrating state between streaming and non-streaming modes. As above, I do not believe this should adversely affect existing userspace applications. At the same time, remove fpsimd_sync_to_sve(). It is no longer used, doesn't do what its documentation implies, and gets in the way of other cleanups and fixes. Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-6-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64: bpf: Add BHB mitigation to the epilogue for cBPF programsJames Morse
A malicious BPF program may manipulate the branch history to influence what the hardware speculates will happen next. On exit from a BPF program, emit the BHB mititgation sequence. This is only applied for 'classic' cBPF programs that are loaded by seccomp. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2025-05-08arm64: proton-pack: Expose whether the branchy loop k valueJames Morse
Add a helper to expose the k value of the branchy loop. This is needed by the BPF JIT to generate the mitigation sequence in BPF programs. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-08arm64: proton-pack: Expose whether the platform is mitigated by firmwareJames Morse
is_spectre_bhb_fw_affected() allows the caller to determine if the CPU is known to need a firmware mitigation. CPUs are either on the list of CPUs we know about, or firmware has been queried and reported that the platform is affected - and mitigated by firmware. This helper is not useful to determine if the platform is mitigated by firmware. A CPU could be on the know list, but the firmware may not be implemented. Its affected but not mitigated. spectre_bhb_enable_mitigation() handles this distinction by checking the firmware state before enabling the mitigation. Add a helper to expose this state. This will be used by the BPF JIT to determine if calling firmware for a mitigation is necessary and supported. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-08arm64/fpsimd: signal: Consistently read FPSIMD contextMark Rutland
For historical reasons, restore_sve_fpsimd_context() has an open-coded copy of the logic from read_fpsimd_context(), which is used to either restore an FPSIMD-only context, or to merge FPSIMD state into an SVE state when restoring an SVE+FPSIMD context. The logic is *almost* identical. Refactor the logic to avoid duplication and make this clearer. This comes with two functional changes that I do not believe will be problematic in practice: * The user_fpsimd_state::size field will be checked in all restore paths that consume it user_fpsimd_state. The kernel always populates this field when delivering a signal, and so this should contain the expected value unless it has been corrupted. * If a read of user_fpsimd_state fails, we will return early without modifying TIF_SVE, the saved SVCR, or the save fp_type. This will leave the task in a consistent state, without potentially resurrecting stale FPSIMD state. A read of user_fpsimd_state should never fail unless the structure has been corrupted or the stack has been unmapped. Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-5-mark.rutland@arm.com [will: Ensure read_fpsimd_context() returns negative error code or zero] Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Mandate SVE payload for streaming-mode stateMark Rutland
Non-streaming SVE state may be preserved without an SVE payload, in which case the SVE context only has a header with VL==0, and all state can be restored from the FPSIMD context. Streaming SVE state is always preserved with an SVE payload, where the SVE context header has VL!=0, and the SVE_SIG_FLAG_SM flag is set. The kernel never preserves an SVE context where SVE_SIG_FLAG_SM is set without an SVE payload. However, restore_sve_fpsimd_context() doesn't forbid restoring such a context, and will handle this case by clearing PSTATE.SM and restoring the FPSIMD context into non-streaming mode, which isn't consistent with the SVE_SIG_FLAG_SM flag. Forbid this case, and mandate an SVE payload when the SVE_SIG_FLAG_SM flag is set. This avoids an awkward ABI quirk and reduces the risk that later rework to this code permits configuring a task with PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. I've marked this as a fix given that we never intended to support this case, and we don't want anyone to start relying upon the old behaviour once we re-enable SME. Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-4-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Clear PSTATE.SM when restoring FPSIMD frame onlyMark Rutland
On systems with SVE and/or SME, the kernel will always create SVE and FPSIMD signal frames when delivering a signal, but a user can manipulate signal frames such that a signal return only observes an FPSIMD signal frame. When this happens, restore_fpsimd_context() will restore state such that fp_type==FP_STATE_FPSIMD, but will leave PSTATE.SM as-is. It is possible for a user to set PSTATE.SM between syscall entry and execution of the sigreturn logic (e.g. via ptrace), and consequently the sigreturn may result in the task having PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. For various reasons it is not legitimate for a task to be in a state where PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Portions of the user ABI are written with the requirement that streaming SVE state is always presented in SVE format rather than FPSIMD format, and as there is no mechanism to permit access to only the FPSIMD subset of streaming SVE state, streaming SVE state must always be saved and restored in SVE format. Fix restore_fpsimd_context() to clear PSTATE.SM when restoring an FPSIMD signal frame without an SVE signal frame. This matches the current behaviour when an SVE signal frame is present, but the SVE signal frame has no register payload (e.g. as is the case on SME-only systems which lack SVE). This change should have no effect for applications which do not alter signal frames (i.e. almost all applications). I do not expect non-{malicious,buggy} applications to hide the SVE signal frame, but I've chosen to clear PSTATE.SM rather than mandating the presence of an SVE signal frame in case there is some legacy (non-SME) usage that I am not currently aware of. For context, the SME handling was originally introduced in commit: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") ... and subsequently updated/fixed to handle SME-only systems in commits: 7dde62f0687c ("arm64/signal: Always accept SVE signal frames on SME only systems") f26cd7372160 ("arm64/signal: Always allocate SVE signal frames on SME only systems") Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Do not discard modified SVE stateMark Rutland
Historically SVE state was discarded deterministically early in the syscall entry path, before ptrace is notified of syscall entry. This permitted ptrace to modify SVE state before and after the "real" syscall logic was executed, with the modified state being retained. This behaviour was changed by commit: 8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") That commit was intended to speed up workloads that used SVE by opportunistically leaving SVE enabled when returning from a syscall. The syscall entry logic was modified to truncate the SVE state without disabling userspace access to SVE, and fpsimd_save_user_state() was modified to discard userspace SVE state whenever in_syscall(current_pt_regs()) is true, i.e. when current_pt_regs()->syscallno != NO_SYSCALL. Leaving SVE enabled opportunistically resulted in a couple of changes to userspace visible behaviour which weren't described at the time, but are logical consequences of opportunistically leaving SVE enabled: * Signal handlers can observe the type of saved state in the signal's sve_context record. When the kernel only tracks FPSIMD state, the 'vq' field is 0 and there is no space allocated for register contents. When the kernel tracks SVE state, the 'vq' field is non-zero and the register contents are saved into the record. As a result of the above commit, 'vq' (and the presence of SVE register state) is non-deterministically zero or non-zero for a period of time after a syscall. The effective register state is still deterministic. Hopefully no-one relies on this being deterministic. In general, handlers for asynchronous events cannot expect a deterministic state. * Similarly to signal handlers, ptrace requests can observe the type of saved state in the NT_ARM_SVE and NT_ARM_SSVE regsets, as this is exposed in the header flags. As a result of the above commit, this is now in a non-deterministic state after a syscall. The effective register state is still deterministic. Hopefully no-one relies on this being deterministic. In general, debuggers would have to handle this changing at arbitrary points during program flow. Discarding the SVE state within fpsimd_save_user_state() resulted in other changes to userspace visible behaviour which are not desirable: * A ptrace tracer can modify (or create) a tracee's SVE state at syscall entry or syscall exit. As a result of the above commit, the tracee's SVE state can be discarded non-deterministically after modification, rather than being retained as it previously was. Note that for co-operative tracer/tracee pairs, the tracer may (re)initialise the tracee's state arbitrarily after the tracee sends itself an initial SIGSTOP via a syscall, so this affects realistic design patterns. * The current_pt_regs()->syscallno field can be modified via ptrace, and can be altered even when the tracee is not really in a syscall, causing non-deterministic discarding to occur in situations where this was not previously possible. Further, using current_pt_regs()->syscallno in this way is unsound: * There are data races between readers and writers of the current_pt_regs()->syscallno field. The current_pt_regs()->syscallno field is written in interruptible task context using plain C accesses, and is read in irq/softirq context using plain C accesses. These accesses are subject to data races, with the usual concerns with tearing, etc. * Writes to current_pt_regs()->syscallno are subject to compiler reordering. As current_pt_regs()->syscallno is written with plain C accesses, the compiler is free to move those writes arbitrarily relative to anything which doesn't access the same memory location. In theory this could break signal return, where prior to restoring the SVE state, restore_sigframe() calls forget_syscall(). If the write were hoisted after restore of some SVE state, that state could be discarded unexpectedly. In practice that reordering cannot happen in the absence of LTO (as cross compilation-unit function calls happen prevent this reordering), and that reordering appears to be unlikely in the presence of LTO. Additionally, since commit: f130ac0ae4412dbe ("arm64: syscall: unmask DAIF earlier for SVCs") ... DAIF is unmasked before el0_svc_common() sets regs->syscallno to the real syscall number. Consequently state may be saved in SVE format prior to this point. Considering all of the above, current_pt_regs()->syscallno should not be used to infer whether the SVE state can be discarded. Luckily we can instead use cpu_fp_state::to_save to track when it is safe to discard the SVE state: * At syscall entry, after the live SVE register state is truncated, set cpu_fp_state::to_save to FP_STATE_FPSIMD to indicate that only the FPSIMD portion is live and needs to be saved. * At syscall exit, once the task's state is guaranteed to be live, set cpu_fp_state::to_save to FP_STATE_CURRENT to indicate that TIF_SVE must be considered to determine which state needs to be saved. * Whenever state is modified, it must be saved+flushed prior to manipulation. The state will be truncated if necessary when it is saved, and reloading the state will set fp_state::to_save to FP_STATE_CURRENT, preventing subsequent discarding. This permits SVE state to be discarded *only* when it is known to have been truncated (and the non-FPSIMD portions must be zero), and ensures that SVE state is retained after it is explicitly modified. For backporting, note that this fix depends on the following commits: * b2482807fbd4 ("arm64/sme: Optimise SME exit on syscall entry") * f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") * 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early") Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-2-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-07ubsan: Remove regs from report_ubsan_failure()Mostafa Saleh
report_ubsan_failure() doesn't use argument regs, and soon it will be called from the hypervisor context were regs are not available. So, remove the unused argument. Signed-off-by: Mostafa Saleh <smostafa@google.com> Acked-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250430162713.1997569-3-smostafa@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-07arm64: Introduce esr_is_ubsan_brk()Mostafa Saleh
Soon, KVM is going to use this logic for hypervisor panics, so add it in a wrapper that can be used by the hypervisor exit handler to decode hyp panics. Signed-off-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250430162713.1997569-2-smostafa@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06arm64: Add FEAT_FGT2 capabilityMarc Zyngier
As we will eventually have to context-switch the FEAT_FGT2 registers in KVM (something that has been completely ignored so far), add a new cap that we will be able to check for. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to ↵Yeoreum Yun
prevent wrong idmap generation The PTE_MAYBE_NG macro sets the nG page table bit according to the value of "arm64_use_ng_mappings". This variable is currently placed in the .bss section. create_init_idmap() is called before the .bss section initialisation which is done in early_map_kernel(). Therefore, data/test_prot in create_init_idmap() could be set incorrectly through the PAGE_KERNEL -> PROT_DEFAULT -> PTE_MAYBE_NG macros. # llvm-objdump-21 --syms vmlinux-gcc | grep arm64_use_ng_mappings ffff800082f242a8 g O .bss 0000000000000001 arm64_use_ng_mappings The create_init_idmap() function disassembly compiled with llvm-21: // create_init_idmap() ffff80008255c058: d10103ff sub sp, sp, #0x40 ffff80008255c05c: a9017bfd stp x29, x30, [sp, #0x10] ffff80008255c060: a90257f6 stp x22, x21, [sp, #0x20] ffff80008255c064: a9034ff4 stp x20, x19, [sp, #0x30] ffff80008255c068: 910043fd add x29, sp, #0x10 ffff80008255c06c: 90003fc8 adrp x8, 0xffff800082d54000 ffff80008255c070: d280e06a mov x10, #0x703 // =1795 ffff80008255c074: 91400409 add x9, x0, #0x1, lsl #12 // =0x1000 ffff80008255c078: 394a4108 ldrb w8, [x8, #0x290] ------------- (1) ffff80008255c07c: f2e00d0a movk x10, #0x68, lsl #48 ffff80008255c080: f90007e9 str x9, [sp, #0x8] ffff80008255c084: aa0103f3 mov x19, x1 ffff80008255c088: aa0003f4 mov x20, x0 ffff80008255c08c: 14000000 b 0xffff80008255c08c <__pi_create_init_idmap+0x34> ffff80008255c090: aa082d56 orr x22, x10, x8, lsl #11 -------- (2) Note (1) is loading the arm64_use_ng_mappings value in w8 and (2) is set the text or data prot with the w8 value to set PTE_NG bit. If the .bss section isn't initialized, x8 could include a garbage value and generate an incorrect mapping. Annotate arm64_use_ng_mappings as __read_mostly so that it is placed in the .data section. Fixes: 84b04d3e6bdb ("arm64: kernel: Create initial ID map from C code") Cc: stable@vger.kernel.org # 6.9.x Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Link: https://lore.kernel.org/r/20250502180412.3774883-1-yeoreum.yun@arm.com [catalin.marinas@arm.com: use __read_mostly instead of __ro_after_init] [catalin.marinas@arm.com: slight tweaking of the code comment] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-06KVM: arm64: Add .hyp.data sectionDavid Brazdil
The hypervisor has not needed its own .data section because all globals were either .rodata or .bss. To avoid having to initialize future data-structures at run-time, let's introduce add a .data section to the hypervisor. Signed-off-by: David Brazdil <dbrazdil@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-03Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Add missing sentinels to the arm64 Spectre-BHB MIDR arrays, otherwise is_midr_in_range_list() reads beyond the end of these arrays" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays
2025-05-01arm64: errata: Add missing sentinels to Spectre-BHB MIDR arraysWill Deacon
Commit a5951389e58d ("arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists") added some additional CPUs to the Spectre-BHB workaround, including some new arrays for designs that require new 'k' values for the workaround to be effective. Unfortunately, the new arrays omitted the sentinel entry and so is_midr_in_range_list() will walk off the end when it doesn't find a match. With UBSAN enabled, this leads to a crash during boot when is_midr_in_range_list() is inlined (which was more common prior to c8c2647e69be ("arm64: Make  _midr_in_range_list() an exported function")): | Internal error: aarch64 BRK: 00000000f2000001 [#1] PREEMPT SMP | pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : spectre_bhb_loop_affected+0x28/0x30 | lr : is_spectre_bhb_affected+0x170/0x190 | [...] | Call trace: | spectre_bhb_loop_affected+0x28/0x30 | update_cpu_capabilities+0xc0/0x184 | init_cpu_features+0x188/0x1a4 | cpuinfo_store_boot_cpu+0x4c/0x60 | smp_prepare_boot_cpu+0x38/0x54 | start_kernel+0x8c/0x478 | __primary_switched+0xc8/0xd4 | Code: 6b09011f 54000061 52801080 d65f03c0 (d4200020) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: aarch64 BRK: Fatal exception Add the missing sentinel entries. Cc: Lee Jones <lee@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: <stable@vger.kernel.org> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: a5951389e58d ("arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Lee Jones <lee@kernel.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250501104747.28431-1-will@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-30arm64/fpsimd: Avoid warning when sve_to_fpsimd() is unusedMark Rutland
Historically fpsimd_to_sve() and sve_to_fpsimd() were (conditionally) called by functions which were defined regardless of CONFIG_ARM64_SVE. Hence it was necessary that both fpsimd_to_sve() and sve_to_fpsimd() were always defined and not guarded by ifdeffery. As a result of the removal of fpsimd_signal_preserve_current_state() in commit: 929fa99b1215966f ("arm64/fpsimd: signal: Always save+flush state early") ... sve_to_fpsimd() has no callers when CONFIG_ARM64_SVE=n, resulting in a build-time warnign that it is unused: | arch/arm64/kernel/fpsimd.c:676:13: warning: unused function 'sve_to_fpsimd' [-Wunused-function] | 676 | static void sve_to_fpsimd(struct task_struct *task) | | ^~~~~~~~~~~~~ | 1 warning generated. In contrast, fpsimd_to_sve() still has callers which are defined when CONFIG_ARM64_SVE=n, and it would be awkward to hide this behind ifdeffery and/or to use stub functions. For now, suppress the warning by marking both fpsimd_to_sve() and sve_to_fpsimd() as 'static inline', as we usually do for stub functions. The compiler will no longer warn if either function is unused. Aside from suppressing the warning, there should be no functional change as a result of this patch. Link: https://lore.kernel.org/linux-arm-kernel/20250429194600.GA26883@willie-the-truck/ Reported-by: Will Deacon <will@kernel.org> Fixes: 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250430173240.4023627-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-29arm64: Extend pr_crit message on invalid FDTBartosz Szczepanek
Log size in addition to physical and virtual addresses. It has potential to be helpful when DTB exceeds the 2 MB limit. Initialize size to 0 to print out sane value if fixmap_remap_fdt fails without setting the size. Signed-off-by: Bartosz Szczepanek <bsz@amazon.de> Link: https://lore.kernel.org/r/20250423084851.26449-1-bsz@amazon.de Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: Expose AIDR_EL1 via sysfsOliver Upton
The KVM PV ABI recently added a feature that allows the VM to discover the set of physical CPU implementations, identified by a tuple of {MIDR_EL1, REVIDR_EL1, AIDR_EL1}. Unlike other KVM PV features, the expectation is that the VMM implements the hypercall instead of KVM as it has the authoritative view of where the VM gets scheduled. To do this the VMM needs to know the values of these registers on any CPU in the system. While MIDR_EL1 and REVIDR_EL1 are already exposed, AIDR_EL1 is not. Provide it in sysfs along with the other identification registers. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250403231626.3181116-1-oliver.upton@linux.dev Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64: enable PREEMPT_LAZYMark Rutland
For an architecture to enable CONFIG_ARCH_HAS_RESCHED_LAZY, two things are required: 1) Adding a TIF_NEED_RESCHED_LAZY flag definition 2) Checking for TIF_NEED_RESCHED_LAZY in the appropriate locations 2) is handled in a generic manner by CONFIG_GENERIC_ENTRY, which isn't (yet) implemented for arm64. However, outside of core scheduler code, TIF_NEED_RESCHED_LAZY only needs to be checked on a kernel exit, meaning: o return/entry to userspace. o return/entry to guest. The return/entry to a guest is all handled by xfer_to_guest_mode_handle_work() which already does the right thing, so it can be left as-is. arm64 doesn't use common entry's exit_to_user_mode_prepare(), so update its return to user path to check for TIF_NEED_RESCHED_LAZY and call into schedule() accordingly. Link: https://lore.kernel.org/linux-rt-users/20241216190451.1c61977c@mordecai.tesarici.cz/ Link: https://lore.kernel.org/all/xhsmh4j0fl0p3.mognet@vschneid-thinkpadt14sgen2i.remote.csb/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> [testdrive, _TIF_WORK_MASK fixlet and changelog.] Signed-off-by: Mike Galbraith <efault@gmx.de> [Another round of testing; changelog faff] Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250305104925.189198-2-vschneid@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64/cpufeature: Add missing id_aa64mmfr4 feature reg updateYicong Yang
Add missing id_aa64mmfr4 feature register check and update in update_cpu_features(). Update the taint status as well. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250329034409.21354-2-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64/mm: Remove randomization of the linear mapArd Biesheuvel
Since commit 97d6786e0669 ("arm64: mm: account for hotplug memory when randomizing the linear region") the decision whether or not to randomize the placement of the system's DRAM inside the linear map is based on the capabilities of the CPU rather than how much memory is present at boot time. This change was necessary because memory hotplug may result in DRAM appearing in places that are not covered by the linear region at all (and therefore unusable) if the decision is solely based on the memory map at boot. In the Android GKI kernel, which requires support for memory hotplug, and is built with a reduced virtual address space of only 39 bits wide, randomization of the linear map never happens in practice as a result. And even on arm64 kernels built with support for 48 bit virtual addressing, the wider PArange of recent CPUs means that linear map randomization is slowly becoming a feature that only works on systems that will soon be obsolete. So let's just remove this feature. We can always bring it back in an improved form if there is a real need for it. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250318134949.3194334-2-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-29arm64/fpsimd: Avoid unnecessary per-CPU buffers for EFI runtime callsArd Biesheuvel
The EFI specification has some elaborate rules about which runtime services may be called while another runtime service call is already in progress. In Linux, however, for simplicity, all EFI runtime service invocations are serialized via the efi_runtime_lock semaphore. This implies that calls to the helper pair arch_efi_call_virt_setup() and arch_efi_call_virt_teardown() are serialized too, and are guaranteed not to nest. Furthermore, the arm64 arch code has its own spinlock to serialize use of the EFI runtime stack, of which only a single instance exists. This all means that the FP/SIMD and SVE state preserve/restore logic in __efi_fpsimd_begin() and __efi_fpsimd_end() are also serialized, and only a single instance of the associated per-CPU variables can ever be in use at the same time. There is therefore no need at all for per-CPU variables here, and they can all be replaced with singleton instances. This saves a non-trivial amount of memory on systems with many CPUs. To be more robust against potential future changes in the core EFI code that may invalidate the reasoning above, move the invocations of __efi_fpsimd_begin() and __efi_fpsimd_end() into the critical section covered by the efi_rt_lock spinlock. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250318132421.3155799-2-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-04-28arm64/fpsimd: signal: Clear TPIDR2 when delivering signalsMark Rutland
Linux is intended to be compatible with userspace written to Arm's AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension (SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's ZA tile is lazily callee-saved and caller-restored. In this scheme, TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer" pointer. This scheme has been implemented in GCC and LLVM, with necessary runtime support implemented in glibc. AAPCS64 does not specify how the ZA lazy saving scheme is expected to interact with signal handling, and the behaviour that AAPCS64 currently recommends for (sig)setjmp() and (sig)longjmp() does not always compose safely with signal handling, as explained below. When Linux delivers a signal, it creates signal frames which contain the original values of PSTATE.ZA, the ZA tile, and TPIDR_EL2. Between saving the original state and entering the signal handler, Linux clears PSTATE.ZA, but leaves TPIDR2_EL0 unchanged. Consequently a signal handler can be entered with PSTATE.ZA=0 (meaning accesses to ZA will trap), while TPIDR_EL0 is non-null (which may indicate that ZA needs to be lazily saved, depending on the contents of the TPIDR2 block). While in this state, libc and/or compiler runtime code, such as longjmp(), may attempt to save ZA. As PSTATE.ZA=0, these accesses will trap, causing the kernel to inject a SIGILL. Note that by virtue of lazy saving occurring in libc and/or C runtime code, this can be triggered by application/library code which is unaware of SME. To avoid the problem above, the kernel must ensure that signal handlers are entered with PSTATE.ZA and TPIDR2_EL0 configured in a manner which complies with the ZA lazy saving scheme. Practically speaking, the only choice is to enter signal handlers with PSTATE.ZA=0 and TPIDR2_EL0=NULL. This change should not impact SME code which does not follow the ZA lazy saving scheme (and hence does not use TPIDR2_EL0). An alternative approach that was considered is to have the signal handler inherit the original values of both PSTATE.ZA and TPIDR2_EL0, relying on lazy save/restore sequences being idempotent and capable of racing safely. This is not safe as signal handlers must be assumed to have a "private ZA" interface, and therefore cannot be entered with PSTATE.ZA=1 and TPIDR2_EL0=NULL, but it is legitimate for signals to be taken from this state. With the kernel fixed to clear TPIDR2_EL0, there are a couple of remaining issues (largely masked by the first issue) that must be fixed in userspace: (1) When a (sig)setjmp() + (sig)longjmp() pair cross a signal boundary, ZA state may be discarded when it needs to be preserved. Currently, the ZA lazy saving scheme recommends that setjmp() does not save ZA, and recommends that longjmp() is responsible for saving ZA. A call to longjmp() in a signal handler will not have visibility of ZA state that existed prior to entry to the signal, and when a longjmp() is used to bypass a usual signal return, unsaved ZA state will be discarded erroneously. To fix this, it is necessary for setjmp() to eagerly save ZA state, and for longjmp() to configure PSTATE.ZA=0 and TPIDR2_EL0=NULL. This works regardless of whether a signal boundary is crossed. (2) When a C++ exception is thrown and crosses a signal boundary before it is caught, ZA state may be discarded when it needs to be preserved. AAPCS64 requires that exception handlers are entered with PSTATE.{SM,ZA}={0,0} and TPIDR2_EL0=NULL, with exception unwind code expected to perform any necessary save of ZA state. Where it is necessary to perform an exception unwind across an exception boundary, the unwind code must recover any necessary ZA state (along with TPIDR2) from signal frames. Fix the kernel as described above, with setup_return() clearing TPIDR2_EL0 when delivering a signal. Folk on CC are working on fixes for the remaining userspace issues, including updates/fixes to the AAPCS64 specification and glibc. [1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf [2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling") Fixes: 39e54499280f ("arm64/signal: Include TPIDR2 in the signal context") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Richard Sandiford <richard.sandiford@arm.com> Cc: Sander De Smalen <sander.desmalen@arm.com> Cc: Tamas Petz <tamas.petz@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yury Khrustalev <yury.khrustalev@arm.com> Link: https://lore.kernel.org/r/20250417190113.3778111-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-24Merge tag 'kvmarm-fixes-6.15-2' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.15, round #2 - Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for everyone's favorite pile of sand: Cavium ThunderX
2025-04-18arm64: Rework checks for broken Cavium HW in the PI codeMarc Zyngier
Calling into the MIDR checking framework from the PI code has recently become much harder, due to the new fancy "multi-MIDR" support that relies on tables being populated at boot time, but not that early that they are available to the PI code. There are additional issues with this framework, as the code really isn't position independend *at all*. This leads to some ugly breakages, as reported by Ada. It so appears that the only reason for the PI code to call into the MIDR checking code is to cope with The Most Broken ARM64 System Ever, aka Cavium ThunderX, which cannot deal with nG attributes that result of the combination of KASLR and KPTI as a consequence of Erratum 27456. Duplicate the check for the erratum in the PI code, removing the dependency on the bulk of the MIDR checking framework. This allows dropping that same check from kaslr_requires_kpti(), as the KPTI code already relies on the ARM64_WORKAROUND_CAVIUM_27456 cap. Fixes: c8c2647e69bed ("arm64: Make  _midr_in_range_list() an exported function") Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/3d97e45a-23cf-419b-9b6f-140b4d88de7b@arm.com Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250418093129.1755739-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-09arm64/fpsimd: signal: Simplify preserve_tpidr2_context()Mark Rutland
During a context-switch, tls_thread_switch() reads and writes a task's thread_struct::tpidr2_el0 field. Other code shouldn't access this field for an active task, as such accesses would form a data-race with a concurrent context-switch. The usage in preserve_tpidr2_context() is suspicious, but benign as any race with a context switch will write the same value back to current->thread.tpidr2_el0. Make this clearer and match restore_tpidr2_context() by using a temporary variable instead, avoiding the (benign) data-race. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-14-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: signal: Always save+flush state earlyMark Rutland
There are several issues with the way the native signal handling code manipulates FPSIMD/SVE/SME state, described in detail below. These issues largely result from races with preemption and inconsistent handling of live state vs saved state. Known issues with native FPSIMD/SVE/SME state management include: * On systems with FPMR, the code to save/restore the FPMR accesses the register while it is not owned by the current task. Consequently, this may corrupt the FPMR of the current task and/or may corrupt the FPMR of an unrelated task. The FPMR save/restore has been broken since it was introduced in commit: 8c46def44409fc91 ("arm64/signal: Add FPMR signal handling") * On systems with SME, setup_return() modifies both the live register state and the saved state register state regardless of whether the task's state is live, and without holding the cpu fpsimd context. Consequently: - This may corrupt the state an unrelated task which has PSTATE.SM set and/or PSTATE.ZA set. - The task may enter the signal handler in streaming mode, and or with ZA storage enabled unexpectedly. - The task may enter the signal handler in non-streaming SVE mode with stale SVE register state, which may have been inherited from streaming SVE mode unexpectedly. Where the streaming and non-streaming vector lengths differ, this may be packed into registers arbitrarily. This logic has been broken since it was introduced in commit: 40a8e87bb32855b3 ("arm64/sme: Disable ZA and streaming mode when handling signals") Further incorrect manipulation of state was added in commits: ea64baacbc36a0d5 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode") baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") * Several restoration functions use fpsimd_flush_task_state() to discard the live FPSIMD/SVE/SME while the in-memory copy is stale. When a subset of the FPSIMD/SVE/SME state is restored, the remainder may be non-deterministically reset to a stale snapshot from some arbitrary point in the past. This non-deterministic discarding was introduced in commit: 8cd969d28fd2848d ("arm64/sve: Signal handling support") As of that commit, when TIF_SVE was initially clear, failure to restore the SVE signal frame could reset the FPSIMD registers to a stale snapshot. The pattern of discarding unsaved state was subsequently copied into restoration functions for some new state in commits: 39782210eb7e8763 ("arm64/sme: Implement ZA signal handling") ee072cf708048c0d ("arm64/sme: Implement signal handling for ZT") * On systems with SME/SME2, the entire FPSIMD/SVE/SME state may be loaded onto the CPU redundantly. Either restore_fpsimd_context() or restore_sve_fpsimd_context() will load the entire FPSIMD/SVE/SME state via fpsimd_update_current_state() before restore_za_context() and restore_zt_context() each discard the state via fpsimd_flush_task_state(). This is purely redundant work, and not a functional bug. To fix these issues, rework the native signal handling code to always save+flush the current task's FPSIMD/SVE/SME state before manipulating that state. This avoids races with preemption and ensures that state is manipulated consistently regardless of whether it happened to be live prior to manipulation. This largely involes: * Using fpsimd_save_and_flush_current_state() to save+flush the state for both signal delivery and signal return, before the state is manipulated in any way. * Removing fpsimd_signal_preserve_current_state() and updating preserve_fpsimd_context() to explicitly ensure that the FPSIMD state is up-to-date, as preserve_fpsimd_context() is the only consumer of the FPSIMD state during signal delivery. * Modifying fpsimd_update_current_state() to not reload the FPSIMD state onto the CPU. Ideally we'd remove fpsimd_update_current_state() entirely, but I've left that for subsequent patches as there are a number of of other problems with the FPSIMD<->SVE conversion helpers that should be addressed at the same time. For now I've removed the misleading comment. For setup_return(), we need to decide (for ABI reasons) whether signal delivery should have all the side-effects of an SMSTOP. For now I've left a TODO comment, as there are other questions in this area that I'll address with subsequent patches. Fixes: 8c46def44409 ("arm64/signal: Add FPMR signal handling") Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Fixes: ea64baacbc36 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support") Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling") Fixes: ee072cf70804 ("arm64/sme: Implement signal handling for ZT") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-13-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: signal32: Always save+flush state earlyMark Rutland
There are several issues with the way the native signal handling code manipulates FPSIMD/SVE/SME state. To fix those issues, subsequent patches will rework the native signal handling code to always save+flush the current task's FPSIMD/SVE/SME state before manipulating that state. In preparation for those changes, rework the compat signal handling code to save+flush the current task's FPSIMD state before manipulating it. Subsequent patches will remove fpsimd_signal_preserve_current_state() and fpsimd_update_current_state(). Compat tasks can only have FPSIMD state, and cannot have any SVE or SME state. Thus, the SVE state manipulation present in fpsimd_signal_preserve_current_state() and fpsimd_update_current_state() is not necessary, and it is safe to directly manipulate current->thread.uw.fpsimd_state once it has been saved+flushed. Use fpsimd_save_and_flush_current_state() to save+flush the state for both signal delivery and signal return, before the state is manipulated in any way. While it would be safe for compat_restore_vfp_context() to use fpsimd_flush_task_state(current), there are several extant issues in the native signal code resulting from incorrect use of fpsimd_flush_task_state(), and for consistency it is preferable to use fpsimd_save_and_flush_current_state(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-12-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Add fpsimd_save_and_flush_current_state()Mark Rutland
When the current task's FPSIMD/SVE/SME state may be live on *any* CPU in the system, special care must be taken when manipulating that state, as this manipulation can race with preemption and/or asynchronous usage of FPSIMD/SVE/SME (e.g. kernel-mode NEON in softirq handlers). Even when manipulation is is protected with get_cpu_fpsimd_context() and get_cpu_fpsimd_context(), the logic necessary when the state is live on the current CPU can be wildly different from the logic necessary when the state is not live on the current CPU. A number of historical and extant issues result from failing to handle these cases consistetntly and/or correctly. To make it easier to get such manipulation correct, add a new fpsimd_save_and_flush_current_state() helper function, which ensures that the current task's state has been saved to memory and any stale state on any CPU has been "flushed" such that is not live on any CPU in the system. This will allow code to safely manipulate the saved state without risk of races. Subsequent patches will use the new function. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-11-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Fix merging of FPSIMD state during signal returnMark Rutland
For backwards compatibility reasons, when a signal return occurs which restores SVE state, the effective lower 128 bits of each of the SVE vector registers are restored from the corresponding FPSIMD vector register in the FPSIMD signal frame, overriding the values in the SVE signal frame. This is intended to be the case regardless of streaming mode. To make this happen, restore_sve_fpsimd_context() uses fpsimd_update_current_state() to merge the lower 128 bits from the FPSIMD signal frame into the SVE register state. Unfortunately, fpsimd_update_current_state() performs this merging dependent upon TIF_SVE, which is not always correct for streaming SVE register state: * When restoring non-streaming SVE register state there is no observable problem, as the signal return code configures TIF_SVE and the saved fp_type to match before calling fpsimd_update_current_state(), which observes either: - TIF_SVE set AND fp_type == FP_STATE_SVE - TIF_SVE clear AND fp_type == FP_STATE_FPSIMD * On systems which have SME but not SVE, TIF_SVE cannot be set. Thus the merging will never happen for the streaming SVE register state. * On systems which have SVE and SME, TIF_SVE can be set and cleared independently of PSTATE.SM. Thus the merging may or may not happen for streaming SVE register state. As TIF_SVE can be cleared non-deterministically during syscalls (including at the start of sigreturn()), the merging may occur non-deterministically from the perspective of userspace. This logic has been broken since its introduction in commit: 85ed24dad2904f7c ("arm64/sme: Implement streaming SVE signal handling") ... at which point both fpsimd_signal_preserve_current_state() and fpsimd_update_current_state() only checked TIF SVE. When PSTATE.SM==1 and TIF_SVE was clear, signal delivery would place stale FPSIMD state into the FPSIMD signal frame, and signal return would not merge this into the restored register state. Subsequently, signal delivery was fixed as part of commit: 61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state") ... but signal restore was not given a corresponding fix, and when TIF_SVE was clear, signal restore would still fail to merge the FPSIMD state into the restored SVE register state. The 'Fixes' tag did not indicate that this had been broken since its introduction. Fix this by merging the FPSIMD state dependent upon the saved fp_type, matching what we (currently) do during signal delivery. As described above, when backporting this commit, it will also be necessary to backport commit: 61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state") ... and prior to commit: baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") ... it will be necessary for fpsimd_signal_preserve_current_state() and fpsimd_update_current_state() to consider both TIF_SVE and thread_sm_enabled(&current->thread), in place of the saved fp_type. Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-10-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Reset FPMR upon exec()Mark Rutland
An exec() is expected to reset all FPSIMD/SVE/SME state, and barring special handling of the vector lengths, the state is expected to reset to zero. This reset is handled in fpsimd_flush_thread(), which the core exec() code calls via flush_thread(). When support was added for FPMR, no logic was added to fpsimd_flush_thread() to reset the FPMR value, and thus it is erroneously inherited across an exec(). Add the missing reset of FPMR. Fixes: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-9-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Avoid clobbering kernel FPSIMD state with SMSTOPMark Rutland
On system with SME, a thread's kernel FPSIMD state may be erroneously clobbered during a context switch immediately after that state is restored. Systems without SME are unaffected. If the CPU happens to be in streaming SVE mode before a context switch to a thread with kernel FPSIMD state, fpsimd_thread_switch() will restore the kernel FPSIMD state using fpsimd_load_kernel_state() while the CPU is still in streaming SVE mode. When fpsimd_thread_switch() subsequently calls fpsimd_flush_cpu_state(), this will execute an SMSTOP, causing an exit from streaming SVE mode. The exit from streaming SVE mode will cause the hardware to reset a number of FPSIMD/SVE/SME registers, clobbering the FPSIMD state. Fix this by calling fpsimd_flush_cpu_state() before restoring the kernel FPSIMD state. Fixes: e92bee9f861b ("arm64/fpsimd: Avoid erroneous elide of user state reload") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-8-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Don't corrupt FPMR when streaming mode changesMark Brown
When the effective value of PSTATE.SM is changed from 0 to 1 or from 1 to 0 by any method, an entry or exit to/from streaming SVE mode is performed, and hardware automatically resets a number of registers. As of ARM DDI 0487 L.a, this means: * All implemented bits of the SVE vector registers are set to zero. * All implemented bits of the SVE predicate registers are set to zero. * All implemented bits of FFR are set to zero, if FFR is implemented in the new mode. * FPSR is set to 0x0000_0000_0800_009f. * FPMR is set to 0, if FPMR is implemented. Currently task_fpsimd_load() restores FPMR before restoring SVCR (which is an accessor for PSTATE.{SM,ZA}), and so the restored value of FPMR may be clobbered if the restored value of PSTATE.SM happens to differ from the initial value of PSTATE.SM. Fix this by moving the restore of FPMR later. Note: this was originally posted as [1]. Fixes: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR") Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/linux-arm-kernel/20241204-arm64-sme-reenable-v2-2-bae87728251d@kernel.org/ [ Rutland: rewrite commit message ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20250409164010.3480271-7-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Discard stale CPU state when handling SME trapsMark Brown
The logic for handling SME traps manipulates saved FPSIMD/SVE/SME state incorrectly, and a race with preemption can result in a task having TIF_SME set and TIF_FOREIGN_FPSTATE clear even though the live CPU state is stale (e.g. with SME traps enabled). This can result in warnings from do_sme_acc() where SME traps are not expected while TIF_SME is set: | /* With TIF_SME userspace shouldn't generate any traps */ | if (test_and_set_thread_flag(TIF_SME)) | WARN_ON(1); This is very similar to the SVE issue we fixed in commit: 751ecf6afd6568ad ("arm64/sve: Discard stale CPU state when handling SVE traps") The race can occur when the SME trap handler is preempted before and after manipulating the saved FPSIMD/SVE/SME state, starting and ending on the same CPU, e.g. | void do_sme_acc(unsigned long esr, struct pt_regs *regs) | { | // Trap on CPU 0 with TIF_SME clear, SME traps enabled | // task->fpsimd_cpu is 0. | // per_cpu_ptr(&fpsimd_last_state, 0) is task. | | ... | | // Preempted; migrated from CPU 0 to CPU 1. | // TIF_FOREIGN_FPSTATE is set. | | get_cpu_fpsimd_context(); | | /* With TIF_SME userspace shouldn't generate any traps */ | if (test_and_set_thread_flag(TIF_SME)) | WARN_ON(1); | | if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) { | unsigned long vq_minus_one = | sve_vq_from_vl(task_get_sme_vl(current)) - 1; | sme_set_vq(vq_minus_one); | | fpsimd_bind_task_to_cpu(); | } | | put_cpu_fpsimd_context(); | | // Preempted; migrated from CPU 1 to CPU 0. | // task->fpsimd_cpu is still 0 | // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then: | // - Stale HW state is reused (with SME traps enabled) | // - TIF_FOREIGN_FPSTATE is cleared | // - A return to userspace skips HW state restore | } Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set by calling fpsimd_flush_task_state() to detach from the saved CPU state. This ensures that a subsequent context switch will not reuse the stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the new state to be reloaded from memory prior to a return to userspace. Note: this was originallly posted as [1]. Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/linux-arm-kernel/20241204-arm64-sme-reenable-v2-1-bae87728251d@kernel.org/ [ Rutland: rewrite commit message ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-6-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Remove opportunistic freeing of SME stateMark Rutland
When a task's SVE vector length (NSVL) is changed, and the task happens to have SVCR.{SM,ZA}=={0,0}, vec_set_vector_length() opportunistically frees the task's sme_state and clears TIF_SME. The opportunistic freeing was added with no rationale in commit: d4d5be94a8787242 ("arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes") That commit fixed an unrelated problem where the task's sve_state was freed while it could be used to store streaming mode register state, where the fix was to re-allocate the task's sve_state. There is no need to free and/or reallocate the task's sme_state when the SVE vector length changes, and there is no need to clear TIF_SME. Given the SME vector length (SVL) doesn't change, the task's sme_state remains correctly sized. Remove the unnecessary opportunistic freeing of the task's sme_state when the task's SVE vector length is changed. The task's sme_state is still freed when the SME vector length is changed. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-5-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Remove redundant SVE trap manipulationMark Rutland
When task_fpsimd_load() loads the saved FPSIMD/SVE/SME state, it configures EL0 SVE traps by calling sve_user_{enable,disable}(). This is unnecessary, and this is suspicious/confusing as task_fpsimd_load() does not configure EL0 SME traps. All calls to task_fpsimd_load() are followed by a call to fpsimd_bind_task_to_cpu(), where the latter configures traps for SVE and SME dependent upon the current values of TIF_SVE and TIF_SME, overriding any trap configuration performed by task_fpsimd_load(). The calls to sve_user_{enable,disable}() calls in task_fpsimd_load() have been redundant (though benign) since they were introduced in commit: a0136be443d51803 ("arm64/fpsimd: Load FP state based on recorded data type") Remove the unnecessary and confusing SVE trap manipulation. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Remove unused fpsimd_force_sync_to_sve()Mark Rutland
There have been no users of fpsimd_force_sync_to_sve() since commit: bbc6172eefdb276b ("arm64/fpsimd: SME no longer requires SVE register state") Remove fpsimd_force_sync_to_sve(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-09arm64/fpsimd: Avoid RES0 bits in the SME trap handlerMark Rutland
The SME trap handler consumes RES0 bits from the ESR when determining the reason for the trap, and depends upon those bits reading as zero. This may break in future when those RES0 bits are allocated a meaning and stop reading as zero. For SME traps taken with ESR_ELx.EC == 0b011101, the specific reason for the trap is indicated by ESR_ELx.ISS.SMTC ("SME Trap Code"). This field occupies bits [2:0] of ESR_ELx.ISS, and as of ARM DDI 0487 L.a, bits [24:3] of ESR_ELx.ISS are RES0. ESR_ELx.ISS itself occupies bits [24:0] of ESR_ELx. Extract the SMTC field specifically, matching the way we handle ESR_ELx fields elsewhere, and ensuring that the handler is future-proof. Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-04-03Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - Fix max_pfn calculation when hotplugging memory so that it never decreases - Fix dereference of unused source register in the MOPS SET operation fault handling - Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user space does an unaligned LDREX/STREX - Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs - Drop unused code pud accessors (special/mkspecial) * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Don't call NULL in do_compat_alignment_fixup() arm64: Add support for HIP09 Spectre-BHB mitigation arm64: mm: Drop dead code for pud special bit handling arm64: mops: Do not dereference src reg for a set operation arm64: mm: Correct the update of max_pfn
2025-04-01mseal sysmap: enable arm64Jeff Xu
Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering the vdso, vvar, and compat-mode vectors and sigpage mappings. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-04-01arm64: Don't call NULL in do_compat_alignment_fixup()Angelos Oikonomopoulos
do_alignment_t32_to_handler() only fixes up alignment faults for specific instructions; it returns NULL otherwise (e.g. LDREX). When that's the case, signal to the caller that it needs to proceed with the regular alignment fault handling (i.e. SIGBUS). Without this patch, the kernel panics: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000086000006 EC = 0x21: IABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault user pgtable: 4k pages, 48-bit VAs, pgdp=00000800164aa000 [0000000000000000] pgd=0800081fdbd22003, p4d=0800081fdbd22003, pud=08000815d51c6003, pmd=0000000000000000 Internal error: Oops: 0000000086000006 [#1] SMP Modules linked in: cfg80211 rfkill xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat br_netfilter veth nvme_fa> libcrc32c crc32c_generic raid0 multipath linear dm_mod dax raid1 md_mod xhci_pci nvme xhci_hcd nvme_core t10_pi usbcore igb crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_ce crct10dif_common usb_common i2c_algo_bit i2c> CPU: 2 PID: 3932954 Comm: WPEWebProcess Not tainted 6.1.0-31-arm64 #1 Debian 6.1.128-1 Hardware name: GIGABYTE MP32-AR1-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021 pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : 0x0 lr : do_compat_alignment_fixup+0xd8/0x3dc sp : ffff80000f973dd0 x29: ffff80000f973dd0 x28: ffff081b42526180 x27: 0000000000000000 x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 x23: 0000000000000004 x22: 0000000000000000 x21: 0000000000000001 x20: 00000000e8551f00 x19: ffff80000f973eb0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffaebc949bc488 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000400000 x4 : 0000fffffffffffe x3 : 0000000000000000 x2 : ffff80000f973eb0 x1 : 00000000e8551f00 x0 : 0000000000000001 Call trace: 0x0 do_alignment_fault+0x40/0x50 do_mem_abort+0x4c/0xa0 el0_da+0x48/0xf0 el0t_32_sync_handler+0x110/0x140 el0t_32_sync+0x190/0x194 Code: bad PC value ---[ end trace 0000000000000000 ]--- Signed-off-by: Angelos Oikonomopoulos <angelos@igalia.com> Fixes: 3fc24ef32d3b ("arm64: compat: Implement misalignment fixups for multiword loads") Cc: <stable@vger.kernel.org> # 6.1.x Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250401085150.148313-1-angelos@igalia.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-30Merge tag 'modules-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules updates from Petr Pavlu: - Use RCU instead of RCU-sched The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable() in the module code and its users has been replaced with just rcu_read_lock() - The rest of changes are smaller fixes and updates * tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits) MAINTAINERS: Update the MODULE SUPPORT section module: Remove unnecessary size argument when calling strscpy() module: Replace deprecated strncpy() with strscpy() params: Annotate struct module_param_attrs with __counted_by() bug: Use RCU instead RCU-sched to protect module_bug_list. static_call: Use RCU in all users of __module_text_address(). kprobes: Use RCU in all users of __module_text_address(). bpf: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_address(). x86: Use RCU in all users of __module_address(). cfi: Use RCU while invoking __module_address(). powerpc/ftrace: Use RCU in all users of __module_text_address(). LoongArch: ftrace: Use RCU in all users of __module_text_address(). LoongArch/orc: Use RCU in all users of __module_address(). arm64: module: Use RCU in all users of __module_text_address(). ARM: module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_address(). module: Use RCU in search_module_extables(). ...
2025-03-28arm64: Add support for HIP09 Spectre-BHB mitigationJinqian Yang
The HIP09 processor is vulnerable to the Spectre-BHB (Branch History Buffer) attack, which can be exploited to leak information through branch prediction side channels. This commit adds the MIDR of HIP09 to the list for software mitigation. Signed-off-by: Jinqian Yang <yangjinqian1@huawei.com> Link: https://lore.kernel.org/r/20250325141900.2057314-1-yangjinqian1@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>