Age | Commit message (Collapse) | Author |
|
In arch_dup_task_struct() we try to ensure that the child task inherits
the FPSIMD state of its parent, but this depends on the parent task's
saved state being in FPSIMD format, which is not always the case.
Consequently the child task may inherit stale FPSIMD state in some
cases.
This can happen when the parent's state has been modified by ptrace
since syscall entry, as writes to the NT_ARM_SVE regset may save state
in SVE format. This has been possible since commit:
bc0ee4760364 ("arm64/sve: Core task context handling")
More recently it has been possible for a task's FPSIMD/SVE state to be
saved before lazy discarding was guaranteed to occur, in which case
preemption could cause the effective FPSIMD state to be saved in SVE
format non-deterministically. This has been possible since commit:
f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Fix this by saving the parent task's effective FPSIMD state into FPSIMD
format before copying the task_struct. As this requires modifying the
parent's fpsimd_state, we must save+flush the state to avoid racing with
concurrent manipulation.
Similar issues exist when the parent has streaming mode state, and will
be addressed by subsequent patches.
Fixes: bc0ee4760364 ("arm64/sve: Core task context handling")
Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-12-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For historical reasons, arch_dup_task_struct() only calls
fpsimd_preserve_current_state() when current->mm is non-NULL, but this
is no longer necessary.
Historically TIF_FOREIGN_FPSTATE was only managed for user threads, and
was never set for kernel threads. At the time, various functions
attempted to avoid saving/restoring state for kernel threads by checking
task_struct::mm to try to distinguish user threads from kernel threads.
We added the current->mm check to arch_dup_task_struct() in commit:
6eb6c80187c5 ("arm64: kernel thread don't need to save fpsimd context.")
... where the intent was to avoid pointlessly saving state for kernel
threads, which never had live state (and the saved state would never be
consumed).
Subsequently we began setting TIF_FOREIGN_FPSTATE for kernel threads,
and removed most of the task_struct::mm checks in commit:
df3fb9682045 ("arm64: fpsimd: Eliminate task->mm checks")
... but we missed the check in arch_dup_task_struct(), which is similarly
redundant.
Remove the redundant check from arch_dup_task_struct().
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-11-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Historically the behaviour of setup_return() was nondeterministic,
depending on whether the task's FSIMD/SVE/SME state happened to be live.
We fixed most of that in commit:
929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
... but we didn't decide on how clearing PSTATE.SM should behave, and left a
TODO comment to that effect.
Use the new task_smstop_sm() helper to make this behave as if an SMSTOP
instruction was used to exit streaming mode. This would have been the
most common behaviour prior to the commit above.
Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-10-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In a few places we want to transition a task from streaming mode to
non-streaming mode, e.g. signal delivery where we historically tried to
use an SMSTOP SM instruction.
Add a new helper to manipulate a task's state in the same way as an
SMSTOP SM instruction. I have not added a corresponding helper to
simulate the effects of SMSTART SM. Only ptrace transitions a task into
streaming mode, and ptrace has distinct semantics for such transitions.
Per ARM DDI 0487 L.a, section B1.4.6:
| RRSWFQ
| When the Effective value of PSTATE.SM is changed by any method from 0
| to 1, an entry to Streaming SVE mode is performed, and all implemented
| bits of Streaming SVE register state are set to zero.
| RKFRQZ
| When the Effective value of PSTATE.SM is changed by any method from 1
| to 0, an exit from Streaming SVE mode is performed, and in the
| newly-entered mode, all implemented bits of the SVE scalable vector
| registers, SVE predicate registers, and FFR, are set to zero.
Per ARM DDI 0487 L.a, section C5.2.9:
| On entry to or exit from Streaming SVE mode, FPMR is set to 0
Per ARM DDI 0487 L.a, section C5.2.10:
| On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC,
| IXC, IDC, QC} are set to 1 and the remaining bits are set to 0.
This means bits 0, 1, 2, 3, 4, 7, and 27 respectively, i.e. 0x0800009f
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-9-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In subsequent patches we'll need to determine the SVE/SME state size for
a given SVE VL and SME VL regardless of whether a task is currently
configured with those VLs. Split the sizing logic out of
sve_state_size() and sme_state_size() so that we don't need to open-code
this logic elsewhere.
At the same time, apply minor cleanups:
* Move sve_state_size() into fpsimd.h, matching the placement of
sme_state_size().
* Remove the feature checks from sve_state_size(). We only call
sve_state_size() when at least one of SVE and SME are supported, and
when either of the two is not supported, the task's corresponding
SVE/SME vector length will be zero.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-8-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The sve_sync_{to,from}_fpsimd*() functions are intended to
extract/insert the currently effective FPSIMD state of a task regardless
of whether the task's state is saved in FPSIMD format or SVE format.
Historically they were only used by ptrace, but sve_sync_to_fpsimd() is
now used more widely, and sve_sync_from_fpsimd_zeropad() may be used
more widely in future.
When FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type")
bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state")
8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... sve_sync_to_fpsimd() was updated to consider task->thread.fp_type
rather than the task's TIF_SVE and PSTATE.SM, but (apparently due to an
oversight) sve_sync_from_fpsimd_zeropad() was left as-is, leaving the
two inconsistent.
Due to this, sve_sync_from_fpsimd_zeropad() may copy state from
task->thread.uw.fpsimd_state into task->thread.sve_state when
task->thread.fp_type == FP_STATE_FPSIMD. This is redundant (but benign)
as task->thread.uw.fpsimd_state is the effective state that will be
restored, and task->thread.sve_state will not be consumed. For
consistency, and to avoid the redundant work, it better for
sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type alone,
matching sve_sync_to_fpsimd().
The naming of both functions is somehat unfortunate, as it is unclear
when and why they copy state. It would be better to describe them in
terms of the effective state.
Considering all of the above, clean this up:
* Adjust sve_sync_from_fpsimd_zeropad() to consider
task->thread.fp_type.
* Update comments to clarify the intended semantics/usage. I've removed
the description that task->thread.sve_state must have been allocated,
as this is only necessary when task->thread.fp_type == FP_STATE_SVE,
which itself implies that task->thread.sve_state must have been
allocated.
* Rename the functions to more clearly indicate when/why they copy
state:
- sve_sync_to_fpsimd() => fpsimd_sync_from_effective_state()
- sve_sync_from_fpsimd_zeropad => fpsimd_sync_to_effective_state_zeropad()
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-7-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Partial writes to the NT_ARM_SVE and NT_ARM_SSVE regsets using an
payload are handled inconsistently and non-deterministically. A comment
within sve_set_common() indicates that we intended that a partial write
would preserve any effective FPSIMD/SVE state which was not overwritten,
but this has never worked consistently, and during syscalls the FPSIMD
vector state may be non-deterministically preserved and may be
erroneously migrated between streaming and non-streaming SVE modes.
The simplest fix is to handle a partial write by consistently zeroing
the remaining state. As detailed below I do not believe this will
adversely affect any real usage.
Neither GDB nor LLDB attempt partial writes to these regsets, and the
documentation (in Documentation/arch/arm64/sve.rst) has always indicated
that state preservation was not guaranteed, as is says:
| The effect of writing a partial, incomplete payload is unspecified.
When the logic was originally introduced in commit:
43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
... there were two potential behaviours, depending on TIF_SVE:
* When TIF_SVE was clear, all SVE state would be zeroed, excluding the
low 128 bits of vectors shared with FPSIMD, FPSR, and FPCR.
* When TIF_SVE was set, all SVE state would be zeroed, including the
low 128 bits of vectors shared with FPSIMD, but excluding FPSR and
FPCR.
Note that as writing to NT_ARM_SVE would set TIF_SVE, partial writes to
NT_ARM_SVE would not be idempotent, and if a first write preserved the
low 128 bits, a subsequent (potentially identical) partial write would
discard the low 128 bits.
When support for the NT_ARM_SSVE regset was added in commit:
e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
... the above behaviour was retained for writes to the NT_ARM_SVE
regset, though writes to the NT_ARM_SSVE would always zero the SVE
registers and would not inherit FPSIMD register state. This happened as
fpsimd_sync_to_sve() only copied the FPSIMD regs when TIF_SVE was clear
and PSTATE.SM==0.
Subsequently, when FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type")
bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state")
8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... there was no corresponding update to the ptrace code, nor to
fpsimd_sync_to_sve(), which stil considers TIF_SVE and PSTATE.SM rather
than the saved fp_type. The saved state can be in the FPSIMD format
regardless of whether TIF_SVE is set or clear, and the saved type can
change non-deterministically during syscalls. Consequently a subsequent
partial write to the NT_ARM_SVE or NT_ARM_SSVE regsets may
non-deterministically preserve the FPSIMD state, and may migrate this
state between streaming and non-streaming modes.
Clean this up by never attempting to preserve ANY state when writing an
SVE payload to the NT_ARM_SVE/NT_ARM_SSVE regsets, zeroing all relevant
state including FPSR and FPCR. This simplifies the code, makes the
behaviour deterministic, and avoids migrating state between streaming
and non-streaming modes. As above, I do not believe this should
adversely affect existing userspace applications.
At the same time, remove fpsimd_sync_to_sve(). It is no longer used,
doesn't do what its documentation implies, and gets in the way of other
cleanups and fixes.
Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Spickett <david.spickett@arm.com>
Cc: Luis Machado <luis.machado@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-6-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
A malicious BPF program may manipulate the branch history to influence
what the hardware speculates will happen next.
On exit from a BPF program, emit the BHB mititgation sequence.
This is only applied for 'classic' cBPF programs that are loaded by
seccomp.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Add a helper to expose the k value of the branchy loop. This is needed
by the BPF JIT to generate the mitigation sequence in BPF programs.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
is_spectre_bhb_fw_affected() allows the caller to determine if the CPU
is known to need a firmware mitigation. CPUs are either on the list
of CPUs we know about, or firmware has been queried and reported that
the platform is affected - and mitigated by firmware.
This helper is not useful to determine if the platform is mitigated
by firmware. A CPU could be on the know list, but the firmware may
not be implemented. Its affected but not mitigated.
spectre_bhb_enable_mitigation() handles this distinction by checking
the firmware state before enabling the mitigation.
Add a helper to expose this state. This will be used by the BPF JIT
to determine if calling firmware for a mitigation is necessary and
supported.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
For historical reasons, restore_sve_fpsimd_context() has an open-coded
copy of the logic from read_fpsimd_context(), which is used to either
restore an FPSIMD-only context, or to merge FPSIMD state into an
SVE state when restoring an SVE+FPSIMD context. The logic is *almost*
identical.
Refactor the logic to avoid duplication and make this clearer.
This comes with two functional changes that I do not believe will be
problematic in practice:
* The user_fpsimd_state::size field will be checked in all restore paths
that consume it user_fpsimd_state. The kernel always populates this
field when delivering a signal, and so this should contain the
expected value unless it has been corrupted.
* If a read of user_fpsimd_state fails, we will return early without
modifying TIF_SVE, the saved SVCR, or the save fp_type. This will
leave the task in a consistent state, without potentially resurrecting
stale FPSIMD state. A read of user_fpsimd_state should never fail
unless the structure has been corrupted or the stack has been
unmapped.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-5-mark.rutland@arm.com
[will: Ensure read_fpsimd_context() returns negative error code or zero]
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Non-streaming SVE state may be preserved without an SVE payload, in
which case the SVE context only has a header with VL==0, and all state
can be restored from the FPSIMD context. Streaming SVE state is always
preserved with an SVE payload, where the SVE context header has VL!=0,
and the SVE_SIG_FLAG_SM flag is set.
The kernel never preserves an SVE context where SVE_SIG_FLAG_SM is set
without an SVE payload. However, restore_sve_fpsimd_context() doesn't
forbid restoring such a context, and will handle this case by clearing
PSTATE.SM and restoring the FPSIMD context into non-streaming mode,
which isn't consistent with the SVE_SIG_FLAG_SM flag.
Forbid this case, and mandate an SVE payload when the SVE_SIG_FLAG_SM
flag is set. This avoids an awkward ABI quirk and reduces the risk that
later rework to this code permits configuring a task with PSTATE.SM==1
and fp_type==FP_STATE_FPSIMD.
I've marked this as a fix given that we never intended to support this
case, and we don't want anyone to start relying upon the old behaviour
once we re-enable SME.
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
On systems with SVE and/or SME, the kernel will always create SVE and
FPSIMD signal frames when delivering a signal, but a user can manipulate
signal frames such that a signal return only observes an FPSIMD signal
frame. When this happens, restore_fpsimd_context() will restore state
such that fp_type==FP_STATE_FPSIMD, but will leave PSTATE.SM as-is.
It is possible for a user to set PSTATE.SM between syscall entry and
execution of the sigreturn logic (e.g. via ptrace), and consequently the
sigreturn may result in the task having PSTATE.SM==1 and
fp_type==FP_STATE_FPSIMD.
For various reasons it is not legitimate for a task to be in a state
where PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Portions of the user
ABI are written with the requirement that streaming SVE state is always
presented in SVE format rather than FPSIMD format, and as there is no
mechanism to permit access to only the FPSIMD subset of streaming SVE
state, streaming SVE state must always be saved and restored in SVE
format.
Fix restore_fpsimd_context() to clear PSTATE.SM when restoring an FPSIMD
signal frame without an SVE signal frame. This matches the current
behaviour when an SVE signal frame is present, but the SVE signal frame
has no register payload (e.g. as is the case on SME-only systems which
lack SVE).
This change should have no effect for applications which do not alter
signal frames (i.e. almost all applications). I do not expect
non-{malicious,buggy} applications to hide the SVE signal frame, but
I've chosen to clear PSTATE.SM rather than mandating the presence of an
SVE signal frame in case there is some legacy (non-SME) usage that I am
not currently aware of.
For context, the SME handling was originally introduced in commit:
85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
... and subsequently updated/fixed to handle SME-only systems in commits:
7dde62f0687c ("arm64/signal: Always accept SVE signal frames on SME only systems")
f26cd7372160 ("arm64/signal: Always allocate SVE signal frames on SME only systems")
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Historically SVE state was discarded deterministically early in the
syscall entry path, before ptrace is notified of syscall entry. This
permitted ptrace to modify SVE state before and after the "real" syscall
logic was executed, with the modified state being retained.
This behaviour was changed by commit:
8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
That commit was intended to speed up workloads that used SVE by
opportunistically leaving SVE enabled when returning from a syscall.
The syscall entry logic was modified to truncate the SVE state without
disabling userspace access to SVE, and fpsimd_save_user_state() was
modified to discard userspace SVE state whenever
in_syscall(current_pt_regs()) is true, i.e. when
current_pt_regs()->syscallno != NO_SYSCALL.
Leaving SVE enabled opportunistically resulted in a couple of changes to
userspace visible behaviour which weren't described at the time, but are
logical consequences of opportunistically leaving SVE enabled:
* Signal handlers can observe the type of saved state in the signal's
sve_context record. When the kernel only tracks FPSIMD state, the 'vq'
field is 0 and there is no space allocated for register contents. When
the kernel tracks SVE state, the 'vq' field is non-zero and the
register contents are saved into the record.
As a result of the above commit, 'vq' (and the presence of SVE
register state) is non-deterministically zero or non-zero for a period
of time after a syscall. The effective register state is still
deterministic.
Hopefully no-one relies on this being deterministic. In general,
handlers for asynchronous events cannot expect a deterministic state.
* Similarly to signal handlers, ptrace requests can observe the type of
saved state in the NT_ARM_SVE and NT_ARM_SSVE regsets, as this is
exposed in the header flags. As a result of the above commit, this is
now in a non-deterministic state after a syscall. The effective
register state is still deterministic.
Hopefully no-one relies on this being deterministic. In general,
debuggers would have to handle this changing at arbitrary points
during program flow.
Discarding the SVE state within fpsimd_save_user_state() resulted in
other changes to userspace visible behaviour which are not desirable:
* A ptrace tracer can modify (or create) a tracee's SVE state at syscall
entry or syscall exit. As a result of the above commit, the tracee's
SVE state can be discarded non-deterministically after modification,
rather than being retained as it previously was.
Note that for co-operative tracer/tracee pairs, the tracer may
(re)initialise the tracee's state arbitrarily after the tracee sends
itself an initial SIGSTOP via a syscall, so this affects realistic
design patterns.
* The current_pt_regs()->syscallno field can be modified via ptrace, and
can be altered even when the tracee is not really in a syscall,
causing non-deterministic discarding to occur in situations where this
was not previously possible.
Further, using current_pt_regs()->syscallno in this way is unsound:
* There are data races between readers and writers of the
current_pt_regs()->syscallno field.
The current_pt_regs()->syscallno field is written in interruptible
task context using plain C accesses, and is read in irq/softirq
context using plain C accesses. These accesses are subject to data
races, with the usual concerns with tearing, etc.
* Writes to current_pt_regs()->syscallno are subject to compiler
reordering.
As current_pt_regs()->syscallno is written with plain C accesses,
the compiler is free to move those writes arbitrarily relative to
anything which doesn't access the same memory location.
In theory this could break signal return, where prior to restoring the
SVE state, restore_sigframe() calls forget_syscall(). If the write
were hoisted after restore of some SVE state, that state could be
discarded unexpectedly.
In practice that reordering cannot happen in the absence of LTO (as
cross compilation-unit function calls happen prevent this reordering),
and that reordering appears to be unlikely in the presence of LTO.
Additionally, since commit:
f130ac0ae4412dbe ("arm64: syscall: unmask DAIF earlier for SVCs")
... DAIF is unmasked before el0_svc_common() sets regs->syscallno to the
real syscall number. Consequently state may be saved in SVE format prior
to this point.
Considering all of the above, current_pt_regs()->syscallno should not be
used to infer whether the SVE state can be discarded. Luckily we can
instead use cpu_fp_state::to_save to track when it is safe to discard
the SVE state:
* At syscall entry, after the live SVE register state is truncated, set
cpu_fp_state::to_save to FP_STATE_FPSIMD to indicate that only the
FPSIMD portion is live and needs to be saved.
* At syscall exit, once the task's state is guaranteed to be live, set
cpu_fp_state::to_save to FP_STATE_CURRENT to indicate that TIF_SVE
must be considered to determine which state needs to be saved.
* Whenever state is modified, it must be saved+flushed prior to
manipulation. The state will be truncated if necessary when it is
saved, and reloading the state will set fp_state::to_save to
FP_STATE_CURRENT, preventing subsequent discarding.
This permits SVE state to be discarded *only* when it is known to have
been truncated (and the non-FPSIMD portions must be zero), and ensures
that SVE state is retained after it is explicitly modified.
For backporting, note that this fix depends on the following commits:
* b2482807fbd4 ("arm64/sme: Optimise SME exit on syscall entry")
* f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
* 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
report_ubsan_failure() doesn't use argument regs, and soon it will
be called from the hypervisor context were regs are not available.
So, remove the unused argument.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-3-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Soon, KVM is going to use this logic for hypervisor panics,
so add it in a wrapper that can be used by the hypervisor exit
handler to decode hyp panics.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-2-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As we will eventually have to context-switch the FEAT_FGT2 registers
in KVM (something that has been completely ignored so far), add
a new cap that we will be able to check for.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
prevent wrong idmap generation
The PTE_MAYBE_NG macro sets the nG page table bit according to the value
of "arm64_use_ng_mappings". This variable is currently placed in the
.bss section. create_init_idmap() is called before the .bss section
initialisation which is done in early_map_kernel(). Therefore,
data/test_prot in create_init_idmap() could be set incorrectly through
the PAGE_KERNEL -> PROT_DEFAULT -> PTE_MAYBE_NG macros.
# llvm-objdump-21 --syms vmlinux-gcc | grep arm64_use_ng_mappings
ffff800082f242a8 g O .bss 0000000000000001 arm64_use_ng_mappings
The create_init_idmap() function disassembly compiled with llvm-21:
// create_init_idmap()
ffff80008255c058: d10103ff sub sp, sp, #0x40
ffff80008255c05c: a9017bfd stp x29, x30, [sp, #0x10]
ffff80008255c060: a90257f6 stp x22, x21, [sp, #0x20]
ffff80008255c064: a9034ff4 stp x20, x19, [sp, #0x30]
ffff80008255c068: 910043fd add x29, sp, #0x10
ffff80008255c06c: 90003fc8 adrp x8, 0xffff800082d54000
ffff80008255c070: d280e06a mov x10, #0x703 // =1795
ffff80008255c074: 91400409 add x9, x0, #0x1, lsl #12 // =0x1000
ffff80008255c078: 394a4108 ldrb w8, [x8, #0x290] ------------- (1)
ffff80008255c07c: f2e00d0a movk x10, #0x68, lsl #48
ffff80008255c080: f90007e9 str x9, [sp, #0x8]
ffff80008255c084: aa0103f3 mov x19, x1
ffff80008255c088: aa0003f4 mov x20, x0
ffff80008255c08c: 14000000 b 0xffff80008255c08c <__pi_create_init_idmap+0x34>
ffff80008255c090: aa082d56 orr x22, x10, x8, lsl #11 -------- (2)
Note (1) is loading the arm64_use_ng_mappings value in w8 and (2) is set
the text or data prot with the w8 value to set PTE_NG bit. If the .bss
section isn't initialized, x8 could include a garbage value and generate
an incorrect mapping.
Annotate arm64_use_ng_mappings as __read_mostly so that it is placed in
the .data section.
Fixes: 84b04d3e6bdb ("arm64: kernel: Create initial ID map from C code")
Cc: stable@vger.kernel.org # 6.9.x
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Link: https://lore.kernel.org/r/20250502180412.3774883-1-yeoreum.yun@arm.com
[catalin.marinas@arm.com: use __read_mostly instead of __ro_after_init]
[catalin.marinas@arm.com: slight tweaking of the code comment]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The hypervisor has not needed its own .data section because all globals
were either .rodata or .bss. To avoid having to initialize future
data-structures at run-time, let's introduce add a .data section to the
hypervisor.
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Add missing sentinels to the arm64 Spectre-BHB MIDR arrays, otherwise
is_midr_in_range_list() reads beyond the end of these arrays"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays
|
|
Commit a5951389e58d ("arm64: errata: Add newer ARM cores to the
spectre_bhb_loop_affected() lists") added some additional CPUs to the
Spectre-BHB workaround, including some new arrays for designs that
require new 'k' values for the workaround to be effective.
Unfortunately, the new arrays omitted the sentinel entry and so
is_midr_in_range_list() will walk off the end when it doesn't find a
match. With UBSAN enabled, this leads to a crash during boot when
is_midr_in_range_list() is inlined (which was more common prior to
c8c2647e69be ("arm64: Make _midr_in_range_list() an exported
function")):
| Internal error: aarch64 BRK: 00000000f2000001 [#1] PREEMPT SMP
| pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : spectre_bhb_loop_affected+0x28/0x30
| lr : is_spectre_bhb_affected+0x170/0x190
| [...]
| Call trace:
| spectre_bhb_loop_affected+0x28/0x30
| update_cpu_capabilities+0xc0/0x184
| init_cpu_features+0x188/0x1a4
| cpuinfo_store_boot_cpu+0x4c/0x60
| smp_prepare_boot_cpu+0x38/0x54
| start_kernel+0x8c/0x478
| __primary_switched+0xc8/0xd4
| Code: 6b09011f 54000061 52801080 d65f03c0 (d4200020)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: aarch64 BRK: Fatal exception
Add the missing sentinel entries.
Cc: Lee Jones <lee@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: <stable@vger.kernel.org>
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: a5951389e58d ("arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Lee Jones <lee@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20250501104747.28431-1-will@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Historically fpsimd_to_sve() and sve_to_fpsimd() were (conditionally)
called by functions which were defined regardless of CONFIG_ARM64_SVE.
Hence it was necessary that both fpsimd_to_sve() and sve_to_fpsimd()
were always defined and not guarded by ifdeffery.
As a result of the removal of fpsimd_signal_preserve_current_state() in
commit:
929fa99b1215966f ("arm64/fpsimd: signal: Always save+flush state early")
... sve_to_fpsimd() has no callers when CONFIG_ARM64_SVE=n, resulting in
a build-time warnign that it is unused:
| arch/arm64/kernel/fpsimd.c:676:13: warning: unused function 'sve_to_fpsimd' [-Wunused-function]
| 676 | static void sve_to_fpsimd(struct task_struct *task)
| | ^~~~~~~~~~~~~
| 1 warning generated.
In contrast, fpsimd_to_sve() still has callers which are defined when
CONFIG_ARM64_SVE=n, and it would be awkward to hide this behind
ifdeffery and/or to use stub functions.
For now, suppress the warning by marking both fpsimd_to_sve() and
sve_to_fpsimd() as 'static inline', as we usually do for stub functions.
The compiler will no longer warn if either function is unused.
Aside from suppressing the warning, there should be no functional change
as a result of this patch.
Link: https://lore.kernel.org/linux-arm-kernel/20250429194600.GA26883@willie-the-truck/
Reported-by: Will Deacon <will@kernel.org>
Fixes: 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250430173240.4023627-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Log size in addition to physical and virtual addresses. It has potential
to be helpful when DTB exceeds the 2 MB limit.
Initialize size to 0 to print out sane value if fixmap_remap_fdt fails
without setting the size.
Signed-off-by: Bartosz Szczepanek <bsz@amazon.de>
Link: https://lore.kernel.org/r/20250423084851.26449-1-bsz@amazon.de
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The KVM PV ABI recently added a feature that allows the VM to discover
the set of physical CPU implementations, identified by a tuple of
{MIDR_EL1, REVIDR_EL1, AIDR_EL1}. Unlike other KVM PV features, the
expectation is that the VMM implements the hypercall instead of KVM as
it has the authoritative view of where the VM gets scheduled.
To do this the VMM needs to know the values of these registers on any
CPU in the system. While MIDR_EL1 and REVIDR_EL1 are already exposed,
AIDR_EL1 is not. Provide it in sysfs along with the other identification
registers.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250403231626.3181116-1-oliver.upton@linux.dev
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For an architecture to enable CONFIG_ARCH_HAS_RESCHED_LAZY, two things are
required:
1) Adding a TIF_NEED_RESCHED_LAZY flag definition
2) Checking for TIF_NEED_RESCHED_LAZY in the appropriate locations
2) is handled in a generic manner by CONFIG_GENERIC_ENTRY, which isn't
(yet) implemented for arm64. However, outside of core scheduler code,
TIF_NEED_RESCHED_LAZY only needs to be checked on a kernel exit, meaning:
o return/entry to userspace.
o return/entry to guest.
The return/entry to a guest is all handled by xfer_to_guest_mode_handle_work()
which already does the right thing, so it can be left as-is.
arm64 doesn't use common entry's exit_to_user_mode_prepare(), so update its
return to user path to check for TIF_NEED_RESCHED_LAZY and call into
schedule() accordingly.
Link: https://lore.kernel.org/linux-rt-users/20241216190451.1c61977c@mordecai.tesarici.cz/
Link: https://lore.kernel.org/all/xhsmh4j0fl0p3.mognet@vschneid-thinkpadt14sgen2i.remote.csb/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[testdrive, _TIF_WORK_MASK fixlet and changelog.]
Signed-off-by: Mike Galbraith <efault@gmx.de>
[Another round of testing; changelog faff]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250305104925.189198-2-vschneid@redhat.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Add missing id_aa64mmfr4 feature register check and update in
update_cpu_features(). Update the taint status as well.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20250329034409.21354-2-yangyicong@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Since commit
97d6786e0669 ("arm64: mm: account for hotplug memory when randomizing the linear region")
the decision whether or not to randomize the placement of the system's
DRAM inside the linear map is based on the capabilities of the CPU
rather than how much memory is present at boot time. This change was
necessary because memory hotplug may result in DRAM appearing in places
that are not covered by the linear region at all (and therefore
unusable) if the decision is solely based on the memory map at boot.
In the Android GKI kernel, which requires support for memory hotplug,
and is built with a reduced virtual address space of only 39 bits wide,
randomization of the linear map never happens in practice as a result.
And even on arm64 kernels built with support for 48 bit virtual
addressing, the wider PArange of recent CPUs means that linear map
randomization is slowly becoming a feature that only works on systems
that will soon be obsolete.
So let's just remove this feature. We can always bring it back in an
improved form if there is a real need for it.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250318134949.3194334-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The EFI specification has some elaborate rules about which runtime
services may be called while another runtime service call is already in
progress. In Linux, however, for simplicity, all EFI runtime service
invocations are serialized via the efi_runtime_lock semaphore.
This implies that calls to the helper pair arch_efi_call_virt_setup()
and arch_efi_call_virt_teardown() are serialized too, and are guaranteed
not to nest. Furthermore, the arm64 arch code has its own spinlock to
serialize use of the EFI runtime stack, of which only a single instance
exists.
This all means that the FP/SIMD and SVE state preserve/restore logic in
__efi_fpsimd_begin() and __efi_fpsimd_end() are also serialized, and
only a single instance of the associated per-CPU variables can ever be
in use at the same time. There is therefore no need at all for per-CPU
variables here, and they can all be replaced with singleton instances.
This saves a non-trivial amount of memory on systems with many CPUs.
To be more robust against potential future changes in the core EFI code
that may invalidate the reasoning above, move the invocations of
__efi_fpsimd_begin() and __efi_fpsimd_end() into the critical section
covered by the efi_rt_lock spinlock.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250318132421.3155799-2-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with signal handling, and the behaviour that AAPCS64 currently
recommends for (sig)setjmp() and (sig)longjmp() does not always compose
safely with signal handling, as explained below.
When Linux delivers a signal, it creates signal frames which contain the
original values of PSTATE.ZA, the ZA tile, and TPIDR_EL2. Between saving
the original state and entering the signal handler, Linux clears
PSTATE.ZA, but leaves TPIDR2_EL0 unchanged. Consequently a signal
handler can be entered with PSTATE.ZA=0 (meaning accesses to ZA will
trap), while TPIDR_EL0 is non-null (which may indicate that ZA needs to
be lazily saved, depending on the contents of the TPIDR2 block). While
in this state, libc and/or compiler runtime code, such as longjmp(), may
attempt to save ZA. As PSTATE.ZA=0, these accesses will trap, causing
the kernel to inject a SIGILL. Note that by virtue of lazy saving
occurring in libc and/or C runtime code, this can be triggered by
application/library code which is unaware of SME.
To avoid the problem above, the kernel must ensure that signal handlers
are entered with PSTATE.ZA and TPIDR2_EL0 configured in a manner which
complies with the ZA lazy saving scheme. Practically speaking, the only
choice is to enter signal handlers with PSTATE.ZA=0 and TPIDR2_EL0=NULL.
This change should not impact SME code which does not follow the ZA lazy
saving scheme (and hence does not use TPIDR2_EL0).
An alternative approach that was considered is to have the signal
handler inherit the original values of both PSTATE.ZA and TPIDR2_EL0,
relying on lazy save/restore sequences being idempotent and capable of
racing safely. This is not safe as signal handlers must be assumed to
have a "private ZA" interface, and therefore cannot be entered with
PSTATE.ZA=1 and TPIDR2_EL0=NULL, but it is legitimate for signals to be
taken from this state.
With the kernel fixed to clear TPIDR2_EL0, there are a couple of
remaining issues (largely masked by the first issue) that must be fixed
in userspace:
(1) When a (sig)setjmp() + (sig)longjmp() pair cross a signal boundary,
ZA state may be discarded when it needs to be preserved.
Currently, the ZA lazy saving scheme recommends that setjmp() does
not save ZA, and recommends that longjmp() is responsible for saving
ZA. A call to longjmp() in a signal handler will not have visibility
of ZA state that existed prior to entry to the signal, and when a
longjmp() is used to bypass a usual signal return, unsaved ZA state
will be discarded erroneously.
To fix this, it is necessary for setjmp() to eagerly save ZA state,
and for longjmp() to configure PSTATE.ZA=0 and TPIDR2_EL0=NULL. This
works regardless of whether a signal boundary is crossed.
(2) When a C++ exception is thrown and crosses a signal boundary before
it is caught, ZA state may be discarded when it needs to be
preserved.
AAPCS64 requires that exception handlers are entered with
PSTATE.{SM,ZA}={0,0} and TPIDR2_EL0=NULL, with exception unwind code
expected to perform any necessary save of ZA state.
Where it is necessary to perform an exception unwind across an
exception boundary, the unwind code must recover any necessary ZA
state (along with TPIDR2) from signal frames.
Fix the kernel as described above, with setup_return() clearing
TPIDR2_EL0 when delivering a signal. Folk on CC are working on fixes for
the remaining userspace issues, including updates/fixes to the AAPCS64
specification and glibc.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling")
Fixes: 39e54499280f ("arm64/signal: Include TPIDR2 in the signal context")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250417190113.3778111-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.15, round #2
- Single fix for broken usage of 'multi-MIDR' infrastructure in PI
code, adding an open-coded erratum check for everyone's favorite pile
of sand: Cavium ThunderX
|
|
Calling into the MIDR checking framework from the PI code has recently
become much harder, due to the new fancy "multi-MIDR" support that
relies on tables being populated at boot time, but not that early that
they are available to the PI code. There are additional issues with
this framework, as the code really isn't position independend *at all*.
This leads to some ugly breakages, as reported by Ada.
It so appears that the only reason for the PI code to call into the
MIDR checking code is to cope with The Most Broken ARM64 System Ever,
aka Cavium ThunderX, which cannot deal with nG attributes that result
of the combination of KASLR and KPTI as a consequence of Erratum 27456.
Duplicate the check for the erratum in the PI code, removing the
dependency on the bulk of the MIDR checking framework. This allows
dropping that same check from kaslr_requires_kpti(), as the KPTI code
already relies on the ARM64_WORKAROUND_CAVIUM_27456 cap.
Fixes: c8c2647e69bed ("arm64: Make _midr_in_range_list() an exported function")
Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/3d97e45a-23cf-419b-9b6f-140b4d88de7b@arm.com
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250418093129.1755739-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
During a context-switch, tls_thread_switch() reads and writes a task's
thread_struct::tpidr2_el0 field. Other code shouldn't access this field
for an active task, as such accesses would form a data-race with a
concurrent context-switch.
The usage in preserve_tpidr2_context() is suspicious, but benign as any
race with a context switch will write the same value back to
current->thread.tpidr2_el0.
Make this clearer and match restore_tpidr2_context() by using a
temporary variable instead, avoiding the (benign) data-race.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-14-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There are several issues with the way the native signal handling code
manipulates FPSIMD/SVE/SME state, described in detail below. These
issues largely result from races with preemption and inconsistent
handling of live state vs saved state.
Known issues with native FPSIMD/SVE/SME state management include:
* On systems with FPMR, the code to save/restore the FPMR accesses the
register while it is not owned by the current task. Consequently, this
may corrupt the FPMR of the current task and/or may corrupt the FPMR
of an unrelated task. The FPMR save/restore has been broken since it
was introduced in commit:
8c46def44409fc91 ("arm64/signal: Add FPMR signal handling")
* On systems with SME, setup_return() modifies both the live register
state and the saved state register state regardless of whether the
task's state is live, and without holding the cpu fpsimd context.
Consequently:
- This may corrupt the state an unrelated task which has PSTATE.SM set
and/or PSTATE.ZA set.
- The task may enter the signal handler in streaming mode, and or with
ZA storage enabled unexpectedly.
- The task may enter the signal handler in non-streaming SVE mode with
stale SVE register state, which may have been inherited from
streaming SVE mode unexpectedly. Where the streaming and
non-streaming vector lengths differ, this may be packed into
registers arbitrarily.
This logic has been broken since it was introduced in commit:
40a8e87bb32855b3 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Further incorrect manipulation of state was added in commits:
ea64baacbc36a0d5 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode")
baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
* Several restoration functions use fpsimd_flush_task_state() to discard
the live FPSIMD/SVE/SME while the in-memory copy is stale.
When a subset of the FPSIMD/SVE/SME state is restored, the remainder
may be non-deterministically reset to a stale snapshot from some
arbitrary point in the past.
This non-deterministic discarding was introduced in commit:
8cd969d28fd2848d ("arm64/sve: Signal handling support")
As of that commit, when TIF_SVE was initially clear, failure to
restore the SVE signal frame could reset the FPSIMD registers to a
stale snapshot.
The pattern of discarding unsaved state was subsequently copied into
restoration functions for some new state in commits:
39782210eb7e8763 ("arm64/sme: Implement ZA signal handling")
ee072cf708048c0d ("arm64/sme: Implement signal handling for ZT")
* On systems with SME/SME2, the entire FPSIMD/SVE/SME state may be
loaded onto the CPU redundantly. Either restore_fpsimd_context() or
restore_sve_fpsimd_context() will load the entire FPSIMD/SVE/SME state
via fpsimd_update_current_state() before restore_za_context() and
restore_zt_context() each discard the state via
fpsimd_flush_task_state().
This is purely redundant work, and not a functional bug.
To fix these issues, rework the native signal handling code to always
save+flush the current task's FPSIMD/SVE/SME state before manipulating
that state. This avoids races with preemption and ensures that state is
manipulated consistently regardless of whether it happened to be live
prior to manipulation. This largely involes:
* Using fpsimd_save_and_flush_current_state() to save+flush the state
for both signal delivery and signal return, before the state is
manipulated in any way.
* Removing fpsimd_signal_preserve_current_state() and updating
preserve_fpsimd_context() to explicitly ensure that the FPSIMD state
is up-to-date, as preserve_fpsimd_context() is the only consumer of
the FPSIMD state during signal delivery.
* Modifying fpsimd_update_current_state() to not reload the FPSIMD state
onto the CPU. Ideally we'd remove fpsimd_update_current_state()
entirely, but I've left that for subsequent patches as there are a
number of of other problems with the FPSIMD<->SVE conversion helpers
that should be addressed at the same time. For now I've removed the
misleading comment.
For setup_return(), we need to decide (for ABI reasons) whether signal
delivery should have all the side-effects of an SMSTOP. For now I've
left a TODO comment, as there are other questions in this area that I'll
address with subsequent patches.
Fixes: 8c46def44409 ("arm64/signal: Add FPMR signal handling")
Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Fixes: ea64baacbc36 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode")
Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support")
Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling")
Fixes: ee072cf70804 ("arm64/sme: Implement signal handling for ZT")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-13-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There are several issues with the way the native signal handling code
manipulates FPSIMD/SVE/SME state. To fix those issues, subsequent
patches will rework the native signal handling code to always save+flush
the current task's FPSIMD/SVE/SME state before manipulating that state.
In preparation for those changes, rework the compat signal handling code
to save+flush the current task's FPSIMD state before manipulating it.
Subsequent patches will remove fpsimd_signal_preserve_current_state()
and fpsimd_update_current_state(). Compat tasks can only have FPSIMD
state, and cannot have any SVE or SME state. Thus, the SVE state
manipulation present in fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() is not necessary, and it is safe to
directly manipulate current->thread.uw.fpsimd_state once it has been
saved+flushed.
Use fpsimd_save_and_flush_current_state() to save+flush the state for
both signal delivery and signal return, before the state is manipulated
in any way. While it would be safe for compat_restore_vfp_context() to
use fpsimd_flush_task_state(current), there are several extant issues in
the native signal code resulting from incorrect use of
fpsimd_flush_task_state(), and for consistency it is preferable to use
fpsimd_save_and_flush_current_state().
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-12-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When the current task's FPSIMD/SVE/SME state may be live on *any* CPU in
the system, special care must be taken when manipulating that state, as
this manipulation can race with preemption and/or asynchronous usage of
FPSIMD/SVE/SME (e.g. kernel-mode NEON in softirq handlers).
Even when manipulation is is protected with get_cpu_fpsimd_context() and
get_cpu_fpsimd_context(), the logic necessary when the state is live on
the current CPU can be wildly different from the logic necessary when
the state is not live on the current CPU. A number of historical and
extant issues result from failing to handle these cases consistetntly
and/or correctly.
To make it easier to get such manipulation correct, add a new
fpsimd_save_and_flush_current_state() helper function, which ensures
that the current task's state has been saved to memory and any stale
state on any CPU has been "flushed" such that is not live on any CPU in
the system. This will allow code to safely manipulate the saved state
without risk of races.
Subsequent patches will use the new function.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
For backwards compatibility reasons, when a signal return occurs which
restores SVE state, the effective lower 128 bits of each of the SVE
vector registers are restored from the corresponding FPSIMD vector
register in the FPSIMD signal frame, overriding the values in the SVE
signal frame. This is intended to be the case regardless of streaming
mode.
To make this happen, restore_sve_fpsimd_context() uses
fpsimd_update_current_state() to merge the lower 128 bits from the
FPSIMD signal frame into the SVE register state. Unfortunately,
fpsimd_update_current_state() performs this merging dependent upon
TIF_SVE, which is not always correct for streaming SVE register state:
* When restoring non-streaming SVE register state there is no observable
problem, as the signal return code configures TIF_SVE and the saved
fp_type to match before calling fpsimd_update_current_state(), which
observes either:
- TIF_SVE set AND fp_type == FP_STATE_SVE
- TIF_SVE clear AND fp_type == FP_STATE_FPSIMD
* On systems which have SME but not SVE, TIF_SVE cannot be set. Thus the
merging will never happen for the streaming SVE register state.
* On systems which have SVE and SME, TIF_SVE can be set and cleared
independently of PSTATE.SM. Thus the merging may or may not happen for
streaming SVE register state.
As TIF_SVE can be cleared non-deterministically during syscalls
(including at the start of sigreturn()), the merging may occur
non-deterministically from the perspective of userspace.
This logic has been broken since its introduction in commit:
85ed24dad2904f7c ("arm64/sme: Implement streaming SVE signal handling")
... at which point both fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() only checked TIF SVE. When PSTATE.SM==1
and TIF_SVE was clear, signal delivery would place stale FPSIMD state
into the FPSIMD signal frame, and signal return would not merge this
into the restored register state.
Subsequently, signal delivery was fixed as part of commit:
61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state")
... but signal restore was not given a corresponding fix, and when
TIF_SVE was clear, signal restore would still fail to merge the FPSIMD
state into the restored SVE register state. The 'Fixes' tag did not
indicate that this had been broken since its introduction.
Fix this by merging the FPSIMD state dependent upon the saved fp_type,
matching what we (currently) do during signal delivery.
As described above, when backporting this commit, it will also be
necessary to backport commit:
61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state")
... and prior to commit:
baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
... it will be necessary for fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() to consider both TIF_SVE and
thread_sm_enabled(¤t->thread), in place of the saved fp_type.
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-10-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
An exec() is expected to reset all FPSIMD/SVE/SME state, and barring
special handling of the vector lengths, the state is expected to reset
to zero. This reset is handled in fpsimd_flush_thread(), which the core
exec() code calls via flush_thread().
When support was added for FPMR, no logic was added to
fpsimd_flush_thread() to reset the FPMR value, and thus it is
erroneously inherited across an exec().
Add the missing reset of FPMR.
Fixes: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-9-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
On system with SME, a thread's kernel FPSIMD state may be erroneously
clobbered during a context switch immediately after that state is
restored. Systems without SME are unaffected.
If the CPU happens to be in streaming SVE mode before a context switch
to a thread with kernel FPSIMD state, fpsimd_thread_switch() will
restore the kernel FPSIMD state using fpsimd_load_kernel_state() while
the CPU is still in streaming SVE mode. When fpsimd_thread_switch()
subsequently calls fpsimd_flush_cpu_state(), this will execute an
SMSTOP, causing an exit from streaming SVE mode. The exit from
streaming SVE mode will cause the hardware to reset a number of
FPSIMD/SVE/SME registers, clobbering the FPSIMD state.
Fix this by calling fpsimd_flush_cpu_state() before restoring the kernel
FPSIMD state.
Fixes: e92bee9f861b ("arm64/fpsimd: Avoid erroneous elide of user state reload")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-8-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When the effective value of PSTATE.SM is changed from 0 to 1 or from 1
to 0 by any method, an entry or exit to/from streaming SVE mode is
performed, and hardware automatically resets a number of registers. As
of ARM DDI 0487 L.a, this means:
* All implemented bits of the SVE vector registers are set to zero.
* All implemented bits of the SVE predicate registers are set to zero.
* All implemented bits of FFR are set to zero, if FFR is implemented in
the new mode.
* FPSR is set to 0x0000_0000_0800_009f.
* FPMR is set to 0, if FPMR is implemented.
Currently task_fpsimd_load() restores FPMR before restoring SVCR (which
is an accessor for PSTATE.{SM,ZA}), and so the restored value of FPMR
may be clobbered if the restored value of PSTATE.SM happens to differ
from the initial value of PSTATE.SM.
Fix this by moving the restore of FPMR later.
Note: this was originally posted as [1].
Fixes: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/linux-arm-kernel/20241204-arm64-sme-reenable-v2-2-bae87728251d@kernel.org/
[ Rutland: rewrite commit message ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250409164010.3480271-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The logic for handling SME traps manipulates saved FPSIMD/SVE/SME state
incorrectly, and a race with preemption can result in a task having
TIF_SME set and TIF_FOREIGN_FPSTATE clear even though the live CPU state
is stale (e.g. with SME traps enabled). This can result in warnings from
do_sme_acc() where SME traps are not expected while TIF_SME is set:
| /* With TIF_SME userspace shouldn't generate any traps */
| if (test_and_set_thread_flag(TIF_SME))
| WARN_ON(1);
This is very similar to the SVE issue we fixed in commit:
751ecf6afd6568ad ("arm64/sve: Discard stale CPU state when handling SVE traps")
The race can occur when the SME trap handler is preempted before and
after manipulating the saved FPSIMD/SVE/SME state, starting and ending on
the same CPU, e.g.
| void do_sme_acc(unsigned long esr, struct pt_regs *regs)
| {
| // Trap on CPU 0 with TIF_SME clear, SME traps enabled
| // task->fpsimd_cpu is 0.
| // per_cpu_ptr(&fpsimd_last_state, 0) is task.
|
| ...
|
| // Preempted; migrated from CPU 0 to CPU 1.
| // TIF_FOREIGN_FPSTATE is set.
|
| get_cpu_fpsimd_context();
|
| /* With TIF_SME userspace shouldn't generate any traps */
| if (test_and_set_thread_flag(TIF_SME))
| WARN_ON(1);
|
| if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
| unsigned long vq_minus_one =
| sve_vq_from_vl(task_get_sme_vl(current)) - 1;
| sme_set_vq(vq_minus_one);
|
| fpsimd_bind_task_to_cpu();
| }
|
| put_cpu_fpsimd_context();
|
| // Preempted; migrated from CPU 1 to CPU 0.
| // task->fpsimd_cpu is still 0
| // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then:
| // - Stale HW state is reused (with SME traps enabled)
| // - TIF_FOREIGN_FPSTATE is cleared
| // - A return to userspace skips HW state restore
| }
Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set
by calling fpsimd_flush_task_state() to detach from the saved CPU
state. This ensures that a subsequent context switch will not reuse the
stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the
new state to be reloaded from memory prior to a return to userspace.
Note: this was originallly posted as [1].
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/linux-arm-kernel/20241204-arm64-sme-reenable-v2-1-bae87728251d@kernel.org/
[ Rutland: rewrite commit message ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When a task's SVE vector length (NSVL) is changed, and the task happens
to have SVCR.{SM,ZA}=={0,0}, vec_set_vector_length() opportunistically
frees the task's sme_state and clears TIF_SME.
The opportunistic freeing was added with no rationale in commit:
d4d5be94a8787242 ("arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes")
That commit fixed an unrelated problem where the task's sve_state was
freed while it could be used to store streaming mode register state,
where the fix was to re-allocate the task's sve_state.
There is no need to free and/or reallocate the task's sme_state when the
SVE vector length changes, and there is no need to clear TIF_SME. Given
the SME vector length (SVL) doesn't change, the task's sme_state remains
correctly sized.
Remove the unnecessary opportunistic freeing of the task's sme_state
when the task's SVE vector length is changed. The task's sme_state is
still freed when the SME vector length is changed.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When task_fpsimd_load() loads the saved FPSIMD/SVE/SME state, it
configures EL0 SVE traps by calling sve_user_{enable,disable}(). This is
unnecessary, and this is suspicious/confusing as task_fpsimd_load() does
not configure EL0 SME traps.
All calls to task_fpsimd_load() are followed by a call to
fpsimd_bind_task_to_cpu(), where the latter configures traps for SVE and
SME dependent upon the current values of TIF_SVE and TIF_SME, overriding
any trap configuration performed by task_fpsimd_load().
The calls to sve_user_{enable,disable}() calls in task_fpsimd_load()
have been redundant (though benign) since they were introduced in
commit:
a0136be443d51803 ("arm64/fpsimd: Load FP state based on recorded data type")
Remove the unnecessary and confusing SVE trap manipulation.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There have been no users of fpsimd_force_sync_to_sve() since commit:
bbc6172eefdb276b ("arm64/fpsimd: SME no longer requires SVE register state")
Remove fpsimd_force_sync_to_sve().
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The SME trap handler consumes RES0 bits from the ESR when determining
the reason for the trap, and depends upon those bits reading as zero.
This may break in future when those RES0 bits are allocated a meaning
and stop reading as zero.
For SME traps taken with ESR_ELx.EC == 0b011101, the specific reason for
the trap is indicated by ESR_ELx.ISS.SMTC ("SME Trap Code"). This field
occupies bits [2:0] of ESR_ELx.ISS, and as of ARM DDI 0487 L.a, bits
[24:3] of ESR_ELx.ISS are RES0. ESR_ELx.ISS itself occupies bits [24:0]
of ESR_ELx.
Extract the SMTC field specifically, matching the way we handle ESR_ELx
fields elsewhere, and ensuring that the handler is future-proof.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix max_pfn calculation when hotplugging memory so that it never
decreases
- Fix dereference of unused source register in the MOPS SET operation
fault handling
- Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
space does an unaligned LDREX/STREX
- Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs
- Drop unused code pud accessors (special/mkspecial)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Don't call NULL in do_compat_alignment_fixup()
arm64: Add support for HIP09 Spectre-BHB mitigation
arm64: mm: Drop dead code for pud special bit handling
arm64: mops: Do not dereference src reg for a set operation
arm64: mm: Correct the update of max_pfn
|
|
Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering the
vdso, vvar, and compat-mode vectors and sigpage mappings.
Production release testing passes on Android and Chrome OS.
Link: https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Berg <benjamin@sipsolutions.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Elliot Hughes <enh@google.com>
Cc: Florian Faineli <f.fainelli@gmail.com>
Cc: Greg Ungerer <gerg@kernel.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <mike.rapoport@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
do_alignment_t32_to_handler() only fixes up alignment faults for
specific instructions; it returns NULL otherwise (e.g. LDREX). When
that's the case, signal to the caller that it needs to proceed with the
regular alignment fault handling (i.e. SIGBUS). Without this patch, the
kernel panics:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Mem abort info:
ESR = 0x0000000086000006
EC = 0x21: IABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x06: level 2 translation fault
user pgtable: 4k pages, 48-bit VAs, pgdp=00000800164aa000
[0000000000000000] pgd=0800081fdbd22003, p4d=0800081fdbd22003, pud=08000815d51c6003, pmd=0000000000000000
Internal error: Oops: 0000000086000006 [#1] SMP
Modules linked in: cfg80211 rfkill xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat br_netfilter veth nvme_fa>
libcrc32c crc32c_generic raid0 multipath linear dm_mod dax raid1 md_mod xhci_pci nvme xhci_hcd nvme_core t10_pi usbcore igb crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_ce crct10dif_common usb_common i2c_algo_bit i2c>
CPU: 2 PID: 3932954 Comm: WPEWebProcess Not tainted 6.1.0-31-arm64 #1 Debian 6.1.128-1
Hardware name: GIGABYTE MP32-AR1-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : 0x0
lr : do_compat_alignment_fixup+0xd8/0x3dc
sp : ffff80000f973dd0
x29: ffff80000f973dd0 x28: ffff081b42526180 x27: 0000000000000000
x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000004 x22: 0000000000000000 x21: 0000000000000001
x20: 00000000e8551f00 x19: ffff80000f973eb0 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffaebc949bc488
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : 0000000000400000 x4 : 0000fffffffffffe x3 : 0000000000000000
x2 : ffff80000f973eb0 x1 : 00000000e8551f00 x0 : 0000000000000001
Call trace:
0x0
do_alignment_fault+0x40/0x50
do_mem_abort+0x4c/0xa0
el0_da+0x48/0xf0
el0t_32_sync_handler+0x110/0x140
el0t_32_sync+0x190/0x194
Code: bad PC value
---[ end trace 0000000000000000 ]---
Signed-off-by: Angelos Oikonomopoulos <angelos@igalia.com>
Fixes: 3fc24ef32d3b ("arm64: compat: Implement misalignment fixups for multiword loads")
Cc: <stable@vger.kernel.org> # 6.1.x
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250401085150.148313-1-angelos@igalia.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules updates from Petr Pavlu:
- Use RCU instead of RCU-sched
The mix of rcu_read_lock(), rcu_read_lock_sched() and
preempt_disable() in the module code and its users has
been replaced with just rcu_read_lock()
- The rest of changes are smaller fixes and updates
* tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits)
MAINTAINERS: Update the MODULE SUPPORT section
module: Remove unnecessary size argument when calling strscpy()
module: Replace deprecated strncpy() with strscpy()
params: Annotate struct module_param_attrs with __counted_by()
bug: Use RCU instead RCU-sched to protect module_bug_list.
static_call: Use RCU in all users of __module_text_address().
kprobes: Use RCU in all users of __module_text_address().
bpf: Use RCU in all users of __module_text_address().
jump_label: Use RCU in all users of __module_text_address().
jump_label: Use RCU in all users of __module_address().
x86: Use RCU in all users of __module_address().
cfi: Use RCU while invoking __module_address().
powerpc/ftrace: Use RCU in all users of __module_text_address().
LoongArch: ftrace: Use RCU in all users of __module_text_address().
LoongArch/orc: Use RCU in all users of __module_address().
arm64: module: Use RCU in all users of __module_text_address().
ARM: module: Use RCU in all users of __module_text_address().
module: Use RCU in all users of __module_text_address().
module: Use RCU in all users of __module_address().
module: Use RCU in search_module_extables().
...
|
|
The HIP09 processor is vulnerable to the Spectre-BHB (Branch History
Buffer) attack, which can be exploited to leak information through
branch prediction side channels. This commit adds the MIDR of HIP09
to the list for software mitigation.
Signed-off-by: Jinqian Yang <yangjinqian1@huawei.com>
Link: https://lore.kernel.org/r/20250325141900.2057314-1-yangjinqian1@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|