summaryrefslogtreecommitdiff
path: root/arch/arm64/kernel
AgeCommit message (Collapse)Author
6 daysMerge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
7 daysMerge tag 'efi-next-for-v6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "Not a lot going on in the EFI tree this cycle. The only thing that stands out is the new support for SBAT metadata, which was a bit contentious when it was first proposed, because in the initial incarnation, it would have required us to maintain a revocation index, and bump it each time a vulnerability affecting UEFI secure boot got fixed. This was shot down for obvious reasons. This time, only the changes needed to emit the SBAT section into the PE/COFF image are being carried upstream, and it is up to the distros to decide what to put in there when creating and signing the build. This only has the EFI zboot bits (which the distros will be using for arm64); the x86 bzImage changes should be arriving next cycle, presumably via the -tip tree. Summary: - Add support for emitting a .sbat section into the EFI zboot image, so that downstreams can easily include revocation metadata in the signed EFI images - Align PE symbolic constant names with other projects - Bug fix for the efi_test module - Log the physical address and size of the EFI memory map when failing to map it - A kerneldoc fix for the EFI stub code" * tag 'efi-next-for-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: include: pe.h: Fix PE definitions efi/efi_test: Fix missing pending status update in getwakeuptime efi: zboot specific mechanism for embedding SBAT section efi/libstub: Describe missing 'out' parameter in efi_load_initrd efi: Improve logging around memmap init
8 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support. Quotes are appropriate because (at 6k lines and 100+ commits) it is much bigger than the rest, which will come later this week and consists mostly of bugfixes and selftests. s390 changes will also come in the second batch. ARM: - Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance. - Enable nested virtualisation support on systems that support it, though it is disabled by default. - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. - Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. - Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. - New selftest checking the pKVM ownership transition rules - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. - Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. - Add a new selftest for the SVE host state being corrupted by a guest. - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. - and the usual random cleanups. LoongArch: - Don't flush tlb if the host supports hardware page table walks. - Add KVM selftests support. RISC-V: - Add vector registers to get-reg-list selftest - VCPU reset related improvements - Remove scounteren initialization from VCPU reset - Support VCPU reset from userspace using set_mpstate() ioctl x86: - Initial support for TDX in KVM. This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module. This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit 7bcf7246c42a ('Merge branch 'kvm-tdx-finish-initial' into HEAD')" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits) x86/tdx: mark tdh_vp_enter() as __flatten Documentation: virt/kvm: remove unreferenced footnote RISC-V: KVM: lock the correct mp_state during reset KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET RISC-V: KVM: Remove scounteren initialization KVM: RISC-V: remove unnecessary SBI reset state ...
9 daysMerge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The headline feature is the re-enablement of support for Arm's Scalable Matrix Extension (SME) thanks to a bumper crop of fixes from Mark Rutland. If matrices aren't your thing, then Ryan's page-table optimisation work is much more interesting. Summary: ACPI, EFI and PSCI: - Decouple Arm's "Software Delegated Exception Interface" (SDEI) support from the ACPI GHES code so that it can be used by platforms booted with device-tree - Remove unnecessary per-CPU tracking of the FPSIMD state across EFI runtime calls - Fix a node refcount imbalance in the PSCI device-tree code CPU Features: - Ensure register sanitisation is applied to fields in ID_AA64MMFR4 - Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests can reliably query the underlying CPU types from the VMM - Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes to our context-switching, signal handling and ptrace code Entry code: - Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be selected Memory management: - Prevent BSS exports from being used by the early PI code - Propagate level and stride information to the low-level TLB invalidation routines when operating on hugetlb entries - Use the page-table contiguous hint for vmap() mappings with VM_ALLOW_HUGE_VMAP where possible - Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode" and hook this up on arm64 so that the trailing DSB (used to publish the updates to the hardware walker) can be deferred until the end of the mapping operation - Extend mmap() randomisation for 52-bit virtual addresses (on par with 48-bit addressing) and remove limited support for randomisation of the linear map Perf and PMUs: - Add support for probing the CMN-S3 driver using ACPI - Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers Selftests: - Fix FPSIMD and SME tests to align with the freshly re-enabled SME support - Fix default setting of the OUTPUT variable so that tests are installed in the right location vDSO: - Replace raw counter access from inline assembly code with a call to the the __arch_counter_get_cntvct() helper function Miscellaneous: - Add some missing header inclusions to the CCA headers - Rework rendering of /proc/cpuinfo to follow the x86-approach and avoid repeated buffer expansion (the user-visible format remains identical) - Remove redundant selection of CONFIG_CRC32 - Extend early error message when failing to map the device-tree blob" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (83 commits) arm64: cputype: Add cputype definition for HIP12 arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again perf/arm-cmn: Add CMN S3 ACPI binding arm64/boot: Disallow BSS exports to startup code arm64/boot: Move global CPU override variables out of BSS arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace perf/arm-cmn: Initialise cmn->cpu earlier kselftest/arm64: Set default OUTPUT path when undefined arm64: Update comment regarding values in __boot_cpu_mode arm64: mm: Drop redundant check in pmd_trans_huge() arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1 arm64/mm: Permit lazy_mmu_mode to be nested arm64/mm: Disable barrier batching in interrupt contexts arm64/cpuinfo: only show one cpu's info in c_show() arm64/mm: Batch barriers when updating kernel mappings mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes arm64/mm: Support huge pte-mapped pages in vmap mm/vmalloc: Gracefully unmap huge ptes mm/vmalloc: Warn on improper use of vunmap_range() arm64/mm: Hoist barriers out of set_ptes_anysz() loop ...
10 daysMerge branch 'for-next/sme-fixes' into for-next/coreWill Deacon
* for-next/sme-fixes: (35 commits) arm64/fpsimd: Allow CONFIG_ARM64_SME to be selected arm64/fpsimd: ptrace: Gracefully handle errors arm64/fpsimd: ptrace: Mandate SVE payload for streaming-mode state arm64/fpsimd: ptrace: Do not present register data for inactive mode arm64/fpsimd: ptrace: Save task state before generating SVE header arm64/fpsimd: ptrace/prctl: Ensure VL changes leave task in a valid state arm64/fpsimd: ptrace/prctl: Ensure VL changes do not resurrect stale data arm64/fpsimd: Make clone() compatible with ZA lazy saving arm64/fpsimd: Clear PSTATE.SM during clone() arm64/fpsimd: Consistently preserve FPSIMD state during clone() arm64/fpsimd: Remove redundant task->mm check arm64/fpsimd: signal: Use SMSTOP behaviour in setup_return() arm64/fpsimd: Add task_smstop_sm() arm64/fpsimd: Factor out {sve,sme}_state_size() helpers arm64/fpsimd: Clarify sve_sync_*() functions arm64/fpsimd: ptrace: Consistently handle partial writes to NT_ARM_(S)SVE arm64/fpsimd: signal: Consistently read FPSIMD context arm64/fpsimd: signal: Mandate SVE payload for streaming-mode state arm64/fpsimd: signal: Clear PSTATE.SM when restoring FPSIMD frame only arm64/fpsimd: Do not discard modified SVE state ...
10 daysMerge branch 'for-next/mm' into for-next/coreWill Deacon
* for-next/mm: arm64/boot: Disallow BSS exports to startup code arm64/boot: Move global CPU override variables out of BSS arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace arm64: mm: Drop redundant check in pmd_trans_huge() arm64/mm: Permit lazy_mmu_mode to be nested arm64/mm: Disable barrier batching in interrupt contexts arm64/mm: Batch barriers when updating kernel mappings mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes arm64/mm: Support huge pte-mapped pages in vmap mm/vmalloc: Gracefully unmap huge ptes mm/vmalloc: Warn on improper use of vunmap_range() arm64/mm: Hoist barriers out of set_ptes_anysz() loop arm64: hugetlb: Use __set_ptes_anysz() and __ptep_get_and_clear_anysz() arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() mm/page_table_check: Batch-check pmds/puds just like ptes arm64: hugetlb: Refine tlb maintenance scope arm64: hugetlb: Cleanup huge_pte size discovery mechanisms arm64: pageattr: Explicitly bail out when changing permissions for vmalloc_huge mappings arm64: Support ARM64_VA_BITS=52 when setting ARCH_MMAP_RND_BITS_MAX arm64/mm: Remove randomization of the linear map
10 daysMerge branch 'for-next/misc' into for-next/coreWill Deacon
* for-next/misc: arm64/cpuinfo: only show one cpu's info in c_show() arm64: Extend pr_crit message on invalid FDT arm64: Kconfig: remove unnecessary selection of CRC32 arm64: Add missing includes for mem_encrypt
10 daysMerge branch 'for-next/fixes' into for-next/coreWill Deacon
Merge in for-next/fixes, as subsequent improvements to our early PI code that disallow BSS exports depend on the 'arm64_use_ng_mappings' fix here. * for-next/fixes: arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays
10 daysMerge branch 'for-next/entry' into for-next/coreWill Deacon
* for-next/entry: arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again arm64: Update comment regarding values in __boot_cpu_mode arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1 arm64: enable PREEMPT_LAZY
10 daysMerge branch 'for-next/efi' into for-next/coreWill Deacon
* for-next/efi: arm64/fpsimd: Avoid unnecessary per-CPU buffers for EFI runtime calls
11 daysMerge tag 'kvmarm-6.16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.16 * New features: - Add large stage-2 mapping support for non-protected pKVM guests, clawing back some performance. - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. - Enable nested virtualisation support on systems that support it (yes, it has been a long time coming), though it is disabled by default. * Improvements, fixes and cleanups: - Large rework of the way KVM tracks architecture features and links them with the effects of control bits. This ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. - Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. - New selftest checking the pKVM ownership transition rules - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. - Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. - Add a new selftest for the SVE host state being corrupted by a guest. - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. - and the usual random cleanups.
2025-05-23Merge branch kvm-arm64/misc-6.16 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/misc-6.16: : . : Misc changes and improvements for 6.16: : : - Add a new selftest for the SVE host state being corrupted by a guest : : - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, : ensuring that the window for interrupts is slightly bigger, and avoiding : a pretty bad erratum on the AmpereOne HW : : - Replace a couple of open-coded on/off strings with str_on_off() : : - Get rid of the pKVM memblock sorting, which now appears to be superflous : : - Drop superflous clearing of ICH_LR_EOI in the LR when nesting : : - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from : a pretty bad case of TLB corruption unless accesses to HCR_EL2 are : heavily synchronised : : - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables : in a human-friendly fashion : . KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables arm64: errata: Work around AmpereOne's erratum AC04_CPU_23 KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1 KVM: arm64: Drop sort_memblock_regions() KVM: arm64: selftests: Add test for SVE host corruption KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode KVM: arm64: Replace ternary flags with str_on_off() helper Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/fgt-masks into kvmarm-master/nextMarc Zyngier
* kvm-arm64/fgt-masks: (43 commits) : . : Large rework of the way KVM deals with trap bits in conjunction with : the CPU feature registers. It now draws a direct link between which : the feature set, the system registers that need to UNDEF to match : the configuration and bits that need to behave as RES0 or RES1 in : the trap registers that are visible to the guest. : : Best of all, these definitions are mostly automatically generated : from the JSON description published by ARM under a permissive : license. : . KVM: arm64: Handle TSB CSYNC traps KVM: arm64: Add FGT descriptors for FEAT_FGT2 KVM: arm64: Allow sysreg ranges for FGT descriptors KVM: arm64: Add context-switch for FEAT_FGT2 registers KVM: arm64: Add trap routing for FEAT_FGT2 registers KVM: arm64: Add sanitisation for FEAT_FGT2 registers KVM: arm64: Add FEAT_FGT2 registers to the VNCR page KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits KVM: arm64: Allow kvm_has_feat() to take variable arguments KVM: arm64: Use FGT feature maps to drive RES0 bits KVM: arm64: Validate FGT register descriptions against RES0 masks KVM: arm64: Switch to table-driven FGU configuration KVM: arm64: Handle PSB CSYNC traps KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask KVM: arm64: Remove hand-crafted masks for FGT registers KVM: arm64: Use computed FGT masks to setup FGT registers KVM: arm64: Propagate FGT masks to the nVHE hypervisor KVM: arm64: Unconditionally configure fine-grain traps KVM: arm64: Use computed masks as sanitisers for FGT registers ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/mte-frac into kvmarm-master/nextMarc Zyngier
* kvm-arm64/mte-frac: : . : Prevent FEAT_MTE_ASYNC from being accidently exposed to a guest, : courtesy of Ben Horgan. From the cover letter: : : "The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM. : However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0 : indicates that MTE_ASYNC is supported. On a host with : ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the : MTE capability enabled will incorrectly see MTE_ASYNC advertised as : supported. This series fixes that." : . KVM: selftests: Confirm exposing MTE_frac does not break migration KVM: arm64: Make MTE_frac masking conditional on MTE capability arm64/sysreg: Expose MTE_frac so that it is visible to KVM Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/ubsan-el2 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/ubsan-el2: : . : Add UBSAN support to the EL2 portion of KVM, reusing most of the : existing logic provided by CONFIG_IBSAN_TRAP. : : Patches courtesy of Mostafa Saleh. : . KVM: arm64: Handle UBSAN faults KVM: arm64: Introduce CONFIG_UBSAN_KVM_EL2 ubsan: Remove regs from report_ubsan_failure() arm64: Introduce esr_is_ubsan_brk() Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21include: pe.h: Fix PE definitionsPali Rohár
* Rename constants to their standard PE names: - MZ_MAGIC -> IMAGE_DOS_SIGNATURE - PE_MAGIC -> IMAGE_NT_SIGNATURE - PE_OPT_MAGIC_PE32_ROM -> IMAGE_ROM_OPTIONAL_HDR_MAGIC - PE_OPT_MAGIC_PE32 -> IMAGE_NT_OPTIONAL_HDR32_MAGIC - PE_OPT_MAGIC_PE32PLUS -> IMAGE_NT_OPTIONAL_HDR64_MAGIC - IMAGE_DLL_CHARACTERISTICS_NX_COMPAT -> IMAGE_DLLCHARACTERISTICS_NX_COMPAT * Import constants and their description from readpe and file projects which contains current up-to-date information: - IMAGE_FILE_MACHINE_* - IMAGE_FILE_* - IMAGE_SUBSYSTEM_* - IMAGE_DLLCHARACTERISTICS_* - IMAGE_DLLCHARACTERISTICS_EX_* - IMAGE_DEBUG_TYPE_* * Add missing IMAGE_SCN_* constants and update their incorrect description * Fix incorrect value of IMAGE_SCN_MEM_PURGEABLE constant * Add description for win32_version and loader_flags PE fields Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-05-21Merge branch kvm-arm64/pkvm-selftest-6.16 into kvm-arm64/pkvm-np-thp-6.16Marc Zyngier
* kvm-arm64/pkvm-selftest-6.16: : . : pKVM selftests covering the memory ownership transitions by : Quentin Perret. From the initial cover letter: : : "We have recently found a bug [1] in the pKVM memory ownership : transitions by code inspection, but it could have been caught with a : test. : : Introduce a boot-time selftest exercising all the known pKVM memory : transitions and importantly checks the rejection of illegal transitions. : : The new test is hidden behind a new Kconfig option separate from : CONFIG_EL2_NVHE_DEBUG on purpose as that has side effects on the : transition checks ([1] doesn't reproduce with EL2 debug enabled). : : [1] https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/" : . KVM: arm64: Extend pKVM selftest for np-guests KVM: arm64: Selftest for pKVM transitions KVM: arm64: Don't WARN from __pkvm_host_share_guest() KVM: arm64: Add .hyp.data section Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19arm64: errata: Work around AmpereOne's erratum AC04_CPU_23D Scott Phillips
On AmpereOne AC04, updates to HCR_EL2 can rarely corrupt simultaneous translations for data addresses initiated by load/store instructions. Only instruction initiated translations are vulnerable, not translations from prefetches for example. A DSB before the store to HCR_EL2 is sufficient to prevent older instructions from hitting the window for corruption, and an ISB after is sufficient to prevent younger instructions from hitting the window for corruption. Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250513184514.2678288-1-scott@os.amperecomputing.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-16arm64/boot: Disallow BSS exports to startup codeArd Biesheuvel
BSS might be uninitialized when entering the startup code, so forbid the use by the startup code of any variables that live after __bss_start in the linker map. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Link: https://lore.kernel.org/r/20250508114328.2460610-8-ardb+git@google.com [will: Drop export of 'memstart_offset_seed', as this has been removed] Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16arm64/boot: Move global CPU override variables out of BSSArd Biesheuvel
Accessing BSS will no longer be permitted from the startup code in arch/arm64/kernel/pi, as some of it executes before BSS is cleared. Clearing BSS earlier would involve managing cache coherency explicitly in software, which is a hassle we prefer to avoid. So move some variables that are assigned by the startup code out of BSS and into .data. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Link: https://lore.kernel.org/r/20250508114328.2460610-7-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespaceArd Biesheuvel
init_pgdir[] is only referenced from the startup code, but lives after BSS in the linker map. Before tightening the rules about accessing BSS from startup code, move init_pgdir[] into the __pi_ namespace, so it does not need to be exported explicitly. For symmetry, do the same with init_idmap_pgdir[], although it lives before BSS. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Link: https://lore.kernel.org/r/20250508114328.2460610-6-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1Anshuman Khandual
mov_q cannot really move PIE_E[0|1] macros into a general purpose register as expected if those macro constants contain some 128 bit layout elements, that are required for D128 page tables. The primary issue is that for D128, PIE_E[0|1] are defined in terms of 128-bit types with shifting and masking, which the assembler can't accommodate. Instead pre-calculate these PIRE0_EL1/PIR_EL1 constants into asm-offsets.h based PIE_E0_ASM/PIE_E1_ASM which can then be used in arch/arm64/mm/proc.S. While here also drop PTE_MAYBE_NG/PTE_MAYBE_SHARED assembly overrides which are not required any longer, as the compiler toolchains are smart enough to compute both the PIE_[E0|E1]_ASM constants in all scenarios. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20250429050511.1663235-1-anshuman.khandual@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-16arm64/sysreg: Expose MTE_frac so that it is visible to KVMBen Horgan
KVM exposes the sanitised ID registers to guests. Currently these ignore the ID_AA64PFR1_EL1.MTE_frac field, meaning guests always see a value of zero. This is a problem for platforms without the MTE_ASYNC feature where ID_AA64PFR1_EL1.MTE==0x2 and ID_AA64PFR1_EL1.MTE_frac==0xf. KVM forces MTE_frac to zero, meaning the guest believes MTE_ASYNC is supported, when no async fault will ever occur. Before KVM can fix this, the architecture needs to sanitise the ID register field for MTE_frac. Linux itself does not use MTE_frac field and just assumes MTE async faults can be generated if MTE is supported. Signed-off-by: Ben Horgan <ben.horgan@arm.com> Link: https://lore.kernel.org/r/20250512114112.359087-2-ben.horgan@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-11arm64/mm: define ptdesc_tAnshuman Khandual
Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. Link: https://lkml.kernel.org/r/20250407053113.746295-4-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11Merge tag 'arm64_cbpf_mitigation_2025_05_08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 cBPF BHB mitigation from James Morse: "This adds the BHB mitigation into the code JITted for cBPF programs as these can be loaded by unprivileged users via features like seccomp. The existing mechanisms to disable the BHB mitigation will also prevent the mitigation being JITted. In addition, cBPF programs loaded by processes with the SYS_ADMIN capability are not mitigated as these could equally load an eBPF program that does the same thing. For good measure, the list of 'k' values for CPU's local mitigations is updated from the version on arm's website" * tag 'arm64_cbpf_mitigation_2025_05_08' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: proton-pack: Add new CPUs 'k' values for branch mitigation arm64: bpf: Only mitigate cBPF programs loaded by unprivileged users arm64: bpf: Add BHB mitigation to the epilogue for cBPF programs arm64: proton-pack: Expose whether the branchy loop k value arm64: proton-pack: Expose whether the platform is mitigated by firmware arm64: insn: Add support for encoding DSB
2025-05-09Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Move the arm64_use_ng_mappings variable from the .bss to the .data section as it is accessed very early during boot with the MMU off and before the .bss has been initialised. This could lead to incorrect idmap page table" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
2025-05-09arm64/cpuinfo: only show one cpu's info in c_show()Ye Bin
Currently, when ARM64 displays CPU information, every call to c_show() assembles all CPU information. However, as the number of CPUs increases, this can lead to insufficient buffer space due to excessive assembly in a single call, causing repeated expansion and multiple calls to c_show(). To prevent this invalid c_show() call, only one CPU's information is assembled each time c_show() is called. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20250421062947.4072855-1-yebin@huaweicloud.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-09arm64/mm: Batch barriers when updating kernel mappingsRyan Roberts
Because the kernel can't tolerate page faults for kernel mappings, when setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a dsb(ishst) to ensure that the store to the pgtable is observed by the table walker immediately. Additionally it emits an isb() to ensure that any already speculatively determined invalid mapping fault gets canceled. We can improve the performance of vmalloc operations by batching these barriers until the end of a set of entry updates. arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() provide the required hooks. vmalloc improves by up to 30% as a result. Two new TIF_ flags are created; TIF_LAZY_MMU tells us if the task is in the lazy mode and can therefore defer any barriers until exit from the lazy mode. TIF_LAZY_MMU_PENDING is used to remember if any pte operation was performed while in the lazy mode that required barriers. Then when leaving lazy mode, if that flag is set, we emit the barriers. Since arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() are used for both user and kernel mappings, we need the second flag to avoid emitting barriers unnecessarily if only user mappings were updated. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Luiz Capitulino <luizcap@redhat.com> Link: https://lore.kernel.org/r/20250422081822.1836315-12-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64: proton-pack: Add new CPUs 'k' values for branch mitigationJames Morse
Update the list of 'k' values for the branch mitigation from arm's website. Add the values for Cortex-X1C. The MIDR_EL1 value can be found here: https://developer.arm.com/documentation/101968/0002/Register-descriptions/AArch> Link: https://developer.arm.com/documentation/110280/2-0/?lang=en Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-08arm64/fpsimd: ptrace: Gracefully handle errorsMark Rutland
Within sve_set_common() we do not handle error conditions correctly: * When writing to NT_ARM_SSVE, if sme_alloc() fails, the task will be left with task->thread.sme_state==NULL, but TIF_SME will be set and task->thread.fp_type==FP_STATE_SVE. This will result in a subsequent null pointer dereference when the task's state is loaded or otherwise manipulated. * When writing to NT_ARM_SSVE, if sve_alloc() fails, the task will be left with task->thread.sve_state==NULL, but TIF_SME will be set, PSTATE.SM will be set, and task->thread.fp_type==FP_STATE_FPSIMD. This is not a legitimate state, and can result in various problems, including a subsequent null pointer dereference and/or the task inheriting stale streaming mode register state the next time its state is loaded into hardware. * When writing to NT_ARM_SSVE, if the VL is changed but the resulting VL differs from that in the header, the task will be left with TIF_SME set, PSTATE.SM set, but task->thread.fp_type==FP_STATE_FPSIMD. This is not a legitimate state, and can result in various problems as described above. Avoid these problems by allocating memory earlier, and by changing the task's saved fp_type to FP_STATE_SVE before skipping register writes due to a change of VL. To make early returns simpler, I've moved the call to fpsimd_flush_task_state() earlier. As the tracee's state has already been saved, and the tracee is known to be blocked for the duration of sve_set_common(), it doesn't matter whether this is called at the start or the end. For consistency I've moved the setting of TIF_SVE earlier. This will be cleared when loading FPSIMD-only state, and so moving this has no resulting functional change. Note that we only allocate the memory for SVE state when SVE register contents are provided, avoiding unnecessary memory allocations for tasks which only use FPSIMD. Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Fixes: 5d0a8d2fba50 ("arm64/ptrace: Ensure that SME is set up for target when writing SSVE state") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-20-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace: Mandate SVE payload for streaming-mode stateMark Rutland
When a task has PSTATE.SM==1, reads of NT_ARM_SSVE are required to always present a header with SVE_PT_REGS_SVE, and register data in SVE format. Reads of NT_ARM_SSVE must never present register data in FPSIMD format. Within the kernel, we always expect streaming SVE data to be stored in SVE format. Currently a user can write to NT_ARM_SSVE with a header presenting SVE_PT_REGS_FPSIMD rather than SVE_PT_REGS_SVE, placing the task's FPSIMD/SVE data into an invalid state. To fix this we can either: (a) Forbid such writes. (b) Accept such writes, and immediately convert data into SVE format. Take the simple option and forbid such writes. Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-19-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace: Do not present register data for inactive modeMark Rutland
The SME ptrace ABI is written around the incorrect assumption that SVE_PT_REGS_FPSIMD and SVE_PT_REGS_SVE are independent bit flags, where it is possible for both to be clear. In reality they are different values for bit 0 of the header flags, where SVE_PT_REGS_FPSIMD is 0 and SVE_PT_REGS_SVE is 1. In cases where code was written expecting that neither bit flag would be set, the value is equivalent to SVE_PT_REGS_FPSIMD. One consequence of this is that reads of the NT_ARM_SVE or NT_ARM_SSVE will erroneously present data from the other mode: * When PSTATE.SM==1, reads of NT_ARM_SVE will present a header with SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from streaming mode. * When PSTATE.SM==0, reads of NT_ARM_SSVE will present a header with SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from non-streaming mode. The original intent was that no register data would be provided in these cases, as described in commit: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Luckily, debuggers do not consume the bogus register data. Both GDB and LLDB read the NT_ARM_SSVE regset before the NT_ARM_SVE regset, and assume that when the NT_ARM_SSVE header presents SVE_PT_REGS_FPSIMD, it is necessary to read register contents from the NT_ARM_SVE regset, regardless of whether the NT_ARM_SSVE regset provided bogus register data. Fix the code to stop presenting register data from the inactive mode. At the same time, make the manipulation of the flag clearer, and remove the bogus comment from sve_set_common(). I've given this a quick spin with GDB and LLDB, and both seem happy. Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-18-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace: Save task state before generating SVE headerMark Rutland
As sve_init_header_from_task() consumes the saved value of PSTATE.SM and the saved fp_type, both must be saved before the header is generated. When generating a coredump for the current task, sve_get_common() calls sve_init_header_from_task() before saving the task's state. Consequently the header may be bogus, and the contents of the regset may be misleading. Fix this by saving the task's state before generting the header. Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Fixes: b017a0cea627 ("arm64/ptrace: Use saved floating point state type to determine SVE layout") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-17-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace/prctl: Ensure VL changes leave task in a valid stateMark Rutland
Currently, vec_set_vector_length() can manipulate a task into an invalid state as a result of a prctl/ptrace syscall which changes the SVE/SME vector length, resulting in several problems: (1) When changing the SVE vector length, if the task initially has PSTATE.ZA==1, and sve_alloc() fails to allocate memory, the task will be left with PSTATE.ZA==1 and sve_state==NULL. This is not a legitimate state, and could result in a subsequent null pointer dereference. (2) When changing the SVE vector length, if the task initially has PSTATE.SM==1, the task will be left with PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Streaming mode state always needs to be saved in SVE format, so this is not a legitimate state. Attempting to restore this state may cause a task to erroneously inherit stale streaming mode predicate registers and FFR contents, behaving non-deterministically and potentially leaving information from another task. While in this state, reads of the NT_ARM_SSVE regset will indicate that the registers are not stored in SVE format. For the NT_ARM_SSVE regset specifically, debuggers interpret this as meaning that PSTATE.SM==0. (3) When changing the SME vector length, if the task initially has PSTATE.SM==1, the lower 128 bits of task's streaming mode vector state will be migrated to non-streaming mode, rather than these bits being zeroed as is usually the case for changes to PSTATE.SM. To fix the first issue, we can eagerly allocate the new sve_state and sme_state before modifying the task. This makes it possible to handle memory allocation failure without modifying the task state at all, and removes the need to clear TIF_SVE and TIF_SME. To fix the second issue, we either need to clear PSTATE.SM or not change the saved fp_type. Given we're going to eagerly allocate sve_state and sme_state, the simplest option is to preserve PSTATE.SM and the saves fp_type, and consistently truncate the SVE state. This ensures that the task always stays in a valid state, and by virtue of not exiting streaming mode, this also sidesteps the third issue. I believe these changes should not be problematic for realistic usage: * When the SVE/SME vector length is changed via prctl(), syscall entry will have cleared PSTATE.SM. Unless the task's state has been manipulated via ptrace after entry, the task will have PSTATE.SM==0. * When the SVE/SME vector length is changed via a write to the NT_ARM_SVE or NT_ARM_SSVE regsets, PSTATE.SM will be forced immediately after the length change, and new vector state will be copied from userspace. * When the SME vector length is changed via a write to the NT_ARM_ZA regset, the (S)SVE state is clobbered today, so anyone who cares about the specific state would need to install this after writing to the NT_ARM_ZA regset. As we need to free the old SVE state while TIF_SVE may still be set, we cannot use sve_free(), and using kfree() directly makes it clear that the free pairs with the subsequent assignment. As this leaves sve_free() unused, I've removed the existing sve_free() and renamed __sve_free() to mirror sme_free(). Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-16-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace/prctl: Ensure VL changes do not resurrect stale dataMark Rutland
The SVE/SME vector lengths can be changed via prctl/ptrace syscalls. Changes to the SVE/SME vector lengths are documented as preserving the lower 128 bits of the Z registers (i.e. the bits shared with the FPSIMD V registers). To ensure this, vec_set_vector_length() explicitly copies register values from a task's saved SVE state to its saved FPSIMD state when dropping the task to FPSIMD-only. The logic for this was not updated when when FPSIMD/SVE state tracking was changed across commits: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Since the last commit above, a task's FPSIMD/SVE state may be stored in FPSIMD format while TIF_SVE is set, and the stored SVE state is stale. When vec_set_vector_length() encounters this case, it will erroneously clobber the live FPSIMD state with stale SVE state by using sve_to_fpsimd(). Fix this by using fpsimd_sync_from_effective_state() instead. Related issues with streaming mode state will be addressed in subsequent patches. Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-15-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Make clone() compatible with ZA lazy savingMark Rutland
Linux is intended to be compatible with userspace written to Arm's AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension (SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's ZA tile is lazily callee-saved and caller-restored. In this scheme, TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer" pointer. This scheme has been implemented in GCC and LLVM, with necessary runtime support implemented in glibc and bionic. AAPCS64 does not specify how the ZA lazy saving scheme is expected to interact with thread creation mechanisms such as fork() and pthread_create(), which would be implemented in terms of the Linux clone syscall. The behaviour implemented by Linux and glibc/bionic doesn't always compose safely, as explained below. Currently the clone syscall is implemented such that PSTATE.ZA and the ZA tile are always inherited by the new task, and TPIDR2_EL0 is inherited unless the 'flags' argument includes CLONE_SETTLS, in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense: (a) TPIDR2_EL0 is part of the calling convention, and changes as control is passed between functions. It is *NOT* used for thread local storage, despite superficial similarity to TPIDR_EL0, which is is used as the TLS register. (b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call standard, and some combinations of states are illegal. In general, manipulating the two independently is not guaranteed to be safe. In practice, code which is compliant with the procedure call standard may issue a clone syscall while in the "ZA dormant" state, where PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to be saved. This can cause a variety of problems, including: * If the implementation of pthread_create() passes CLONE_SETTLS, the new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the procedure call standard this is not a legitimate state for most functions. This can cause data corruption (e.g. as code may rely on PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will zero the ZA tile contents), and may result in other undefined behaviour. * If the implementation of pthread_create() does not pass CLONE_SETTLS, the new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a TPIDR2 block on the parent thread's stack. This can result in a variety of problems, e.g. - The child may write back to the parent's za_save_buffer, corrupting its contents. - The child may read from the TPIDR2 block after the parent has reused this memory for something else, and consequently the child may abort or clobber arbitrary memory. Ideally we'd require that userspace ensures that a task is in the "ZA off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a clone syscall, and have the kernel force this state for new threads. Unfortunately, contemporary C libraries do not do this, and simply forcing this state within the implementation of clone would break fork(). Instead, we can bodge around this by considering the CLONE_VM flag, and manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that the new task will run in the same address space as its parent, and in that case it doesn't make sense to inherit a stale pointer to the parent's TPIDR2 block: * For fork(), CLONE_VM will not be set, and it is safe to inherit both PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the address space, and cannot clobber its parent's stack. * For pthread_create() and vfork(), CLONE_VM will be set, and discarding PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing assumptions in userspace. Implement this behaviour for clone(). We currently inherit PSTATE.ZA in arch_dup_task_struct(), but this does not have access to the clone flags, so move this logic under copy_thread(). Documentation is updated to describe the new behaviour. [1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf [2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Richard Sandiford <richard.sandiford@arm.com> Cc: Sander De Smalen <sander.desmalen@arm.com> Cc: Tamas Petz <tamas.petz@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yury Khrustalev <yury.khrustalev@arm.com> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com> Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Clear PSTATE.SM during clone()Mark Rutland
Currently arch_dup_task_struct() doesn't handle cases where the parent task has PSTATE.SM==1. Since syscall entry exits streaming mode, the parent will usually have PSTATE.SM==0, but this can be change by ptrace after syscall entry. When this happens, arch_dup_task_struct() will initialise the new task into an invalid state. The new task inherits the parent's configuration of PSTATE.SM, but fp_type is set to FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and sme_state may be set to NULL. This can result in a variety of problems whenever the new task's state is manipulated, including kernel NULL pointer dereferences and leaking of streaming mode state between tasks. When ptrace is not involved, the parent will have PSTATE.SM==0 as a result of syscall entry, and the documentation in Documentation/arch/arm64/sme.rst says: | On process creation (eg, clone()) the newly created process will have | PSTATE.SM cleared. ... so make this true by using task_smstop_sm() to exit streaming mode in the child task, avoiding the problems above. Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Consistently preserve FPSIMD state during clone()Mark Rutland
In arch_dup_task_struct() we try to ensure that the child task inherits the FPSIMD state of its parent, but this depends on the parent task's saved state being in FPSIMD format, which is not always the case. Consequently the child task may inherit stale FPSIMD state in some cases. This can happen when the parent's state has been modified by ptrace since syscall entry, as writes to the NT_ARM_SVE regset may save state in SVE format. This has been possible since commit: bc0ee4760364 ("arm64/sve: Core task context handling") More recently it has been possible for a task's FPSIMD/SVE state to be saved before lazy discarding was guaranteed to occur, in which case preemption could cause the effective FPSIMD state to be saved in SVE format non-deterministically. This has been possible since commit: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Fix this by saving the parent task's effective FPSIMD state into FPSIMD format before copying the task_struct. As this requires modifying the parent's fpsimd_state, we must save+flush the state to avoid racing with concurrent manipulation. Similar issues exist when the parent has streaming mode state, and will be addressed by subsequent patches. Fixes: bc0ee4760364 ("arm64/sve: Core task context handling") Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-12-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Remove redundant task->mm checkMark Rutland
For historical reasons, arch_dup_task_struct() only calls fpsimd_preserve_current_state() when current->mm is non-NULL, but this is no longer necessary. Historically TIF_FOREIGN_FPSTATE was only managed for user threads, and was never set for kernel threads. At the time, various functions attempted to avoid saving/restoring state for kernel threads by checking task_struct::mm to try to distinguish user threads from kernel threads. We added the current->mm check to arch_dup_task_struct() in commit: 6eb6c80187c5 ("arm64: kernel thread don't need to save fpsimd context.") ... where the intent was to avoid pointlessly saving state for kernel threads, which never had live state (and the saved state would never be consumed). Subsequently we began setting TIF_FOREIGN_FPSTATE for kernel threads, and removed most of the task_struct::mm checks in commit: df3fb9682045 ("arm64: fpsimd: Eliminate task->mm checks") ... but we missed the check in arch_dup_task_struct(), which is similarly redundant. Remove the redundant check from arch_dup_task_struct(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-11-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Use SMSTOP behaviour in setup_return()Mark Rutland
Historically the behaviour of setup_return() was nondeterministic, depending on whether the task's FSIMD/SVE/SME state happened to be live. We fixed most of that in commit: 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early") ... but we didn't decide on how clearing PSTATE.SM should behave, and left a TODO comment to that effect. Use the new task_smstop_sm() helper to make this behave as if an SMSTOP instruction was used to exit streaming mode. This would have been the most common behaviour prior to the commit above. Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-10-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Add task_smstop_sm()Mark Rutland
In a few places we want to transition a task from streaming mode to non-streaming mode, e.g. signal delivery where we historically tried to use an SMSTOP SM instruction. Add a new helper to manipulate a task's state in the same way as an SMSTOP SM instruction. I have not added a corresponding helper to simulate the effects of SMSTART SM. Only ptrace transitions a task into streaming mode, and ptrace has distinct semantics for such transitions. Per ARM DDI 0487 L.a, section B1.4.6: | RRSWFQ | When the Effective value of PSTATE.SM is changed by any method from 0 | to 1, an entry to Streaming SVE mode is performed, and all implemented | bits of Streaming SVE register state are set to zero. | RKFRQZ | When the Effective value of PSTATE.SM is changed by any method from 1 | to 0, an exit from Streaming SVE mode is performed, and in the | newly-entered mode, all implemented bits of the SVE scalable vector | registers, SVE predicate registers, and FFR, are set to zero. Per ARM DDI 0487 L.a, section C5.2.9: | On entry to or exit from Streaming SVE mode, FPMR is set to 0 Per ARM DDI 0487 L.a, section C5.2.10: | On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, | IXC, IDC, QC} are set to 1 and the remaining bits are set to 0. This means bits 0, 1, 2, 3, 4, 7, and 27 respectively, i.e. 0x0800009f Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-9-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Factor out {sve,sme}_state_size() helpersMark Rutland
In subsequent patches we'll need to determine the SVE/SME state size for a given SVE VL and SME VL regardless of whether a task is currently configured with those VLs. Split the sizing logic out of sve_state_size() and sme_state_size() so that we don't need to open-code this logic elsewhere. At the same time, apply minor cleanups: * Move sve_state_size() into fpsimd.h, matching the placement of sme_state_size(). * Remove the feature checks from sve_state_size(). We only call sve_state_size() when at least one of SVE and SME are supported, and when either of the two is not supported, the task's corresponding SVE/SME vector length will be zero. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-8-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: Clarify sve_sync_*() functionsMark Rutland
The sve_sync_{to,from}_fpsimd*() functions are intended to extract/insert the currently effective FPSIMD state of a task regardless of whether the task's state is saved in FPSIMD format or SVE format. Historically they were only used by ptrace, but sve_sync_to_fpsimd() is now used more widely, and sve_sync_from_fpsimd_zeropad() may be used more widely in future. When FPSIMD/SVE state tracking was changed across commits: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") ... sve_sync_to_fpsimd() was updated to consider task->thread.fp_type rather than the task's TIF_SVE and PSTATE.SM, but (apparently due to an oversight) sve_sync_from_fpsimd_zeropad() was left as-is, leaving the two inconsistent. Due to this, sve_sync_from_fpsimd_zeropad() may copy state from task->thread.uw.fpsimd_state into task->thread.sve_state when task->thread.fp_type == FP_STATE_FPSIMD. This is redundant (but benign) as task->thread.uw.fpsimd_state is the effective state that will be restored, and task->thread.sve_state will not be consumed. For consistency, and to avoid the redundant work, it better for sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type alone, matching sve_sync_to_fpsimd(). The naming of both functions is somehat unfortunate, as it is unclear when and why they copy state. It would be better to describe them in terms of the effective state. Considering all of the above, clean this up: * Adjust sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type. * Update comments to clarify the intended semantics/usage. I've removed the description that task->thread.sve_state must have been allocated, as this is only necessary when task->thread.fp_type == FP_STATE_SVE, which itself implies that task->thread.sve_state must have been allocated. * Rename the functions to more clearly indicate when/why they copy state: - sve_sync_to_fpsimd() => fpsimd_sync_from_effective_state() - sve_sync_from_fpsimd_zeropad => fpsimd_sync_to_effective_state_zeropad() Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-7-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: ptrace: Consistently handle partial writes to NT_ARM_(S)SVEMark Rutland
Partial writes to the NT_ARM_SVE and NT_ARM_SSVE regsets using an payload are handled inconsistently and non-deterministically. A comment within sve_set_common() indicates that we intended that a partial write would preserve any effective FPSIMD/SVE state which was not overwritten, but this has never worked consistently, and during syscalls the FPSIMD vector state may be non-deterministically preserved and may be erroneously migrated between streaming and non-streaming SVE modes. The simplest fix is to handle a partial write by consistently zeroing the remaining state. As detailed below I do not believe this will adversely affect any real usage. Neither GDB nor LLDB attempt partial writes to these regsets, and the documentation (in Documentation/arch/arm64/sve.rst) has always indicated that state preservation was not guaranteed, as is says: | The effect of writing a partial, incomplete payload is unspecified. When the logic was originally introduced in commit: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support") ... there were two potential behaviours, depending on TIF_SVE: * When TIF_SVE was clear, all SVE state would be zeroed, excluding the low 128 bits of vectors shared with FPSIMD, FPSR, and FPCR. * When TIF_SVE was set, all SVE state would be zeroed, including the low 128 bits of vectors shared with FPSIMD, but excluding FPSR and FPCR. Note that as writing to NT_ARM_SVE would set TIF_SVE, partial writes to NT_ARM_SVE would not be idempotent, and if a first write preserved the low 128 bits, a subsequent (potentially identical) partial write would discard the low 128 bits. When support for the NT_ARM_SSVE regset was added in commit: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") ... the above behaviour was retained for writes to the NT_ARM_SVE regset, though writes to the NT_ARM_SSVE would always zero the SVE registers and would not inherit FPSIMD register state. This happened as fpsimd_sync_to_sve() only copied the FPSIMD regs when TIF_SVE was clear and PSTATE.SM==0. Subsequently, when FPSIMD/SVE state tracking was changed across commits: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") ... there was no corresponding update to the ptrace code, nor to fpsimd_sync_to_sve(), which stil considers TIF_SVE and PSTATE.SM rather than the saved fp_type. The saved state can be in the FPSIMD format regardless of whether TIF_SVE is set or clear, and the saved type can change non-deterministically during syscalls. Consequently a subsequent partial write to the NT_ARM_SVE or NT_ARM_SSVE regsets may non-deterministically preserve the FPSIMD state, and may migrate this state between streaming and non-streaming modes. Clean this up by never attempting to preserve ANY state when writing an SVE payload to the NT_ARM_SVE/NT_ARM_SSVE regsets, zeroing all relevant state including FPSR and FPCR. This simplifies the code, makes the behaviour deterministic, and avoids migrating state between streaming and non-streaming modes. As above, I do not believe this should adversely affect existing userspace applications. At the same time, remove fpsimd_sync_to_sve(). It is no longer used, doesn't do what its documentation implies, and gets in the way of other cleanups and fixes. Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-6-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64: bpf: Add BHB mitigation to the epilogue for cBPF programsJames Morse
A malicious BPF program may manipulate the branch history to influence what the hardware speculates will happen next. On exit from a BPF program, emit the BHB mititgation sequence. This is only applied for 'classic' cBPF programs that are loaded by seccomp. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2025-05-08arm64: proton-pack: Expose whether the branchy loop k valueJames Morse
Add a helper to expose the k value of the branchy loop. This is needed by the BPF JIT to generate the mitigation sequence in BPF programs. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-08arm64: proton-pack: Expose whether the platform is mitigated by firmwareJames Morse
is_spectre_bhb_fw_affected() allows the caller to determine if the CPU is known to need a firmware mitigation. CPUs are either on the list of CPUs we know about, or firmware has been queried and reported that the platform is affected - and mitigated by firmware. This helper is not useful to determine if the platform is mitigated by firmware. A CPU could be on the know list, but the firmware may not be implemented. Its affected but not mitigated. spectre_bhb_enable_mitigation() handles this distinction by checking the firmware state before enabling the mitigation. Add a helper to expose this state. This will be used by the BPF JIT to determine if calling firmware for a mitigation is necessary and supported. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
2025-05-08arm64/fpsimd: signal: Consistently read FPSIMD contextMark Rutland
For historical reasons, restore_sve_fpsimd_context() has an open-coded copy of the logic from read_fpsimd_context(), which is used to either restore an FPSIMD-only context, or to merge FPSIMD state into an SVE state when restoring an SVE+FPSIMD context. The logic is *almost* identical. Refactor the logic to avoid duplication and make this clearer. This comes with two functional changes that I do not believe will be problematic in practice: * The user_fpsimd_state::size field will be checked in all restore paths that consume it user_fpsimd_state. The kernel always populates this field when delivering a signal, and so this should contain the expected value unless it has been corrupted. * If a read of user_fpsimd_state fails, we will return early without modifying TIF_SVE, the saved SVCR, or the save fp_type. This will leave the task in a consistent state, without potentially resurrecting stale FPSIMD state. A read of user_fpsimd_state should never fail unless the structure has been corrupted or the stack has been unmapped. Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-5-mark.rutland@arm.com [will: Ensure read_fpsimd_context() returns negative error code or zero] Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Mandate SVE payload for streaming-mode stateMark Rutland
Non-streaming SVE state may be preserved without an SVE payload, in which case the SVE context only has a header with VL==0, and all state can be restored from the FPSIMD context. Streaming SVE state is always preserved with an SVE payload, where the SVE context header has VL!=0, and the SVE_SIG_FLAG_SM flag is set. The kernel never preserves an SVE context where SVE_SIG_FLAG_SM is set without an SVE payload. However, restore_sve_fpsimd_context() doesn't forbid restoring such a context, and will handle this case by clearing PSTATE.SM and restoring the FPSIMD context into non-streaming mode, which isn't consistent with the SVE_SIG_FLAG_SM flag. Forbid this case, and mandate an SVE payload when the SVE_SIG_FLAG_SM flag is set. This avoids an awkward ABI quirk and reduces the risk that later rework to this code permits configuring a task with PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. I've marked this as a fix given that we never intended to support this case, and we don't want anyone to start relying upon the old behaviour once we re-enable SME. Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-4-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08arm64/fpsimd: signal: Clear PSTATE.SM when restoring FPSIMD frame onlyMark Rutland
On systems with SVE and/or SME, the kernel will always create SVE and FPSIMD signal frames when delivering a signal, but a user can manipulate signal frames such that a signal return only observes an FPSIMD signal frame. When this happens, restore_fpsimd_context() will restore state such that fp_type==FP_STATE_FPSIMD, but will leave PSTATE.SM as-is. It is possible for a user to set PSTATE.SM between syscall entry and execution of the sigreturn logic (e.g. via ptrace), and consequently the sigreturn may result in the task having PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. For various reasons it is not legitimate for a task to be in a state where PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Portions of the user ABI are written with the requirement that streaming SVE state is always presented in SVE format rather than FPSIMD format, and as there is no mechanism to permit access to only the FPSIMD subset of streaming SVE state, streaming SVE state must always be saved and restored in SVE format. Fix restore_fpsimd_context() to clear PSTATE.SM when restoring an FPSIMD signal frame without an SVE signal frame. This matches the current behaviour when an SVE signal frame is present, but the SVE signal frame has no register payload (e.g. as is the case on SME-only systems which lack SVE). This change should have no effect for applications which do not alter signal frames (i.e. almost all applications). I do not expect non-{malicious,buggy} applications to hide the SVE signal frame, but I've chosen to clear PSTATE.SM rather than mandating the presence of an SVE signal frame in case there is some legacy (non-SME) usage that I am not currently aware of. For context, the SME handling was originally introduced in commit: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") ... and subsequently updated/fixed to handle SME-only systems in commits: 7dde62f0687c ("arm64/signal: Always accept SVE signal frames on SME only systems") f26cd7372160 ("arm64/signal: Always allocate SVE signal frames on SME only systems") Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>