summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2025-04-17powerpc/pseries: Add papr-indices char driver for ibm,get-indicesHaren Myneni
The RTAS call ibm,get-indices is used to obtain indices and location codes for a specified indicator or sensor token. The current implementation uses rtas_get_indices() API provided by librtas library which allocates RMO buffer and issue this RTAS call in the user space. But writable mapping /dev/mem access by the user space is prohibited under system lockdown. To overcome the restricted access in the user space, the kernel provide interfaces to collect indices data from the hypervisor. This patch adds papr-indices character driver and expose standard interfaces such as open / ioctl/ read to user space in ways that are compatible with lockdown. PAPR (2.13 7.3.17 ibm,get-indices RTAS Call) describes the following steps to retrieve all indices data: - User input parameters to the RTAS call: sensor or indicator, and indice type - ibm,get-indices is sequence RTAS call which means has to issue multiple times to get the entire list of indicators or sensors of a particular type. The hypervisor expects the first RTAS call with the sequence 1 and the subsequent calls with the sequence number returned from the previous calls. - The OS may not interleave calls to ibm,get-indices for different indicator or sensor types. Means other RTAS calls with different type should not be issued while the previous type sequence is in progress. So collect the entire list of indices and copied to buffer BLOB during ioctl() and expose this buffer to the user space with the file descriptor. - The hypervisor fills the work area with a specific format but does not return the number of bytes written to the buffer. Instead of parsing the data for each call to determine the data length, copy the work area size (RTAS_GET_INDICES_BUF_SIZE) to the buffer. Return work-area size of data to the user space for each read() call. Expose these interfaces to user space with a /dev/papr-indices character device using the following programming model: int devfd = open("/dev/papr-indices", O_RDONLY); int fd = ioctl(devfd, PAPR_INDICES_IOC_GET, struct papr_indices_io_block) - Collect all indices data for the specified token to the buffer char *buf = malloc(RTAS_GET_INDICES_BUF_SIZE); length = read(fd, buf, RTAS_GET_INDICES_BUF_SIZE) - RTAS_GET_INDICES_BUF_SIZE of data is returned to the user space. - The user space retrieves the indices and their location codes from the buffer - Should issue multiple read() calls until reaches the end of BLOB buffer. The read() should use the file descriptor obtained from ioctl to get the data that is exposed to file descriptor. Implemented support in librtas (rtas_get_indices()) for this new ABI for system lockdown. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Tested-by: Sathvika Vasireddy <sv@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416225743.596462-4-haren@linux.ibm.com
2025-04-17powerpc/pseries: Define papr_indices_io_block for papr-indices ioctlsHaren Myneni
To issue ibm,get-indices, ibm,set-dynamic-indicator and ibm,get-dynamic-sensor-state in the user space, the RMO buffer is allocated for the work area which is restricted under system lockdown. So instead of user space execution, the kernel will provide /dev/papr-indices interface to execute these RTAS calls. The user space assigns data in papr_indices_io_block struct depends on the specific HCALL and passes to the following ioctls: PAPR_INDICES_IOC_GET: Use for ibm,get-indices. Returns a get-indices handle fd to read data. PAPR_DYNAMIC_SENSOR_IOC_GET: Use for ibm,get-dynamic-sensor-state. Updates the sensor state in papr_indices_io_block.dynamic_param.state PAPR_DYNAMIC_INDICATOR_IOC_SET: Use for ibm,set-dynamic-indicator. Sets the new state for the input indicator. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Tested-by: Sathvika Vasireddy <sv@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416225743.596462-3-haren@linux.ibm.com
2025-04-17powerpc/pseries: Define common functions for RTAS sequence callsHaren Myneni
The RTAS call can be normal where retrieves the data form the hypervisor once or sequence based RTAS call which has to issue multiple times until the complete data is obtained. For some of these sequence RTAS calls, the OS should not interleave calls with different input until the sequence is completed. The data is collected for each call and copy to the buffer for the entire sequence during ioctl() handle and then expose this buffer to the user space with read() handle. One such sequence RTAS call is ibm,get-vpd and its support is already included in the current code. To add the similar support for other sequence based calls, move the common functions in to separate file and update papr_rtas_sequence struct with the following callbacks so that RTAS call specific code will be defined and executed to complete the sequence. struct papr_rtas_sequence { int error; void params; void (*begin) (struct papr_rtas_sequence *); void (*end) (struct papr_rtas_sequence *); const char * (*work) (struct papr_rtas_sequence *, size_t *); }; params: Input parameters used to pass for RTAS call. Begin: RTAS call specific function to initialize data including work area allocation. End: RTAS call specific function to free up resources (free work area) after the sequence is completed. Work: The actual RTAS call specific function which collects the data from the hypervisor. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Tested-by: Sathvika Vasireddy <sv@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416225743.596462-2-haren@linux.ibm.com
2025-04-17powerpc/crc: Include uaccess.h and othersHerbert Xu
The powerpc crc code was relying on pagefault_disable from being pulled in by random header files. Fix this by explicitly including uaccess.h. Also add other missing header files to prevent similar problems in future. Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 7ba8df47810f ("asm-generic: Make simd.h more resilient") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-16powerpc: enable dynamic preemptionShrikanth Hegde
Once the lazy preemption is supported, it would be desirable to change the preemption models at runtime. So add support for dynamic preemption using DYNAMIC_KEY. ::Tested lightly on Power10 LPAR Performance numbers indicate that, preempt=none(no dynamic) and preempt=none(dynamic) are close. cat /sys/kernel/debug/sched/preempt (none) voluntary full lazy perf stat -e probe:__cond_resched -a sleep 1 Performance counter stats for 'system wide': 1,253 probe:__cond_resched echo full > /sys/kernel/debug/sched/preempt cat /sys/kernel/debug/sched/preempt none voluntary (full) lazy perf stat -e probe:__cond_resched -a sleep 1 Performance counter stats for 'system wide': 0 probe:__cond_resched Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250210184334.567383-2-sshegde@linux.ibm.com
2025-04-16fadump: Use str_yes_no() helper in fadump_show_config()Thorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250209081704.2758-2-thorsten.blum@linux.dev
2025-04-16KVM: powerpc: Enable commented out BUILD_BUG_ON() assertionThorsten Blum
The BUILD_BUG_ON() assertion was commented out in commit 38634e676992 ("powerpc/kvm: Remove problematic BUILD_BUG_ON statement") and fixed in commit c0a187e12d48 ("KVM: powerpc: Fix BUILD_BUG_ON condition"), but not enabled. Enable it now that this no longer breaks and remove the comment. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250411084222.6916-1-thorsten.blum@linux.dev
2025-04-16powerpc: mpic: Use str_enabled_disabled() helper functionThorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Use pr_debug() instead of printk(KERN_DEBUG) to silence a checkpatch warning. Reviewed-by: Ricardo B. Marlière <ricardo@marliere.net> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250219112053.3352-2-thorsten.blum@linux.dev
2025-04-16powerpc/ps3: Use str_write_read() in ps3_notification_read_write()Thorsten Blum
Remove hard-coded strings by using the str_write_read() helper function. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250219111445.2875-2-thorsten.blum@linux.dev
2025-04-16powerpc/kvm-hv-pmu: Add perf-events for Hostwide countersVaibhav Jain
Update 'kvm-hv-pmu.c' to add five new perf-events mapped to the five Hostwide counters. Since these newly introduced perf events are at system wide scope and can be read from any L1-Lpar CPU, 'kvmppc_pmu' scope and capabilities are updated appropriately. Also introduce two new helpers. First is kvmppc_update_l0_stats() that uses the infrastructure introduced in previous patches to issues the H_GUEST_GET_STATE hcall L0-PowerVM to fetch guest-state-buffer holding the latest values of these counters which is then parsed and 'l0_stats' variable updated. Second helper is kvmppc_pmu_event_update() which is called from 'kvmppv_pmu' callbacks and uses kvmppc_update_l0_stats() to update 'l0_stats' and the update the 'struct perf_event's event-counter. Some minor updates to kvmppc_pmu_{add, del, read}() to remove some debug scaffolding code. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Athira Rajeev <atrajeev@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416162740.93143-7-vaibhav@linux.ibm.com
2025-04-16powerpc/kvm-hv-pmu: Implement GSB message-ops for hostwide countersVaibhav Jain
Implement and setup necessary structures to send a prepolulated Guest-State-Buffer(GSB) requesting hostwide counters to L0-PowerVM and have the returned GSB holding the values of these counters parsed. This is done via existing GSB implementation and with the newly added support of Hostwide elements in GSB. The request to L0-PowerVM to return Hostwide counters is done using a pre-allocated GSB named 'gsb_l0_stats'. To be able to populate this GSB with the needed Guest-State-Elements (GSIDs) a instance of 'struct kvmppc_gs_msg' named 'gsm_l0_stats' is introduced. The 'gsm_l0_stats' is tied to an instance of 'struct kvmppc_gs_msg_ops' named 'gsb_ops_l0_stats' which holds various callbacks to be compute the size ( hostwide_get_size() ), populate the GSB ( hostwide_fill_info() ) and refresh ( hostwide_refresh_info() ) the contents of 'l0_stats' that holds the Hostwide counters returned from L0-PowerVM. To protect these structures from simultaneous access a spinlock 'lock_l0_stats' has been introduced. The allocation and initialization of the above structures is done in newly introduced kvmppc_init_hostwide() and similarly the cleanup is performed in newly introduced kvmppc_cleanup_hostwide(). Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416162740.93143-6-vaibhav@linux.ibm.com
2025-04-16kvm powerpc/book3s-apiv2: Introduce kvm-hv specific PMUVaibhav Jain
Introduce a new PMU named 'kvm-hv' inside a new module named 'kvm-hv-pmu' to report Book3s kvm-hv specific performance counters. This will expose KVM-HV specific performance attributes to user-space via kernel's PMU infrastructure and would enableusers to monitor active kvm-hv based guests. The patch creates necessary scaffolding to for the new PMU callbacks and introduces the new kernel module name 'kvm-hv-pmu' which is built with CONFIG_KVM_BOOK3S_HV_PMU. The patch doesn't introduce any perf-events yet, which will be introduced in later patches Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416162740.93143-5-vaibhav@linux.ibm.com
2025-04-16kvm powerpc/book3s-apiv2: Add kunit tests for Hostwide GSB elementsVaibhav Jain
Update 'test-guest-state-buffer.c' to add two new KUNIT test cases for validating correctness of changes to Guest-state-buffer management infrastructure for adding support for Hostwide GSB elements. The newly introduced test test_gs_hostwide_msg() checks if the Hostwide elements can be set and parsed from a Guest-state-buffer. The second kunit test test_gs_hostwide_counters() checks if the Hostwide GSB elements can be send to the L0-PowerVM hypervisor via the H_GUEST_SET_STATE hcall and ensures that the returned guest-state-buffer has all the 5 Hostwide stat counters present. Below is the KATP test report with the newly added KUNIT tests: KTAP version 1 # Subtest: guest_state_buffer_test # module: test_guest_state_buffer 1..7 ok 1 test_creating_buffer ok 2 test_adding_element ok 3 test_gs_bitmap ok 4 test_gs_parsing ok 5 test_gs_msg ok 6 test_gs_hostwide_msg # test_gs_hostwide_counters: Guest Heap Size=0 bytes # test_gs_hostwide_counters: Guest Heap Size Max=10995367936 bytes # test_gs_hostwide_counters: Guest Page-table Size=2178304 bytes # test_gs_hostwide_counters: Guest Page-table Size Max=2147483648 bytes # test_gs_hostwide_counters: Guest Page-table Reclaim Size=0 bytes ok 7 test_gs_hostwide_counters # guest_state_buffer_test: pass:7 fail:0 skip:0 total:7 # Totals: pass:7 fail:0 skip:0 total:7 ok 1 guest_state_buffer_test Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416162740.93143-4-vaibhav@linux.ibm.com
2025-04-16kvm powerpc/book3s-apiv2: Add support for Hostwide GSB elementsVaibhav Jain
Add support for adding and parsing Hostwide elements to the Guest-state-buffer data structure used in apiv2. These elements are used to share meta-information pertaining to entire L1-Lpar and this meta-information is maintained by L0-PowerVM hypervisor. Example of this include the amount of the page-table memory currently used by L0-PowerVM for hosting the Shadow-Pagetable of all active L2-Guests. More of the are documented in kernel-documentation at [1]. The Hostwide GSB elements are currently only support with H_GUEST_SET_STATE hcall with a special flag namely 'KVMPPC_GS_FLAGS_HOST_WIDE'. The patch introduces new defs for the 5 new Hostwide GSB elements including their GSIDs as well as introduces a new class of GSB elements namely 'KVMPPC_GS_CLASS_HOSTWIDE' to indicate to GSB construction/parsing infrastructure in 'kvm/guest-state-buffer.c'. Also gs_msg_ops_vcpu_get_size(), kvmppc_gsid_type() and kvmppc_gse_{flatten,unflatten}_iden() are updated to appropriately indicate the needed size for these Hostwide GSB elements as well as how to flatten/unflatten their GSIDs so that they can be marked as available in GSB bitmap. [1] Documention/arch/powerpc/kvm-nested.rst Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250416162740.93143-3-vaibhav@linux.ibm.com
2025-04-16powerpc/boot: Check for ld-option supportMadhavan Srinivasan
Commit 579aee9fc594 ("powerpc: suppress some linker warnings in recent linker versions") enabled support to add linker option "--no-warn-rwx-segments", if the version is greater than 2.39. Similar build warning were reported recently from linker version 2.35.2. ld: warning: arch/powerpc/boot/zImage.epapr has a LOAD segment with RWX permissions ld: warning: arch/powerpc/boot/zImage.pseries has a LOAD segment with RWX permissions Fix the warning by checking for "--no-warn-rwx-segments" option support in linker to enable it, instead of checking for the version range. Fixes: 579aee9fc594 ("powerpc: suppress some linker warnings in recent linker versions") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/linuxppc-dev/61cf556c-4947-4bd6-af63-892fc0966dad@linux.ibm.com/ Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250401004218.24869-1-maddy@linux.ibm.com
2025-04-16crypto: poly1305 - centralize the shash wrappers for arch codeEric Biggers
Following the example of the crc32, crc32c, and chacha code, make the crypto subsystem register both generic and architecture-optimized poly1305 shash algorithms, both implemented on top of the appropriate library functions. This eliminates the need for every architecture to implement the same shash glue code. Note that the poly1305 shash requires that the key be prepended to the data, which differs from the library functions where the key is simply a parameter to poly1305_init(). Previously this was handled at a fairly low level, polluting the library code with shash-specific code. Reorganize things so that the shash code handles this quirk itself. Also, to register the architecture-optimized shashes only when architecture-optimized code is actually being used, add a function poly1305_is_arch_optimized() and make each arch implement it. Change each architecture's Poly1305 module_init function to arch_initcall so that the CPU feature detection is guaranteed to run before poly1305_is_arch_optimized() gets called by crypto/poly1305.c. (In cases where poly1305_is_arch_optimized() just returns true unconditionally, using arch_initcall is not strictly needed, but it's still good to be consistent across architectures.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-16crypto: powerpc/poly1305 - implement library instead of shashEric Biggers
Currently the Power10 optimized Poly1305 is only wired up to the crypto_shash API, which makes it unavailable to users of the library API. The crypto_shash API for Poly1305 is going to change to be implemented on top of the library API, so the library API needs to be supported. And of course it's needed anyway to serve the library users. Therefore, change the Power10 optimized Poly1305 code to implement the library API instead of the crypto_shash API. Cc: Danny Tsen <dtsen@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-16crypto: powerpc - Include uaccess.h and othersHerbert Xu
The powerpc aes/ghash code was relying on pagefault_disable from being pulled in by random header files. Fix this by explicitly including uaccess.h. Also add other missing header files to prevent similar problems in future. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-15powerpc: Add check to select PPC_RADIX_BROADCAST_TLBIEMadhavan Srinivasan
Commit 3d45a3d0d2e6 ("powerpc: Define config option for processors with broadcast TLBIE") added a config option PPC_RADIX_BROADCAST_TLBIE to support processors with broadcast TLBIE. Since this option is relevant only for RADIX_MMU, add a check as a dependency to enable PPC_RADIX_BROADCAST_TLBIE in both powernv and pseries configs. This fixes the unmet config dependency warning reported WARNING: unmet direct dependencies detected for PPC_RADIX_BROADCAST_TLBIE Depends on [n]: PPC_RADIX_MMU [=n] Selected by [y]: - PPC_PSERIES [=y] && PPC64 [=y] && PPC_BOOK3S [=y] Reported-by: kernel test robot <lkp@intel.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Closes: https://lore.kernel.org/oe-kbuild-all/202504051857.jRqxM60c-lkp@intel.com/ Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250407084029.357710-1-maddy@linux.ibm.com
2025-04-15powerpc64/ftrace: fix module loading without patchable function entriesAnthony Iliopoulos
get_stubs_size assumes that there must always be at least one patchable function entry, which is not always the case (modules that export data but no code), otherwise it returns -ENOEXEC and thus the section header sh_size is set to that value. During module_memory_alloc() the size is passed to execmem_alloc() after being page-aligned and thus set to zero which will cause it to fail the allocation (and thus module loading) as __vmalloc_node_range() checks for zero-sized allocs and returns null: [ 115.466896] module_64: cast_common: doesn't contain __patchable_function_entries. [ 115.469189] ------------[ cut here ]------------ [ 115.469496] WARNING: CPU: 0 PID: 274 at mm/vmalloc.c:3778 __vmalloc_node_range_noprof+0x8b4/0x8f0 ... [ 115.478574] ---[ end trace 0000000000000000 ]--- [ 115.479545] execmem: unable to allocate memory Fix this by removing the check completely, since it is anyway not helpful to propagate this as an error upwards. Fixes: eec37961a56a ("powerpc64/ftrace: Move ftrace sequence out of line") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20250204231821.39140-1-ailiop@suse.com
2025-04-15book3s64/radix : Align section vmemmap start address to PAGE_SIZEDonet Tom
A vmemmap altmap is a device-provided region used to provide backing storage for struct pages. For each namespace, the altmap should belong to that same namespace. If the namespaces are created unaligned, there is a chance that the section vmemmap start address could also be unaligned. If the section vmemmap start address is unaligned, the altmap page allocated from the current namespace might be used by the previous namespace also. During the free operation, since the altmap is shared between two namespaces, the previous namespace may detect that the page does not belong to its altmap and incorrectly assume that the page is a normal page. It then attempts to free the normal page, which leads to a kernel crash. Kernel attempted to read user page (18) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000018 Faulting instruction address: 0xc000000000530c7c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 32 PID: 2104 Comm: ndctl Kdump: loaded Tainted: G W NIP: c000000000530c7c LR: c000000000530e00 CTR: 0000000000007ffe REGS: c000000015e57040 TRAP: 0300 Tainted: G W MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 84482404 CFAR: c000000000530dfc DAR: 0000000000000018 DSISR: 40000000 IRQMASK: 0 GPR00: c000000000530e00 c000000015e572e0 c000000002c5cb00 c00c000101008040 GPR04: 0000000000000000 0000000000000007 0000000000000001 000000000000001f GPR08: 0000000000000005 0000000000000000 0000000000000018 0000000000002000 GPR12: c0000000001d2fb0 c0000060de6b0080 0000000000000000 c0000060dbf90020 GPR16: c00c000101008000 0000000000000001 0000000000000000 c000000125b20f00 GPR20: 0000000000000001 0000000000000000 ffffffffffffffff c00c000101007fff GPR24: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 GPR28: 0000000004040201 0000000000000001 0000000000000000 c00c000101008040 NIP [c000000000530c7c] get_pfnblock_flags_mask+0x7c/0xd0 LR [c000000000530e00] free_unref_page_prepare+0x130/0x4f0 Call Trace: free_unref_page+0x50/0x1e0 free_reserved_page+0x40/0x68 free_vmemmap_pages+0x98/0xe0 remove_pte_table+0x164/0x1e8 remove_pmd_table+0x204/0x2c8 remove_pud_table+0x1c4/0x288 remove_pagetable+0x1c8/0x310 vmemmap_free+0x24/0x50 section_deactivate+0x28c/0x2a0 __remove_pages+0x84/0x110 arch_remove_memory+0x38/0x60 memunmap_pages+0x18c/0x3d0 devm_action_release+0x30/0x50 release_nodes+0x68/0x140 devres_release_group+0x100/0x190 dax_pmem_compat_release+0x44/0x80 [dax_pmem_compat] device_for_each_child+0x8c/0x100 [dax_pmem_compat_remove+0x2c/0x50 [dax_pmem_compat] nvdimm_bus_remove+0x78/0x140 [libnvdimm] device_remove+0x70/0xd0 Another issue is that if there is no altmap, a PMD-sized vmemmap page will be allocated from RAM, regardless of the alignment of the section start address. If the section start address is not aligned to the PMD size, a VM_BUG_ON will be triggered when setting the PMD-sized page to page table. In this patch, we are aligning the section vmemmap start address to PAGE_SIZE. After alignment, the start address will not be part of the current namespace, and a normal page will be allocated for the vmemmap mapping of the current section. For the remaining sections, altmaps will be allocated. During the free operation, the normal page will be correctly freed. In the same way, a PMD_SIZE vmemmap page will be allocated only if the section start address is PMD_SIZE-aligned; otherwise, it will fall back to a PAGE-sized vmemmap allocation. Without this patch ================== NS1 start NS2 start _________________________________________________________ | NS1 | NS2 | --------------------------------------------------------- | Altmap| Altmap | .....|Altmap| Altmap | ........... | NS1 | NS1 | | NS2 | NS2 | In the above scenario, NS1 and NS2 are two namespaces. The vmemmap for NS1 comes from Altmap NS1, which belongs to NS1, and the vmemmap for NS2 comes from Altmap NS2, which belongs to NS2. The vmemmap start for NS2 is not aligned, so Altmap NS2 is shared by both NS1 and NS2. During the free operation in NS1, Altmap NS2 is not part of NS1's altmap, causing it to attempt to free an invalid page. With this patch =============== NS1 start NS2 start _________________________________________________________ | NS1 | NS2 | --------------------------------------------------------- | Altmap| Altmap | .....| Normal | Altmap | Altmap |....... | NS1 | NS1 | | Page | NS2 | NS2 | If the vmemmap start for NS2 is not aligned then we are allocating a normal page. NS1 and NS2 vmemmap will be freed correctly. Fixes: 368a0590d954 ("powerpc/book3s64/vmemmap: switch radix to use a different vmemmap handling function") Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8f98ec2b442977c618f7256cec88eb17dde3f2b9.1741609795.git.donettom@linux.ibm.com
2025-04-15book3s64/radix: Fix compile errors when CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=nRitesh Harjani (IBM)
Fix compile errors when CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=n Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8231763344223c193e3452eab0ae8ea966aff466.1741609795.git.donettom@linux.ibm.com
2025-04-11net: Retire DCCP socket.Kuniyuki Iwashima
DCCP was orphaned in 2021 by commit 054c4610bd05 ("MAINTAINERS: dccp: move Gerrit Renker to CREDITS"), which noted that the last maintainer had been inactive for five years. In recent years, it has become a playground for syzbot, and most changes to DCCP have been odd bug fixes triggered by syzbot. Apart from that, the only changes have been driven by treewide or networking API updates or adjustments related to TCP. Thus, in 2023, we announced we would remove DCCP in 2025 via commit b144fcaf46d4 ("dccp: Print deprecation notice."). Since then, only one individual has contacted the netdev mailing list. [0] There is ongoing research for Multipath DCCP. The repository is hosted on GitHub [1], and development is not taking place through the upstream community. While the repository is published under the GPLv2 license, the scheduling part remains proprietary, with a LICENSE file [2] stating: "This is not Open Source software." The researcher mentioned a plan to address the licensing issue, upstream the patches, and step up as a maintainer, but there has been no further communication since then. Maintaining DCCP for a decade without any real users has become a burden. Therefore, it's time to remove it. Removing DCCP will also provide significant benefits to TCP. It allows us to freely reorganize the layout of struct inet_connection_sock, which is currently shared with DCCP, and optimize it to reduce the number of cachelines accessed in the TCP fast path. Note that we keep DCCP netfilter modules as requested. [3] Link: https://lore.kernel.org/netdev/20230710182253.81446-1-kuniyu@amazon.com/T/#u #[0] Link: https://github.com/telekom/mp-dccp #[1] Link: https://github.com/telekom/mp-dccp/blob/mpdccp_v03_k5.10/net/dccp/non_gpl_scheduler/LICENSE #[2] Link: https://lore.kernel.org/netdev/Z_VQ0KlCRkqYWXa-@calendula/ #[3] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> (LSM and SELinux) Acked-by: Casey Schaufler <casey@schaufler-ca.com> Link: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09perf/arch: Record sample last_period before updating on the x86 and PowerPC ↵Mark Barnett
platforms This change alters the PowerPC and x86 driver implementations to record the last sample period before the event is updated for the next period. A common pattern in PMU driver implementations is to have a "*_event_set_period" function which takes care of updating the various period-related fields in a perf_event structure. In most cases, the drivers choose to call this function after initializing a sample data structure with perf_sample_data_init. The x86 and PowerPC drivers deviate from this, choosing to update the period before initializing the sample data. When using an event with an alternate sample period, this causes an incorrect period to be written to the sample data that gets reported to userspace. Signed-off-by: Mark Barnett <mark.barnett@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250408171530.140858-2-mark.barnett@arm.com
2025-04-08Merge tag 'crc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC cleanups from Eric Biggers: "Finish cleaning up the CRC kconfig options by removing the remaining unnecessary prompts and an unnecessary 'default y', removing CONFIG_LIBCRC32C, and documenting all the CRC library options" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: lib/crc: remove CONFIG_LIBCRC32C lib/crc: document all the CRC library kconfig options lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF lib/crc: remove unnecessary prompt for CONFIG_CRC16 lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
2025-04-07crypto: powerpc/chacha - remove the skcipher algorithmsEric Biggers
Since crypto/chacha.c now registers chacha20-$(ARCH), xchacha20-$(ARCH), and xchacha12-$(ARCH) skcipher algorithms that use the architecture's ChaCha and HChaCha library functions, individual architectures no longer need to do the same. Therefore, remove the redundant skcipher algorithms and leave just the library functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-07crypto: chacha - centralize the skcipher wrappers for arch codeEric Biggers
Following the example of the crc32 and crc32c code, make the crypto subsystem register both generic and architecture-optimized chacha20, xchacha20, and xchacha12 skcipher algorithms, all implemented on top of the appropriate library functions. This eliminates the need for every architecture to implement the same skcipher glue code. To register the architecture-optimized skciphers only when architecture-optimized code is actually being used, add a function chacha_is_arch_optimized() and make each arch implement it. Change each architecture's ChaCha module_init function to arch_initcall so that the CPU feature detection is guaranteed to run before chacha_is_arch_optimized() gets called by crypto/chacha.c. In the case of s390, remove the CPU feature based module autoloading, which is no longer needed since the module just gets pulled in via function linkage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-06Merge tag 'timers-cleanups-2025-04-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A set of final cleanups for the timer subsystem: - Convert all del_timer[_sync]() instances over to the new timer_delete[_sync]() API and remove the legacy wrappers. Conversion was done with coccinelle plus some manual fixups as coccinelle chokes on scoped_guard(). - The final cleanup of the hrtimer_init() to hrtimer_setup() conversion. This has been delayed to the end of the merge window, so that all patches which have been merged through other trees are in mainline and all new users are catched. Doing this right before rc1 ensures that new code which is merged post rc1 is not introducing new instances of the original functionality" * tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/timers: Rename the hrtimer_init event to hrtimer_setup hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack() hrtimers: Rename debug_init() to debug_setup() hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns() hrtimers: Make callback function pointer private hrtimers: Merge __hrtimer_init() into __hrtimer_setup() hrtimers: Switch to use __htimer_setup() hrtimers: Delete hrtimer_init() treewide: Convert new and leftover hrtimer_init() users treewide: Switch/rename to timer_delete[_sync]()
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-04lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_TEric Biggers
All modules that need CONFIG_CRC_ITU_T already select it, so there is no need to bother users about the option. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIFEric Biggers
All modules that need CONFIG_CRC_T10DIF already select it, so there is no need to bother users about the option. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04lib/crc: remove unnecessary prompt for CONFIG_CRC16Eric Biggers
All modules that need CONFIG_CRC16 already select it, so there is no need to bother users about the option. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITTEric Biggers
All modules that need CONFIG_CRC_CCITT already select it, so there is no need to bother users about the option. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04irqdomain: Rename irq_get_default_host() to irq_get_default_domain()Jiri Slaby (SUSE)
Naming interrupt domains host is confusing at best and the irqdomain code uses both domain and host inconsistently. Therefore rename irq_get_default_host() to irq_get_default_domain(). Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-4-jirislaby@kernel.org
2025-04-04irqdomain: Rename irq_set_default_host() to irq_set_default_domain()Jiri Slaby (SUSE)
Naming interrupt domains host is confusing at best and the irqdomain code uses both domain and host inconsistently. Therefore rename irq_set_default_host() to irq_set_default_domain(). Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-3-jirislaby@kernel.org
2025-04-03Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull dcache fixes from Al Viro: "Fixes for bugs caught as part of tree-in-dcache work. Mostly dentry refcount mishandling" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: hypfs_create_cpu_files(): add missing check for hypfs_mkdir() failure qibfs: fix _another_ leak spufs: fix a leak in spufs_create_context() spufs: fix gang directory lifetimes spufs: fix a leak on spufs_new_file() failure
2025-04-02Merge tag 'powerpc-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull more powerpc updates from Michael Ellerman: - Remove the IBM CAPI (cxl) driver Thanks to Andrew Donnellan. * tag 'powerpc-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: docs: Fix references to IBM CAPI (cxl) removal version cxl: Remove driver
2025-04-02Merge branch 'topic/cxl' into nextMichael Ellerman
This merges in the removal of the IBM CAPI "cxl" driver.
2025-04-01Merge tag 'driver-core-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updatesk from Greg KH: "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates" * tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits) rust: platform: require Send for Driver trait implementers rust: pci: require Send for Driver trait implementers rust: platform: impl Send + Sync for platform::Device rust: pci: impl Send + Sync for pci::Device rust: platform: fix unrestricted &mut platform::Device rust: pci: fix unrestricted &mut pci::Device rust: device: implement device context marker rust: pci: use to_result() in enable_device_mem() MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers rust/kernel/faux: mark Registration methods inline driver core: faux: only create the device if probe() succeeds rust/faux: Add missing parent argument to Registration::new() rust/faux: Drop #[repr(transparent)] from faux::Registration rust: io: fix devres test with new io accessor functions rust: io: rename `io::Io` accessors kernfs: Move dput() outside of the RCU section. efi: rci2: mark bin_attribute as __ro_after_init rapidio: constify 'struct bin_attribute' firmware: qemu_fw_cfg: constify 'struct bin_attribute' powerpc/perf/hv-24x7: Constify 'struct bin_attribute' ...
2025-04-01Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES*() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-30Merge tag 'modules-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules updates from Petr Pavlu: - Use RCU instead of RCU-sched The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable() in the module code and its users has been replaced with just rcu_read_lock() - The rest of changes are smaller fixes and updates * tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits) MAINTAINERS: Update the MODULE SUPPORT section module: Remove unnecessary size argument when calling strscpy() module: Replace deprecated strncpy() with strscpy() params: Annotate struct module_param_attrs with __counted_by() bug: Use RCU instead RCU-sched to protect module_bug_list. static_call: Use RCU in all users of __module_text_address(). kprobes: Use RCU in all users of __module_text_address(). bpf: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_address(). x86: Use RCU in all users of __module_address(). cfi: Use RCU while invoking __module_address(). powerpc/ftrace: Use RCU in all users of __module_text_address(). LoongArch: ftrace: Use RCU in all users of __module_text_address(). LoongArch/orc: Use RCU in all users of __module_address(). arm64: module: Use RCU in all users of __module_text_address(). ARM: module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_address(). module: Use RCU in search_module_extables(). ...
2025-03-29Merge tag 'v6.15-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove legacy compression interface - Improve scatterwalk API - Add request chaining to ahash and acomp - Add virtual address support to ahash and acomp - Add folio support to acomp - Remove NULL dst support from acomp Algorithms: - Library options are fuly hidden (selected by kernel users only) - Add Kerberos5 algorithms - Add VAES-based ctr(aes) on x86 - Ensure LZO respects output buffer length on compression - Remove obsolete SIMD fallback code path from arm/ghash-ce Drivers: - Add support for PCI device 0x1134 in ccp - Add support for rk3588's standalone TRNG in rockchip - Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93 - Fix bugs in tegra uncovered by multi-threaded self-test - Fix corner cases in hisilicon/sec2 Others: - Add SG_MITER_LOCAL to sg miter - Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp" * tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (187 commits) crypto: testmgr - Add multibuffer acomp testing crypto: acomp - Fix synchronous acomp chaining fallback crypto: testmgr - Add multibuffer hash testing crypto: hash - Fix synchronous ahash chaining fallback crypto: arm/ghash-ce - Remove SIMD fallback code path crypto: essiv - Replace memcpy() + NUL-termination with strscpy() crypto: api - Call crypto_alg_put in crypto_unregister_alg crypto: scompress - Fix incorrect stream freeing crypto: lib/chacha - remove unused arch-specific init support crypto: remove obsolete 'comp' compression API crypto: compress_null - drop obsolete 'comp' implementation crypto: cavium/zip - drop obsolete 'comp' implementation crypto: zstd - drop obsolete 'comp' implementation crypto: lzo - drop obsolete 'comp' implementation crypto: lzo-rle - drop obsolete 'comp' implementation crypto: lz4hc - drop obsolete 'comp' implementation crypto: lz4 - drop obsolete 'comp' implementation crypto: deflate - drop obsolete 'comp' implementation crypto: 842 - drop obsolete 'comp' implementation crypto: nx - Migrate to scomp API ...
2025-03-27Merge tag 'powerpc-6.15-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Madhavan Srinivasan: - Remove support for IBM Cell Blades - SMP support for microwatt platform - Support for inline static calls on PPC32 - Enable pmu selftests for power11 platform - Enable hardware trace macro (HTM) hcall support - Support for limited address mode capability - Changes to RMA size from 512 MB to 768 MB to handle fadump - Misc fixes and cleanups Thanks to Abhishek Dubey, Amit Machhiwal, Andreas Schwab, Arnd Bergmann, Athira Rajeev, Avnish Chouhan, Christophe Leroy, Disha Goel, Donet Tom, Gaurav Batra, Gautam Menghani, Hari Bathini, Kajol Jain, Kees Cook, Mahesh Salgaonkar, Michael Ellerman, Paul Mackerras, Ritesh Harjani (IBM), Sathvika Vasireddy, Segher Boessenkool, Sourabh Jain, Vaibhav Jain, and Venkat Rao Bagalkote. * tag 'powerpc-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (61 commits) powerpc/kexec: fix physical address calculation in clear_utlb_entry() crypto: powerpc: Mark ghashp8-ppc.o as an OBJECT_FILES_NON_STANDARD powerpc: Fix 'intra_function_call not a direct call' warning powerpc/perf: Fix ref-counting on the PMU 'vpa_pmu' KVM: PPC: Enable CAP_SPAPR_TCE_VFIO on pSeries KVM guests powerpc/prom_init: Fixup missing #size-cells on PowerBook6,7 powerpc/microwatt: Add SMP support powerpc: Define config option for processors with broadcast TLBIE powerpc/microwatt: Define an idle power-save function powerpc/microwatt: Device-tree updates powerpc/microwatt: Select COMMON_CLK in order to get the clock framework net: toshiba: Remove reference to PPC_IBM_CELL_BLADE net: spider_net: Remove powerpc Cell driver cpufreq: ppc_cbe: Remove powerpc Cell driver genirq: Remove IRQ_EDGE_EOI_HANDLER docs: Remove reference to removed CBE_CPUFREQ_SPU_GOVERNOR powerpc: Remove UDBG_RTAS_CONSOLE powerpc/io: Use standard barrier macros in io.c powerpc/io: Rename _insw_ns() etc. powerpc/io: Use generic raw accessors ...
2025-03-27Merge tag 'asm-generic-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "This is mainly set of cleanups of asm-generic/io.h, resolving problems with inconsistent semantics of ioread64/iowrite64 that were causing runtime and build issues. The "GENERIC_IOMAP" version that switches between inb()/outb() and readb()/writeb() style accessors is now only used on architectures that have PC-style ISA devices that are not memory mapped (x86, uml, m68k-q40 and powerpc-powernv), while alpha and parisc use a more complicated variant and everything else just maps the ioread interfaces to plan MMIO (readb/writeb etc). In addition there are two small changes from Raag Jadav to simplify the asm-generic/io.h indirect inclusions and from Jann Horn to fix a corner case with read_word_at_a_time" * tag 'asm-generic-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: rwonce: fix crash by removing READ_ONCE() for unaligned read rwonce: handle KCSAN like KASAN in read_word_at_a_time() m68k: coldfire: select PCI_IOMAP for PCI mips: export pci_iounmap() mips: fix PCI_IOBASE definition m68k/nommu: stop using GENERIC_IOMAP mips: drop GENERIC_IOMAP wrapper powerpc: asm/io.h: remove split ioread64/iowrite64 helpers parisc: stop using asm-generic/iomap.h sh: remove duplicate ioread/iowrite helpers alpha: stop using asm-generic/iomap.h io.h: drop unused headers drm/draw: include missing headers asm-generic/io.h: rework split ioread64/iowrite64 helpers
2025-03-26Merge tag 'net-next-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls) - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API: - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers: - Remove the IBM LCS driver for s390 - Remove the sb1000 cable modem driver - Add support for SFP module access over SMBus - Add MCTP transport driver for MCTP-over-USB - Enable XDP metadata support in multiple drivers - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7" * tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits) unix: fix up for "apparmor: add fine grained af_unix mediation" mctp: Fix incorrect tx flow invalidation condition in mctp-i2c net: usb: asix: ax88772: Increase phy_name size net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string net: libwx: fix Tx L4 checksum net: libwx: fix Tx descriptor content for some tunnel packets atm: Fix NULL pointer dereference net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus net: phy: aquantia: add essential functions to aqr105 driver net: phy: aquantia: search for firmware-name in fwnode net: phy: aquantia: add probe function to aqr105 for firmware loading net: phy: Add swnode support to mdiobus_scan gve: add XDP DROP and PASS support for DQ gve: update XDP allocation path support RX buffer posting gve: merge packet buffer size fields gve: update GQ RX to use buf_size gve: introduce config-based allocation for XDP gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics ...
2025-03-25Merge tag 'crc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...
2025-03-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...
2025-03-25Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Nothing major this time around. Apart from the usual perf/PMU updates, some page table cleanups, the notable features are average CPU frequency based on the AMUv1 counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in the uaccess routines. Perf and PMUs: - Support for the 'Rainier' CPU PMU from Arm - Preparatory driver changes and cleanups that pave the way for BRBE support - Support for partial virtualisation of the Apple-M1 PMU - Support for the second event filter in Arm CSPMU designs - Minor fixes and cleanups (CMN and DWC PMUs) - Enable EL2 requirements for FEAT_PMUv3p9 Power, CPU topology: - Support for AMUv1-based average CPU frequency - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds a generic topology_is_primary_thread() function overridden by x86 and powerpc New(ish) features: - MOPS (memcpy/memset) support for the uaccess routines Security/confidential compute: - Fix the DMA address for devices used in Realms with Arm CCA. The CCA architecture uses the address bit to differentiate between shared and private addresses - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by default Memory management clean-ups: - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs - Some minor page table accessor clean-ups - PIE/POE (permission indirection/overlay) helpers clean-up Kselftests: - MTE: skip hugetlb tests if MTE is not supported on such mappings and user correct naming for sync/async tag checking modes Miscellaneous: - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people request) - Sysreg updates for new register fields - CPU type info for some Qualcomm Kryo cores" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) arm64: mm: Don't use %pK through printk perf/arm_cspmu: Fix missing io.h include arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists arm64: cputype: Add MIDR_CORTEX_A76AE arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list arm64/sysreg: Enforce whole word match for open/close tokens arm64/sysreg: Fix unbalanced closing block arm64: Kconfig: Enable HOTPLUG_SMT arm64: topology: Support SMT control on ACPI based system arch_topology: Support SMT control for OF based system cpu/SMT: Provide a default topology_is_primary_thread() arm64/mm: Define PTDESC_ORDER perf/arm_cspmu: Add PMEVFILT2R support perf/arm_cspmu: Generalise event filtering perf/arm_cspmu: Move register definitons to header arm64/kernel: Always use level 2 or higher for early mappings arm64/mm: Drop PXD_TABLE_BIT arm64/mm: Check pmd_table() in pmd_trans_huge() ...
2025-03-25Merge tag 'timers-vdso-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull VDSO infrastructure updates from Thomas Gleixner: - Consolidate the VDSO storage The VDSO data storage and data layout has been largely architecture specific for historical reasons. That increases the maintenance effort and causes inconsistencies over and over. There is no real technical reason for architecture specific layouts and implementations. The architecture specific details can easily be integrated into a generic layout, which also reduces the amount of duplicated code for managing the mappings. Convert all architectures over to a unified layout and common mapping infrastructure. This splits the VDSO data layout into subsystem specific blocks, timekeeping, random and architecture parts, which provides a better structure and allows to improve and update the functionalities without conflict and interaction. - Rework the timekeeping data storage The current implementation is designed for exposing system timekeeping accessors, which was good enough at the time when it was designed. PTP and Time Sensitive Networking (TSN) change that as there are requirements to expose independent PTP clocks, which are not related to system timekeeping. Replace the monolithic data storage by a structured layout, which allows to add support for independent PTP clocks on top while reusing both the data structures and the time accessor implementations. * tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) sparc/vdso: Always reject undefined references during linking x86/vdso: Always reject undefined references during linking vdso: Rework struct vdso_time_data and introduce struct vdso_clock vdso: Move architecture related data before basetime data powerpc/vdso: Prepare introduction of struct vdso_clock arm64/vdso: Prepare introduction of struct vdso_clock x86/vdso: Prepare introduction of struct vdso_clock time/namespace: Prepare introduction of struct vdso_clock vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct vdso/vsyscall: Prepare introduction of struct vdso_clock vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock vdso/gettimeofday: Prepare introduction of struct vdso_clock vdso/helpers: Prepare introduction of struct vdso_clock vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks vdso: Make vdso_time_data cacheline aligned arm64: Make asm/cache.h compatible with vDSO ...