summaryrefslogtreecommitdiff
path: root/arch/s390
AgeCommit message (Collapse)Author
2025-03-24Merge tag 'execve-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - elf: Define and use note name macros (Akihiko Odaki) - elf: add remaining SHF_ flag macros (Timur Tabi) - binfmt: Remove loader from linux_binprm struct (Yonatan Goldschmidt) - binfmt_elf_fdpic: fix variable set but not used warning (sunliming) * tag 'execve-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf_fdpic: fix variable set but not used warning elf: add remaining SHF_ flag macros binfmt: Remove loader from linux_binprm struct crash: Remove KEXEC_CORE_NOTE_NAME s390/crash: Use note name macros crash: Use note name macros powerpc/crash: Use note name macros binfmt_elf: Use note name macros elf: Define note name macros
2025-03-24Merge tag 'vfs-6.15-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...
2025-03-21s390/pci: Support mmap() of PCI resources except for ISM devicesNiklas Schnelle
So far s390 does not allow mmap() of PCI resources to user-space via the usual mechanisms, though it does use it for RDMA. For the PCI sysfs resource files and /proc/bus/pci it defines neither HAVE_PCI_MMAP nor ARCH_GENERIC_PCI_MMAP_RESOURCE. For vfio-pci s390 previously relied on disabled VFIO_PCI_MMAP and now relies on setting pdev->non_mappable_bars for all devices. This is partly because access to mapped PCI resources from user-space requires special PCI load/store memory-I/O (MIO) instructions, or the special MMIO syscalls when these are not available. Still, such access is possible and useful not just for RDMA, in fact not being able to mmap() PCI resources has previously caused extra work when testing devices. One thing that doesn't work with PCI resources mapped to user-space though is the s390 specific virtual ISM device. Not only because the BAR size of 256 TiB prevents mapping the whole BAR but also because access requires use of the legacy PCI instructions which are not accessible to user-space on systems with the newer MIO PCI instructions. Now with the pdev->non_mappable_bars flag ISM can be excluded from mapping its resources while making this functionality available for all other PCI devices. To this end introduce a minimal implementation of PCI_QUIRKS and use that to set pdev->non_mappable_bars for ISM devices only. Then also set ARCH_GENERIC_PCI_MMAP_RESOURCE to take advantage of the generic implementation of pci_mmap_resource_range() enabling only the newer sysfs mmap() interface. This follows the recommendation in Documentation/PCI/sysfs-pci.rst. Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-3-c5c0f1d26efd@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21s390/pci: Introduce pdev->non_mappable_bars and replace VFIO_PCI_MMAPNiklas Schnelle
The ability to map PCI resources to user-space is controlled by global defines. For vfio there is VFIO_PCI_MMAP which is only disabled on s390 and controls mapping of PCI resources using vfio-pci with a fallback option via the pread()/pwrite() interface. For the PCI core there is ARCH_GENERIC_PCI_MMAP_RESOURCE which enables a generic implementation for mapping PCI resources plus the newer sysfs interface. Then there is HAVE_PCI_MMAP which can be used with custom definitions of pci_mmap_resource_range() and the historical /proc/bus/pci interface. Both mechanisms are all or nothing. For s390 mapping PCI resources is possible and useful for testing and certain applications such as QEMU's vfio-pci based user-space NVMe driver. For certain devices, however access to PCI resources via mappings to user-space is not possible and these must be excluded from the general PCI resource mapping mechanisms. Introduce pdev->non_mappable_bars to indicate that a PCI device's BARs can not be accessed via mappings to user-space. In the future this enables per-device restrictions of PCI resource mapping. For now, set this flag for all PCI devices on s390 in line with the existing, general disable of PCI resource mapping. As s390 is the only user of the VFI_PCI_MMAP Kconfig options this can already be replaced with a check of this new flag. Also add similar checks in the other code protected by HAVE_PCI_MMAP respectively ARCH_GENERIC_PCI_MMAP in preparation for enabling these for supported devices. Link: https://lore.kernel.org/lkml/20250212132808.08dcf03c.alex.williamson@redhat.com/ Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-2-c5c0f1d26efd@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21s390/pci: Fix s390_mmio_read/write syscall page fault handlingNiklas Schnelle
The s390 MMIO syscalls when using the classic PCI instructions do not cause a page fault when follow_pfnmap_start() fails due to the page not being present. Besides being a general deficiency this breaks vfio-pci's mmap() handling once VFIO_PCI_MMAP gets enabled as this lazily maps on first access. Fix this by following a failed follow_pfnmap_start() with fixup_user_page() and retrying the follow_pfnmap_start(). Also fix a VM_READ vs VM_WRITE mixup in the read syscall. Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-1-c5c0f1d26efd@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
2025-03-21crypto: lib/chacha - remove unused arch-specific init supportEric Biggers
All implementations of chacha_init_arch() just call chacha_init_generic(), so it is pointless. Just delete it, and replace chacha_init() with what was previously chacha_init_generic(). Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21crypto: scatterwalk - simplify map and unmap calling conventionEric Biggers
Now that the address returned by scatterwalk_map() is always being stored into the same struct scatter_walk that is passed in, make scatterwalk_map() do so itself and return void. Similarly, now that scatterwalk_unmap() is always being passed the address field within a struct scatter_walk, make scatterwalk_unmap() take a pointer to struct scatter_walk instead of the address directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-20Merge branch 'kvm-nvmx-and-vm-teardown' into HEADPaolo Bonzini
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
2025-03-18s390: Use inline qualifier for all EX_TABLE and ALTERNATIVE inline assembliesHeiko Carstens
Use asm_inline for all inline assemblies which make use of the EX_TABLE or ALTERNATIVE macros. These macros expand to many lines and the compiler assumes the number of lines within an inline assembly is the same as the number of instructions within an inline assembly. This has an effect on inlining and loop unrolling decisions. In order to avoid incorrect assumptions use asm_inline, which tells the compiler that an inline assembly has the smallest possible size. In order to avoid confusion when asm_inline should be used or not, since a couple of inline assemblies are quite large: the rule is to always use asm_inline whenever the EX_TABLE or ALTERNATIVE macro is used. In specific cases there may be reasons to not follow this guideline, but that should be documented with the corresponding code. Using the inline qualifier everywhere has only a small effect on the kernel image size: add/remove: 0/10 grow/shrink: 19/8 up/down: 1492/-1858 (-366) The only location where this seems to matter is load_unaligned_zeropad() from word-at-a-time.h where the compiler inlines more functions within the dcache code, which is indeed code where performance matters. Suggested-by: Juergen Christ <jchrist@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/kfence: Split kfence pool into 4k mappings in arch_kfence_init_pool()Vasily Gorbik
Since commit d08d4e7cd6bf ("s390/mm: use full 4KB page for 2KB PTE"), there is no longer any reason to avoid splitting the kfence pool into 4k mappings in arch_kfence_init_pool(). Remove the architecture-specific kfence_split_mapping(). Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/ptrace: Avoid KASAN false positives in regs_get_kernel_stack_nth()Vasily Gorbik
With recent ftrace changes, argument tracing has been added to the function tracer. As a result, ftrace opportunistically reads the first FTRACE_REGS_MAX_ARGS (i.e., 6) registers. On s390, only five arguments are passed in registers, and the 6-th is read from the stack. If a function has fewer than 6 arguments, the following KASAN report may be observed: BUG: KASAN: stack-out-of-bounds in regs_get_kernel_stack_nth+0xa8/0xb0 Read of size 8 at addr 00007f7fe066fdb8 by task swapper/31/0 CPU: 31 UID: 0 PID: 0 Comm: swapper/31 Not tainted 6.14.0-rc4-00006-g76fe0337c219 #16 Hardware name: IBM 3931 A01 704 (KVM/Linux) Call Trace: [<00007fffe0147224>] dump_stack_lvl+0x104/0x168 [<00007fffe011381c>] print_address_description.constprop.0+0x34/0x338 [<00007fffe0113b64>] print_report+0x44/0x138 [<00007fffe0ad9422>] kasan_report+0xc2/0x180 [<00007fffe0159ff8>] regs_get_kernel_stack_nth+0xa8/0xb0 [<00007fffe05ebeda>] trace_function+0x23a/0x4d0 [<00007fffe0615d32>] irqsoff_tracer_call+0xd2/0x110 [<00007fffe2b4e34c>] ftrace_common+0x1c/0x40 [<00007fffe0150826>] arch_cpu_idle_enter+0x6/0x10 [<00007fffe035a1c8>] do_idle+0x168/0x2e0 [<00007fffe035a9d0>] cpu_startup_entry+0x90/0xb0 [<00007fffe017d25a>] smp_start_secondary+0x3da/0x4e0 [<00007fffe2b4e20a>] restart_int_handler+0x72/0x88 no locks held by swapper/31/0. The buggy address belongs to stack of task swapper/31/0 and is located at offset 0 in frame: do_idle+0x0/0x2e0 This frame has 1 object: [32, 40) '__mask' The buggy address belongs to the virtual mapping at [00007f7fe0660000, 00007f7fe0671000) created by: dup_task_struct+0x66/0x4e0 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x80f23 flags: 0x3ffff00000000000(node=0|zone=1|lastcpupid=0x1ffff) raw: 3ffff00000000000 0000000000000000 0000000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: 00007f7fe066fc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00007f7fe066fd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >00007f7fe066fd80: 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f3 f3 f3 00 ^ 00007f7fe066fe00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00007f7fe066fe80: 00 f1 f1 f1 f1 00 f2 f2 f2 00 00 f3 f3 00 00 00 The function regs_get_kernel_stack_nth() verifies that the requested argument is located on the stack, making it safe to read even if it is not actually present. Make use of READ_ONCE_NOCHECK() helper to silence KASAN reports in this case. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/boot: Ignore vmlinux.mapWangYuli
When building with CONFIG_VMLINUX_MAP=y, a decompressor vmlinux.map file is generated in the boot directory. Add this file to .gitignore to ensure Git does not track it. Signed-off-by: WangYuli <wangyuli@uniontech.com> Link: https://lore.kernel.org/r/F884C733016D6715+20250311030824.675683-1-wangyuli@uniontech.com Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/sysctl: Remove "vm/allocate_pgste" sysctlHeiko Carstens
Remove the not needed "vm/allocate_pgste" sysctl. It has no effect anymore. However this is a user space visible change. It shouldn't cause any problems, however if it does this needs to be partially reverted. Note that some distributions set vm/allocate_pgste=1 in one of the various sysctl configuration files. Besides a warning about the (now) non-existent procfs file this doesn't cause any problems. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390: Remove 2k vs 4k page table leftoversHeiko Carstens
Since commit d08d4e7cd6bf ("s390/mm: use full 4KB page for 2KB PTE") always 4k page tables are allocated, however there is still some (now) obsolete code left which deals with switching from 2k to 4k page tables for qemu/kvm processes. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Remove the not needed code. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/tlb: Use mm_has_pgste() instead of mm_alloc_pgste()Heiko Carstens
An mm has pgstes only after s390_enable_sie() has been called, while mm_alloc_pgste() may be always true (e.g. via sysctl setting). Limit the calls to gmap_unlink() in pte_free_tlb() to those cases where there might be something to unlink. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/lowcore: Use lghi instead llilh to clear registerHeiko Carstens
lghi is the fastest way to clear a register. Use that intead of llilh. Suggested-by: Juergen Christ <jchrist@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/syscall: Merge __do_syscall() and do_syscall()Heiko Carstens
The compiler inlines do_syscall() into __do_syscall(). Therefore do this in C code as well, since this makes the code easier to understand. Also adjust and add various unlikely() and likely() annotations. Furthermore this allows to replace the separate exit_to_user_mode() and syscall_exit_to_user_mode_work() calls with a combined syscall_exit_to_user_mode() call which results in slightly better code. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/spinlock: Implement SPINLOCK_LOCKVAL with inline assemblyHeiko Carstens
Implement SPINLOCK_LOCKVAL with an inline assembly, which makes use of the ALTERNATIVE macro, to read spinlock_lockval from lowcore. Provide an alternative instruction with a different offset in case lowcore is relocated. This replaces sequences of two instructions with one instruction. Before: 10602a: a7 78 00 00 lhi %r7,0 10602e: a5 8e 00 00 llilh %r8,0 106032: 58 d0 83 ac l %r13,940(%r8) 106036: ba 7d b5 80 cs %r7,%r13,1408(%r11) After: 10602a: a7 88 00 00 lhi %r8,0 10602e: e3 70 03 ac 00 58 ly %r7,940 106034: ba 87 b5 80 cs %r8,%r7,1408(%r11) Kernel image size change: add/remove: 756/750 grow/shrink: 646/3435 up/down: 30778/-46326 (-15548) Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/smp: Implement raw_smp_processor_id() with inline assemblyHeiko Carstens
Implement raw_smp_processor_id() with an inline assembly, which makes use of the ALTERNATIVE macro, to read cpu_nr from lowcore. Provide an alternative instruction with a different offset in case lowcore is relocated. This replaces sequences of two instructions with one instruction. Before: 1000b6: a5 1e 00 00 llilh %r1,0 1000ba: 58 20 13 a0 l %r2,928(%r1) After: 1000b6: e3 20 03 a0 00 58 ly %r2,928 Kernel image size change: add/remove: 753/755 grow/shrink: 230/1510 up/down: 30538/-35832 (-5294) Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/current: Implement current with inline assemblyHeiko Carstens
Implement current with an inline assembly, which makes use of the ALTERNATIVE macro, to read current from lowcore. Provide an alternative instruction with a different offset in case lowcore is relocated. This replaces sequences of two instructions with one instruction. Before: 100076: a5 1e 00 00 llilh %r1,0 10007a: e3 40 13 40 00 04 lg %r4,832(%r1) After: 100076: e3 10 03 40 00 04 lg %r1,832 Kernel image size change: add/remove: 3/17 grow/shrink: 166/2204 up/down: 7122/-24594 (-17472) Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390/lowcore: Use inline qualifier for get_lowcore() inline assemblyHeiko Carstens
Use asm_inline to let the compiler know that the get_lowcore() inline assembly has the smallest possible size. The ALTERNATIVE construct is used to generate a single instruction, however the macro expands to multiple lines. GCC uses the number of lines of an inline assembly to count the number of instructions within an inline assembly, which then has an effect on inlining decisions. In order to avoid incorrect assumptions use asm_inline. The result is that more functions are inlined, which results in a small growth of the kernel image: add/remove: 59/480 grow/shrink: 854/647 up/down: 168780/-162394 (6386) Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18s390: Move s390 sysctls into their own file under arch/s390joel granados
Move s390 sysctls (spin_retry and userprocess_debug) into their own files under arch/s390. Create two new sysctl tables (2390_{fault,spin}_sysctl_table) which will be initialized with arch_initcall placing them after their original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: joel granados <joel.granados@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20250306-jag-mv_ctltables-v2-6-71b243c8d3f8@kernel.org Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-17arch, mm: make releasing of memory to page allocator more explicitMike Rapoport (Microsoft)
The point where the memory is released from memblock to the buddy allocator is hidden inside arch-specific mem_init()s and the call to memblock_free_all() is needlessly duplicated in every artiste cure and after introduction of arch_mm_preinit() hook, mem_init() implementation on many architecture only contains the call to memblock_free_all(). Pull memblock_free_all() call into mm_core_init() and drop mem_init() on relevant architectures to make it more explicit where the free memory is released from memblock to the buddy allocator and to reduce code duplication in architecture specific code. Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: introduce arch_mm_preinitMike Rapoport (Microsoft)
Currently, implementation of mem_init() in every architecture consists of one or more of the following: * initializations that must run before page allocator is active, for instance swiotlb_init() * a call to memblock_free_all() to release all the memory to the buddy allocator * initializations that must run after page allocator is ready and there is no arch-specific hook other than mem_init() for that, like for example register_page_bootmem_info() in x86 and sparc64 or simple setting of mem_init_done = 1 in several architectures * a bunch of semi-related stuff that apparently had no better place to live, for example a ton of BUILD_BUG_ON()s in parisc. Introduce arch_mm_preinit() that will be the first thing called from mm_core_init(). On architectures that have initializations that must happen before the page allocator is ready, move those into arch_mm_preinit() along with the code that does not depend on ordering with page allocator setup. On several architectures this results in reduction of mem_init() to a single call to memblock_free_all() that allows its consolidation next. Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set high_memory in free_area_init()Mike Rapoport (Microsoft)
high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and by the end of memory otherwise. All this is known to generic memory management initialization code that can set high_memory while initializing core mm structures. Add a generic calculation of high_memory to free_area_init() and remove per-architecture calculation except for the architectures that set and use high_memory earlier than that. Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17arch, mm: set max_mapnr when allocating memory map for FLATMEMMike Rapoport (Microsoft)
max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every architecture when it's anyway calculated in alloc_node_mem_map(). Drop setting of max_mapnr from architecture code and set it once in alloc_node_mem_map(). While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so there won't be two copies for MMU and !MMU variants. Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17s390: make setup_zero_pages() use memblockMike Rapoport (Microsoft)
Allocating the zero pages from memblock is simpler because the memory is already reserved. This will also help with pulling out memblock_free_all() to the generic code and reducing code duplication in arch::mem_init(). Link: https://lkml.kernel.org/r/20250313135003.836600-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17hypfs_create_cpu_files(): add missing check for hypfs_mkdir() failureAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-03-17s390: Rely on generic printing of preemption modelSebastian Andrzej Siewior
die() invokes later show_regs() -> show_regs_print_info() which prints the current preemption model. Remove it from the initial line. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20250314160810.2373416-7-bigeasy@linutronix.de
2025-03-17perf: Supply task information to sched_task()Kan Liang
To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com
2025-03-17KVM: s390: Don't use %pK through debug printingThomas Weißschuh
Restricted pointers ("%pK") are only meant to be used when directly printing to a file from task context. Otherwise it can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Link: https://lore.kernel.org/r/20250217-restricted-pointers-s390-v1-2-0e4ace75d8aa@linutronix.de Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20250217-restricted-pointers-s390-v1-2-0e4ace75d8aa@linutronix.de>
2025-03-17KVM: s390: Don't use %pK through tracepointsThomas Weißschuh
Restricted pointers ("%pK") are not meant to be used through TP_format(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Link: https://lore.kernel.org/r/20250217-restricted-pointers-s390-v1-1-0e4ace75d8aa@linutronix.de Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20250217-restricted-pointers-s390-v1-1-0e4ace75d8aa@linutronix.de>
2025-03-17mm: rename GENERIC_PTDUMP and PTDUMP_COREAnshuman Khandual
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/cma: introduce cma_intersects functionFrank van der Linden
Now that CMA areas can have multiple physical ranges, code can't assume a CMA struct represents a base_pfn plus a size, as returned from cma_get_base. Most cases are ok though, since they all explicitly refer to CMA areas that were created using existing interfaces (cma_declare_contiguous_nid or cma_init_reserved_mem), which guarantees they have just one physical range. An exception is the s390 code, which walks all CMA ranges to see if they intersect with a range of memory that is about to be hotremoved. So, in the future, it might run in to multi-range areas. To keep this check working, define a cma_intersects function. This just checks if a physaddr range intersects any of the ranges. Use it in the s390 check. Link: https://lkml.kernel.org/r/20250228182928.2645936-4-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned longRyan Roberts
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t could never be bigger than unsigned long. But this assumption soon will not be true on arm64 when using D128 pgtables. In 128 bit page table configuration, unsigned long is 64 bit, but pgprot_t is 128 bit. Passing platform abstracted pgprot_t argument is better as compared to size based data types. Let's change the parameter to directly pass pgprot_t like another similar helper generic_ioremap_prot(). Without this change in place, D128 configuration does not work on arm64 as the top 64 bits gets silently stripped when passing the protection value to this function. Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: zbud: remove zbudYosry Ahmed
The zbud compressed pages allocator is rarely used, most users use zsmalloc. zbud consumes much more memory (only stores 1 or 2 compressed pages per physical page). The only advantage of zbud is a marginal performance improvement that by no means justify the memory overhead. Historically, zsmalloc had significantly worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. In a kernel build test on tmpfs in a limited cgroup, zbud 2-3% less time than zsmalloc, but at the cost of using ~32% more memory (1.5G vs 1.13G). The tradeoff does not make sense for zbud in any practical scenario. The only alleged advantage of zbud is not having the dependency on CONFIG_MMU, but CONFIG_SWAP already depends on CONFIG_MMU anyway, and zbud is only used by zswap. Remove zbud after z3fold's removal, leaving zsmalloc as the one and only zpool allocator. Leave the removal of the zpool API (and its associated config options) to a followup cleanup after no more allocators show up. Deprecating zbud for a few cycles before removing it was initially proposed [2], like z3fold was marked as deprecated for 2 cycles [3]. However, Johannes rightfully pointed out that the 2 cycles is too short for most downstream consumers, and z3fold was deprecated first only as a courtesy anyway. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/Z5gdnSX5Lv-nfjQL@google.com/ [3]https://lore.kernel.org/lkml/20240904233343.933462-1-yosryahmed@google.com/ Link: https://lkml.kernel.org/r/20250129180633.3501650-3-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17kbuild: Create intermediate vmlinux build with relocations preservedArd Biesheuvel
The imperative paradigm used to build vmlinux, extract some info from it or perform some checks on it, and subsequently modify it again goes against the declarative paradigm that is usually employed for defining make rules. In particular, the Makefile.postlink files that consume their input via an output rule result in some dodgy logic in the decompressor makefiles for RISC-V and x86, given that the vmlinux.relocs input file needed to generate the arch-specific relocation tables may not exist or be out of date, but cannot be constructed using the ordinary Make dependency based rules, because the info needs to be extracted while vmlinux is in its ephemeral, non-stripped form. So instead, for architectures that require the static relocations that are emitted into vmlinux when passing --emit-relocs to the linker, and are subsequently stripped out again, introduce an intermediate vmlinux target called vmlinux.unstripped, and organize the reset of the build logic accordingly: - vmlinux.unstripped is created only once, and not updated again - build rules under arch/*/boot can depend on vmlinux.unstripped without running the risk of the data disappearing or being out of date - the final vmlinux generated by the build is not bloated with static relocations that are never needed again after the build completes. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-03-17kbuild: Introduce Kconfig symbol for linking vmlinux with relocationsArd Biesheuvel
Some architectures build vmlinux with static relocations preserved, but strip them again from the final vmlinux image. Arch specific tools consume these static relocations in order to construct relocation tables for KASLR. The fact that vmlinux is created, consumed and subsequently updated goes against the typical, declarative paradigm used by Make, which is based on rules and dependencies. So as a first step towards cleaning this up, introduce a Kconfig symbol to declare that the arch wants to consume the static relocations emitted into vmlinux. This will be wired up further in subsequent patches. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-03-15bpf: Introduce load-acquire and store-release instructionsPeilin Ye
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags: #define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110 A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100). Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110). Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends. Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set). As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian): db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0)) opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ Similarly, a 16-bit BPF store-release: cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2) opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena. [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15crypto: scatterwalk - Change scatterwalk_next calling conventionHerbert Xu
Rather than returning the address and storing the length into an argument pointer, add an address field to the walk struct and use that to store the address. The length is returned directly. Change the done functions to use this stored address instead of getting them from the caller. Split the address into two using a union. The user should only access the const version so that it is never changed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-14KVM: s390: pv: fix race when making a page secureClaudio Imbrenda
Holding the pte lock for the page that is being converted to secure is needed to avoid races. A previous commit removed the locking, which caused issues. Fix by locking the pte again. Fixes: 5cbe24350b7d ("KVM: s390: move pv gmap functions into kvm") Reported-by: David Hildenbrand <david@redhat.com> Tested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [david@redhat.com: replace use of get_locked_pte() with folio_walk_start()] Link: https://lore.kernel.org/r/20250312184912.269414-2-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250312184912.269414-2-imbrenda@linux.ibm.com>
2025-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.") 2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c 661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-11Merge branch 'strict-mm-typechecks-support' into featuresVasily Gorbik
Heiko writes: "The recent large kernel Rust thread where Linus commented about that structures may be returned in registers [1] made me again aware that this is not true for s390 where the ABI defines that structures are returned in a return value buffer allocated by the caller. This was also mentioned by Alexander Gordeev a couple of weeks ago. In theory the -freg-struct-return compiler flag would allow to return small structures in registers, however that has not been implemented for s390. Juergen Christ did an experimental gcc implementation which shows the benefit of such a change (bloat-o-meter): add/remove: 3/2 grow/shrink: 12/441 up/down: 740/-7182 (-6442) This result is not very impressive, and doesn't seem to justify a new ABI for the kernel. However there is still the existing STRICT_MM_TYPECHECKS which can be used to change some mm types from structures to simple scalar types. Changing the mm types results in: add/remove: 2/8 grow/shrink: 25/116 up/down: 3902/-6204 (-2302) Which is already a third of the possible savings which would be the result of the described ABI change. Therefore add support for a configurable STRICT_MM_TYPECHECKS which allows to generate better code, but also allows to have type checking for debug builds." [1] https://lore.kernel.org/all/CAHk-=wgb1g9VVHRaAnJjrfRFWAOVT2ouNOMqt0js8h3D6zvHDw@mail.gmail.com/ * strict-mm-typechecks-support: s390/mm: Add configurable STRICT_MM_TYPECHECKS s390/mm: Convert pgste_val() into function s390/mm: Convert pgprot_val() into function s390/mm: Use pgprot_val() instead of open coding Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390/syscall: Simplify syscall_get_arguments()Sven Schnelle
Replace the while loop and if statement with a simple for loop to make the code easier to understand. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390: Remove ioremap_wt() and pgprot_writethrough()Niklas Schnelle
It turns out that while s390 architecture calls its memory-I/O mapping variants write-through and write-back the implementation of ioremap_wt() and pgprot_writethrough() does not match Linux notion of ioremap_wt(). In particular Linux expects ioremap_wt() to be weaker still than ioremap_wc(), allowing not just gathering and re-ordering but also reads to be served from cache. Instead s390's implementation is equivalent to normal ioremap() while its ioremap_wc() allows re-ordering. Note that there are no known users of ioremap_wt() on s390 and the resulting behavior is in line with asm-generic defining ioremap_wt() as ioremap(), if undefined, so no breakage is expected. As s390 does not have a mapping type matching the Linux notion of ioremap_wt() and pgprot_writethrough(), simply drop them and rely on the asm-generic fallbacks instead. Fixes: b02002cc4c0f ("s390/pci: Implement ioremap_wc/prot() with MIO") Fixes: b43b3fff042d ("s390: mm: convert to GENERIC_IOREMAP") Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390/mm: Add configurable STRICT_MM_TYPECHECKSHeiko Carstens
Add support for configurable STRICT_MM_TYPECHECKS. The s390 ABI defines that return values with complex types like structures and unions are returned in a return value buffer allocated by the caller. This is also true for small structures and unions which would fit into a register. On the other hand when such types are passed as arguments to functions they are passed in registers, if they are small enough. This leads to inefficient code when such a return value of a function call is then passed as argument to a subsequent function call. This is especially true for all mm types, like pte_t and others, which are only for type checking reasons defined as a structure. This however can be bypassed with the STRICT_MM_TYPECHECKS feature, which is used by a few other architectures, which seem to have the same problem. Add CONFIG_STRICT_MM_TYPECHECKS which can be used to change the type of pte_t and other structures. If the config option is not enabled the types are defined to unsigned long, allowing for better code generation, however there is no type checking anymore. If it is enabled the types are structures like before so that type checking is performed, but less efficient code is generated. The option is always enabled in debug_defconfig, and for convenience an mmtypes.config topic target is added, which allows to easily enable it, in case memory management code is changed. CONFIG_STRICT_MM_TYPECHECKS and STRICT_MM_TYPECHECKS are kept separate, since STRICT_MM_TYPECHECKS is common across architectures and common code. Therefore use the same define also for s390 code. Add CONFIG_STRICT_MM_TYPECHECKS to make it build time configurable. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390/mm: Convert pgste_val() into functionHeiko Carstens
Similar to all other *_val() functions convert the last remaining architecture specific mm primitive pgste_val() into a function. Add set_pgste_bit() and clear_pgste_bit() helper functions which allow to clear and set pgste bits. This is also similar to e.g. set_pte_bit() and other helper functions. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390/mm: Convert pgprot_val() into functionHeiko Carstens
Convert pgprot_val() into a function similar to other mm primitives like e.g. pte_val(). This disallows usage as an lvalue; however there aren't any such users left, except for some architecture specific ones. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-11s390/mm: Use pgprot_val() instead of open codingHeiko Carstens
Use pgprot_val() to get the page protection value, instead of accessing the structure member directly. The type of pgprot_t is supposed to be hidden from all users so that it can be changed; e.g. for STRICT_MM_TYPECHECKS. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>