summaryrefslogtreecommitdiff
path: root/tools
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'kvmarm-6.16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.16 * New features: - Add large stage-2 mapping support for non-protected pKVM guests, clawing back some performance. - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. - Enable nested virtualisation support on systems that support it (yes, it has been a long time coming), though it is disabled by default. * Improvements, fixes and cleanups: - Large rework of the way KVM tracks architecture features and links them with the effects of control bits. This ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. - Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. - New selftest checking the pKVM ownership transition rules - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. - Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. - Add a new selftest for the SVE host state being corrupted by a guest. - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. - and the usual random cleanups.
2025-05-26Merge branch 'pm-tools'Rafael J. Wysocki
Merge a cpupower utility update for 6.16-rc1 that adds a systemd service to run cpupower and changes binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli). * pm-tools: cpupower: do not install files to /etc/default/ cpupower: do not call systemctl at install time cpupower: do not write DESTDIR to cpupower.service cpupower: change binding's makefile to use -lcpupower cpupower: add a systemd service to run cpupower
2025-05-26Merge branches 'pm-runtime' and 'pm-sleep'Rafael J. Wysocki
Merge updates related to system sleep handling and runtime PM for 6.16-rc1: - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla). - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki). - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás). - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum). - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki). - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel). - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang). - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu). - Add configurable pm_test delay for hibernation (Zihuan Zhang). - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter). - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki). * pm-runtime: PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn() PM: sysfs: Move debug runtime PM attributes to runtime_attrs[] PM: runtime: Add new devm functions * pm-sleep: PM: freezer: Rewrite restarting tasks log to remove stray *done.* PM: sleep: Introduce pm_sleep_transition_in_progress() PM: sleep: Introduce pm_suspend_in_progress() PM: sleep: Print PM debug messages during hibernation ucsi_ccg: Disable async suspend in ucsi_ccg_probe() PM: hibernate: add configurable delay for pm_test PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks() PM: wakeup: Add missing wakeup source attribute relax_count PM: sleep: Remove unnecessary !! PM: sleep: Use two lines for "Restarting..." / "done" messages PM: sleep: Make suspend of devices more asynchronous PM: sleep: Suspend async parents after suspending children PM: sleep: Resume children after resuming the parent PM: hibernate: Remove size arguments when calling strscpy()
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26Merge tag 'vfs-6.16-rc1.selftests' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs selftests updates from Christian Brauner: "This contains various cleanups, fixes, and extensions for out filesystem selftests" * tag 'vfs-6.16-rc1.selftests' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/fs/mount-notify: add a test variant running inside userns selftests/filesystems: create setup_userns() helper selftests/filesystems: create get_unique_mnt_id() helper selftests/fs/mount-notify: build with tools include dir selftests/mount_settattr: remove duplicate syscall definitions selftests/pidfd: move syscall definitions into wrappers.h selftests/fs/statmount: build with tools include dir selftests/filesystems: move wrapper.h out of overlayfs subdir selftests/mount_settattr: ensure that ext4 filesystem can be created selftests/mount_settattr: add missing STATX_MNT_ID_UNIQUE define selftests/mount_settattr: don't define sys_open_tree() twice
2025-05-26Merge tag 'vfs-6.16-rc1.coredump' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull coredump updates from Christian Brauner: "This adds support for sending coredumps over an AF_UNIX socket. It also makes (implicit) use of the new SO_PEERPIDFD ability to hand out pidfds for reaped peer tasks The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a saf way to handle them instead of relying on super privileged coredumping helpers This will also be significantly more lightweight since the kernel doens't have to do a fork()+exec() for each crashing process to spawn a usermodehelper. Instead the kernel just connects to the AF_UNIX socket and userspace can process it concurrently however it sees fit. Support for userspace is incoming starting with systemd-coredump There's more work coming in that direction next cycle. The rest below goes into some details and background Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump In the example the core_pattern shown causes the kernel to spawn systemd-coredump as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant) - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe - A new usermode helper has to be spawned for each crashing process This adds a new mode: (3) Dumping into an AF_UNIX socket Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket: - The coredump server uses SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. That is a huge attack vector right now - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to detect whether /proc/<pid> still refers to the same process The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX socket directly - The pidfd for the crashing task will contain information how the task coredumps. The PIDFD_GET_INFO ioctl gained a new flag PIDFD_INFO_COREDUMP which can be used to retreive the coredump information If the coredump gets a new coredump client connection the kernel guarantees that PIDFD_INFO_COREDUMP information is available. Currently the following information is provided in the new @coredump_mask extension to struct pidfd_info: * PIDFD_COREDUMPED is raised if the task did actually coredump * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., undumpable) * PIDFD_COREDUMP_USER is raised if this is a regular coredump and doesn't need special care by the coredump server * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be treated as sensitive and the coredump server should restrict access to the generated coredump to sufficiently privileged users" * tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: mips, net: ensure that SOCK_COREDUMP is defined selftests/coredump: add tests for AF_UNIX coredumps selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure coredump: validate socket name as it is written coredump: show supported coredump modes pidfs, coredump: add PIDFD_INFO_COREDUMP coredump: add coredump socket coredump: reflow dump helpers a little coredump: massage do_coredump() coredump: massage format_corename()
2025-05-26Merge tag 'vfs-6.16-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "Features: - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD socket option SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for the process that called connect() or listen(). This is heavily used to safely authenticate clients in userspace avoiding security bugs due to pid recycling races (dbus, polkit, systemd, etc.) SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag Another summary has been provided by David Rheinsberg: > A pidfd can outlive the task it refers to, and thus user-space > must already be prepared that the task underlying a pidfd is > gone at the time they get their hands on the pidfd. For > instance, resolving the pidfd to a PID via the fdinfo must be > prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several > kernel APIs currently add another layer that checks for this. In > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was > already reaped, but returns a stale pidfd if the task is reaped > immediately after the respective alive-check. > > This has the unfortunate effect that user-space now has two ways > to check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though > there is no particular reason to distinguish both cases. This > also propagates through user-space APIs, which pass on pidfds. > They must be prepared to pass on `-1` *or* the pidfd, because > there is no guaranteed way to get a stale pidfd from the kernel. > > Userspace must already deal with a pidfd referring to a reaped > task as the task may exit and get reaped at any time will there > are still many pidfds referring to it In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info = { .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes - Hand a pidfd to the coredump usermode helper process Give userspace a way to instruct the kernel to install a pidfd for the crashing process into the process started as a usermode helper. There's still tricky race-windows that cannot be easily or sometimes not closed at all by userspace. There's various ways like looking at the start time of a process to make sure that the usermode helper process is started after the crashing process but it's all very very brittle and fraught with peril The crashed-but-not-reaped process can be killed by userspace before coredump processing programs like systemd-coredump have had time to manually open a PIDFD from the PID the kernel provides them, which means they can be tricked into reading from an arbitrary process, and they run with full privileges as they are usermode helper processes Even if that specific race-window wouldn't exist it's still the safest and cleanest way to let the kernel provide the pidfd directly instead of requiring userspace to do it manually. In parallel with this commit we already have systemd adding support for this in [1] When the usermode helper process is forked we install a pidfd file descriptor three into the usermode helper's file descriptor table so it's available to the exec'd program Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed yet and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited - Allow telling when a task has not been found from finding the wrong task when creating a pidfd We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader This is causing issues for non-threaded workloads as in where they expect ESRCH to be reported, not EINVAL So allow userspace to reliably distinguish between (1) and (2) - Make it possible to detect when a pidfs entry would outlive the struct pid it pinned - Add a range of new selftests Cleanups: - Remove unneeded NULL check from pidfd_prepare() for passed struct pid - Avoid pointless reference count bump during release_task() Fixes: - Various fixes to the pidfd and coredump selftests - Fix error handling for replace_fd() when spawning coredump usermode helper" * tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: detect refcount bugs coredump: hand a pidfd to the usermode coredump helper coredump: fix error handling for replace_fd() pidfs: move O_RDWR into pidfs_alloc_file() selftests: coredump: Raise timeout to 2 minutes selftests: coredump: Fix test failure for slow machines selftests: coredump: Properly initialize pointer net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid pidfs: get rid of __pidfd_prepare() net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid pidfs: register pid in pidfs net, pidfd: report EINVAL for ESRCH release_task: kill the no longer needed get/put_pid(thread_pid) pidfs: ensure consistent ENOENT/ESRCH reporting exit: move wake_up_all() pidfd waiters into __unhash_process() selftest/pidfd: add test for thread-group leader pidfd open for thread pidfd: improve uapi when task isn't found pidfd: remove unneeded NULL check from pidfd_prepare() selftests/pidfd: adapt to recent changes
2025-05-26Merge tag 'nf-next-25-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge branch 'acpica'Rafael J. Wysocki
Merge ACPICA updates, including two upstream releases 20241212 and 20250404, for 6.16-rc1: - Fix two ACPICA SLAB cache leaks (Seunghun Han). - Add EINJv2 get error type action and define Error Injection Actions in hex values to avoid inconsistencies between the specification and the code (Zaid Alali). - Fix typo in comments for SRAT structures (Adam Lackorzynski). - Prevent possible loss of data in ACPICA because of u32 to u8 conversions (Saket Dumbre). - Fix reading FFixedHW operation regions in ACPICA (Daniil Tatianin). - Add support for printing AML arguments when the ACPICA debug level is ACPI_LV_TRACE_POINT (Mario Limonciello). - Drop a stale comment about the file content from actbl2.h (Sudeep Holla). - Apply pack(1) to union aml_resource (Tamir Duberstein). - Fix overflow check in the ACPICA version of vsnprintf() (gldrk). - Interpret SIDP structures in DMAR added revision 3.4 of the VT-d specification (Alexey Neyman). - Add typedef and other definitions related to MRRM to ACPICA (Tony Luck). - Add definitions for RIMT to ACPICA (Sunil V L). - Fix spelling mistake "Incremement" -> "Increment" in the ACPICA utilities code (Colin Ian King). - Add typedef and other definitions for ERDT to ACPICA (Tony Luck). - Introduce ACPI_NONSTRING and use it (Kees Cook, Ahmed Salem). - Rename structure and field names of the RAS2 table in actbl2.h (Shiju Jose). - Fix up whitespace in acpica/utcache.c (Zhe Qiao). - Avoid sequence overread in a call to strncmp() in ap_get_table_length() and replace strncpy() with memcpy() in ACPICA in some places (Ahmed Salem). - Update copyright year in all ACPICA files (Saket Dumbre). * acpica: (30 commits) ACPICA: Update copyright year ACPICA: Logfile: Changes for version 20250404 ACPICA: Replace strncpy() with memcpy() ACPICA: Apply ACPI_NONSTRING in more places ACPICA: Avoid sequence overread in call to strncmp() ACPICA: Adjust the position of code lines ACPICA: actbl2.h: ACPI 6.5: RAS2: Rename structure and field names of the RAS2 table ACPICA: Apply ACPI_NONSTRING ACPICA: Introduce ACPI_NONSTRING ACPICA: actbl2.h: ERDT: Add typedef and other definitions ACPICA: infrastructure: Add new DMT_BUF types and shorten a long name ACPICA: Utilities: Fix spelling mistake "Incremement" -> "Increment" ACPICA: MRRM: Some cleanups ACPICA: actbl2: Add definitions for RIMT ACPICA: actbl2.h: MRRM: Add typedef and other definitions ACPICA: infrastructure: Add new header and ACPI_DMT_BUF26 types ACPICA: Interpret SIDP structures in DMAR ACPICA: utilities: Fix overflow check in vsnprintf() ACPICA: Apply pack(1) to union aml_resource ACPICA: Drop stale comment about the header file content ...
2025-05-26Merge tag 'linux-can-next-for-6.16-20250522' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2025-05-22 this is a pull request of 22 patches for net-next/main. The series by Biju Das contains 19 patches and adds RZ/G3E CANFD support to the rcar_canfd driver. The patch by Vincent Mailhol adds a struct data_bittiming_params to group FD parameters as a preparation patch for CAN-XL support. Felix Maurer's patch imports tst-filter from can-tests into the kernel self tests and Vincent Mailhol adds support for physical CAN interfaces. linux-can-next-for-6.16-20250522 * tag 'linux-can-next-for-6.16-20250522' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (22 commits) selftests: can: test_raw_filter.sh: add support of physical interfaces selftests: can: Import tst-filter from can-tests can: dev: add struct data_bittiming_params to group FD parameters can: rcar_canfd: Add RZ/G3E support can: rcar_canfd: Enhance multi_channel_irqs handling can: rcar_canfd: Add external_clk variable to struct rcar_canfd_hw_info can: rcar_canfd: Add sh variable to struct rcar_canfd_hw_info can: rcar_canfd: Add struct rcanfd_regs variable to struct rcar_canfd_hw_info can: rcar_canfd: Add shared_can_regs variable to struct rcar_canfd_hw_info can: rcar_canfd: Add ch_interface_mode variable to struct rcar_canfd_hw_info can: rcar_canfd: Add {nom,data}_bittiming variables to struct rcar_canfd_hw_info can: rcar_canfd: Add max_cftml variable to struct rcar_canfd_hw_info can: rcar_canfd: Add max_aflpn variable to struct rcar_canfd_hw_info can: rcar_canfd: Add rnc_field_width variable to struct rcar_canfd_hw_info can: rcar_canfd: Update RCANFD_GAFLCFG macro can: rcar_canfd: Add rcar_canfd_setrnc() can: rcar_canfd: Drop the mask operation in RCANFD_GAFLCFG_SETRNC macro can: rcar_canfd: Update RCANFD_GERFL_ERR macro can: rcar_canfd: Drop RCANFD_GAFLCFG_GETRNC macro can: rcar_canfd: Use of_get_available_child_by_name() ... ==================== Link: https://patch.msgid.link/20250522084128.501049-1-mkl@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
2025-05-26selftests: ncdevmem: add tx test with multiple IOVsStanislav Fomichev
Use prime 3 for length to make offset slowly drift away. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26selftests: ncdevmem: make chunking optionalStanislav Fomichev
Add new -z argument to specify max IOV size. By default, use single large IOV. Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-25Merge branch 'locking/futex' into locking/core, to pick up pending futex changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-24perf tests switch-tracking: Fix timestamp comparisonLeo Yan
The test might fail on the Arm64 platform with the error: # perf test -vvv "Track with sched_switch" Missing sched_switch events # The issue is caused by incorrect handling of timestamp comparisons. The comparison result, a signed 64-bit value, was being directly cast to an int, leading to incorrect sorting for sched events. The case does not fail everytime, usually I can trigger the failure after run 20 ~ 30 times: # while true; do perf test "Track with sched_switch"; done 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : FAILED! 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok 106: Track with sched_switch : FAILED! 106: Track with sched_switch : Ok 106: Track with sched_switch : Ok I used cross compiler to build Perf tool on my host machine and tested on Debian / Juno board. Generally, I think this issue is not very specific to GCC versions. As both internal CI and my local env can reproduce the issue. My Host Build compiler: # aarch64-linux-gnu-gcc --version aarch64-linux-gnu-gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Juno Board: # lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 12 (bookworm) Release: 12 Codename: bookworm Fix this by explicitly returning 0, 1, or -1 based on whether the result is zero, positive, or negative. Fixes: d44bc558297222d9 ("perf tests: Add a test for tracking with sched_switch") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Leo Yan <leo.yan@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: James Clark <james.clark@linaro.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250331172759.115604-1-leo.yan@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-23Merge branch 'for-6.16/cxl-features-ras' into cxl-for-nextDave Jiang
Add CXL RAS Features support. Features include "patrol scrub control", "error check scrub", "perform maintenance", and "memory sparing". This support connects the RAS Featurs to EDAC.
2025-05-23cxl/edac: Add CXL memory device patrol scrub control featureShiju Jose
CXL spec 3.2 section 8.2.10.9.11.1 describes the device patrol scrub control feature. The device patrol scrub proactively locates and makes corrections to errors in regular cycle. Allow specifying the number of hours within which the patrol scrub must be completed, subject to minimum and maximum limits reported by the device. Also allow disabling scrub allowing trade-off error rates against performance. Add support for patrol scrub control on CXL memory devices. Register with the EDAC device driver, which retrieves the scrub attribute descriptors from EDAC scrub and exposes the sysfs scrub control attributes to userspace. For example, scrub control for the CXL memory device "cxl_mem0" is exposed in /sys/bus/edac/devices/cxl_mem0/scrubX/. Additionally, add support for region-based CXL memory patrol scrub control. CXL memory regions may be interleaved across one or more CXL memory devices. For example, region-based scrub control for "cxl_region1" is exposed in /sys/bus/edac/devices/cxl_region1/scrubX/. [dj: A few formatting fixes from Jonathan] Reviewed-by: Dave Jiang <dave.jiang@intel.com> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20250521124749.817-4-shiju.jose@huawei.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-23libbpf: Use mmap to parse vmlinux BTF from sysfsLorenz Bauer
Teach libbpf to use mmap when parsing vmlinux BTF from /sys. We don't apply this to fall-back paths on the regular file system because there is no way to ensure that modifications underlying the MAP_PRIVATE mapping are not visible to the process. Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250520-vmlinux-mmap-v5-3-e8c941acc414@isovalent.com
2025-05-23selftests: bpf: Add a test for mmapable vmlinux BTFLorenz Bauer
Add a basic test for the ability to mmap /sys/kernel/btf/vmlinux. Ensure that the data is valid BTF and that it is padded with zero. Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20250520-vmlinux-mmap-v5-2-e8c941acc414@isovalent.com
2025-05-23tools: hv: Enable debug logs for hv_kvp_daemonShradha Gupta
Allow the KVP daemon to log the KVP updates triggered in the VM with a new debug flag(-d). When the daemon is started with this flag, it logs updates and debug information in syslog with loglevel LOG_DEBUG. This information comes in handy for debugging issues where the key-value pairs for certain pools show mismatch/incorrect values. The distro-vendors can further consume these changes and modify the respective service files to redirect the logs to specific files as needed. Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Naman Jain <namjain@linux.microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/r/1744715978-8185-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1744715978-8185-1-git-send-email-shradhagupta@linux.microsoft.com>
2025-05-23selftests: ublk: add test for UBLK_F_QUIESCEMing Lei
Add test generic_11 for covering new control command of UBLK_U_CMD_QUIESCE_DEV. Add 'quiesce -n dev_id' sub-command on ublk utility for transitioning device state to quiesce states, then verify the feature via generic_10 by doing quiesce and recovery. Cc: Yoav Cohen <yoav@nvidia.com> Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250522163523.406289-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZEMing Lei
Add test generic_10 for covering new control command of UBLK_U_CMD_UPDATE_SIZE. Add 'update_size -s|--size size_in_bytes' sub-command on ublk utility for supporting this feature, then verify the feature via generic_10. Cc: Omri Mann <omri@nvidia.com> Cc: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250522163523.406289-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23selftests: netfilter: Torture nftables netdev hooksPhil Sutter
Add a ruleset which binds to various interface names via netdev-family chains and flowtables and massage the notifiers by frequently renaming interfaces to match these names. While doing so: - Keep an 'nft monitor' running in background to receive the notifications - Loop over 'nft list ruleset' to exercise ruleset dump codepath - Have iperf running so the involved chains/flowtables see traffic If supported, also test interface wildcard support separately by creating a flowtable with 'wild*' interface spec and quickly add/remove matching dummy interfaces. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFsFlorian Westphal
Replace the existing VRF test with a more comprehensive one. It tests following combinations: - fib type (returns address type, e.g. unicast) - fib oif (route output interface index - both with and without 'iif' keyword (changes result, e.g. 'fib daddr type local' will be true when the destination address is configured on the local machine, but 'fib daddr . iif type local' will only be true when the destination address is configured on the incoming interface. Add all types of addresses to test with for both ipv4 and ipv6: - local address on the incoming interface - local address on another interface - local address on another interface thats part of a vrf - address on another host The ruleset stores obtained results from 'fib' in nftables sets and then queries the sets to check that it has the expected results. Perform one pass while packets are coming in on interface NOT part of a VRF and then again when it was added and make sure fib returns the expected routes and address types for the various addresses in the setup. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23Merge branch kvm-arm64/misc-6.16 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/misc-6.16: : . : Misc changes and improvements for 6.16: : : - Add a new selftest for the SVE host state being corrupted by a guest : : - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, : ensuring that the window for interrupts is slightly bigger, and avoiding : a pretty bad erratum on the AmpereOne HW : : - Replace a couple of open-coded on/off strings with str_on_off() : : - Get rid of the pKVM memblock sorting, which now appears to be superflous : : - Drop superflous clearing of ICH_LR_EOI in the LR when nesting : : - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from : a pretty bad case of TLB corruption unless accesses to HCR_EL2 are : heavily synchronised : : - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables : in a human-friendly fashion : . KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables arm64: errata: Work around AmpereOne's erratum AC04_CPU_23 KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1 KVM: arm64: Drop sort_memblock_regions() KVM: arm64: selftests: Add test for SVE host corruption KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode KVM: arm64: Replace ternary flags with str_on_off() helper Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/fgt-masks into kvmarm-master/nextMarc Zyngier
* kvm-arm64/fgt-masks: (43 commits) : . : Large rework of the way KVM deals with trap bits in conjunction with : the CPU feature registers. It now draws a direct link between which : the feature set, the system registers that need to UNDEF to match : the configuration and bits that need to behave as RES0 or RES1 in : the trap registers that are visible to the guest. : : Best of all, these definitions are mostly automatically generated : from the JSON description published by ARM under a permissive : license. : . KVM: arm64: Handle TSB CSYNC traps KVM: arm64: Add FGT descriptors for FEAT_FGT2 KVM: arm64: Allow sysreg ranges for FGT descriptors KVM: arm64: Add context-switch for FEAT_FGT2 registers KVM: arm64: Add trap routing for FEAT_FGT2 registers KVM: arm64: Add sanitisation for FEAT_FGT2 registers KVM: arm64: Add FEAT_FGT2 registers to the VNCR page KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits KVM: arm64: Allow kvm_has_feat() to take variable arguments KVM: arm64: Use FGT feature maps to drive RES0 bits KVM: arm64: Validate FGT register descriptions against RES0 masks KVM: arm64: Switch to table-driven FGU configuration KVM: arm64: Handle PSB CSYNC traps KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask KVM: arm64: Remove hand-crafted masks for FGT registers KVM: arm64: Use computed FGT masks to setup FGT registers KVM: arm64: Propagate FGT masks to the nVHE hypervisor KVM: arm64: Unconditionally configure fine-grain traps KVM: arm64: Use computed masks as sanitisers for FGT registers ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/mte-frac into kvmarm-master/nextMarc Zyngier
* kvm-arm64/mte-frac: : . : Prevent FEAT_MTE_ASYNC from being accidently exposed to a guest, : courtesy of Ben Horgan. From the cover letter: : : "The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM. : However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0 : indicates that MTE_ASYNC is supported. On a host with : ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the : MTE capability enabled will incorrectly see MTE_ASYNC advertised as : supported. This series fixes that." : . KVM: selftests: Confirm exposing MTE_frac does not break migration KVM: arm64: Make MTE_frac masking conditional on MTE capability arm64/sysreg: Expose MTE_frac so that it is visible to KVM Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23selftest: af_unix: Test SO_PASSRIGHTS.Kuniyuki Iwashima
scm_rights.c has various patterns of tests to exercise GC. Let's add cases where SO_PASSRIGHTS is disabled. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23af_unix: Introduce SO_PASSRIGHTS.Kuniyuki Iwashima
As long as recvmsg() or recvmmsg() is used with cmsg, it is not possible to avoid receiving file descriptors via SCM_RIGHTS. This behaviour has occasionally been flagged as problematic, as it can be (ab)used to trigger DoS during close(), for example, by passing a FUSE-controlled fd or a hung NFS fd. For instance, as noted on the uAPI Group page [0], an untrusted peer could send a file descriptor pointing to a hung NFS mount and then close it. Once the receiver calls recvmsg() with msg_control, the descriptor is automatically installed, and then the responsibility for the final close() now falls on the receiver, which may result in blocking the process for a long time. Regarding this, systemd calls cmsg_close_all() [1] after each recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS. However, this cannot work around the issue at all, because the final fput() may still occur on the receiver's side once sendmsg() with SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does not work for the same reason. Thus, we need a better way to refuse SCM_RIGHTS at sendmsg(). Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS. Note that this option is enabled by default for backward compatibility. Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0] Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23tcp: Restrict SO_TXREHASH to TCP socket.Kuniyuki Iwashima
sk->sk_txrehash is only used for TCP. Let's restrict SO_TXREHASH to TCP to reflect this. Later, we will make sk_txrehash a part of the union for other protocol families. Note that we need to modify BPF selftest not to get/set SO_TEREHASH for non-TCP sockets. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-22perf pmu intel: Adjust cpumaks for sub-NUMA clusters on graniterapidsIan Rogers
On graniterapids the cache home agent (CHA) and memory controller (IMC) PMUs all have their cpumask set to per-socket information. In order for per NUMA node aggregation to work correctly the PMUs cpumask needs to be set to CPUs for the relevant sub-NUMA grouping. For example, on a 2 socket graniterapids machine with sub NUMA clustering of 3, for uncore_cha and uncore_imc PMUs the cpumask is "0,120" leading to aggregation only on NUMA nodes 0 and 3: ``` $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1 Performance counter stats for 'system wide': N0 1 277,835,681,344 UNC_CHA_CLOCKTICKS N0 1 19,242,894,228 UNC_M_CLOCKTICKS N3 1 277,803,448,124 UNC_CHA_CLOCKTICKS N3 1 19,240,741,498 UNC_M_CLOCKTICKS 1.002113847 seconds time elapsed ``` By updating the PMUs cpumasks to "0,120", "40,160" and "80,200" then the correctly 6 NUMA node aggregations are achieved: ``` $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1 Performance counter stats for 'system wide': N0 1 92,748,667,796 UNC_CHA_CLOCKTICKS N0 0 6,424,021,142 UNC_M_CLOCKTICKS N1 0 92,753,504,424 UNC_CHA_CLOCKTICKS N1 1 6,424,308,338 UNC_M_CLOCKTICKS N2 0 92,751,170,084 UNC_CHA_CLOCKTICKS N2 0 6,424,227,402 UNC_M_CLOCKTICKS N3 1 92,745,944,144 UNC_CHA_CLOCKTICKS N3 0 6,423,752,086 UNC_M_CLOCKTICKS N4 0 92,725,793,788 UNC_CHA_CLOCKTICKS N4 1 6,422,393,266 UNC_M_CLOCKTICKS N5 0 92,717,504,388 UNC_CHA_CLOCKTICKS N5 0 6,421,842,618 UNC_M_CLOCKTICKS 1.003406645 seconds time elapsed ``` In general, having the perf tool adjust cpumasks isn't desirable as ideally the PMU driver would be advertising the correct cpumask. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Weilin Wang <weilin.wang@intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250515181417.491401-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf tests trace_summary.sh: Run in exclusive modeArnaldo Carvalho de Melo
And it is being successfull only when running alone, probably because there are some tests that add the vfs_getname probe that gets used by 'perf trace' and alter how it does syscall arg pathname resolution. This should be removed or made a fallback to the preferred BPF mode of getting syscall parameters, but till then, run this in exclusive mode. For reference, here are some of the tests that run close to this one: 127: perf record offcpu profiling tests : Ok 128: perf all PMU test : Ok 129: perf stat --bpf-counters test : Ok 130: Check Arm CoreSight trace data recording and synthesized samples: Skip 131: Check Arm CoreSight disassembly script completes without errors : Skip 132: Check Arm SPE trace data recording and synthesized samples : Skip 133: Test data symbol : Ok 134: Miscellaneous Intel PT testing : Skip 135: test Intel TPEBS counting mode : Skip 136: perf script task-analyzer tests : Ok 137: Check open filename arg using perf trace + vfs_getname : Ok 138: perf trace summary : Ok Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/aC-hHTgArwlF_zu9@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf test: Add cgroup summary test case for 'perf trace'Namhyung Kim
$ sudo ./perf test -vv 112 112: perf trace summary: --- start --- test child forked, pid 1018940 testing: perf trace -s -- true testing: perf trace -S -- true testing: perf trace -s --summary-mode=thread -- true testing: perf trace -S --summary-mode=total -- true testing: perf trace -as --summary-mode=thread --no-bpf-summary -- true testing: perf trace -as --summary-mode=total --no-bpf-summary -- true testing: perf trace -as --summary-mode=thread --bpf-summary -- true testing: perf trace -as --summary-mode=total --bpf-summary -- true testing: perf trace -aS --summary-mode=total --bpf-summary -- true testing: perf trace -as --summary-mode=cgroup --bpf-summary -- true testing: perf trace -aS --summary-mode=cgroup --bpf-summary -- true ---- end(0) ---- 112: perf trace summary : Ok Reviewed-by: Howard Chu <howardchu95@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522142551.1062417-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf python: Add counting.py as example for counting perf eventsGautam Menghani
Add counting.py - a python version of counting.c to demonstrate measuring and reading of counts for given perf events. Committer testing: Build perf and make the generated python binding somewhere you can point to to avoid using the one in the distro python3-perf (fedora, may be different in other distros): $ make -k O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin Copy /tmp/build/perf-tools-next/python/perf.cpython-313-x86_64-linux-gnu.so to somewhere outside this toolbox container and then use it with root: # export PYTHONPATH=/root/python/ # ls -la /root/python/ total 10640 drwxr-xr-x. 1 root root 72 May 21 11:40 . dr-xr-x---. 1 root root 574 May 21 11:40 .. -rwxr-xr-x. 1 acme acme 10894360 May 21 11:40 perf.cpython-313-x86_64-linux-gnu.so # tools/perf/python/counting.py | head -5 For evsel(software/cpu-clock/) val: 2930946 enable: 2932479 run: 2932479 For evsel(software/cpu-clock/) val: 2924975 enable: 2926267 run: 2926267 For evsel(software/cpu-clock/) val: 2921017 enable: 2922430 run: 2922430 For evsel(software/cpu-clock/) val: 2914966 enable: 2916549 run: 2916549 For evsel(software/cpu-clock/) val: 2910027 enable: 2911589 run: 2911589 # Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> [ make the API take a CPU and thread then compute from these the appropriate indices. ] Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/linux-perf-users/CAP-5=fWb-=hCYmpg7U5N9C94EucQGTOS7YwR2-fo4ptOexzxyg@mail.gmail.com/ Link: https://lore.kernel.org/r/20250519195148.1708988-8-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf python: Add evlist close supportGautam Menghani
Add support for the evlist close function. Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250519195148.1708988-7-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf python: Add evsel read methodGautam Menghani
Add the evsel read method to enable python to read counter data for the given evsel. Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/linux-perf-users/20250512055748.479786-1-gautam@linux.ibm.com/ Link: https://lore.kernel.org/r/20250519195148.1708988-6-irogers@google.com [ make the API take a CPU and thread then compute from these the appropriate indices. ] Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22perf python: Add support for 'struct perf_counts_values' to return counter dataGautam Menghani
Add support for the perf_counts_values struct to enable the python bindings to read and return the counter data. Committer notes: Use T_ULONG instead of Py_T_ULONG, as all the other PyMemberDef arrays, fixing the build with older python3 versions. Use { .name = NULL, } to finish the new PyMemberDef pyrf_counts_values_members array, again as the other arrays to please some clang versions, ditto for PyGetSetDef. Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250519195148.1708988-5-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-22selftests/eventfd: correct test name and improve messagesRyan Chung
- Rename test from eventfd_chek_flag_cloexec_and_nonblock to eventfd_check_flag_cloexec_and_nonblock. - Make the RDWR‐flag comment declarative: “The kernel automatically adds the O_RDWR flag.” - Update semaphore‐flag failure message to: “eventfd semaphore flag check failed: …” Link: https://lkml.kernel.org/r/20250513074411.6965-1-seokwoo.chung130@gmail.com Signed-off-by: Ryan Chung <seokwoo.chung130@gmail.com> Reviewed-by: Wen Yang <wen.yang@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22selftests/damon/_damon_sysfs: read tried regions directories in orderSeongJae Park
Kdamond.update_schemes_tried_regions() reads and stores tried regions information out of address order. It makes debugging a test failure difficult. Change the behavior to do the reading and writing in the address order. Link: https://lkml.kernel.org/r/20250513002715.40126-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22selftests/mm: deduplicate second mmap() of 5*PAGE_SIZE at baseMark Brown
The map_fixed_noreplace test does two blocks of test starting from a mapping of 5 pages at the base address, logging a test result for each initial mapping. These are logged with the same test name, causing test automation software to see two reports for the same test in a single run. Tweak the log message for the second one to deduplicate. Link: https://lkml.kernel.org/r/20250518-selftests-mm-map-fixed-noreplace-dup-v1-1-1a11a62c5e9f@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22selftests/mm: add simple VM_PFNMAP tests based on mmap'ing /dev/memDavid Hildenbrand
Let's test some basic functionality using /dev/mem. These tests will implicitly cover some PAT (Page Attribute Handling) handling on x86. These tests will only run when /dev/mem access to the first two pages in physical address space is possible and allowed; otherwise, the tests are skipped. On current x86-64 with PAT inside a VM, all tests pass: TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # OK pfnmap.madvise_disallowed ok 1 pfnmap.madvise_disallowed # RUN pfnmap.munmap_split ... # OK pfnmap.munmap_split ok 2 pfnmap.munmap_split # RUN pfnmap.mremap_fixed ... # OK pfnmap.mremap_fixed ok 3 pfnmap.mremap_fixed # RUN pfnmap.mremap_shrink ... # OK pfnmap.mremap_shrink ok 4 pfnmap.mremap_shrink # RUN pfnmap.mremap_expand ... # OK pfnmap.mremap_expand ok 5 pfnmap.mremap_expand # RUN pfnmap.fork ... # OK pfnmap.fork ok 6 pfnmap.fork # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 However, we are able to trigger: [ 27.888251] x86/PAT: pfnmap:1790 freeing invalid memtype [mem 0x00000000-0x00000fff] There are probably more things worth testing in the future, such as MAP_PRIVATE handling. But this set of tests is sufficient to cover most of the things we will rework regarding PAT handling. Link: https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22selftests/bpf: sockmap_listen cleanup: Drop af_inet SOCK_DGRAM redir testsMichal Luczaj
Remove tests covered by sockmap_redir. Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-8-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: sockmap_listen cleanup: Drop af_unix redir testsMichal Luczaj
Remove tests covered by sockmap_redir. Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-7-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: sockmap_listen cleanup: Drop af_vsock redir testsMichal Luczaj
Remove tests covered by sockmap_redir. Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-6-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: Add selftest for sockmap/hashmap redirectionMichal Luczaj
Test redirection logic. All supported and unsupported redirect combinations are tested for success and failure respectively. BPF_MAP_TYPE_SOCKMAP BPF_MAP_TYPE_SOCKHASH x sk_msg-to-egress sk_msg-to-ingress sk_skb-to-egress sk_skb-to-ingress x AF_INET, SOCK_STREAM AF_INET6, SOCK_STREAM AF_INET, SOCK_DGRAM AF_INET6, SOCK_DGRAM AF_UNIX, SOCK_STREAM AF_UNIX, SOCK_DGRAM AF_VSOCK, SOCK_STREAM AF_VSOCK, SOCK_SEQPACKET Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-5-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: Introduce verdict programs for sockmap_redirMichal Luczaj
Instead of piggybacking on test_sockmap_listen, introduce test_sockmap_redir especially for sockmap redirection tests. Suggested-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-4-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: Add u32()/u64() to sockmap_helpersMichal Luczaj
Add integer wrappers for convenient sockmap usage. While there, fix misaligned trailing slashes. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-3-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: Add socket_kind_to_str() to socket_helpersMichal Luczaj
Add function that returns string representation of socket's domain/type. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-2-a1ea723f7e7e@rbox.co
2025-05-22selftests/bpf: Support af_unix SOCK_DGRAM socket pair creationMichal Luczaj
Handle af_unix in init_addr_loopback(). For pair creation, bind() the peer socket to make SOCK_DGRAM connect() happy. Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-1-a1ea723f7e7e@rbox.co
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>