summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-03-24Merge tag 'sched_ext-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add mechanism to count and report internal events. This significantly improves visibility on subtle corner conditions. - The default idle CPU selection logic is revamped and improved in multiple ways including being made topology aware. - sched_ext was disabling ttwu_queue for simplicity, which can be costly when hardware topology is more complex. Implement SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively enable ttwu_queue. - tools/sched_ext updates to improve compatibility among others. - Other misc updates and fixes. - sched_ext/for-6.14-fixes were pulled a few times to receive prerequisite fixes and resolve conflicts. * tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits) sched_ext: idle: Refactor scx_select_cpu_dfl() sched_ext: idle: Honor idle flags in the built-in idle selection policy sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local() sched_ext: Add trace point to track sched_ext core events sched_ext: Change the event type from u64 to s64 sched_ext: Documentation: add task lifecycle summary tools/sched_ext: Provide a compatible helper for scx_bpf_events() selftests/sched_ext: Add NUMA-aware scheduler test tools/sched_ext: Provide consistent access to scx flags sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior sched_ext: idle: Introduce scx_bpf_nr_node_ids() sched_ext: idle: Introduce node-aware idle cpu kfunc helpers sched_ext: idle: Per-node idle cpumasks sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE sched_ext: idle: Make idle static keys private sched/topology: Introduce for_each_node_numadist() iterator mm/numa: Introduce nearest_node_nodemask() nodemask: numa: reorganize inclusion path nodemask: add nodes_copy() tools/sched_ext: Sync with scx repo ...
2025-03-24Merge tag 'cgroup-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Add deprecation info messages to cgroup1-only features - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact - Other misc and doc updates * tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: Cleanup flushing functions and locking cgroup/rstat: avoid disabling irqs for O(num_cpu) mm: Fix a build breakage in memcontrol-v1.c blk-cgroup: Simplify policy files registration cgroup: Update file naming comment cgroup: Add deprecation message to legacy freezer controller mm: Add transformation message for per-memcg swappiness RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level cgroup/cpuset-v1: Add deprecation messages to memory_migrate cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall cgroup: Print message when /proc/cgroups is read on v2-only system cgroup/blkio: Add deprecation messages to reset_stats cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers cgroup/rstat: Fix forceidle time in cpu.stat cgroup/misc: Remove unused misc_cg_res_total_usage cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c cgroup: update comment about dropping cgroup kn refs
2025-03-24Merge tag 'slab-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Move the TINY_RCU kvfree_rcu() implementation from RCU to SLAB subsystem and cleanup its integration (Vlastimil Babka) Following the move of the TREE_RCU batching kvfree_rcu() implementation in 6.14, move also the simpler TINY_RCU variant. Refactor the #ifdef guards so that the simple implementation is also used with SLUB_TINY. Remove the need for RCU to recognize fake callback function pointers (__is_kvfree_rcu_offset()) when handling call_rcu() by implementing a callback that calculates the object's address from the embedded rcu_head address without knowing its offset. - Improve kmalloc cache randomization in kvmalloc (GONG Ruiqi) Due to an extra layer of function call, all kvmalloc() allocations used the same set of random caches. Thanks to moving the kvmalloc() implementation to slub.c, this is improved and randomization now works for kvmalloc. - Various improvements to debugging, testing and other cleanups (Hyesoo Yu, Lilith Gkini, Uladzislau Rezki, Matthew Wilcox, Kevin Brodsky, Ye Bin) * tag 'slab-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: Handle freelist cycle in on_freelist() mm/slab: call kmalloc_noprof() unconditionally in kmalloc_array_noprof() slab: Mark large folios for debugging purposes kunit, slub: Add test_kfree_rcu_wq_destroy use case mm, slab: cleanup slab_bug() parameters mm: slub: call WARN() when detecting a slab corruption mm: slub: Print the broken data before restoring them slab: Achieve better kmalloc caches randomization in kvmalloc slab: Adjust placement of __kvmalloc_node_noprof mm/slab: simplify SLAB_* flag handling slab: don't batch kvfree_rcu() with SLUB_TINY rcu, slab: use a regular callback function for kvfree_rcu rcu: remove trace_rcu_kvfree_callback slab, rcu: move TINY_RCU variant of kvfree_rcu() to SLAB
2025-03-24Merge tag 'seccomp-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - avoid the lock trip seccomp_filter_release in common case (Mateusz Guzik) - remove unused 'sd' argument through-out (Oleg Nesterov) - selftests/seccomp: Add hard-coded __NR_uretprobe for x86_64 * tag 'seccomp-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: avoid the lock trip seccomp_filter_release in common case seccomp: remove the 'sd' argument from __seccomp_filter() seccomp: remove the 'sd' argument from __secure_computing() seccomp: fix the __secure_computing() stub for !HAVE_ARCH_SECCOMP_FILTER seccomp/mips: change syscall_trace_enter() to use secure_computing() selftests/seccomp: Add hard-coded __NR_uretprobe for x86_64
2025-03-24Merge tag 'hardening-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As usual, it's scattered changes all over. Patches touching things outside of our traditional areas in the tree have been Acked by maintainers or were trivial changes: - loadpin: remove unsupported MODULE_COMPRESS_NONE (Arulpandiyan Vadivel) - samples/check-exec: Fix script name (Mickaël Salaün) - yama: remove needless locking in yama_task_prctl() (Oleg Nesterov) - lib/string_choices: Sort by function name (R Sundar) - hardening: Allow default HARDENED_USERCOPY to be set at compile time (Mel Gorman) - uaccess: Split out compile-time checks into ucopysize.h - kbuild: clang: Support building UM with SUBARCH=i386 - x86: Enable i386 FORTIFY_SOURCE on Clang 16+ - ubsan/overflow: Rework integer overflow sanitizer option - Add missing __nonstring annotations for callers of memtostr*()/strtomem*() - Add __must_be_noncstr() and have memtostr*()/strtomem*() check for it - Introduce __nonstring_array for silencing future GCC 15 warnings" * tag 'hardening-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits) compiler_types: Introduce __nonstring_array hardening: Enable i386 FORTIFY_SOURCE on Clang 16+ x86/build: Remove -ffreestanding on i386 with GCC ubsan/overflow: Enable ignorelist parsing and add type filter ubsan/overflow: Enable pattern exclusions ubsan/overflow: Rework integer overflow sanitizer option to turn on everything samples/check-exec: Fix script name yama: don't abuse rcu_read_lock/get_task_struct in yama_task_prctl() kbuild: clang: Support building UM with SUBARCH=i386 loadpin: remove MODULE_COMPRESS_NONE as it is no longer supported lib/string_choices: Rearrange functions in sorted order string.h: Validate memtostr*()/strtomem*() arguments more carefully compiler.h: Introduce __must_be_noncstr() nilfs2: Mark on-disk strings as nonstring uapi: stddef.h: Introduce __kernel_nonstring x86/tdx: Mark message.bytes as nonstring string: kunit: Mark nonstring test strings as __nonstring scsi: qla2xxx: Mark device strings as nonstring scsi: mpt3sas: Mark device strings as nonstring scsi: mpi3mr: Mark device strings as nonstring ...
2025-03-24Merge tag 'execve-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - elf: Define and use note name macros (Akihiko Odaki) - elf: add remaining SHF_ flag macros (Timur Tabi) - binfmt: Remove loader from linux_binprm struct (Yonatan Goldschmidt) - binfmt_elf_fdpic: fix variable set but not used warning (sunliming) * tag 'execve-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf_fdpic: fix variable set but not used warning elf: add remaining SHF_ flag macros binfmt: Remove loader from linux_binprm struct crash: Remove KEXEC_CORE_NOTE_NAME s390/crash: Use note name macros crash: Use note name macros powerpc/crash: Use note name macros binfmt_elf: Use note name macros elf: Define note name macros
2025-03-24rv: Add scpd, snep and sncid per-cpu monitorsGabriele Monaco
Add 3 per-cpu monitors as part of the sched model: * scpd: schedule called with preemption disabled Monitor to ensure schedule is called with preemption disabled * snep: schedule does not enable preempt Monitor to ensure schedule does not enable preempt * sncid: schedule not called with interrupt disabled Monitor to ensure schedule is not called with interrupt disabled To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-6-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24rv: Add snroc per-task monitorGabriele Monaco
Add a per-task monitor as part of the sched model: * snroc: set non runnable on its own context Monitor to ensure set_state happens only in the respective task's context To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-5-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24rv: Add sco and tss per-cpu monitorsGabriele Monaco
Add 2 per-cpu monitors as part of the sched model: * sco: scheduling context operations Monitor to ensure sched_set_state happens only in thread context * tss: task switch while scheduling Monitor to ensure sched_switch happens only in scheduling context To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-4-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24rv: Add option for nested monitors and include schedGabriele Monaco
Monitors describing complex systems, such as the scheduler, can easily grow to the point where they are just hard to understand because of the many possible state transitions. Often it is possible to break such descriptions into smaller monitors, sharing some or all events. Enabling those smaller monitors concurrently is, in fact, testing the system as if we had one single larger monitor. Splitting models into multiple specification is not only easier to understand, but gives some more clues when we see errors. Add the possibility to create container monitors, whose only purpose is to host other nested monitors. Enabling a container monitor enables all nested ones, but it's still possible to enable nested monitors independently. Add the sched monitor as first container, for now empty. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-3-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24sched: Add sched tracepoints for RV task modelGabriele Monaco
Add the following tracepoints: * sched_entry(bool preempt, ip) Called while entering __schedule * sched_exit(bool is_switch, ip) Called while exiting __schedule * sched_set_state(task, curr_state, state) Called when a task changes its state (to and from running) These tracepoints are useful to describe the Linux task model and are adapted from the patches by Daniel Bristot de Oliveira (https://bristot.me/linux-task-model/). Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-2-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24tracing: Do not use PERF enums when perf is not definedSteven Rostedt
An update was made to up the module ref count when a synthetic event is registered for both trace and perf events. But if perf is not configured in, the perf enums used will cause the kernel to fail to build. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Link: https://lore.kernel.org/20250323152151.528b5ced@batman.local.home Fixes: 21581dd4e7ff ("tracing: Ensure module defining synth event cannot be unloaded while tracing") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503232230.TeREVy8R-lkp@intel.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24Merge tag 'kernel-6.15-rc1.tasklist_lock' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tasklist_lock optimizations from Christian Brauner: "According to the performance testbots this brings a 23% performance increase when creating new processes: - Reduce tasklist_lock hold time on exit: - Perform add_device_randomness() without tasklist_lock - Perform free_pid() calls outside of tasklist_lock - Drop irq disablement around pidmap_lock - Add some tasklist_lock asserts - Call flush_sigqueue() lockless by changing release_task() - Don't pointlessly clear TIF_SIGPENDING in __exit_signal() -> clear_tsk_thread_flag()" * tag 'kernel-6.15-rc1.tasklist_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: drop irq disablement around pidmap_lock pid: perform free_pid() calls outside of tasklist_lock pid: sprinkle tasklist_lock asserts exit: hoist get_pid() in release_task() outside of tasklist_lock exit: perform add_device_randomness() without tasklist_lock exit: kill the pointless __exit_signal()->clear_tsk_thread_flag(TIF_SIGPENDING) exit: change the release_task() paths to call flush_sigqueue() lockless
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-24Merge tag 'vfs-6.15-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pidfs updates from Christian Brauner: - Allow retrieving exit information after a process has been reaped through pidfds via the new PIDFD_INTO_EXIT extension for the PIDFD_GET_INFO ioctl. Various tools need access to information about a process/task even after it has already been reaped. Pidfd polling allows waiting on either task exit or for a task to have been reaped. The contract for PIDFD_INFO_EXIT is simply that EPOLLHUP must be observed before exit information can be retrieved, i.e., exit information is only provided once the task has been reaped and then can be retrieved as long as the pidfd is open. - Add PIDFD_SELF_{THREAD,THREAD_GROUP} sentinels allowing userspace to forgo allocating a file descriptor for their own process. This is useful in scenarios where users want to act on their own process through pidfds and is akin to AT_FDCWD. - Improve premature thread-group leader and subthread exec behavior when polling on pidfds: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior is to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit premature as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. After this pull no exit notifications will be generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec before no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. This means an exit notification indicates the ability for the father to reap the child. * tag 'vfs-6.15-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) selftests/pidfd: third test for multi-threaded exec polling selftests/pidfd: second test for multi-threaded exec polling selftests/pidfd: first test for multi-threaded exec polling pidfs: improve multi-threaded exec and premature thread-group leader exit polling pidfs: ensure that PIDFS_INFO_EXIT is available selftests/pidfd: add seventh PIDFD_INFO_EXIT selftest selftests/pidfd: add sixth PIDFD_INFO_EXIT selftest selftests/pidfd: add fifth PIDFD_INFO_EXIT selftest selftests/pidfd: add fourth PIDFD_INFO_EXIT selftest selftests/pidfd: add third PIDFD_INFO_EXIT selftest selftests/pidfd: add second PIDFD_INFO_EXIT selftest selftests/pidfd: add first PIDFD_INFO_EXIT selftest selftests/pidfd: expand common pidfd header pidfs/selftests: ensure correct headers for ioctl handling selftests/pidfd: fix header inclusion pidfs: allow to retrieve exit information pidfs: record exit code and cgroupid at exit pidfs: use private inode slab cache pidfs: move setting flags into pidfs_alloc_file() pidfd: rely on automatic cleanup in __pidfd_prepare() ...
2025-03-24Merge tag 'vfs-6.15-rc1.pipe' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pipe updates from Christian Brauner: - Introduce struct file_operations pipeanon_fops - Don't update {a,c,m}time for anonymous pipes to avoid the performance costs associated with it - Change pipe_write() to never add a zero-sized buffer - Limit the slots in pipe_resize_ring() - Use pipe_buf() to retrieve the pipe buffer everywhere - Drop an always true check in anon_pipe_write() - Cache 2 pages instead of 1 - Avoid spurious calls to prepare_to_wait_event() in ___wait_event() * tag 'vfs-6.15-rc1.pipe' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/splice: Use pipe_buf() helper to retrieve pipe buffer fs/pipe: Use pipe_buf() helper to retrieve pipe buffer kernel/watch_queue: Use pipe_buf() to retrieve the pipe buffer fs/pipe: Limit the slots in pipe_resize_ring() wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() pipe: cache 2 pages instead of 1 pipe: drop an always true check in anon_pipe_write() pipe: change pipe_write() to never add a zero-sized buffer pipe: don't update {a,c,m}time for anonymous pipes pipe: introduce struct file_operations pipeanon_fops
2025-03-24Merge tag 'vfs-6.15-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-24Merge branch 'pm-sleep'Rafael J. Wysocki
Merge updates related to system sleep for 6.15-rc1 including fixes, cleanups and a rework of the "smart suspend" driver flag handling to avoid issues that may occur when drivers using it depend on some other drivers: - Rework the handling of the "smart suspend" driver flag in the PM core to avoid issues hat may occur when drivers using it depend on some other drivers and clean up the related PM core code (Rafael Wysocki, Colin Ian King). - Fix the handling of devices with the power.direct_complete flag set if device_suspend() returns an error for at least one device to avoid situations in which some of them may not be resumed (Rafael Wysocki). - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a possible deadlock that may occur if the "compressor" hibernation module parameter is accessed during the registration of a new ieee80211 device (Lizhi Xu). - Suppress sleeping parent warning in device_pm_add() in the case when new children are added under a device with the power.direct_complete set after it has been processed by device_resume() (Xu Yang). - Remove needless return in three void functions related to system wakeup (Zijun Hu). - Replace deprecated kmap_atomic() with kmap_local_page() in the hibernation core code (David Reaver). - Remove unused helper functions related to system sleep (David Alan Gilbert). - Clean up s2idle_enter() so it does not lock and unlock CPU offline in vain and update comments in it (Ulf Hansson). - Clean up broken white space in dpm_wait_for_children() (Geert Uytterhoeven). * pm-sleep: PM: sleep: Fix bit masking operation PM: sleep: Fix handling devices with direct_complete set on errors PM: sleep: core: Fix indentation in dpm_wait_for_children() PM: s2idle: Extend comment in s2idle_enter() PM: s2idle: Drop redundant locks when entering s2idle PM: sleep: Remove unused pm_generic_ wrappers PM: sleep: Rearrange dpm_async_fn() and async state clearing PM: sleep: Rename power.async_in_progress to power.work_in_progress PM: core: Tweak pm_runtime_block_if_disabled() return value PM: runtime: Convert pm_runtime_blocked() to static inline PM: sleep: Update power.smart_suspend under PM spinlock PM: sleep: Adjust check before setting power.must_resume PM: wakeup: Remove needless return in three void APIs PM: sleep: Suppress sleeping parent warning in special case PM: hibernate: Avoid deadlock in hibernate_compressor_param_set() PM: sleep: Avoid unnecessary checks in device_prepare_smart_suspend() PM: sleep: Use DPM_FLAG_SMART_SUSPEND conditionally PM: runtime: Introduce pm_runtime_blocked() PM: Block enabling of runtime PM during system suspend PM: hibernate: Replace deprecated kmap_atomic() with kmap_local_page()
2025-03-24Merge branches 'pm-em' and 'pm-runtime'Rafael J. Wysocki
Merge Energy Model handling code updates and updates of the runtime PM core code for 6.15-rc1: - Clean up the Energy Model handling code somewhat (Rafael Wysocki). - Use kfree_rcu() to simplify the handling of runtime Energy Model updates (Li RongQing). - Add an entry for the Energy Model framework to MAINTAINERS as properly maintained (Lukasz Luba). - Address RCU-related sparse warnings in the Energy Model code (Rafael Wysocki). - Remove ENERGY_MODEL dependency on SMP and allow it to be selected when DEVFREQ is set without CPUFREQ so it can be used on a wider range of systems (Jeson Gao). - Unify error handling during runtime suspend and runtime resume in the core to help drivers to implement more consistent runtime PM error handling (Rafael Wysocki). - Drop a redundant check from pm_runtime_force_resume() and rearrange documentation related to __pm_runtime_disable() (Rafael Wysocki). * pm-em: PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL PM: EM: Address RCU-related sparse warnings PM: EM: Consify two parameters of em_dev_register_perf_domain() MAINTAINERS: Add Energy Model framework as properly maintained PM: EM: use kfree_rcu() to simplify the code PM: EM: Slightly reduce em_check_capacity_update() overhead PM: EM: Drop unused parameter from em_adjust_new_capacity() * pm-runtime: PM: runtime: Unify error handling during suspend and resume PM: runtime: Drop status check from pm_runtime_force_resume() PM: Rearrange documentation related to __pm_runtime_disable()
2025-03-23tracing: Use hashtable.h for event_hashSasha Levin
Convert the event_hash array in trace_output.c to use the generic hashtable implementation from hashtable.h instead of the manually implemented hash table. This simplifies the code and makes it more maintainable by using the standard hashtable API defined in hashtable.h. Rename EVENT_HASHSIZE to EVENT_HASH_BITS to properly reflect its new meaning as the number of bits for the hashtable size. Link: https://lore.kernel.org/20250323132800.3010783-1-sashal@kernel.org Link: https://lore.kernel.org/20250319190545.3058319-1-sashal@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23tracing: Ensure module defining synth event cannot be unloaded while tracingDouglas Raillard
Currently, using synth_event_delete() will fail if the event is being used (tracing in progress), but that is normally done in the module exit function. At that stage, failing is problematic as returning a non-zero status means the module will become locked (impossible to unload or reload again). Instead, ensure the module exit function does not get called in the first place by increasing the module refcnt when the event is enabled. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 35ca5207c2d11 ("tracing: Add synthetic event command generation functions") Link: https://lore.kernel.org/20250318180906.226841-1-douglas.raillard@arm.com Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23tracing: fix return value in __ftrace_event_enable_disable for ↵Gabriele Paoloni
TRACE_REG_UNREGISTER When __ftrace_event_enable_disable invokes the class callback to unregister the event, the return value is not reported up to the caller, hence leading to event unregister failures being silently ignored. This patch assigns the ret variable to the invocation of the event unregister callback, so that its return value is stored and reported to the caller, and it raises a warning in case of error. Link: https://lore.kernel.org/20250321170821.101403-1-gpaoloni@redhat.com Signed-off-by: Gabriele Paoloni <gpaoloni@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23tracing/osnoise: Fix possible recursive locking for cpus_read_lock()Ran Xiaokai
Lockdep reports this deadlock log: osnoise: could not start sampling thread ============================================ WARNING: possible recursive locking detected -------------------------------------------- CPU0 ---- lock(cpu_hotplug_lock); lock(cpu_hotplug_lock); Call Trace: <TASK> print_deadlock_bug+0x282/0x3c0 __lock_acquire+0x1610/0x29a0 lock_acquire+0xcb/0x2d0 cpus_read_lock+0x49/0x120 stop_per_cpu_kthreads+0x7/0x60 start_kthread+0x103/0x120 osnoise_hotplug_workfn+0x5e/0x90 process_one_work+0x44f/0xb30 worker_thread+0x33e/0x5e0 kthread+0x206/0x3b0 ret_from_fork+0x31/0x50 ret_from_fork_asm+0x11/0x20 </TASK> This is the deadlock scenario: osnoise_hotplug_workfn() guard(cpus_read_lock)(); // first lock call start_kthread(cpu) if (IS_ERR(kthread)) { stop_per_cpu_kthreads(); { cpus_read_lock(); // second lock call. Cause the AA deadlock } } It is not necessary to call stop_per_cpu_kthreads() which stops osnoise kthread for every other CPUs in the system if a failure occurs during hotplug of a certain CPU. For start_per_cpu_kthreads(), if the start_kthread() call fails, this function calls stop_per_cpu_kthreads() to handle the error. Therefore, similarly, there is no need to call stop_per_cpu_kthreads() again within start_kthread(). So just remove stop_per_cpu_kthreads() from start_kthread to solve this issue. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250321095249.2739397-1-ranxiaokai627@163.com Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-23tracing: Align synth event print fmtDouglas Raillard
The vast majority of ftrace event print fmt consist of a space-separated field=value pair. Synthetic event currently use a comma-separated field=value pair, which sticks out from events created via more classical means. Align the format of synth events so they look just like any other event, for better consistency and less headache when doing crude text-based data processing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250319215028.1680278-1-douglas.raillard@arm.com Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-22bpf: Fix out-of-bounds read in check_atomic_load/store()Kohei Enju
syzbot reported the following splat [0]. In check_atomic_load/store(), register validity is not checked before atomic_ptr_type_ok(). This causes the out-of-bounds read in is_ctx_reg() called from atomic_ptr_type_ok() when the register number is MAX_BPF_REG or greater. Call check_load_mem()/check_store_reg() before atomic_ptr_type_ok() to avoid the OOB read. However, some tests introduced by commit ff3afe5da998 ("selftests/bpf: Add selftests for load-acquire and store-release instructions") assume calling atomic_ptr_type_ok() before checking register validity. Therefore the swapping of order unintentionally changes verifier messages of these tests. For example in the test load_acquire_from_pkt_pointer(), expected message is 'BPF_ATOMIC loads from R2 pkt is not allowed' although actual messages are different. validate_msgs:FAIL:754 expect_msg VERIFIER LOG: ============= Global function load_acquire_from_pkt_pointer() doesn't return scalar. Only those are supported. 0: R1=ctx() R10=fp0 ; asm volatile ( @ verifier_load_acquire.c:140 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx() R2_w=pkt(r=0) 1: (d3) r0 = load_acquire((u8 *)(r2 +0)) invalid access to packet, off=0 size=1, R2(id=0,off=0,r=0) R2 offset is outside of the packet processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 ============= EXPECTED SUBSTR: 'BPF_ATOMIC loads from R2 pkt is not allowed' #505/19 verifier_load_acquire/load-acquire from pkt pointer:FAIL This is because instructions in the test don't pass check_load_mem() and therefore don't enter the atomic_ptr_type_ok() path. In this case, we have to modify instructions so that they pass the check_load_mem() and trigger atomic_ptr_type_ok(). Similarly for store-release tests, we need to modify instructions so that they pass check_store_reg(). Like load_acquire_from_pkt_pointer(), modify instructions in: load_acquire_from_sock_pointer() store_release_to_ctx_pointer() store_release_to_pkt_pointer() Also in store_release_to_sock_pointer(), check_store_reg() returns error early and atomic_ptr_type_ok() is not triggered, since write to sock pointer is not possible in general. We might be able to remove the test, but for now let's leave it and just change the expected message. [0] BUG: KASAN: slab-out-of-bounds in is_ctx_reg kernel/bpf/verifier.c:6185 [inline] BUG: KASAN: slab-out-of-bounds in atomic_ptr_type_ok+0x3d7/0x550 kernel/bpf/verifier.c:6223 Read of size 4 at addr ffff888141b0d690 by task syz-executor143/5842 CPU: 1 UID: 0 PID: 5842 Comm: syz-executor143 Not tainted 6.14.0-rc3-syzkaller-gf28214603dc6 #0 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0x16e/0x5b0 mm/kasan/report.c:521 kasan_report+0x143/0x180 mm/kasan/report.c:634 is_ctx_reg kernel/bpf/verifier.c:6185 [inline] atomic_ptr_type_ok+0x3d7/0x550 kernel/bpf/verifier.c:6223 check_atomic_store kernel/bpf/verifier.c:7804 [inline] check_atomic kernel/bpf/verifier.c:7841 [inline] do_check+0x89dd/0xedd0 kernel/bpf/verifier.c:19334 do_check_common+0x1678/0x2080 kernel/bpf/verifier.c:22600 do_check_main kernel/bpf/verifier.c:22691 [inline] bpf_check+0x165c8/0x1cca0 kernel/bpf/verifier.c:23821 Reported-by: syzbot+a5964227adc0f904549c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a5964227adc0f904549c Tested-by: syzbot+a5964227adc0f904549c@syzkaller.appspotmail.com Fixes: e24bbad29a8d ("bpf: Introduce load-acquire and store-release instructions") Fixes: ff3afe5da998 ("selftests/bpf: Add selftests for load-acquire and store-release instructions") Signed-off-by: Kohei Enju <enjuk@amazon.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250322045340.18010-5-enjuk@amazon.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-22tracing: Fix use-after-free in print_graph_function_flags during tracer ↵Tengda Wu
switching Kairui reported a UAF issue in print_graph_function_flags() during ftrace stress testing [1]. This issue can be reproduced if puting a 'mdelay(10)' after 'mutex_unlock(&trace_types_lock)' in s_start(), and executing the following script: $ echo function_graph > current_tracer $ cat trace > /dev/null & $ sleep 5 # Ensure the 'cat' reaches the 'mdelay(10)' point $ echo timerlat > current_tracer The root cause lies in the two calls to print_graph_function_flags within print_trace_line during each s_show(): * One through 'iter->trace->print_line()'; * Another through 'event->funcs->trace()', which is hidden in print_trace_fmt() before print_trace_line returns. Tracer switching only updates the former, while the latter continues to use the print_line function of the old tracer, which in the script above is print_graph_function_flags. Moreover, when switching from the 'function_graph' tracer to the 'timerlat' tracer, s_start only calls graph_trace_close of the 'function_graph' tracer to free 'iter->private', but does not set it to NULL. This provides an opportunity for 'event->funcs->trace()' to use an invalid 'iter->private'. To fix this issue, set 'iter->private' to NULL immediately after freeing it in graph_trace_close(), ensuring that an invalid pointer is not passed to other tracers. Additionally, clean up the unnecessary 'iter->private = NULL' during each 'cat trace' when using wakeup and irqsoff tracers. [1] https://lore.kernel.org/all/20231112150030.84609-1-ryncsn@gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250320122137.23635-1-wutengda@huaweicloud.com Fixes: eecb91b9f98d ("tracing: Fix memleak due to race between current_tracer and trace") Closes: https://lore.kernel.org/all/CAMgjq7BW79KDSCyp+tZHjShSzHsScSiJxn5ffskp-QzVM06fxw@mail.gmail.com/ Reported-by: Kairui Song <kasong@tencent.com> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-22tracing: Disable branch profiling in noinstr codeJosh Poimboeuf
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-21relay: use kasprintf() instead of fixed buffer formattingAndy Shevchenko
Improve readability and maintainability by replacing a hard coded string allocation and formatting by using the kasprintf() helper. It also eliminates the GCC compiler warning (with CONFIG_WERROR=y, which is default, it becomes an error: kernel/relay.c:357:42: error: `snprintf' output may be truncated before the last format character [-Werror=format-truncation=] Link: https://lkml.kernel.org/r/20250317212948.1811176-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21resource: replace open coded variant of DEFINE_RES()Andy Shevchenko
Replace open coded variant of DEFINE_RES(). No functional changes intended. Link: https://lkml.kernel.org/r/20250317181412.1560630-5-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21resource: replace open coded variants of DEFINE_RES_*_NAMED()Andy Shevchenko
Replace open coded variants of DEFINE_RES_*_NAMED(). Link: https://lkml.kernel.org/r/20250317181412.1560630-4-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21resource: replace open coded variant of DEFINE_RES_NAMED_DESC()Andy Shevchenko
Replace open coded variant of DEFINE_RES_NAMED_DESC(). Link: https://lkml.kernel.org/r/20250317181412.1560630-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21hung_task: show the blocker task if the task is hung on mutexMasami Hiramatsu (Google)
Patch series "hung_task: Dump the blocking task stacktrace", v4. The hung_task detector is very useful for detecting the lockup. However, since it only dumps the blocked (uninterruptible sleep) processes, it is not enough to identify the root cause of that lockup. For example, if a process holds a mutex and sleep an event in interruptible state long time, the other processes will wait on the mutex in uninterruptible state. In this case, the waiter processes are dumped, but the blocker process is not shown because it is sleep in interruptible state. This adds a feature to dump the blocker task which holds a mutex when detecting a hung task. e.g. INFO: task cat:115 blocked for more than 122 seconds. Not tainted 6.14.0-rc3-00003-ga8946be3de00 #156 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:cat state:D stack:13432 pid:115 tgid:115 ppid:106 task_flags:0x400100 flags:0x00000002 Call Trace: <TASK> __schedule+0x731/0x960 ? schedule_preempt_disabled+0x54/0xa0 schedule+0xb7/0x140 ? __mutex_lock+0x51b/0xa60 ? __mutex_lock+0x51b/0xa60 schedule_preempt_disabled+0x54/0xa0 __mutex_lock+0x51b/0xa60 read_dummy+0x23/0x70 full_proxy_read+0x6a/0xc0 vfs_read+0xc2/0x340 ? __pfx_direct_file_splice_eof+0x10/0x10 ? do_sendfile+0x1bd/0x2e0 ksys_read+0x76/0xe0 do_syscall_64+0xe3/0x1c0 ? exc_page_fault+0xa9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x4840cd RSP: 002b:00007ffe99071828 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd RDX: 0000000000001000 RSI: 00007ffe99071870 RDI: 0000000000000003 RBP: 00007ffe99071870 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000 R13: 00000000132fd3a0 R14: 0000000000000001 R15: ffffffffffffffff </TASK> INFO: task cat:115 is blocked on a mutex likely owned by task cat:114. task:cat state:S stack:13432 pid:114 tgid:114 ppid:106 task_flags:0x400100 flags:0x00000002 Call Trace: <TASK> __schedule+0x731/0x960 ? schedule_timeout+0xa8/0x120 schedule+0xb7/0x140 schedule_timeout+0xa8/0x120 ? __pfx_process_timeout+0x10/0x10 msleep_interruptible+0x3e/0x60 read_dummy+0x2d/0x70 full_proxy_read+0x6a/0xc0 vfs_read+0xc2/0x340 ? __pfx_direct_file_splice_eof+0x10/0x10 ? do_sendfile+0x1bd/0x2e0 ksys_read+0x76/0xe0 do_syscall_64+0xe3/0x1c0 ? exc_page_fault+0xa9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x4840cd RSP: 002b:00007ffe3e0147b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd RDX: 0000000000001000 RSI: 00007ffe3e014800 RDI: 0000000000000003 RBP: 00007ffe3e014800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000 R13: 000000001a0a93a0 R14: 0000000000000001 R15: ffffffffffffffff </TASK> TBD: We can extend this feature to cover other locks like rwsem and rt_mutex, but rwsem requires to dump all the tasks which acquire and wait that rwsem. We can follow the waiter link but the output will be a bit different compared with mutex case. This patch (of 2): The "hung_task" shows a long-time uninterruptible slept task, but most often, it's blocked on a mutex acquired by another task. Without dumping such a task, investigating the root cause of the hung task problem is very difficult. This introduce task_struct::blocker_mutex to point the mutex lock which this task is waiting for. Since the mutex has "owner" information, we can find the owner task and dump it with hung tasks. Note: the owner can be changed while dumping the owner task, so this is "likely" the owner of the mutex. With this change, the hung task shows blocker task's info like below; INFO: task cat:115 blocked for more than 122 seconds. Not tainted 6.14.0-rc3-00003-ga8946be3de00 #156 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:cat state:D stack:13432 pid:115 tgid:115 ppid:106 task_flags:0x400100 flags:0x00000002 Call Trace: <TASK> __schedule+0x731/0x960 ? schedule_preempt_disabled+0x54/0xa0 schedule+0xb7/0x140 ? __mutex_lock+0x51b/0xa60 ? __mutex_lock+0x51b/0xa60 schedule_preempt_disabled+0x54/0xa0 __mutex_lock+0x51b/0xa60 read_dummy+0x23/0x70 full_proxy_read+0x6a/0xc0 vfs_read+0xc2/0x340 ? __pfx_direct_file_splice_eof+0x10/0x10 ? do_sendfile+0x1bd/0x2e0 ksys_read+0x76/0xe0 do_syscall_64+0xe3/0x1c0 ? exc_page_fault+0xa9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x4840cd RSP: 002b:00007ffe99071828 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd RDX: 0000000000001000 RSI: 00007ffe99071870 RDI: 0000000000000003 RBP: 00007ffe99071870 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000 R13: 00000000132fd3a0 R14: 0000000000000001 R15: ffffffffffffffff </TASK> INFO: task cat:115 is blocked on a mutex likely owned by task cat:114. task:cat state:S stack:13432 pid:114 tgid:114 ppid:106 task_flags:0x400100 flags:0x00000002 Call Trace: <TASK> __schedule+0x731/0x960 ? schedule_timeout+0xa8/0x120 schedule+0xb7/0x140 schedule_timeout+0xa8/0x120 ? __pfx_process_timeout+0x10/0x10 msleep_interruptible+0x3e/0x60 read_dummy+0x2d/0x70 full_proxy_read+0x6a/0xc0 vfs_read+0xc2/0x340 ? __pfx_direct_file_splice_eof+0x10/0x10 ? do_sendfile+0x1bd/0x2e0 ksys_read+0x76/0xe0 do_syscall_64+0xe3/0x1c0 ? exc_page_fault+0xa9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x4840cd RSP: 002b:00007ffe3e0147b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd RDX: 0000000000001000 RSI: 00007ffe3e014800 RDI: 0000000000000003 RBP: 00007ffe3e014800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000 R13: 000000001a0a93a0 R14: 0000000000000001 R15: ffffffffffffffff </TASK> [akpm@linux-foundation.org: implement debug_show_blocker() in C rather than in CPP] Link: https://lkml.kernel.org/r/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com Link: https://lkml.kernel.org/r/174046695384.2194069.16796289525958195643.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21fork: use __vmalloc_node() for stack allocationUladzislau Rezki (Sony)
Replace __vmalloc_node_range() by __vmalloc_node(). The last variant requires less parameters and it uses exactly the same arguments which are partly now hidden inside __vmalloc_node(). This change does not change any functionality. It makes the code a bit simpler. Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21tracepoint: Print the function symbol when tracepoint_debug is setHuang Shijie
When tracepoint_debug is set, we may get the output in kernel log: [ 380.013843] Probe 0 : 00000000f0d68cda It is not readable, so change to print the function symbol. After this patch, the output may becomes: [ 55.225555] Probe 0 : perf_trace_sched_wakeup_template+0x0/0x20 Link: https://lore.kernel.org/20250307033858.4134-1-shijie@os.amperecomputing.com Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-21timekeeping: Fix possible inconsistencies in _COARSE clockidsJohn Stultz
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time inconsistencies. Lei tracked down that this was being caused by the adjustment tk->tkr_mono.xtime_nsec -= offset; which is made to compensate for the unaccumulated cycles in offset when the multiplicator is adjusted forward, so that the non-_COARSE clockids don't see inconsistencies. However, the _COARSE clockid getter functions use the adjusted xtime_nsec value directly and do not compensate the negative offset via the clocksource delta multiplied with the new multiplicator. In that case the caller can observe time going backwards in consecutive calls. By design, this negative adjustment should be fine, because the logic run from timekeeping_adjust() is done after it accumulated approximately multiplicator * interval_cycles into xtime_nsec. The accumulated value is always larger then the mult_adj * offset value, which is subtracted from xtime_nsec. Both operations are done together under the tk_core.lock, so the net change to xtime_nsec is always always be positive. However, do_adjtimex() calls into timekeeping_advance() as well, to to apply the NTP frequency adjustment immediately. In this case, timekeeping_advance() does not return early when the offset is smaller then interval_cycles. In that case there is no time accumulated into xtime_nsec. But the subsequent call into timekeeping_adjust(), which modifies the multiplicator, subtracts from xtime_nsec to correct for the new multiplicator. Here because there was no accumulation, xtime_nsec becomes smaller than before, which opens a window up to the next accumulation, where the _COARSE clockid getters, which don't compensate for the offset, can observe the inconsistency. To fix this, rework the timekeeping_advance() logic so that when invoked from do_adjtimex(), the time is immediately forwarded to accumulate also the sub-interval portion into xtime. That means the remaining offset becomes zero and the subsequent multiplier adjustment therefore does not modify xtime_nsec. There is another related inconsistency. If xtime is forwarded due to the instantaneous multiplier adjustment, the NTP error, which was accumulated with the previous setting, becomes meaningless. Therefore clear the NTP error as well, after forwarding the clock for the instantaneous multiplier update. Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE") Reported-by: Lei Chen <lei.chen@smartx.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250320200306.1712599-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-03-21Merge tag 'sched-urgent-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Revert a scheduler performance optimization that regressed other workloads" * tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
2025-03-21PM: hibernate: Use crypto_acomp interfaceHerbert Xu
Replace the legacy crypto compression interface with the new acomp interface. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flagRoger Pau Monne
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both MSI and MSI-X entries globally, regardless of the IRQ chip they are using. Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over event channels, to prevent PCI code from attempting to toggle the maskbit, as it's Xen that controls the bit. However, the pci_msi_ignore_mask being global will affect devices that use MSI interrupts but are not routing those interrupts over event channels (not using the Xen pIRQ chip). One example is devices behind a VMD PCI bridge. In that scenario the VMD bridge configures MSI(-X) using the normal IRQ chip (the pIRQ one in the Xen case), and devices behind the bridge configure the MSI entries using indexes into the VMD bridge MSI table. The VMD bridge then demultiplexes such interrupts and delivers to the destination device(s). Having pci_msi_ignore_mask set in that scenario prevents (un)masking of MSI entries for devices behind the VMD bridge. Move the signaling of no entry masking into the MSI domain flags, as that allows setting it on a per-domain basis. Set it for the Xen MSI domain that uses the pIRQ chip, while leaving it unset for the rest of the cases. Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and with Xen dropping usage the variable is unneeded. This fixes using devices behind a VMD bridge on Xen PV hardware domains. Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean Linux cannot use them. By inhibiting the usage of VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask bodge devices behind a VMD bridge do work fine when use from a Linux Xen hardware domain. That's the whole point of the series. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-4-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-20Merge tag 'dma-mapping-6.14-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - fix missing clear bdr in check_ram_in_range_map() (Baochen Qiang) * tag 'dma-mapping-6.14-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: fix missing clear bdr in check_ram_in_range_map()
2025-03-20bpf: Add struct_ops context information to struct bpf_prog_auxJuntong Deng
This patch adds struct_ops context information to struct bpf_prog_aux. This context information will be used in the kfunc filter. Currently the added context information includes struct_ops member offset and a pointer to struct bpf_struct_ops. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20250319215358.2287371-2-ameryhung@gmail.com
2025-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc8). Conflict: tools/testing/selftests/net/Makefile 03544faad761 ("selftest: net: add proc_net_pktgen") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") tools/testing/selftests/net/config: 85cb3711acb8 ("selftests: net: Add test cases for link and peer netns") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") Adjacent commits: tools/testing/selftests/net/Makefile c935af429ec2 ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY") 355d940f4d5a ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20cgroup: rstat: Cleanup flushing functions and lockingYosry Ahmed
Now that the rstat lock is being re-acquired on every CPU iteration in cgroup_rstat_flush_locked(), having the initially acquire the lock is unnecessary and unclear. Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move the lock/unlock calls to the beginning and ending of the loop body to make the critical section obvious. cgroup_rstat_flush_hold/release() do not make much sense with the lock being dropped and reacquired internally. Since it has no external callers, remove it and explicitly acquire the lock in cgroup_base_stat_cputime_show() instead. This leaves the code with a single flushing function, cgroup_rstat_flush(). Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-20printk/panic: Add option to allow non-panic CPUs to write to the ring buffer.Donghyeok Choe
Commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") aimed to isolate panic-related messages. However, when panic() itself malfunctions, messages from non-panic CPUs become crucial for debugging. While commit bcc954c6caba ("printk/panic: Allow cpu backtraces to be written into ringbuffer during panic") enables non-panic CPU backtraces, it may not provide sufficient diagnostic information. Introduce the "debug_non_panic_cpus" command-line option, enabling non-panic CPU messages to be stored in the ring buffer during a panic. This also prevents discarding non-finalized messages from non-panic CPUs during console flushing, providing a more comprehensive view of system state during critical failures. Link: https://lore.kernel.org/all/Z8cLEkqLL2IOyNIj@pathway/ Signed-off-by: Donghyeok Choe <d7271.choe@samsung.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20250318022320.2428155-1-d7271.choe@samsung.com [pmladek@suse.com: Added documentation, added module_parameter, removed printk_ prefix.] Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-03-20pidfs: improve multi-threaded exec and premature thread-group leader exit ↵Christian Brauner
polling This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent. A quick recap of these two cases: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit. If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. Co-Developed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20tracing: Constify struct event_trigger_opsChristophe JAILLET
'event_trigger_ops mwifiex_if_ops' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 31368 9024 6200 46592 b600 kernel/trace/trace_events_trigger.o After: ===== text data bss dec hex filename 31752 8608 6200 46560 b5e0 kernel/trace/trace_events_trigger.o Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/66e8f990e649678e4be37d4d1a19158ca0dea2f4.1741521295.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-20Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19sched/debug: Make 'const_debug' tunables unconditional __read_mostlyIngo Molnar
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-3-mingo@kernel.org
2025-03-19sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()Ingo Molnar
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org