summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-03-15bpf/helpers: Introduce bpf_dynptr_copy kfuncMykyta Yatsenko
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to another. This functionality is useful in scenarios such as capturing XDP data to a ring buffer. The implementation consists of 4 branches: * A fast branch for contiguous buffer capacity in both source and destination dynptrs * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy data to/from non-contiguous buffer Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_writeMykyta Yatsenko
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code into the static functions namely __bpf_dynptr_read and __bpf_dynptr_write, this allows calling these without compiler warnings. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15Revert "kheaders: Ignore silly-rename files"Masahiro Yamada
This reverts commit 973b710b8821c3401ad7a25360c89e94b26884ac. As I mentioned in the review [1], I do not believe this was the correct fix. Commit 41a00051283e ("kheaders: prevent `find` from seeing perl temp files") addressed the root cause of the issue. I asked David to test it but received no response. Commit 973b710b8821 ("kheaders: Ignore silly-rename files") merely worked around the issue by excluding such files, rather than preventing their creation. I have reverted the latter commit, hoping the issue has already been resolved by the former. If the silly-rename files come back, I will restore this change (or preferably, investigate the root cause). [1]: https://lore.kernel.org/lkml/CAK7LNAQndCMudAtVRAbfSfnV+XhSMDcnP-s1_GAQh8UiEdLBSg@mail.gmail.com/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-03-15Revert "sched/core: Reduce cost of sched_move_task when config autogroup"Dietmar Eggemann
This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261. Hazem reported a 30% drop in UnixBench spawn test with commit eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM (aarch64) (single level MC sched domain): https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com There is an early bail from sched_move_task() if p->sched_task_group is equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope' (Ubuntu '22.04.5 LTS'). So in: do_exit() sched_autogroup_exit_task() sched_move_task() if sched_get_task_group(p) == p->sched_task_group return /* p is enqueued */ dequeue_task() \ sched_change_group() | task_change_group_fair() | detach_task_cfs_rq() | (1) set_task_rq() | attach_task_cfs_rq() | enqueue_task() / (1) isn't called for p anymore. Turns out that the regression is related to sgs->group_util in group_is_overloaded() and group_has_capacity(). If (1) isn't called for all the 'spawn' tasks then sgs->group_util is ~900 and sgs->group_capacity = 1024 (single CPU sched domain) and this leads to group_is_overloaded() returning true (2) and group_has_capacity() false (3) much more often compared to the case when (1) is called. I.e. there are much more cases of 'group_is_overloaded' and 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which then returns much more often a CPU != smp_processor_id() (5). This isn't good for these extremely short running tasks (FORK + EXIT) and also involves calling sched_balance_find_dst_group_cpu() unnecessary (single CPU sched domain). Instead if (1) is called for 'p->flags & PF_EXITING' then the path (4),(6) is taken much more often. select_task_rq_fair(..., wake_flags = WF_FORK) cpu = smp_processor_id() new_cpu = sched_balance_find_dst_cpu(..., cpu, ...) group = sched_balance_find_dst_group(..., cpu) do { update_sg_wakeup_stats() sgs->group_type = group_classify() if group_is_overloaded() (2) return group_overloaded if !group_has_capacity() (3) return group_fully_busy return group_has_spare (4) } while group if local_sgs.group_type > idlest_sgs.group_type return idlest (5) case group_has_spare: if local_sgs.idle_cpus >= idlest_sgs.idle_cpus return NULL (6) Unixbench Tests './Run -c 4 spawn' on: (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4') and Ubuntu 22.04.5 LTS (aarch64). Shell & test run in '/user.slice/user-1000.slice/session-1.scope'. w/o patch w/ patch 21005 27120 (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and Ubuntu 22.04.5 LTS (x86_64). Shell & test run in '/A'. w/o patch w/ patch 67675 88806 CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal 0 or 1. Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Hagar Hemdan <hagarhem@amazon.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com
2025-03-15sched/uclamp: Optimize sched_uclamp_used static key enablingXuewen Yan
Repeat calls of static_branch_enable() to an already enabled static key introduce overhead, because it calls cpus_read_lock(). Users may frequently set the uclamp value of tasks, triggering the repeat enabling of the sched_uclamp_used static key. Optimize this and avoid repeat calls to static_branch_enable() by checking whether it's enabled already. [ mingo: Rewrote the changelog for legibility ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-2-xuewen.yan@unisoc.com
2025-03-15sched/uclamp: Use the uclamp_is_used() helper instead of open-coding itXuewen Yan
Don't open-code static_branch_unlikely(&sched_uclamp_used), we have the uclamp_is_used() wrapper around it. [ mingo: Clean up the changelog ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-1-xuewen.yan@unisoc.com
2025-03-15tracing: tprobe-events: Fix leakage of module refcountMasami Hiramatsu (Google)
When enabling the tracepoint at loading module, the target module refcount is incremented by find_tracepoint_in_module(). But it is unnecessary because the module is not unloaded while processing module loading callbacks. Moreover, the refcount is not decremented in that function. To be clear the module refcount handling, move the try_module_get() callsite to trace_fprobe_create_internal(), where it is actually required. Link: https://lore.kernel.org/all/174182761071.83274.18334217580449925882.stgit@devnote2/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org
2025-03-15tracing: tprobe-events: Fix to clean up tprobe correctly when module unloadMasami Hiramatsu (Google)
When unloading module, the tprobe events are not correctly cleaned up. Thus it becomes `fprobe-event` and never be enabled again even if loading the same module again. For example; # cd /sys/kernel/tracing # modprobe trace_events_sample # echo 't:my_tprobe foo_bar' >> dynamic_events # cat dynamic_events t:tracepoints/my_tprobe foo_bar # rmmod trace_events_sample # cat dynamic_events f:tracepoints/my_tprobe foo_bar As you can see, the second time my_tprobe starts with 'f' instead of 't'. This unregisters the fprobe and tracepoint callback when module is unloaded but marks the fprobe-event is tprobe-event. Link: https://lore.kernel.org/all/174158724946.189309.15826571379395619524.stgit@mhiramat.tok.corp.google.com/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-03-14Merge tag 'sched-urgent-2025-03-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a sleeping-while-atomic bug caused by a recent optimization utilizing static keys that didn't consider that the static_key_disable() call could be triggered in atomic context. Revert the optimization" * tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Don't define sched_clock_irqtime as static key
2025-03-14Merge tag 'locking-urgent-2025-03-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc locking fixes from Ingo Molnar: - Restrict the Rust runtime from unintended access to dynamically allocated LockClassKeys - KernelDoc annotation fix - Fix a lock ordering bug in semaphore::up(), related to trying to printk() and wake up the console within critical sections * tag 'locking-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/semaphore: Use wake_q to wake up processes outside lock critical section locking/rtmutex: Use the 'struct' keyword in kernel-doc comment rust: lockdep: Remove support for dynamically allocated LockClassKeys
2025-03-14sched_ext: idle: Refactor scx_select_cpu_dfl()Andrea Righi
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Honor idle flags in the built-in idle selection policyAndrea Righi
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14tracing: Correct the refcount if the hist/hist_debug file fails to openTengda Wu
The function event_{hist,hist_debug}_open() maintains the refcount of 'file->tr' and 'file' through tracing_open_file_tr(). However, it does not roll back these counts on subsequent failure paths, resulting in a refcount leak. A very obvious case is that if the hist/hist_debug file belongs to a specific instance, the refcount leak will prevent the deletion of that instance, as it relies on the condition 'tr->ref == 1' within __remove_instance(). Fix this by calling tracing_release_file_tr() on all failure paths in event_{hist,hist_debug}_open() to correct the refcount. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com Fixes: 1cc111b9cddc ("tracing: Fix uaf issue when open the hist or hist_debug file") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.") 2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c 661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13genirq/msi: Rename msi_[un]lock_descs()Thomas Gleixner
Now that all abuse is gone and the legit users are converted to guard(msi_descs_lock), rename the lock functions and document them as internal. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> Link: https://lore.kernel.org/all/20250313130322.027190131@linutronix.de
2025-03-13genirq/msi: Use lock guards for MSI descriptor lockingThomas Gleixner
Provide a lock guard for MSI descriptor locking and update the core code accordingly. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/all/20250313130321.506045185@linutronix.de
2025-03-13sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()Andrea Righi
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class. However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit. Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13genirq/msi: Make a few functions staticThomas Gleixner
None of these functions are used outside of the MSI core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250309084110.204054172@linutronix.de
2025-03-13posix-timers: Provide a mechanism to allocate a given timer IDThomas Gleixner
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers with the same timer ID on restore. It uses sys_timer_create() and relies on the monotonic increasing timer ID provided by this syscall. It creates and deletes timers until the desired ID is reached. This is can loop for a long time, when the checkpointed process had a very sparse timer ID range. It has been debated to implement a new syscall to allow the creation of timers with a given timer ID, but that's tideous due to the 32/64bit compat issues of sigevent_t and of dubious value. The restore mechanism of CRIU creates the timers in a state where all threads of the restored process are held on a barrier and cannot issue syscalls. That means the restorer task has exclusive control. This allows to address this issue with a prctl() so that the restorer thread can do: if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON)) goto linear_mode; create_timers_with_explicit_ids(); prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF); This is backwards compatible because the prctl() fails on older kernels and CRIU can fall back to the linear timer ID mechanism. CRIU versions which do not know about the prctl() just work as before. Implement the prctl() and modify timer_create() so that it copies the requested timer ID from userspace by utilizing the existing timer_t pointer, which is used to copy out the allocated timer ID on success. If the prctl() is disabled, which it is by default, timer_create() works as before and does not try to read from the userspace pointer. There is no problem when a broken or rogue user space application enables the prctl(). If the user space pointer does not contain a valid ID, then timer_create() fails. If the data is not initialized, but constains a random valid ID, timer_create() will create that random timer ID or fail if the ID is already given out. As CRIU must use the raw syscall to avoid manipulating the internal state of the restored process, this has no library dependencies and can be adopted by CRIU right away. Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with the create/delete method. With the prctl() it takes 3 microseconds. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Tested-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx
2025-03-13posix-timers: Make per process list RCU safeThomas Gleixner
Preparatory change to remove the sighand locking from the /proc/$PID/timers iterator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.403223080@linutronix.de
2025-03-13posix-timers: Avoid false cacheline sharingThomas Gleixner
struct k_itimer has the hlist_node, which is used for lookup in the hash bucket, and the timer lock in the same cache line. That's obviously bad, if one CPU fiddles with a timer and the other is walking the hash bucket on which that timer is queued. Avoid this by restructuring struct k_itimer, so that the read mostly (only modified during setup and teardown) fields are in the first cache line and the lock and the rest of the fields which get written to are in cacheline 2-N. Reduces cacheline contention in a test case of 64 processes creating and accessing 20000 timers each by almost 30% according to perf. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.341108067@linutronix.de
2025-03-13posix-timers: Switch to jhash32()Thomas Gleixner
The hash distribution of hash_32() is suboptimal. jhash32() provides a way better distribution, which evens out the length of the hash bucket lists, which in turn avoids large outliers in list walk times. Due to the sparse ID space (thanks CRIU) there is no guarantee that the timers will be fully evenly distributed over the hash buckets, but the behaviour is way better than with hash_32() even for randomly sparse ID spaces. For a pathological test case with 64 processes creating and accessing 20000 timers each, this results in a runtime reduction of ~10% and a significantly reduced runtime variation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250308155624.279080328@linutronix.de
2025-03-13posix-timers: Improve hash table performanceThomas Gleixner
Eric and Ben reported a significant performance bottleneck on the global hash, which is used to store posix timers for lookup. Eric tried to do a lockless validation of a new timer ID before trying to insert the timer, but that does not solve the problem. For the non-contended case this is a pointless exercise and for the contended case this extra lookup just creates enough interleaving that all tasks can make progress. There are actually two real solutions to the problem: 1) Provide a per process (signal struct) xarray storage 2) Implement a smarter hash like the one in the futex code #1 works perfectly fine for most cases, but the fact that CRIU enforced a linear increasing timer ID to restore timers makes this problematic. It's easy enough to create a sparse timer ID space, which amounts very fast to a large junk of memory consumed for the xarray. 2048 timers with a ID offset of 512 consume more than one megabyte of memory for the xarray storage. #2 The main advantage of the futex hash is that it uses per hash bucket locks instead of a global hash lock. Aside of that it is scaled according to the number of CPUs at boot time. Experiments with artifical benchmarks have shown that a scaled hash with per bucket locks comes pretty close to the xarray performance and in some scenarios it performes better. Test 1: A single process creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 23 ms 23 ms 9 ms 8 ms getoverrun 14 ms 14 ms 5 ms 4 ms Test 2: A single process creates 50000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 98 ms 219 ms 20 ms 18 ms getoverrun 62 ms 62 ms 10 ms 9 ms Test 3: A single process creates 100000 timers and afterwards invokes timer_getoverrun(2) on each of them: mainline Eric newhash xarray create 313 ms 750 ms 48 ms 33 ms getoverrun 261 ms 260 ms 20 ms 14 ms Erics changes create quite some overhead in the create() path due to the double list walk, as the main issue according to perf is the list walk itself. With 100k timers each hash bucket contains ~200 timers, which in the worst case need to be all inspected. The same problem applies for getoverrun() where the lookup has to walk through the hash buckets to find the timer it is looking for. The scaled hash obviously reduces hash collisions and lock contention significantly. This becomes more prominent with concurrency. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray create 180239 ms 38599 ms 579 ms 813 ms getoverrun 2645 ms 2642 ms 32 ms 7 ms Test 5: A process forks 63 times and all forks wait on a barrier before each instance creates 20000 timers and afterwards invokes timer_getoverrun(2) on each of them. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline eric newhash xarray create 157253 ms 40008 ms 83 ms 60 ms getoverrun 2611 ms 2614 ms 40 ms 4 ms So clearly the reduction of lock contention with Eric's changes makes a significant difference for the create() loop, but it does not mitigate the problem of long list walks, which is clearly visible on the getoverrun() side because that is purely dominated by the lookup itself. Once the timer is found, the syscall just reads from the timer structure with no other locks or code paths involved and returns. The reason for the difference between the thread and the fork case for the new hash and the xarray is that both suffer from contention on sighand::siglock and the xarray suffers additionally from contention on the xarray lock on insertion. The only case where the reworked hash slighly outperforms the xarray is a tight loop which creates and deletes timers. Test 4: A process creates 63 threads and all threads wait on a barrier before each instance runs a loop which creates and deletes a timer 100000 times in a row. The threads are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per thread: mainline Eric newhash xarray loop 5917 ms 5897 ms 5473 ms 7846 ms Test 5: A process forks 63 times and all forks wait on a barrier before each each instance runs a loop which creates and deletes a timer 100000 times in a row. The processes are pinned on seperate CPUs to achive maximum concurrency. The numbers are the average times per process: mainline Eric newhash xarray loop 5137 ms 7828 ms 891 ms 872 ms In both test there is not much contention on the hash, but the ucount accounting for the signal and in the thread case the sighand::siglock contention (plus the xarray locking) contribute dominantly to the overhead. As the memory consumption of the xarray in the sparse ID case is significant, the scaled hash with per bucket locks seems to be the better overall option. While the xarray has faster lookup times for a large number of timers, the actual syscall usage, which requires the lookup is not an extreme hotpath. Most applications utilize signal delivery and all syscalls except timer_getoverrun(2) are all but cheap. So implement a scaled hash with per bucket locks, which offers the best tradeoff between performance and memory consumption. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de
2025-03-13posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_tEric Dumazet
The global hash_lock protecting the posix timer hash table can be heavily contended especially when there is an extensive linear search for a timer ID. Timer IDs are handed out by monotonically increasing next_posix_timer_id and then validating that there is no timer with the same ID in the hash table. Both operations happen with the global hash lock held. To reduce the hash lock contention the hash will be reworked to a scaled hash with per bucket locks, which requires to handle the ID counter lockless. Prepare for this by making next_posix_timer_id an atomic_t, which can be used lockless with atomic_inc_return(). [ tglx: Adopted from Eric's series, massaged change log and simplified it ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de
2025-03-13posix-timers: Make lock_timer() use guard()Peter Zijlstra
The lookup and locking of posix timers requires the same repeating pattern at all usage sites: tmr = lock_timer(tiner_id); if (!tmr) return -EINVAL; .... unlock_timer(tmr); Solve this with a guard implementation, which works in most places out of the box except for those, which need to unlock the timer inside the guard scope. Though the only places where this matters are timer_delete() and timer_settime(). In both cases the timer pointer needs to be preserved across the end of the scope, which is solved by storing the pointer in a variable outside of the scope. timer_settime() also has to protect the timer with RCU before unlocking, which obviously can't use guard(rcu) before leaving the guard scope as that guard is cleaned up before the unlock. Solve this by providing the RCU protection open coded. [ tglx: Made it work and added change log ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de
2025-03-13posix-timers: Rework timer removalThomas Gleixner
sys_timer_delete() and the do_exit() cleanup function itimer_delete() are doing the same thing, but have needlessly different implementations instead of sharing the code. The other oddity of timer deletion is the fact that the timer is not invalidated before the actual deletion happens, which allows concurrent lookups to succeed. That's wrong because a timer which is in the process of being deleted should not be visible and any actions like signal queueing, delivery and rearming should not happen once the task, which invoked timer_delete(), has the timer locked. Rework the code so that: 1) The signal queueing and delivery code ignore timers which are marked invalid 2) The deletion implementation between sys_timer_delete() and itimer_delete() is shared 3) The timer is invalidated and removed from the linked lists before the deletion callback of the relevant clock is invoked. That requires to rework timer_wait_running() as it does a lookup of the timer when relocking it at the end. In case of deletion this lookup would fail due to the preceding invalidation and the wait loop would terminate prematurely. But due to the preceding invalidation the timer cannot be accessed by other tasks anymore, so there is no way that the timer has been freed after the timer lock has been dropped. Move the re-validation out of timer_wait_running() and handle it at the only other usage site, timer_settime(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx
2025-03-13posix-timers: Simplify lock/unlock_timer()Thomas Gleixner
Since the integration of sigqueue into the timer struct, lock_timer() is only used in task context. So taking the lock with irqsave() is not longer required. Convert it to use spin_[un]lock_irq(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de
2025-03-13posix-timers: Use guards in a few placesThomas Gleixner
Switch locking and RCU to guards where applicable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de
2025-03-13posix-timers: Remove SLAB_PANIC from kmem cacheThomas Gleixner
There is no need to panic when the posix-timer kmem_cache can't be created. timer_create() will fail with -ENOMEM and that's it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de
2025-03-13posix-timers: Remove a few paranoid warningsThomas Gleixner
Warnings about a non-initialized timer or non-existing callbacks are just useful for implementing new posix clocks, but there a NULL pointer dereference is expected anyway. :) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de
2025-03-13posix-timers: Cleanup includesThomas Gleixner
Remove pointless includes and sort the remaining ones alphabetically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de
2025-03-13posix-timers: Add cond_resched() to posix_timer_add() search loopEric Dumazet
With a large number of POSIX timers the search for a valid ID might cause a soft lockup on PREEMPT_NONE/VOLUNTARY kernels. Add cond_resched() to the loop to prevent that. [ tglx: Split out from Eric's series ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de
2025-03-13posix-timers: Initialise timer before adding it to the hash tableEric Dumazet
A timer is only valid in the hashtable when both timer::it_signal and timer::it_id are set to their final values, but timers are added without those values being set. The timer ID is allocated when the timer is added to the hash in invalid state. The ID is taken from a monotonically increasing per process counter which wraps around after reaching INT_MAX. The hash insertion validates that there is no timer with the allocated ID in the hash table which belongs to the same process. That opens a mostly theoretical race condition: If other threads of the same process manage to create/delete timers in rapid succession before the newly created timer is fully initialized and wrap around to the timer ID which was handed out, then a duplicate timer ID will be inserted into the hash table. Prevent this by: 1) Setting timer::it_id before inserting the timer into the hashtable. 2) Storing the signal pointer in timer::it_signal with bit 0 set before inserting it into the hashtable. Bit 0 acts as a invalid bit, which means that the regular lookup for sys_timer_*() will fail the comparison with the signal pointer. But the lookup on insertion masks out bit 0 and can therefore detect a timer which is not yet valid, but allocated in the hash table. Bit 0 in the pointer is cleared once the initialization of the timer completed. [ tglx: Fold ID and signal iniitializaion into one patch and massage change log and comments. ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de
2025-03-13posix-timers: Ensure that timer initialization is fully visibleThomas Gleixner
Frederic pointed out that the memory operations to initialize the timer are not guaranteed to be visible, when __lock_timer() observes timer::it_signal valid under timer::it_lock: T0 T1 --------- ----------- do_timer_create() // A new_timer->.... = .... spin_lock(current->sighand) // B WRITE_ONCE(new_timer->it_signal, current->signal) spin_unlock(current->sighand) sys_timer_*() t = __lock_timer() spin_lock(&timr->it_lock) // observes B if (timr->it_signal == current->signal) return timr; if (!t) return; // Is not guaranteed to observe A Protect the write of timer::it_signal, which makes the timer valid, with timer::it_lock as well. This guarantees that T1 must observe the initialization A completely, when it observes the valid signal pointer under timer::it_lock. sighand::siglock must still be taken to protect the signal::posix_timers list. Reported-by: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de
2025-03-13clocksource: Remove unnecessary strscpy() size argumentThorsten Blum
The size argument of strscpy() is only required when the destination pointer is not a fixed sized array or when the copy needs to be smaller than the size of the fixed sized destination array. For fixed sized destination arrays and full copies, strscpy() automatically determines the length of the destination buffer if the size argument is omitted. This makes the explicit sizeof() unnecessary. Remove it. [ tglx: Massaged change log ] Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250311110624.495718-2-thorsten.blum@linux.dev
2025-03-13timer_list: Don't use %pK through printk()Thomas Weißschuh
This reverts commit f590308536db ("timer debug: Hide kernel addresses via %pK in /proc/timer_list") The timer list helper SEQ_printf() uses either the real seq_printf() for procfs output or vprintk() to print to the kernel log, when invoked from SysRq-q. It uses %pK for printing pointers. In the past %pK was prefered over %p as it would not leak raw pointer values into the kernel log. Since commit ad67b74d2469 ("printk: hash addresses printed with %p") the regular %p has been improved to avoid this issue. Furthermore, restricted pointers ("%pK") were never meant to be used through printk(). They can still unintentionally leak raw pointers or acquire sleeping looks in atomic contexts. Switch to the regular pointer formatting which is safer, easier to reason about and sufficient here. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de
2025-03-12Merge tag 'sched_ext-for-6.14-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "BPF schedulers could trigger a crash by passing in an invalid CPU to the scx_bpf_select_cpu_dfl() helper. Fix it by verifying input validity" * tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
2025-03-12PM: s2idle: Extend comment in s2idle_enter()Ulf Hansson
The s2idle_lock must be held while checking for a pending wakeup and while moving into S2IDLE_STATE_ENTER, to make sure a wakeup doesn't get lost. Let's extend the comment in the code to make this clear. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20250311160827.1129643-3-ulf.hansson@linaro.org [ rjw: Rewrote the new comment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-03-12PM: s2idle: Drop redundant locks when entering s2idleUlf Hansson
The calls to cpus_read_lock|unlock() protects us from getting CPUS hotplugged, while entering suspend-to-idle. However, when s2idle_enter() is called we should be far beyond the point when CPUs may be hotplugged. Let's therefore simplify the code and drop the use of the lock. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20250311160827.1129643-2-ulf.hansson@linaro.org [ rjw: Rewrote the new comment ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-03-12dma-mapping: fix missing clear bdr in check_ram_in_range_map()Baochen Qiang
As discussed in [1], if 'bdr' is set once, it would never get cleared, hence 0 is always returned. Refactor the range check hunk into a new helper dma_find_range(), which allows 'bdr' to be cleared in each iteration. Link: https://lore.kernel.org/all/64931fac-085b-4ff3-9314-84bac2fa9bdb@quicinc.com/ # [1] Fixes: a409d9600959 ("dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM") Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> Link: https://lore.kernel.org/r/20250307030350.69144-1-quic_bqiang@quicinc.com Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2025-03-11blk-cgroup: Simplify policy files registrationMichal Koutný
Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup: Add deprecation message to legacy freezer controllerMichal Koutný
As explained in the commit 76f969e8948d8 ("cgroup: cgroup v2 freezer"), the original freezer is imperfect, some users may unwittingly rely on it when there exists the alternative of v2. Print a message when it happens and explain that in the docs. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_levelMichal Koutný
This is not a properly hierarchical resource, it might be better implemented based on a sched_attr. Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup/cpuset-v1: Add deprecation messages to memory_migrateMichal Koutný
Memory migration (between cgroups) was given up in v2 due to performance reasons of its implementation. Migration between NUMA nodes within one memcg may still make sense to modify affinity at runtime though. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwallMichal Koutný
The concept of exclusive memory affinity may require complex approaches like with cpuset v2 cpu partitions. There is so far no implementation in cpuset v2. Specific kernel memory affinity may cause unintended (global) bottlenecks like kmem limits. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup: Print message when /proc/cgroups is read on v2-only systemMichal Koutný
As a followup to commits 6c2920926b10e ("cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation") and ab03125268679 ("cgroup: Show # of subsystem CSSes in cgroup.stat"), add a runtime message to users who read status of controllers in /proc/cgroups on v2-only system. The detection is based on a) no controllers are attached to v1, b) default hierarchy is mounted (the latter is for setups that never mount v2 but read /proc/cgroups upon boot when controllers default to v2, so that this code may be backported to older kernels). Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and ↵Michal Koutný
memory_spread_slab There is MPOL_INTERLEAVE for user explicit allocations. Deprecate spreading of allocations that users carry out unwittingly. Use straight warning level for slab spreading since such a knob is unnecessarily intertwined with slab allocator. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and ↵Michal Koutný
memory_pressure_enabled These two v1 feature have analogues in cgroup v2. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11stop-machine: Add comment for rcu_momentary_eqs()Paul E. McKenney
Add a comment to explain the purpose of the rcu_momentary_eqs() call from multi_cpu_stop(), which is to suppress false-positive RCU CPU stall warnings. Reported-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/87wmeuanti.ffs@tglx/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org>
2025-03-11printk: Check CON_SUSPEND when unblanking a consoleMarcos Paulo de Souza
The commit 9e70a5e109a4 ("printk: Add per-console suspended state") introduced the CON_SUSPENDED flag for consoles. The suspended consoles will stop receiving messages, so don't unblank suspended consoles because it won't be showing anything either way. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20250226-printk-renaming-v1-5-0b878577f2e6@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>