summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-02-21ftrace: Correct preemption accounting for function tracing.Sebastian Andrzej Siewior
The function tracer should record the preemption level at the point when the function is invoked. If the tracing subsystem decrement the preemption counter it needs to correct this before feeding the data into the trace buffer. This was broken in the commit cited below while shifting the preempt-disabled section. Use tracing_gen_ctx_dec() which properly subtracts one from the preemption counter on a preemptible kernel. Cc: stable@vger.kernel.org Cc: Wander Lairson Costa <wander@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de Fixes: ce5e48036c9e7 ("ftrace: disable preemption when recursion locked") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21fprobe: Fix accounting of when to unregister from function graphSteven Rostedt
When adding a new fprobe, it will update the function hash to the functions the fprobe is attached to and register with function graph to have it call the registered functions. The fprobe_graph_active variable keeps track of the number of fprobes that are using function graph. If two fprobes attach to the same function, it increments the fprobe_graph_active for each of them. But when they are removed, the first fprobe to be removed will see that the function it is attached to is also used by another fprobe and it will not remove that function from function_graph. The logic will skip decrementing the fprobe_graph_active variable. This causes the fprobe_graph_active variable to not go to zero when all fprobes are removed, and in doing so it does not unregister from function graph. As the fgraph ops hash will now be empty, and an empty filter hash means all functions are enabled, this triggers function graph to add a callback to the fprobe infrastructure for every function! # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # echo "f:myevent2 kernel_clone%return" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc0024000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # > /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions trace_initcall_start_cb (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 try_to_run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 cleanup_rapl_pmus (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_free_pcibus_map (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_types_exit (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 kvm_shutdown (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 vmx_dump_msrs (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 [..] # cat /sys/kernel/tracing/enabled_functions | wc -l 54702 If a fprobe is being removed and all its functions are also traced by other fprobes, still decrement the fprobe_graph_active counter. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.565129766@goodmis.org Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Closes: https://lore.kernel.org/all/20250217114918.10397-A-hca@linux.ibm.com/ Reported-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21fprobe: Always unregister fgraph function from opsSteven Rostedt
When the last fprobe is removed, it calls unregister_ftrace_graph() to remove the graph_ops from function graph. The issue is when it does so, it calls return before removing the function from its graph ops via ftrace_set_filter_ips(). This leaves the last function lingering in the fprobe's fgraph ops and if a probe is added it also enables that last function (even though the callback will just drop it, it does add unneeded overhead to make that call). # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 schedule_timeout (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # > /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions # echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kmem_cache_free (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 schedule_timeout (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 The above enabled a fprobe on kernel_clone, and then on schedule_timeout. The content of the enabled_functions shows the functions that have a callback attached to them. The fprobe attached to those functions properly. Then the fprobes were cleared, and enabled_functions was empty after that. But after adding a fprobe on kmem_cache_free, the enabled_functions shows that the schedule_timeout was attached again. This is because it was still left in the fprobe ops that is used to tell function graph what functions it wants callbacks from. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Tested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21ftrace: Do not add duplicate entries in subops manager opsSteven Rostedt
Check if a function is already in the manager ops of a subops. A manager ops contains multiple subops, and if two or more subops are tracing the same function, the manager ops only needs a single entry in its hash. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org Fixes: 4f554e955614f ("ftrace: Add ftrace_set_filter_ips function") Tested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21ftrace: Fix accounting of adding subops to a manager opsSteven Rostedt
Function graph uses a subops and manager ops mechanism to attach to ftrace. The manager ops connects to ftrace and the functions it connects to is defined by a list of subops that it manages. The function hash that defines what the above ops attaches to limits the functions to attach if the hash has any content. If the hash is empty, it means to trace all functions. The creation of the manager ops hash is done by iterating over all the subops hashes. If any of the subops hashes is empty, it means that the manager ops hash must trace all functions as well. The issue is in the creation of the manager ops. When a second subops is attached, a new hash is created by starting it as NULL and adding the subops one at a time. But the NULL ops is mistaken as an empty hash, and once an empty hash is found, it stops the loop of subops and just enables all functions. # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions trace_initcall_start_cb (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 try_to_run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 cleanup_rapl_pmus (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_free_pcibus_map (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_types_exit (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 kvm_shutdown (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 vmx_dump_msrs (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 vmx_cleanup_l1d_flush (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 [..] Fix this by initializing the new hash to NULL and if the hash is NULL do not treat it as an empty hash but instead allocate by copying the content of the first sub ops. Then on subsequent iterations, the new hash will not be NULL, but the content of the previous subops. If that first subops attached to all functions, then new hash may assume that the manager ops also needs to attach to all functions. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org Fixes: 5fccc7552ccbc ("ftrace: Add subops logic to allow one ops to manage many") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21rseq: Fix rseq registration with CONFIG_DEBUG_RSEQMichael Jeanson
With CONFIG_DEBUG_RSEQ=y, at rseq registration the read-only fields are copied from user-space, if this copy fails the syscall returns -EFAULT and the registration should not be activated - but it erroneously is. Move the activation of the registration after the copy of the fields to fix this bug. Fixes: 7d5265ffcd8b ("rseq: Validate read-only fields under DEBUG_RSEQ config") Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20250219205330.324770-1-mjeanson@efficios.com
2025-02-20Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull BPF fixes from Daniel Borkmann: - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests: bpf: test batch lookup on array of maps with holes bpf: skip non exist keys in generic_map_lookup_batch bpf: Handle allocation failure in acquire_lock_state bpf: verifier: Disambiguate get_constant_map_key() errors bpf: selftests: Test constant key extraction on irrelevant maps bpf: verifier: Do not extract constant map keys for irrelevant maps bpf: Fix softlockup in arena_map_free on 64k page kernel net: Add rx_skb of kfree_skb to raw_tp_null_args[]. bpf: Fix deadlock when freeing cgroup storage selftests/bpf: Add strparser test for bpf selftests/bpf: Fix invalid flag of recv() bpf: Disable non stream socket for strparser bpf: Fix wrong copied_seq calculation strparser: Add read_sock callback bpf: avoid holding freeze_mutex during mmap operation bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic selftests/bpf: Adjust data size to have ETH_HLEN bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type() bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
2025-02-21Merge tag 'drm-misc-fixes-2025-02-20' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes An reset signal polarity fix for the jd9365da-h3 panel, a folio handling fix and config fix in nouveau, a dmem cgroup descendant pool handling fix, and a missing header for amdxdna. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220-glorious-cockle-of-might-5b35f7@houat
2025-02-19cgroup/dmem: Don't open-code css_for_each_descendant_preFriedrich Vock
The current implementation has a bug: If the current css doesn't contain any pool that is a descendant of the "pool" (i.e. when found_descendant == false), then "pool" will point to some unrelated pool. If the current css has a child, we'll overwrite parent_pool with this unrelated pool on the next iteration. Since we can just check whether a pool refers to the same region to determine whether or not it's related, all the additional pool tracking is unnecessary, so just switch to using css_for_each_descendant_pre for traversal. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250127152754.21325-1-friedrich.vock@gmx.de Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
2025-02-18bpf: skip non exist keys in generic_map_lookup_batchYan Zhai
The generic_map_lookup_batch currently returns EINTR if it fails with ENOENT and retries several times on bpf_map_copy_value. The next batch would start from the same location, presuming it's a transient issue. This is incorrect if a map can actually have "holes", i.e. "get_next_key" can return a key that does not point to a valid value. At least the array of maps type may contain such holes legitly. Right now these holes show up, generic batch lookup cannot proceed any more. It will always fail with EINTR errors. Rather, do not retry in generic_map_lookup_batch. If it finds a non existing element, skip to the next key. This simple solution comes with a price that transient errors may not be recovered, and the iteration might cycle back to the first key under parallel deletion. For example, Hou Tao <houtao@huaweicloud.com> pointed out a following scenario: For LPM trie map: (1) ->map_get_next_key(map, prev_key, key) returns a valid key (2) bpf_map_copy_value() return -ENOMENT It means the key must be deleted concurrently. (3) goto next_key It swaps the prev_key and key (4) ->map_get_next_key(map, prev_key, key) again prev_key points to a non-existing key, for LPM trie it will treat just like prev_key=NULL case, the returned key will be duplicated. With the retry logic, the iteration can continue to the key next to the deleted one. But if we directly skip to the next key, the iteration loop would restart from the first key for the lpm_trie type. However, not all races may be recovered. For example, if current key is deleted after instead of before bpf_map_copy_value, or if the prev_key also gets deleted, then the loop will still restart from the first key for lpm_tire anyway. For generic lookup it might be better to stay simple, i.e. just skip to the next key. To guarantee that the output keys are not duplicated, it is better to implement map type specific batch operations, which can properly lock the trie and synchronize with concurrent mutators. Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op") Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/85618439eea75930630685c467ccefeac0942e2b.1739171594.git.yan@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18uprobes: Don't use %pK through printkThomas Weißschuh
Restricted pointers ("%pK") are not meant to be used through printk(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. For more background, see: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250217-restricted-pointers-uprobes-v1-1-e8cbe5bb22a7@linutronix.de
2025-02-18sched: Compact RSEQ concurrency IDs with reduced threads and affinityMathieu Desnoyers
When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") adds a per-mm/CPU recent_cid which is never unset unless a thread migrates. This is a tradeoff between: A) Preserving cache locality after a transition from many threads to few threads, or after reducing the hamming weight of the allowed CPU mask. B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask easy to document and understand. C) Allowing applications to eventually react to mm_cid compaction after reduction of the nr threads or allowed CPU mask, making the tracking of mm_cid compaction easier by shrinking it back towards 0 or not. D) Making sure applications that periodically reduce and then increase again the nr threads or allowed CPU mask still benefit from good cache locality with mm_cid. Introduce the following changes: * After shrinking the number of threads or reducing the number of allowed CPUs, reduce the value of max_nr_cid so expansion of CID allocation will preserve cache locality if the number of threads or allowed CPUs increase again. * Only re-use a recent_cid if it is within the max_nr_cid upper bound, else find the first available CID. Fixes: 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com
2025-02-17Merge tag 'vfs-6.14-rc4.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "It was reported that the acct(2) system call can be used to trigger a NULL deref in cases where it is set to write to a file that triggers an internal lookup. This can e.g., happen when pointing acct(2) to /sys/power/resume. At the point the where the write to this file happens the calling task has already exited and called exit_fs() but an internal lookup might be triggered through lookup_bdev(). This may trigger a NULL-deref when accessing current->fs. Reorganize the code so that the the final write happens from the workqueue but with the caller's credentials. This preserves the (strange) permission model and has almost no regression risk. Also block access to kernel internal filesystems as well as procfs and sysfs in the first place. Various fixes for netfslib: - Fix a number of read-retry hangs, including: - Incorrect getting/putting of references on subreqs as we retry them - Failure to track whether a last old subrequest in a retried set is superfluous - Inconsistency in the usage of wait queues used for subrequests (ie. using clear_and_wake_up_bit() whilst waiting on a private waitqueue) - Add stats counters for retries and publish in /proc/fs/netfs/stats. This is not a fix per se, but is useful in debugging and shouldn't otherwise change the operation of the code - Fix the ordering of queuing subrequests with respect to setting the request flag that says we've now queued them all" * tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix setting NETFS_RREQ_ALL_QUEUED to be after all subreqs queued netfs: Add retry stat counters netfs: Fix a number of read-retry hangs acct: block access to kernel internal filesystems acct: perform last write from workqueue
2025-02-16Merge tag 'irq_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq Kconfig cleanup from Borislav Petkov: - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS * tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
2025-02-16Merge tag 'sched_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Clarify what happens when a task is woken up from the wake queue and make clear its removal from that queue is atomic * tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clarify wake_up_q()'s write to task->wake_q.next
2025-02-15Merge tag 'trace-ring-buffer-v6.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace ring buffer fixes from Steven Rostedt: - Enable resize on mmap() error When a process mmaps a ring buffer, its size is locked and resizing is disabled. But if the user passes in a wrong parameter, the mmap() can fail after the resize was disabled and the mmap() exits with error without reenabling the ring buffer resize. This prevents the ring buffer from ever being resized after that. Reenable resizing of the ring buffer on mmap() error. - Have resizing return proper error and not always -ENOMEM If the ring buffer is mmapped by one task and another task tries to resize the buffer it will error with -ENOMEM. This is confusing to the user as there may be plenty of memory available. Have it return the error that actually happens (in this case -EBUSY) where the user can understand why the resize failed. - Test the sub-buffer array to validate persistent memory buffer On boot up, the initialization of the persistent memory buffer will do a validation check to see if the content of the data is valid, and if so, it will use the memory as is, otherwise it re-initializes it. There's meta data in this persistent memory that keeps track of which sub-buffer is the reader page and an array that states the order of the sub-buffers. The values in this array are indexes into the sub-buffers. The validator checks to make sure that all the entries in the array are within the sub-buffer list index, but it does not check for duplications. While working on this code, the array got corrupted and had duplicates, where not all the sub-buffers were accounted for. This passed the validator as all entries were valid, but the link list was incorrect and could have caused a crash. The corruption only produced incorrect data, but it could have been more severe. To fix this, create a bitmask that covers all the sub-buffer indexes and set it to all zeros. While iterating the array checking the values of the array content, have it set a bit corresponding to the index in the array. If the bit was already set, then it is a duplicate and mark the buffer as invalid and reset it. - Prevent mmap()ing persistent ring buffer The persistent ring buffer uses vmap() to map the persistent memory. Currently, the mmap() logic only uses virt_to_page() to get the page from the ring buffer memory and use that to map to user space. This works because a normal ring buffer uses alloc_page() to allocate its memory. But because the persistent ring buffer use vmap() it causes a kernel crash. Fixing this to work with vmap() is not hard, but since mmap() on persistent memory buffers never worked, just have the mmap() return -ENODEV (what was returned before mmap() for persistent memory ring buffers, as they never supported mmap. Normal buffers will still allow mmap(). Implementing mmap() for persistent memory ring buffers can wait till the next merge window. - Fix polling on persistent ring buffers There's a "buffer_percent" option (default set to 50), that is used to have reads of the ring buffer binary data block until the buffer fills to that percentage. The field "pages_touched" is incremented every time a new sub-buffer has content added to it. This field is used in the calculations to determine the amount of content is in the buffer and if it exceeds the "buffer_percent" then it will wake the task polling on the buffer. As persistent ring buffers can be created by the content from a previous boot, the "pages_touched" field was not updated. This means that if a task were to poll on the persistent buffer, it would block even if the buffer was completely full. It would block even if the "buffer_percent" was zero, because with "pages_touched" as zero, it would be calculated as the buffer having no content. Update pages_touched when initializing the persistent ring buffer from a previous boot. * tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Update pages_touched to reflect persistent buffer content tracing: Do not allow mmap() of persistent ring buffer ring-buffer: Validate the persistent meta data subbuf array tracing: Have the error of __tracing_resize_ring_buffer() passed to user ring-buffer: Unlock resize on mmap error
2025-02-15ring-buffer: Update pages_touched to reflect persistent buffer contentSteven Rostedt
The pages_touched field represents the number of subbuffers in the ring buffer that have content that can be read. This is used in accounting of "dirty_pages" and "buffer_percent" to allow the user to wait for the buffer to be filled to a certain amount before it reads the buffer in blocking mode. The persistent buffer never updated this value so it was set to zero, and this accounting would take it as it had no content. This would cause user space to wait for content even though there's enough content in the ring buffer that satisfies the buffer_percent. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15tracing: Do not allow mmap() of persistent ring bufferSteven Rostedt
When trying to mmap a trace instance buffer that is attached to reserve_mem, it would crash: BUG: unable to handle page fault for address: ffffe97bd00025c8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 2862f3067 P4D 2862f3067 PUD 0 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:validate_page_before_insert+0x5/0xb0 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000 FS: 00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0 Call Trace: <TASK> ? __die_body.cold+0x19/0x1f ? __die+0x2e/0x40 ? page_fault_oops+0x157/0x2b0 ? search_module_extables+0x53/0x80 ? validate_page_before_insert+0x5/0xb0 ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70 ? __bad_area_nosemaphore+0x16e/0x1b0 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x77/0x90 ? exc_page_fault+0x22b/0x230 ? asm_exc_page_fault+0x2b/0x30 ? validate_page_before_insert+0x5/0xb0 ? vm_insert_pages+0x151/0x400 __rb_map_vma+0x21f/0x3f0 ring_buffer_map+0x21b/0x2f0 tracing_buffers_mmap+0x70/0xd0 __mmap_region+0x6f0/0xbd0 mmap_region+0x7f/0x130 do_mmap+0x475/0x610 vm_mmap_pgoff+0xf2/0x1d0 ksys_mmap_pgoff+0x166/0x200 __x64_sys_mmap+0x37/0x50 x64_sys_call+0x1670/0x1d70 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The reason was that the code that maps the ring buffer pages to user space has: page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); And uses that in: vm_insert_pages(vma, vma->vm_start, pages, &nr_pages); But virt_to_page() does not work with vmap()'d memory which is what the persistent ring buffer has. It is rather trivial to allow this, but for now just disable mmap() of instances that have their ring buffer from the reserve_mem option. If an mmap() is performed on a persistent buffer it will return -ENODEV just like it would if the .mmap field wasn't defined in the file_operations structure. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14Merge tag 'sched_ext-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix lock imbalance in a corner case of dispatch_to_local_dsq() - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path - tools/sched_ext, selftest and other misc updates * tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h. sched_ext: selftests: Fix grammar in tests description sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq() sched_ext: Fix migration disabled handling in targeted dispatches sched_ext: Implement auto local dispatching of migration disabled tasks sched_ext: Fix incorrect time delta calculation in time_delta() sched_ext: Fix lock imbalance in dispatch_to_local_dsq() sched_ext: selftests/dsp_local_on: Fix selftest on UP systems tools/sched_ext: Add helper to check task migration state sched_ext: Fix incorrect autogroup migration detection sched_ext: selftests/dsp_local_on: Fix sporadic failures selftests/sched_ext: Fix enum resolution sched_ext: Include task weight in the error state dump sched_ext: Fixes typos in comments
2025-02-14Merge tag 'cgroup-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a race window where a newly forked task could escape cgroup.kill - Remove incorrectly included steal time from cpu.stat::usage_usec - Minor update in selftest * tag 'cgroup-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Remove steal time from usage_usec selftests/cgroup: use bash in test_cpuset_v1_hp.sh cgroup: fix race between fork and cgroup.kill
2025-02-14Merge tag 'wq-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix a regression where a worker pool can be freed before rescuer workers are done with it leading to user-after-free * tag 'wq-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Put the pwq after detaching the rescuer from the pool
2025-02-14ring-buffer: Validate the persistent meta data subbuf arraySteven Rostedt
The meta data for a mapped ring buffer contains an array of indexes of all the subbuffers. The first entry is the reader page, and the rest of the entries lay out the order of the subbuffers in how the ring buffer link list is to be created. The validator currently makes sure that all the entries are within the range of 0 and nr_subbufs. But it does not check if there are any duplicates. While working on the ring buffer, I corrupted this array, where I added duplicates. The validator did not catch it and created the ring buffer link list on top of it. Luckily, the corruption was only that the reader page was also in the writer path and only presented corrupted data but did not crash the kernel. But if there were duplicates in the writer side, then it could corrupt the ring buffer link list and cause a crash. Create a bitmask array with the size of the number of subbuffers. Then clear it. When walking through the subbuf array checking to see if the entries are within the range, test if its bit is already set in the subbuf_mask. If it is, then there is duplicates and fail the validation. If not, set the corresponding bit and continue. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home Fixes: c76883f18e59b ("ring-buffer: Add test if range of boot buffer is valid") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14tracing: Have the error of __tracing_resize_ring_buffer() passed to userSteven Rostedt
Currently if __tracing_resize_ring_buffer() returns an error, the tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory issue that caused the function to fail. If the ring buffer is memory mapped, then the resizing of the ring buffer will be disabled. But if the user tries to resize the buffer, it will get an -ENOMEM returned, which is confusing because there is plenty of memory. The actual error returned was -EBUSY, which would make much more sense to the user. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14ring-buffer: Unlock resize on mmap errorSteven Rostedt
Memory mapping the tracing ring buffer will disable resizing the buffer. But if there's an error in the memory mapping like an invalid parameter, the function exits out without re-enabling the resizing of the ring buffer, preventing the ring buffer from being resized after that. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14workqueue: Log additional details when rejecting workWill Deacon
Syzbot regularly runs into the following warning on arm64: | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 current_wq_worker kernel/workqueue_internal.h:69 [inline] | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 is_chained_work kernel/workqueue.c:2199 [inline] | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 | Modules linked in: | CPU: 1 UID: 0 PID: 6023 Comm: klogd Not tainted 6.13.0-rc2-syzkaller-g2e7aff49b5da #0 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 | pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __queue_work+0xe50/0x1308 kernel/workqueue_internal.h:69 | lr : current_wq_worker kernel/workqueue_internal.h:69 [inline] | lr : is_chained_work kernel/workqueue.c:2199 [inline] | lr : __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 [...] | __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 (L) | delayed_work_timer_fn+0x74/0x90 kernel/workqueue.c:2485 | call_timer_fn+0x1b4/0x8b8 kernel/time/timer.c:1793 | expire_timers kernel/time/timer.c:1839 [inline] | __run_timers kernel/time/timer.c:2418 [inline] | __run_timer_base+0x59c/0x7b4 kernel/time/timer.c:2430 | run_timer_base kernel/time/timer.c:2439 [inline] | run_timer_softirq+0xcc/0x194 kernel/time/timer.c:2449 The warning is probably because we are trying to queue work into a destroyed workqueue, but the softirq context makes it hard to pinpoint the problematic caller. Extend the warning diagnostics to print both the function we are trying to queue as well as the name of the workqueue. Cc: Tejun Heo <tj@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Link: https://syzkaller.appspot.com/bug?extid=e13e654d315d4da1277c Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Use SCX_CALL_OP_TASK in task_tick_scxChuyi Zhou
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group. Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGSAnup Patel
CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS is not used anymore, hence remove it. Fixes: f94a18249b7f ("genirq: Remove IRQ_MOVE_PCNTXT and related code") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250209041655.331470-7-apatel@ventanamicro.com
2025-02-12acct: block access to kernel internal filesystemsChristian Brauner
There's no point in allowing anything kernel internal nor procfs or sysfs. Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com Link: https://lore.kernel.org/r/20250211-work-acct-v1-2-1c16aecab8b3@kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reported-by: Zicheng Qu <quzicheng@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12acct: perform last write from workqueueChristian Brauner
In [1] it was reported that the acct(2) system call can be used to trigger NULL deref in cases where it is set to write to a file that triggers an internal lookup. This can e.g., happen when pointing acc(2) to /sys/power/resume. At the point the where the write to this file happens the calling task has already exited and called exit_fs(). A lookup will thus trigger a NULL-deref when accessing current->fs. Reorganize the code so that the the final write happens from the workqueue but with the caller's credentials. This preserves the (strange) permission model and has almost no regression risk. This api should stop to exist though. Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1] Link: https://lore.kernel.org/r/20250211-work-acct-v1-1-1c16aecab8b3@kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Zicheng Qu <quzicheng@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-10sched_ext: Fix incorrect assumption about migration disabled tasks in ↵Tejun Heo
task_can_run_on_remote_rq() While fixing migration disabled task handling, 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") assumed that a migration disabled task's ->cpus_ptr would only have the pinned CPU. While this is eventually true for migration disabled tasks that are switched out, ->cpus_ptr update is performed by migrate_disable_switch() which is called right before context_switch() in __scheduler(). However, the task is enqueued earlier during pick_next_task() via put_prev_task_scx(), so there is a race window where another CPU can see the task on a DSQ. If the CPU tries to dispatch the migration disabled task while in that window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq() will subsequently trigger SCHED_WARN(is_migration_disabled()). WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140 Sched_ext: layered (enabled+all), task: runnable_at=-10ms RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140 ... <TASK> consume_dispatch_q+0xab/0x220 scx_bpf_dsq_move_to_local+0x58/0xd0 bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa bpf__sched_ext_ops_dispatch+0x4b/0xab balance_one+0x1fe/0x3b0 balance_scx+0x61/0x1d0 prev_balance+0x46/0xc0 __pick_next_task+0x73/0x1c0 __schedule+0x206/0x1730 schedule+0x3a/0x160 __do_sys_sched_yield+0xe/0x20 do_syscall_64+0xbb/0x1e0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fix it by converting the SCHED_WARN() back to a regular failure path. Also, perform the migration disabled test before task_allowed_on_cpu() test so that BPF schedulers which fail to handle migration disabled tasks can be noticed easily. While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case for brevity and consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") Acked-by: Andrea Righi <arighi@nvidia.com> Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-08sched_ext: Fix migration disabled handling in targeted dispatchesTejun Heo
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Fixes: 0366017e0973 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()") Cc: stable@vger.kernel.org # v6.12+
2025-02-08sched_ext: Implement auto local dispatching of migration disabled tasksTejun Heo
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08Merge tag 'seccomp-v6.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp fix from Kees Cook: "This is really a work-around for x86_64 having grown a syscall to implement uretprobe, which has caused problems since v6.11. This may change in the future, but for now, this fixes the unintended seccomp filtering when uretprobe switched away from traps, and does so with something that should be easy to backport. - Allow uretprobe on x86_64 to avoid behavioral complications (Eyal Birger)" * tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests/seccomp: validate uretprobe syscall passes through seccomp seccomp: passthrough uretprobe systemcall without filtering
2025-02-08Merge tag 'ftrace-v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fix from Steven Rostedt: "Function graph fix of notrace functions. When the function graph tracer was restructured to use the global section of the meta data in the shadow stack, the bit logic was changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in the mask that tells if the function graph tracer is currently in the "notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set. But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was used when it should have been the TRACE_GRAPH_NOTRACE mask. This made notrace not work properly" * tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
2025-02-08Merge tag 'timers-urgent-2025-02-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Fix a PREEMPT_RT bug in the clocksource verification code that caused false positive warnings. Also fix a timer migration setup bug when new CPUs are added" * tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix off-by-one root mis-connection clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
2025-02-08Merge tag 'sched-urgent-2025-02-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive SCHED_WARN_ON() on certain workloads. The bug is believed to be (accidentally) self-correcting, hence no behavioral side effects are expected. Also print se.slice in debug output, since this value can now be set via the syscall ABI and can be useful to track" * tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Provide slice length for fair tasks sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
2025-02-08Merge tag 'locking-urgent-2025-02-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix a dangling pointer bug in the futex code used by the uring code. It isn't causing problems at the moment due to uring ABI limitations leaving it essentially unused in current usages, but is a good idea to fix nevertheless" * tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Pass in task to futex_queue()
2025-02-08sched: Clarify wake_up_q()'s write to task->wake_q.nextJann Horn
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after which a concurrent __wake_q_add() can immediately overwrite task->wake_q.next again. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com
2025-02-08fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BITSteven Rostedt
The code was restructured where the function graph notrace code, that would not trace a function and all its children is done by setting a NOTRACE flag when the function that is not to be traced is hit. There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used TRACE_GRAPH_NOTRACE. For example: # cd /sys/kernel/tracing # echo set_track_prepare stack_trace_save > set_graph_notrace # echo function_graph > current_tracer # cat trace [..] 0) | __slab_free() { 0) | free_to_partial_list() { 0) | arch_stack_walk() { 0) | __unwind_start() { 0) 0.501 us | get_stack_info(); Where a non filter trace looks like: # echo > set_graph_notrace # cat trace 0) | free_to_partial_list() { 0) | set_track_prepare() { 0) | stack_trace_save() { 0) | arch_stack_walk() { 0) | __unwind_start() { Where the filter should look like: # cat trace 0) | free_to_partial_list() { 0) | _raw_spin_lock_irqsave() { 0) 0.350 us | preempt_count_add(); 0) 0.351 us | do_raw_spin_lock(); 0) 2.440 us | } Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-07bpf: Handle allocation failure in acquire_lock_stateKumar Kartikeya Dwivedi
The acquire_lock_state function needs to handle possible NULL values returned by acquire_reference_state, and return -ENOMEM. Fixes: 769b0f1c8214 ("bpf: Refactor {acquire,release}_reference_state") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250206105435.2159977-24-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07bpf: verifier: Disambiguate get_constant_map_key() errorsDaniel Xu
Refactor get_constant_map_key() to disambiguate the constant key value from potential error values. In the case that the key is negative, it could be confused for an error. It's not currently an issue, as the verifier seems to track s32 spills as u32. So even if the program wrongly uses a negative value for an arraymap key, the verifier just thinks it's an impossibly high value which gets correctly discarded. Refactor anyways to make things cleaner and prevent potential future issues. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/dfe144259ae7cfc98aa63e1b388a14869a10632a.1738689872.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07bpf: verifier: Do not extract constant map keys for irrelevant mapsDaniel Xu
Previously, we were trying to extract constant map keys for all bpf_map_lookup_elem(), regardless of map type. This is an issue if the map has a u64 key and the value is very high, as it can be interpreted as a negative signed value. This in turn is treated as an error value by check_func_arg() which causes a valid program to be incorrectly rejected. Fix by only extracting constant map keys for relevant maps. This fix works because nullness elision is only allowed for {PERCPU_}ARRAY maps, and keys for these are within u32 range. See next commit for an example via selftest. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/aa868b642b026ff87ba6105ea151bc8693b35932.1738689872.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-07cgroup: Remove steal time from usage_usecMuhammad Adeel
The CPU usage time is the time when user, system or both are using the CPU. Steal time is the time when CPU is waiting to be run by the Hypervisor. It should not be added to the CPU usage time, hence removing it from the usage_usec entry. Fixes: 936f2a70f2077 ("cgroup: add cpu.stat file to root cgroup") Acked-by: Axel Busch <axel.busch@ibm.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07timers/migration: Fix off-by-one root mis-connectionFrederic Weisbecker
Before attaching a new root to the old root, the children counter of the new root is checked to verify that only the upcoming CPU's top group have been connected to it. However since the recently added commit b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") this check is not valid anymore because the old root is pre-accounted as a child to the new root. Therefore after connecting the upcoming CPU's top group to the new root, the children count to be expected must be 2 and not 1 anymore. This omission results in the old root to not be connected to the new root. Then eventually the system may run with more than one top level, which defeats the purpose of a single idle migrator. Also the old root is pre-accounted but not connected upon the new root creation. But it can be connected to the new root later on. Therefore the old root may be accounted twice to the new root. The propagation of such overcommit can end up creating a double final top-level root with a groupmask incorrectly initialized. Although harmless given that the final top level roots will never have a parent to walk up to, this oddity opportunistically reported the core issue: WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote CPU: 8 UID: 0 PID: 0 Comm: swapper/8 RIP: 0010:tmigr_requires_handle_remote Call Trace: <IRQ> ? tmigr_requires_handle_remote ? hrtimer_run_queues update_process_times tick_periodic tick_handle_periodic __sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt </IRQ> Fix the problem by taking the old root into account in the children count of the new root so the connection is not omitted. Also warn when more than one top level group exists to better detect similar issues in the future. Fixes: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
2025-02-06seccomp: passthrough uretprobe systemcall without filteringEyal Birger
When attaching uretprobes to processes running inside docker, the attached process is segfaulted when encountering the retprobe. The reason is that now that uretprobe is a system call the default seccomp filters in docker block it as they only allow a specific set of known syscalls. This is true for other userspace applications which use seccomp to control their syscall surface. Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes. Pass this systemcall through seccomp without depending on configuration. Note: uretprobe is currently only x86_64 and isn't expected to ever be supported in i386. Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder <rafi@rbk.io> Closes: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2AqOnHQ@mail.gmail.com/ Link: https://lore.kernel.org/lkml/20250121182939.33d05470@gandalf.local.home/T/#me2676c378eff2d6a33f3054fed4a5f3afa64e65b Link: https://lore.kernel.org/lkml/20250128145806.1849977-1-eyal.birger@gmail.com/ Cc: stable@vger.kernel.org Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20250202162921.335813-2-eyal.birger@gmail.com [kees: minimized changes for easier backporting, tweaked commit log] Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-06bpf: Fix softlockup in arena_map_free on 64k page kernelAlan Maguire
On an aarch64 kernel with CONFIG_PAGE_SIZE_64KB=y, arena_htab tests cause a segmentation fault and soft lockup. The same failure is not observed with 4k pages on aarch64. It turns out arena_map_free() is calling apply_to_existing_page_range() with the address returned by bpf_arena_get_kern_vm_start(). If this address is not page-aligned the code ends up calling apply_to_pte_range() with that unaligned address causing soft lockup. Fix it by round up GUARD_SZ to PAGE_SIZE << 1 so that the division by 2 in bpf_arena_get_kern_vm_start() returns a page-aligned value. Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Reported-by: Colm Harrington <colm.harrington@oracle.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20250205170059.427458-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-06net: Add rx_skb of kfree_skb to raw_tp_null_args[].Kuniyuki Iwashima
Yan Zhai reported a BPF prog could trigger a null-ptr-deref [0] in trace_kfree_skb if the prog does not check if rx_sk is NULL. Commit c53795d48ee8 ("net: add rx_sk to trace_kfree_skb") added rx_sk to trace_kfree_skb, but rx_sk is optional and could be NULL. Let's add kfree_skb to raw_tp_null_args[] to let the BPF verifier validate such a prog and prevent the issue. Now we fail to load such a prog: libbpf: prog 'drop': -- BEGIN PROG LOAD LOG -- 0: R1=ctx() R10=fp0 ; int BPF_PROG(drop, struct sk_buff *skb, void *location, @ kfree_skb_sk_null.bpf.c:21 0: (79) r3 = *(u64 *)(r1 +24) func 'kfree_skb' arg3 has btf_id 5253 type STRUCT 'sock' 1: R1=ctx() R3_w=trusted_ptr_or_null_sock(id=1) ; bpf_printk("sk: %d, %d\n", sk, sk->__sk_common.skc_family); @ kfree_skb_sk_null.bpf.c:24 1: (69) r4 = *(u16 *)(r3 +16) R3 invalid mem access 'trusted_ptr_or_null_' processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- Note this fix requires commit 838a10bd2ebf ("bpf: Augment raw_tp arguments with PTR_MAYBE_NULL"). [0]: BUG: kernel NULL pointer dereference, address: 0000000000000010 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 PREEMPT SMP RIP: 0010:bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x148/0x420 ? search_bpf_extables+0x5b/0x70 ? fixup_exception+0x27/0x2c0 ? exc_page_fault+0x75/0x170 ? asm_exc_page_fault+0x22/0x30 ? bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d bpf_trace_run4+0x68/0xd0 ? unix_stream_connect+0x1f4/0x6f0 sk_skb_reason_drop+0x90/0x120 unix_stream_connect+0x1f4/0x6f0 __sys_connect+0x7f/0xb0 __x64_sys_connect+0x14/0x20 do_syscall_64+0x47/0xc30 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: c53795d48ee8 ("net: add rx_sk to trace_kfree_skb") Reported-by: Yan Zhai <yan@cloudflare.com> Closes: https://lore.kernel.org/netdev/Z50zebTRzI962e6X@debian.debian/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250201030142.62703-1-kuniyu@amazon.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-04Merge tag 'kthreads-fixes-2025-02-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthreads fix from Frederic Weisbecker: - Properly handle return value when allocation fails for the preferred affinity * tag 'kthreads-fixes-2025-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: Fix return value on kzalloc() failure in kthread_affine_preferred()
2025-02-04kthread: Fix return value on kzalloc() failure in kthread_affine_preferred()Yu-Chun Lin
kthread_affine_preferred() incorrectly returns 0 instead of -ENOMEM when kzalloc() fails. Return 'ret' to ensure the correct error code is propagated. Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501301528.t0cZVbnq-lkp@intel.com/ Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-02-03Merge tag 'timers-urgent-2025-02-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: - Properly cast the input to secs_to_jiffies() to unsigned long as otherwise the result uses the data type of the input variable, which causes result range checks to fail if the input data type is signed and smaller than unsigned long. - Handle late armed hrtimers gracefully on CPU hotplug There are legitimate cases where a hrtimer is (re)armed on an outgoing CPU after the timers have been migrated away. This triggers warnings and caused people to implement horrible workarounds in RCU. But those workarounds are incomplete and do not cover e.g. the scheduler hrtimers. Stop this by force moving timer which are enqueued on the current CPU after timer migration to be queued on a remote online CPU. This allows to undo the workarounds in a seperate step. - Demote a warning level printk() to info level in the clocksource watchdog code as there is no point to emit a warning level message for a purely informational message. - Mark a helper function __always_inline and move it into the existing #ifdef block to avoid 'unused function' warnings from CLANG * tag 'timers-urgent-2025-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jiffies: Cast to unsigned long in secs_to_jiffies() conversion clocksource: Use pr_info() for "Checking clocksource synchronization" message hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYING hrtimers: Mark is_migration_base() with __always_inline