summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-02-27sched/core: Prevent rescheduling when interrupts are disabledThomas Gleixner
David reported a warning observed while loop testing kexec jump: Interrupts enabled after irqrouter_resume+0x0/0x50 WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220 kernel_kexec+0xf6/0x180 __do_sys_reboot+0x206/0x250 do_syscall_64+0x95/0x180 The corresponding interrupt flag trace: hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90 hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90 That means __up_console_sem() was invoked with interrupts enabled. Further instrumentation revealed that in the interrupt disabled section of kexec jump one of the syscore_suspend() callbacks woke up a task, which set the NEED_RESCHED flag. A later callback in the resume path invoked cond_resched() which in turn led to the invocation of the scheduler: __cond_resched+0x21/0x60 down_timeout+0x18/0x60 acpi_os_wait_semaphore+0x4c/0x80 acpi_ut_acquire_mutex+0x3d/0x100 acpi_ns_get_node+0x27/0x60 acpi_ns_evaluate+0x1cb/0x2d0 acpi_rs_set_srs_method_data+0x156/0x190 acpi_pci_link_set+0x11c/0x290 irqrouter_resume+0x54/0x60 syscore_resume+0x6a/0x200 kernel_kexec+0x145/0x1c0 __do_sys_reboot+0xeb/0x240 do_syscall_64+0x95/0x180 This is a long standing problem, which probably got more visible with the recent printk changes. Something does a task wakeup and the scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and invokes schedule() from a completely bogus context. The scheduler enables interrupts after context switching, which causes the above warning at the end. Quite some of the code paths in syscore_suspend()/resume() can result in triggering a wakeup with the exactly same consequences. They might not have done so yet, but as they share a lot of code with normal operations it's just a question of time. The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling models. Full preemption is not affected as cond_resched() is disabled and the preemption check preemptible() takes the interrupt disabled flag into account. Cure the problem by adding a corresponding check into cond_resched(). Reported-by: David Woodhouse <dwmw@amazon.co.uk> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
2025-02-27x86/bpf: Fix BPF percpu accessesBrian Gerst
Due to this recent commit in the x86 tree: 9d7de2aa8b41 ("Use relative percpu offsets") percpu addresses went from positive offsets from the GSBASE to negative kernel virtual addresses. The BPF verifier has an optimization for x86-64 that loads the address of cpu_number into a register, but was only doing a 32-bit load which truncates negative addresses. Change it to a 64-bit load so that the address is properly sign-extended. Fixes: 9d7de2aa8b41 ("Use relative percpu offsets") Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250227195302.1667654-1-brgerst@gmail.com
2025-02-27Change inode_operations.mkdir to return struct dentry *NeilBrown
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27bpf: Use try_alloc_pages() to allocate pages for bpf needs.Alexei Starovoitov
Use try_alloc_pages() and free_pages_nolock() for BPF needs when context doesn't allow using normal alloc_pages. This is a prerequisite for further work. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250222024427.30294-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-27cgroup/rstat: Fix forceidle time in cpu.statAbel Wu
The commit b824766504e4 ("cgroup/rstat: add force idle show helper") retrieves forceidle_time outside cgroup_rstat_lock for non-root cgroups which can be potentially inconsistent with other stats. Rather than reverting that commit, fix it in a way that retains the effort of cleaning up the ifdef-messes. Fixes: b824766504e4 ("cgroup/rstat: add force idle show helper") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27bpf: cpumap: switch to napi_skb_cache_get_bulk()Alexander Lobakin
Now that cpumap uses GRO, which drops unused skb heads to the NAPI cache, use napi_skb_cache_get_bulk() to try to reuse cached entries and lower MM layer pressure. Always disable the BH before checking and running the cpumap-pinned XDP prog and don't re-enable it in between that and allocating an skb bulk, as we can access the NAPI caches only from the BH context. The better GRO aggregates packets, the less new skbs will be allocated. If an aggregated skb contains 16 frags, this means 15 skbs were returned to the cache, so next 15 skbs will be built without allocating anything. The same trafficgen UDP GRO test now shows: GRO off GRO on threaded GRO 2.3 4 Mpps thr bulk GRO 2.4 4.7 Mpps diff +4 +17 % Comparing to the baseline cpumap: baseline 2.7 N/A Mpps thr bulk GRO 2.4 4.7 Mpps diff -11 +74 % Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27bpf: cpumap: reuse skb array instead of a linked list to chain skbsAlexander Lobakin
cpumap still uses linked lists to store a list of skbs to pass to the stack. Now that we don't use listified Rx in favor of napi_gro_receive(), linked list is now an unneeded overhead. Inside the polling loop, we already have an array of skbs. Let's reuse it for skbs passed to cpumap (generic XDP) and keep there in case of XDP_PASS when a program is installed to the map itself. Don't list regular xdp_frames after converting them to skbs as well; store them in the mentioned array (but *before* generic skbs as the latters have lower priority) and call gro_receive_skb() for each array element after they're done. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27bpf: cpumap: switch to GRO from netif_receive_skb_list()Alexander Lobakin
cpumap has its own BH context based on kthread. It has a sane batch size of 8 frames per one cycle. GRO can be used here on its own. Adjust cpumap calls to the upper stack to use GRO API instead of netif_receive_skb_list() which processes skbs by batches, but doesn't involve GRO layer at all. In plenty of tests, GRO performs better than listed receiving even given that it has to calculate full frame checksums on the CPU. As GRO passes the skbs to the upper stack in the batches of @gro_normal_batch, i.e. 8 by default, and skb->dev points to the device where the frame comes from, it is enough to disable GRO netdev feature on it to completely restore the original behaviour: untouched frames will be being bulked and passed to the upper stack by 8, as it was with netif_receive_skb_list(). Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27Merge patch series "prep patches for my mkdir series"Christian Brauner
NeilBrown <neilb@suse.de> says: These two patches are cleanup are dependencies for my mkdir changes and subsequence directory locking changes. * patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits) nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-26trace/osnoise: Add trace events for samplesTomas Glozar
Add trace events that fire at osnoise and timerlat sample generation, in addition to the already existing noise and threshold events. This allows processing the samples directly in the kernel, either with ftrace triggers or with BPF. Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com Signed-off-by: Tomas Glozar <tglozar@redhat.com> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27tracing: fprobe-events: Log error for exceeding the number of entry argsMasami Hiramatsu (Google)
Add error message when the number of entry argument exceeds the maximum size of entry data. This is currently checked when registering fprobe, but in this case no error message is shown in the error_log file. Link: https://lore.kernel.org/all/174055074269.4079315.17809232650360988538.stgit@mhiramat.tok.corp.google.com/ Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and fprobe)") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27tracing: tprobe-events: Reject invalid tracepoint nameMasami Hiramatsu (Google)
Commit 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") allows user to set a tprobe on non-exist tracepoint but it does not check the tracepoint name is acceptable. So it leads tprobe has a wrong character for events (e.g. with subsystem prefix). In this case, the event is not shown in the events directory. Reject such invalid tracepoint name. The tracepoint name must consist of alphabet or digit or '_'. Link: https://lore.kernel.org/all/174055073461.4079315.15875502830565214255.stgit@mhiramat.tok.corp.google.com/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: stable@vger.kernel.org
2025-02-27tracing: tprobe-events: Fix a memory leak when tprobe with $retvalMasami Hiramatsu (Google)
Fix a memory leak when a tprobe is defined with $retval. This combination is not allowed, but the parse_symbol_and_return() does not free the *symbol which should not be used if it returns the error. Thus, it leaks the *symbol memory in that error path. Link: https://lore.kernel.org/all/174055072650.4079315.3063014346697447838.stgit@mhiramat.tok.corp.google.com/ Fixes: ce51e6153f77 ("tracing: fprobe-event: Fix to check tracepoint event and return") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: stable@vger.kernel.org
2025-02-26Merge tag 'wq-for-6.14-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue update from Tejun Heo: "This contains a patch improve debug visibility. While it isn't a fix, the change carries virtually no risk and makes it substantially easier to chase down a class of problems" * tag 'wq-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Log additional details when rejecting work
2025-02-26Merge tag 'sched_ext-for-6.14-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead" * tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
2025-02-26perf: Remove unnecessary parameter of security checkLuo Gengkun
It seems that the attr parameter was never been used in security checks since it was first introduced by: commit da97e18458fb ("perf_event: Add support for LSM and SELinux checks") so remove it. Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-26bpf: Fix deadlock between rcu_tasks_trace and event_mutex.Alexei Starovoitov
Fix the following deadlock: CPU A _free_event() perf_kprobe_destroy() mutex_lock(&event_mutex) perf_trace_event_unreg() synchronize_rcu_tasks_trace() There are several paths where _free_event() grabs event_mutex and calls sync_rcu_tasks_trace. Above is one such case. CPU B bpf_prog_test_run_syscall() rcu_read_lock_trace() bpf_prog_run_pin_on_cpu() bpf_prog_load() bpf_tracing_func_proto() trace_set_clr_event() mutex_lock(&event_mutex) Delegate trace_set_clr_event() to workqueue to avoid such lock dependency. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250224221637.4780-1-alexei.starovoitov@gmail.com
2025-02-26posix-clock: Remove duplicate compat ioctl() handlerThomas Weißschuh
The normal and compat ioctl handlers are identical, which is fine as compat ioctls are detected and handled dynamically inside the underlying clock implementation. The duplicate definition however is unnecessary. Just reuse the regular ioctl handler also for compat ioctls. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lore.kernel.org/all/20250225-posix-clock-compat-cleanup-v2-1-30de86457a2b@weissschuh.net
2025-02-26genirq: Remove IRQ_EDGE_EOI_HANDLERMichael Ellerman
The powerpc Cell blade support, now removed, was the only user of IRQ_EDGE_EOI_HANDLER, so remove it. Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241218105523.416573-21-mpe@ellerman.id.au
2025-02-26static_call_inline: Provide trampoline address when updating sitesChristophe Leroy
In preparation of support of inline static calls on powerpc, provide trampoline address when updating sites, so that when the destination function is too far for a direct function call, the call site is patched with a call to the trampoline. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/5efe0cffc38d6f69b1ec13988a99f1acff551abf.1733245362.git.christophe.leroy@csgroup.eu
2025-02-26rseq: Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=yMichael Jeanson
With CONFIG_DEBUG_RSEQ=y, an in-kernel copy of the read-only fields is kept synchronized with the user-space fields. Ensure the updates are done in lockstep in case we error out on a write to user-space. Fixes: 7d5265ffcd8b ("rseq: Validate read-only fields under DEBUG_RSEQ config") Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20250225202500.731245-1-mjeanson@efficios.com
2025-02-26futex: Use a hashmask instead of hashsizeSebastian Andrzej Siewior
The global hash uses futex_hashsize to save the amount of the hash buckets that have been allocated during system boot. On each futex_hash() invocation this number is substracted by one to get the mask. This can be optimized by saving directly the mask avoiding the substraction on each futex_hash() invocation. Rename futex_hashsize to futex_hashmask and save the mask of the allocated hash map. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/all/20250226091057.bX8vObR4@linutronix.de
2025-02-26x86/cfi: Add 'cfi=warn' boot optionPeter Zijlstra
Rebuilding with CONFIG_CFI_PERMISSIVE=y enabled is such a pain, esp. since clang is so slow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250224124159.924496481@infradead.org
2025-02-25selftests/bpf: Test gen_pro/epilogue that generate kfuncsAmery Hung
Test gen_prologue and gen_epilogue that generate kfuncs that have not been seen in the main program. The main bpf program and return value checks are identical to pro_epilogue.c introduced in commit 47e69431b57a ("selftests/bpf: Test gen_prologue and gen_epilogue"). However, now when bpf_testmod_st_ops detects a program name with prefix "test_kfunc_", it generates slightly different prologue and epilogue: They still add 1000 to args->a in prologue, add 10000 to args->a and set r0 to 2 * args->a in epilogue, but involve kfuncs. At high level, the alternative version of prologue and epilogue look like this: cgrp = bpf_cgroup_from_id(0); if (cgrp) bpf_cgroup_release(cgrp); else /* Perform what original bpf_testmod_st_ops prologue or * epilogue does */ Since 0 is never a valid cgroup id, the original prologue or epilogue logic will be performed. As a result, the __retval check should expect the exact same return value. Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25bpf: Search and add kfuncs in struct_ops prologue and epilogueAmery Hung
Currently, add_kfunc_call() is only invoked once before the main verification loop. Therefore, the verifier could not find the bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined struct_ops operators but introduced in gen_prologue or gen_epilogue during do_misc_fixup(). Fix this by searching kfuncs in the patching instruction buffer and add them to prog->aux->kfunc_tab. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25bpf: abort verification if env->cur_state->loop_entry != NULLEduard Zingerman
In addition to warning abort verification with -EFAULT. If env->cur_state->loop_entry != NULL something is irrecoverably buggy. Fixes: bbbc02b7445e ("bpf: copy_verifier_state() should copy 'loop_entry' field") Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250225003838.135319-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25uprobes: Remove too strict lockdep_assert() condition in hprobe_expire()Andrii Nakryiko
hprobe_expire() is used to atomically switch pending uretprobe instance (struct return_instance) from being SRCU protected to be refcounted. This can be done from background timer thread, or synchronously within current thread when task is forked. In the former case, return_instance has to be protected through RCU read lock, and that's what hprobe_expire() used to check with lockdep_assert(rcu_read_lock_held()). But in the latter case (hprobe_expire() called from dup_utask()) there is no RCU lock being held, and it's both unnecessary and incovenient. Inconvenient due to the intervening memory allocations inside dup_return_instance()'s loop. Unnecessary because dup_utask() is called synchronously in current thread, and no uretprobe can run at that point, so return_instance can't be freed either. So drop rcu_read_lock_held() condition, and expand corresponding comment to explain necessary lifetime guarantees. lockdep_assert()-detected issue is a false positive. Fixes: dd1a7567784e ("uprobes: SRCU-protect uretprobe lifetime (with timeout)") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250225223214.2970740-1-andrii@kernel.org
2025-02-25tracing: Add traceoff_after_boot optionSteven Rostedt
Sometimes tracing is used to debug issues during the boot process. Since the trace buffer has a limited amount of storage, it may be prudent to disable tracing after the boot is finished, otherwise the critical information may be overwritten. With this option, the main tracing buffer will be turned off at the end of the boot process. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/20250208103017.48a7ec83@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behaviorAndrea Righi
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state. Also clarify this logic in the function documentation. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called ↵Tejun Heo
without balance() a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair(). The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable. This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks. Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Pat Cody <patcody@meta.com> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-25ftrace: Check against is_kernel_text() instead of kaslr_offset()Steven Rostedt
As kaslr_offset() is architecture dependent and also may not be defined by all architectures, when zeroing out unused weak functions, do not check against kaslr_offset(), but instead check if the address is within the kernel text sections. If KASLR added a shift to the zeroed out function, it would still not be located in the kernel text. This is a more robust way to test if the text is valid or not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: "Arnd Bergmann" <arnd@arndb.de> Link: https://lore.kernel.org/20250225182054.471759017@goodmis.org Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table") Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Mark Brown <broonie@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20250224180805.GA1536711@ax162/ Closes: https://lore.kernel.org/all/5225b07b-a9b2-4558-9d5f-aa60b19f6317@sirena.org.uk/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25ftrace: Test mcount_loc addr before calling ftrace_call_addr()Steven Rostedt
The addresses in the mcount_loc can be zeroed and then moved by KASLR making them invalid addresses. ftrace_call_addr() for ARM 64 expects a valid address to kernel text. If the addr read from the mcount_loc section is invalid, it must not call ftrace_call_addr(). Move the addr check before calling ftrace_call_addr() in ftrace_process_locs(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/20250225182054.290128736@goodmis.org Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table") Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: "Arnd Bergmann" <arnd@arndb.de> Tested-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20250225025631.GA271248@ax162/ Closes: https://lore.kernel.org/all/91523154-072b-437b-bbdc-0b70e9783fd0@app.fastmail.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25perf/core: Fix low freq setting via IOC_PERIODKan Liang
A low attr::freq value cannot be set via IOC_PERIOD on some platforms. The perf_event_check_period() introduced in: 81ec3f3c4c4d ("perf/x86: Add check_period PMU callback") was intended to check the period, rather than the frequency. A low frequency may be mistakenly rejected by limit_period(). Fix it. Fixes: 81ec3f3c4c4d ("perf/x86: Add check_period PMU callback") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250117151913.3043942-2-kan.liang@linux.intel.com Closes: https://lore.kernel.org/lkml/20250115154949.3147-1-ravi.bangoria@amd.com/
2025-02-24padata: switch padata_find_next() to using cpumask_next_wrap()Yury Norov
Calling cpumask_next_wrap_old() with starting CPU == -1 effectively means the request to find next CPU, wrapping around if needed. cpumask_next_wrap() is the proper replacement for that. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2025-02-24cpumask: deprecate cpumask_next_wrap()Yury Norov
The next patch aligns implementation of cpumask_next_wrap() with the find_next_bit_wrap(), and it changes function signature. To make the transition smooth, this patch deprecates current implementation by adding an _old suffix. The following patches switch current users to the new implementation one by one. No functional changes were intended. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2025-02-24bpf: Fix kmemleak warning for percpu hashmapYonghong Song
Vlad Poenaru reported the following kmemleak issue: unreferenced object 0x606fd7c44ac8 (size 32): backtrace (crc 0): pcpu_alloc_noprof+0x730/0xeb0 bpf_map_alloc_percpu+0x69/0xc0 prealloc_init+0x9d/0x1b0 htab_map_alloc+0x363/0x510 map_create+0x215/0x3a0 __sys_bpf+0x16b/0x3e0 __x64_sys_bpf+0x18/0x20 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Further investigation shows the reason is due to not 8-byte aligned store of percpu pointer in htab_elem_set_ptr(): *(void __percpu **)(l->key + key_size) = pptr; Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size is 4, that means pptr is stored in a location which is 4 byte aligned but not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based on 8 byte stride, so it won't detect above pptr, hence reporting the memory leak. In htab_map_alloc(), we already have htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); if (percpu) htab->elem_size += sizeof(void *); else htab->elem_size += round_up(htab->map.value_size, 8); So storing pptr with 8-byte alignment won't cause any problem and can fix kmemleak too. The issue can be reproduced with bpf selftest as well: 1. Enable CONFIG_DEBUG_KMEMLEAK config 2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c. The purpose is to keep map available so kmemleak can be detected. 3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported. Reported-by: Vlad Poenaru <thevlad@meta.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250224175514.2207227-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-24uprobes: Reject the shared zeropage in uprobe_write_opcode()Tong Tiangen
We triggered the following crash in syzkaller tests: BUG: Bad page state in process syz.7.38 pfn:1eff3 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1eff3 flags: 0x3fffff00004004(referenced|reserved|node=0|zone=1|lastcpupid=0x1fffff) raw: 003fffff00004004 ffffe6c6c07bfcc8 ffffe6c6c07bfcc8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000fffffffe 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x32/0x50 bad_page+0x69/0xf0 free_unref_page_prepare+0x401/0x500 free_unref_page+0x6d/0x1b0 uprobe_write_opcode+0x460/0x8e0 install_breakpoint.part.0+0x51/0x80 register_for_each_vma+0x1d9/0x2b0 __uprobe_register+0x245/0x300 bpf_uprobe_multi_link_attach+0x29b/0x4f0 link_create+0x1e2/0x280 __sys_bpf+0x75f/0xac0 __x64_sys_bpf+0x1a/0x30 do_syscall_64+0x56/0x100 entry_SYSCALL_64_after_hwframe+0x78/0xe2 BUG: Bad rss-counter state mm:00000000452453e0 type:MM_FILEPAGES val:-1 The following syzkaller test case can be used to reproduce: r2 = creat(&(0x7f0000000000)='./file0\x00', 0x8) write$nbd(r2, &(0x7f0000000580)=ANY=[], 0x10) r4 = openat(0xffffffffffffff9c, &(0x7f0000000040)='./file0\x00', 0x42, 0x0) mmap$IORING_OFF_SQ_RING(&(0x7f0000ffd000/0x3000)=nil, 0x3000, 0x0, 0x12, r4, 0x0) r5 = userfaultfd(0x80801) ioctl$UFFDIO_API(r5, 0xc018aa3f, &(0x7f0000000040)={0xaa, 0x20}) r6 = userfaultfd(0x80801) ioctl$UFFDIO_API(r6, 0xc018aa3f, &(0x7f0000000140)) ioctl$UFFDIO_REGISTER(r6, 0xc020aa00, &(0x7f0000000100)={{&(0x7f0000ffc000/0x4000)=nil, 0x4000}, 0x2}) ioctl$UFFDIO_ZEROPAGE(r5, 0xc020aa04, &(0x7f0000000000)={{&(0x7f0000ffd000/0x1000)=nil, 0x1000}}) r7 = bpf$PROG_LOAD(0x5, &(0x7f0000000140)={0x2, 0x3, &(0x7f0000000200)=ANY=[@ANYBLOB="1800000000120000000000000000000095"], &(0x7f0000000000)='GPL\x00', 0x7, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, @fallback=0x30, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x0, @void, @value}, 0x94) bpf$BPF_LINK_CREATE_XDP(0x1c, &(0x7f0000000040)={r7, 0x0, 0x30, 0x1e, @val=@uprobe_multi={&(0x7f0000000080)='./file0\x00', &(0x7f0000000100)=[0x2], 0x0, 0x0, 0x1}}, 0x40) The cause is that zero pfn is set to the PTE without increasing the RSS count in mfill_atomic_pte_zeropage() and the refcount of zero folio does not increase accordingly. Then, the operation on the same pfn is performed in uprobe_write_opcode()->__replace_page() to unconditional decrease the RSS count and old_folio's refcount. Therefore, two bugs are introduced: 1. The RSS count is incorrect, when process exit, the check_mm() report error "Bad rss-count". 2. The reserved folio (zero folio) is freed when folio->refcount is zero, then free_pages_prepare->free_page_is_bad() report error "Bad page state". There is more, the following warning could also theoretically be triggered: __replace_page() -> ... -> folio_remove_rmap_pte() -> VM_WARN_ON_FOLIO(is_zero_folio(folio), folio) Considering that uprobe hit on the zero folio is a very rare case, just reject zero old folio immediately after get_user_page_vma_remote(). [ mingo: Cleaned up the changelog ] Fixes: 7396fa818d62 ("uprobes/core: Make background page replacement logic account for rss_stat counters") Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250224031149.1598949-1-tongtiangen@huawei.com
2025-02-24seccomp: avoid the lock trip seccomp_filter_release in common caseMateusz Guzik
Vast majority of threads don't have any seccomp filters, all while the lock taken here is shared between all threads in given process and frequently used. Safety of the check relies on the following: - seccomp_filter_release is only legally called for PF_EXITING threads - SIGNAL_GROUP_EXIT is only ever set with the sighand lock held - PF_EXITING is only ever set with the sighand lock held *or* after SIGNAL_GROUP_EXIT is set *or* the process is single-threaded - seccomp_sync_threads holds the sighand lock and skips all threads if SIGNAL_GROUP_EXIT is set, PF_EXITING threads if not Resulting reduction of contention gives me a 5% boost in a microbenchmark spawning and killing threads within the same process. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250213170911.1140187-1-mjguzik@gmail.com Signed-off-by: Kees Cook <kees@kernel.org>
2025-02-24perf/core: Order the PMU list to fix warning about unordered pmu_ctx_listLuo Gengkun
Syskaller triggers a warning due to prev_epc->pmu != next_epc->pmu in perf_event_swap_task_ctx_data(). vmcore shows that two lists have the same perf_event_pmu_context, but not in the same order. The problem is that the order of pmu_ctx_list for the parent is impacted by the time when an event/PMU is added. While the order for a child is impacted by the event order in the pinned_groups and flexible_groups. So the order of pmu_ctx_list in the parent and child may be different. To fix this problem, insert the perf_event_pmu_context to its proper place after iteration of the pmu_ctx_list. The follow testcase can trigger above warning: # perf record -e cycles --call-graph lbr -- taskset -c 3 ./a.out & # perf stat -e cpu-clock,cs -p xxx // xxx is the pid of a.out test.c void main() { int count = 0; pid_t pid; printf("%d running\n", getpid()); sleep(30); printf("running\n"); pid = fork(); if (pid == -1) { printf("fork error\n"); return; } if (pid == 0) { while (1) { count++; } } else { while (1) { count++; } } } The testcase first opens an LBR event, so it will allocate task_ctx_data, and then open tracepoint and software events, so the parent context will have 3 different perf_event_pmu_contexts. On inheritance, child ctx will insert the perf_event_pmu_context in another order and the warning will trigger. [ mingo: Tidied up the changelog. ] Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250122073356.1824736-1-luogengkun@huaweicloud.com
2025-02-24perf/core: Add RCU read lock protection to perf_iterate_ctx()Breno Leitao
The perf_iterate_ctx() function performs RCU list traversal but currently lacks RCU read lock protection. This causes lockdep warnings when running perf probe with unshare(1) under CONFIG_PROVE_RCU_LIST=y: WARNING: suspicious RCU usage kernel/events/core.c:8168 RCU-list traversed in non-reader section!! Call Trace: lockdep_rcu_suspicious ? perf_event_addr_filters_apply perf_iterate_ctx perf_event_exec begin_new_exec ? load_elf_phdrs load_elf_binary ? lock_acquire ? find_held_lock ? bprm_execve bprm_execve do_execveat_common.isra.0 __x64_sys_execve do_syscall_64 entry_SYSCALL_64_after_hwframe This protection was previously present but was removed in commit bd2756811766 ("perf: Rewrite core context handling"). Add back the necessary rcu_read_lock()/rcu_read_unlock() pair around perf_iterate_ctx() call in perf_event_exec(). [ mingo: Use scoped_guard() as suggested by Peter ] Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250117-fix_perf_rcu-v1-1-13cb9210fc6a@debian.org
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-23bpf: Refactor check_ctx_access()Amery Hung
Reduce the variable passing madness surrounding check_ctx_access(). Currently, check_mem_access() passes many pointers to local variables to check_ctx_access(). They are used to initialize "struct bpf_insn_access_aux info" in check_ctx_access() and then passed to is_valid_access(). Then, check_ctx_access() takes the data our from info and write them back the pointers to pass them back. This can be simpilified by moving info up to check_mem_access(). No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20250221175644.1822383-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-22Merge tag 'sched-urgent-2025-02-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fixes from Ingo Molnar: - Fix overly spread-out RSEQ concurrency ID allocation pattern that regressed certain workloads - Fix RSEQ registration syscall behavior on -EFAULT errors when CONFIG_DEBUG_RSEQ=y (This debug option is disabled on most distributions) * tag 'sched-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Fix rseq registration with CONFIG_DEBUG_RSEQ sched: Compact RSEQ concurrency IDs with reduced threads and affinity
2025-02-22Merge tag 'perf-urgent-2025-02-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fixes from Ingo Molnar: "Fix x86 Intel Lion Cove CPU event constraints, and fix uprobes debug/error printk output pointer-value verbosity" * tag 'perf-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix event constraints for LNC uprobes: Don't use %pK through printk
2025-02-22Merge tag 'ftrace-v6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Function graph accounting fixes: - Fix the manage ops hashes The function graph registers a "manager ops" and "sub-ops" to ftrace. The manager ops does not have any callback but calls the sub-ops callbacks. The manage ops hashes (what is used to tell ftrace what functions to attach to) is built on the sub-ops it manages. There was an error in the way it built the hash. An empty hash means to attach to all functions. When the manager ops had one sub-ops it properly copied its hash. But when the manager ops had more than one sub-ops, it went into a loop to make a set of all functions it needed to add to the hash. If any of the subops hashes was empty, that would mean to attach to all functions. The error was that the first iteration of the loop passed in an empty hash to start with in order to add the other hashes. That starting hash was mistaken as to attach to all functions. This made the manage ops attach to all functions whenever it had two or more sub-ops, even if each sub-op was attached to only a single function. - Do not add duplicate entries to the manager ops hash If two or more subops hashes trace the same function, an entry for that function will be added to the manager ops for each subops. This causes waste and extra overhead. Fprobe accounting fixes: - Remove last function from fprobe hash Fprobes has a ftrace hash to manage which functions an fprobe is attached to. It also has a counter of how many fprobes are attached. When the last fprobe is removed, it unregisters the fprobe from ftrace but does not remove the functions the last fprobe was attached to from the hash. This leaves the old functions attached. When a new fprobe is added, the fprobe infrastructure attaches to not only the functions of the new fprobe, but also to the functions of the last fprobe. - Fix accounting of the fprobe counter When a fprobe is added, it updates a counter. If the counter goes from zero to one, it attaches its ops to ftrace. When an fprobe is removed, the counter is decremented. If the counter goes from 1 to zero, it removes the fprobes ops from ftrace. There was an issue where if two fprobes trace the same function, the addition of each fprobe would increment the counter. But when removing the first of the fprobes, it would notice that another fprobe is still attached to one of its functions no it does not remove the functions from the ftrace ops. But it also did not decrement the counter, so when the last fprobe is removed, the counter is still one. This leaves the fprobes callback still registered with ftrace and it being called by the functions defined by the fprobes ops hash. Worse yet, because all the functions from the fprobe ops hash have been removed, that tells ftrace that it wants to trace all functions. Thus, this puts the state of the system where every function is calling the fprobe callback handler (which does nothing as there are no registered fprobes), but this causes a good 13% slow down of the entire system. Other updates: - Add a selftest to test the above issues to prevent regressions. - Fix preempt count accounting in function tracing Better recursion protection was added to function tracing which added another layer of preempt disable. As the preempt_count gets traced in the event, it needs to subtract the amount of preempt disabling the tracer does to record what the preempt_count was when the trace was triggered. - Fix memory leak in output of set_event A variable is passed by the seq_file functions in the location that is set by the return of the next() function. The start() function allocates it and the stop() function frees it. But when the last item is found, the next() returns NULL which leaks the data that was allocated in start(). The m->private is used for something else, so have next() free the data when it returns NULL, as stop() will then just receive NULL in that case" * tag 'ftrace-v6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix memory leak when reading set_event file ftrace: Correct preemption accounting for function tracing. selftests/ftrace: Update fprobe test to check enabled_functions file fprobe: Fix accounting of when to unregister from function graph fprobe: Always unregister fgraph function from ops ftrace: Do not add duplicate entries in subops manager ops ftrace: Fix accounting of adding subops to a manager ops
2025-02-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-02-20 We've added 19 non-merge commits during the last 8 day(s) which contain a total of 35 files changed, 1126 insertions(+), 53 deletions(-). The main changes are: 1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing 2) Add network TX timestamping support to BPF sock_ops, from Jason Xing 3) Add TX metadata Launch Time support, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: igc: Add launch time support to XDP ZC igc: Refactor empty frame insertion for launch time support net: stmmac: Add launch time support to XDP ZC selftests/bpf: Add launch time request to xdp_hw_metadata xsk: Add launch time hardware offload support to XDP Tx metadata selftests/bpf: Add simple bpf tests in the tx path for timestamping feature bpf: Support selective sampling for bpf timestamping bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING bpf: Disable unsafe helpers in TX timestamping callbacks bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping bpf: Add networking timestamping support to bpf_get/setsockopt() selftests/bpf: Add rto max for bpf_setsockopt test bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt ==================== Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21Merge tag 'drm-fixes-2025-02-22' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "Weekly drm fixes pull request, lots of small things all over, msm has a bunch of things but all very small, xe, i915, a fix for the cgroup dmem controller. core: - remove MAINTAINERS entry cgroup/dmem: - use correct function for pool descendants panel: - fix signal polarity issue jd9365da-h3 nouveau: - folio handling fix - config fix amdxdna: - fix missing header xe: - Fix error handling in xe_irq_install - Fix devcoredump format i915: - Use spin_lock_irqsave() in interruptible context on guc submission - Fixes on DDI and TRANS programming - Make sure all planes in use by the joiner have their crtc included - Fix 128b/132b modeset issues msm: - More catalog fixes: - to skip watchdog programming through top block if its not present - fix the setting of WB mask to ensure the WB input control is programmed correctly through ping-pong - drop lm_pair for sm6150 as that chipset does not have any 3dmerge block - Fix the mode validation logic for DP/eDP to account for widebus (2ppc) to allow high clock resolutions - Fix to disable dither during encoder disable as otherwise this was causing kms_writeback failure due to resource sharing between WB and DSI paths as DSI uses dither but WB does not - Fixes for virtual planes, namely to drop extraneous return and fix uninitialized variables - Fix to avoid spill-over of DSC encoder block bits when programming the bits-per-component - Fixes in the DSI PHY to protect against concurrent access of PHY_CMN_CLK_CFG regs between clock and display drivers - Core/GPU: - Fix non-blocking fence wait incorrectly rounding up to 1 jiffy timeout - Only print GMU fw version once, instead of each time the GPU resumes" * tag 'drm-fixes-2025-02-22' of https://gitlab.freedesktop.org/drm/kernel: (28 commits) drm/i915/dp: Fix disabling the transcoder function in 128b/132b mode drm/i915/dp: Fix error handling during 128b/132b link training accel/amdxdna: Add missing include linux/slab.h MAINTAINERS: Remove myself drm/nouveau/pmu: Fix gp10b firmware guard cgroup/dmem: Don't open-code css_for_each_descendant_pre drm/xe/guc: Fix size_t print format drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump drm/i915: Make sure all planes in use by the joiner have their crtc included drm/i915/ddi: Fix HDMI port width programming in DDI_BUF_CTL drm/i915/dsi: Use TRANS_DDI_FUNC_CTL's own port width macro drm/xe: Fix error handling in xe_irq_install() drm/i915/gt: Use spin_lock_irqsave() in interruptible context drm/msm/dsi/phy: Do not overwite PHY_CMN_CLK_CFG1 when choosing bitclk source drm/msm/dsi/phy: Protect PHY_CMN_CLK_CFG1 against clock driver drm/msm/dsi/phy: Protect PHY_CMN_CLK_CFG0 updated from driver side drm/msm/dpu: Drop extraneous return in dpu_crtc_reassign_planes() drm/msm/dpu: Don't leak bits_per_component into random DSC_ENC fields drm/msm/dpu: Disable dither in phys encoder cleanup drm/msm/dpu: Fix uninitialized variable ...
2025-02-21sched: Add unlikey branch hints to several system callsColin Ian King
Adding an unlikely() hint on early error return paths improves the run-time performance of several sched related system calls. Benchmarking on an i9-12900 shows the following per system call performance improvements: before after improvement sched_getattr 182.4ns 170.6ns ~6.5% sched_setattr 284.3ns 267.6ns ~5.9% sched_getparam 161.6ns 148.1ns ~8.4% sched_setparam 1265.4ns 1227.6ns ~3.0% sched_getscheduler 129.4ns 118.2ns ~8.7% sched_setscheduler 1237.3ns 1216.7ns ~1.7% Results are based on running 20 tests with turbo disabled (to reduce clock freq turbo changes), with 10 second run per test based on the number of system calls per second. The % standard deviation of the measurements for the 20 tests was 0.05% to 0.40%, so the results are reliable. Tested on kernel build with gcc 14.2.1 Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250219142423.45516-1-colin.i.king@gmail.com
2025-02-21sched/core: Remove duplicate included header file stats.hThorsten Blum
The header file stats.h is included twice. Remove the redundant include and the following make includecheck warning: stats.h is included more than once Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250219111756.3070-2-thorsten.blum@linux.dev