summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-12-26tracing: Switch trace_events_synth.c code over to use guard()Steven Rostedt
There are a couple functions in trace_events_synth.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.371082515@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_events_filter.c code over to use guard()Steven Rostedt
There are a couple functions in trace_events_filter.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.200737679@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_events_trigger.c code over to use guard()Steven Rostedt
There are a few functions in trace_events_trigger.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Also use __free() for free a temporary buffer in event_trigger_regex_write(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241220110621.639d3bc8@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_events_hist.c code over to use guard()Steven Rostedt
There are a couple functions in trace_events_hist.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.694601480@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_events.c code over to use guard()Steven Rostedt
There are several functions in trace_events.c that have "goto out;" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Some locations did some simple arithmetic after releasing the lock. As this causes no real overhead for holding a mutex while processing the file position (*ppos += cnt;) let the lock be held over this logic too. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.522546095@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Simplify event_enable_func() goto_reg logicSteven Rostedt
Currently there's an "out_reg:" label that gets jumped to if there's no parameters to process. Instead, make it a proper "if (param) { }" block as there's not much to do for the parameter processing, and remove the "out_reg:" label. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.354746196@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Simplify event_enable_func() goto out_free logicSteven Rostedt
The event_enable_func() function allocates the data descriptor early in the function just to assign its data->count value via: kstrtoul(number, 0, &data->count); This makes the code more complex as there are several error paths before the data descriptor is actually used. This means there needs to be a goto out_free; to clean it up. Use a local variable "count" to do the update and move the data allocation just before it is used. This removes the "out_free" label as the data can be freed on the failure path of where it is used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.190820140@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Have event_enable_write() just return error on errorSteven Rostedt
The event_enable_write() function is inconsistent in how it returns errors. Sometimes it updates the ppos parameter and sometimes it doesn't. Simplify the code to just return an error or the count if there isn't an error. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.025284170@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Return -EINVAL if a boot tracer tries to enable the mmiotracer at bootSteven Rostedt
The mmiotracer is not set to be enabled at boot up from the kernel command line. If the boot command line tries to enable that tracer, it will fail to be enabled. The return code is currently zero when that happens so the caller just thinks it was enabled. Return -EINVAL in this case. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201344.854254394@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace.c code over to use guard()Steven Rostedt
There are several functions in trace.c that have "goto out;" or equivalent on error in order to release locks or free values that were allocated. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex or freeing on error over to using the guard(mutex)() and __free() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. There's one place that should probably return an error but instead return 0. This does not change the return as the only changes are to do the conversion without changing the logic. Fixing that location will have to come later. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20241224221413.7b8c68c3@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-24sched_ext: initialize kit->cursor.flagsHenry Huang
struct bpf_iter_scx_dsq *it maybe not initialized. If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice before scx_bpf_dsq_move, it would cause unexpected behaviors: 1. Assign a huge slice into p->scx.slice 2. Assign a invalid vtime into p->scx.dsq_vtime Signed-off-by: Henry Huang <henry.hj@antgroup.com> Fixes: 6462dd53a260 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern") Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24workqueue: add printf attribute to __alloc_workqueue()Su Hui
Fix a compiler warning with W=1: kernel/workqueue.c: error: function ‘__alloc_workqueue’ might be a candidate for ‘gnu_printf’ format attribute[-Werror=suggest-attribute=format] 5657 | name_len = vsnprintf(wq->name, sizeof(wq->name), fmt, args); | ^~~~~~~~ Fixes: 9b59a85a84dc ("workqueue: Don't call va_start / va_end twice") Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24kheaders: Simplify attribute through __BIN_ATTR_SIMPLE_RO()Thomas Weißschuh
The utility macro from the sysfs core is sufficient to implement this attribute. Make use of it. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20241221-sysfs-const-bin_attr-kheaders-v2-1-8205538aa012@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-23tracing: Prevent bad count for tracing_cpumask_writeLizhi Xu
If a large count is provided, it will trigger a warning in bitmap_parse_user. Also check zero for it. Cc: stable@vger.kernel.org Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions") Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23fgraph: Get ftrace recursion lock in function_graph_enterMasami Hiramatsu (Google)
Get the ftrace recursion lock in the generic function_graph_enter() instead of each architecture code. This changes all function_graph tracer callbacks running in non-preemptive state. On x86 and powerpc, this is by default, but on the other architecutres, this will be new. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173379653720.973433.18438622234884980494.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23ftrace: Switch ftrace.c code over to use guard()Steven Rostedt
There are a few functions in ftrace.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.718001540@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23ftrace: Remove unneeded goto jumpsSteven Rostedt
There are some goto jumps to exit a program to just return a value. The code after the label doesn't free anything nor does it do any unlocks. It simply returns the variable that was set before the jump. Remove these unneeded goto jumps. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.544855549@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23ftrace: Do not disable interrupts in profilerSteven Rostedt
The function profiler disables interrupts before processing. This was there since the profiler was introduced back in 2009 when there were recursion issues to deal with. The function tracer is much more robust today and has its own internal recursion protection. There's no reason to disable interrupts in the function profiler. Instead, just disable preemption and use the guard() infrastructure while at it. Before this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 3.099 Time: 2.556 Time: 2.500 Time: 2.705 Time: 2.985 Time: 2.959 Time: 2.859 Time: 2.621 Time: 2.742 Time: 2.631 Performance counter stats for '/work/c/hackbench 10' (10 runs): 23,156.77 msec task-clock # 6.951 CPUs utilized ( +- 2.36% ) 18,306 context-switches # 790.525 /sec ( +- 5.95% ) 495 cpu-migrations # 21.376 /sec ( +- 8.61% ) 11,522 page-faults # 497.565 /sec ( +- 1.80% ) 47,967,124,606 cycles # 2.071 GHz ( +- 0.41% ) 80,009,078,371 instructions # 1.67 insn per cycle ( +- 0.34% ) 16,389,249,798 branches # 707.752 M/sec ( +- 0.36% ) 139,943,109 branch-misses # 0.85% of all branches ( +- 0.61% ) 3.332 +- 0.101 seconds time elapsed ( +- 3.04% ) After this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 1.869 Time: 1.428 Time: 1.575 Time: 1.569 Time: 1.685 Time: 1.511 Time: 1.611 Time: 1.672 Time: 1.724 Time: 1.715 Performance counter stats for '/work/c/hackbench 10' (10 runs): 13,578.21 msec task-clock # 6.931 CPUs utilized ( +- 2.23% ) 12,736 context-switches # 937.973 /sec ( +- 3.86% ) 341 cpu-migrations # 25.114 /sec ( +- 5.27% ) 11,378 page-faults # 837.960 /sec ( +- 1.74% ) 27,638,039,036 cycles # 2.035 GHz ( +- 0.27% ) 45,107,762,498 instructions # 1.63 insn per cycle ( +- 0.23% ) 8,623,868,018 branches # 635.125 M/sec ( +- 0.27% ) 125,738,443 branch-misses # 1.46% of all branches ( +- 0.32% ) 1.9590 +- 0.0484 seconds time elapsed ( +- 2.47% ) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.373853944@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23fgraph: Remove unnecessary disabling of interrupts and recursionSteven Rostedt
The function graph tracer disables interrupts as well as prevents recursion via NMIs when recording the graph tracer code. There's no reason to do this today. That disabling goes back to 2008 when the function graph tracer was first introduced and recursion protection wasn't part of the code. Today, there's no reason to disable interrupts or prevent the code from recursing as the infrastructure can easily handle it. Before this change: ~# echo function_graph > /sys/kernel/tracing/current_tracer ~# perf stat -r 10 ./hackbench 10 Time: 4.240 Time: 4.236 Time: 4.106 Time: 4.014 Time: 4.314 Time: 3.830 Time: 4.063 Time: 4.323 Time: 3.763 Time: 3.727 Performance counter stats for '/work/c/hackbench 10' (10 runs): 33,937.20 msec task-clock # 7.008 CPUs utilized ( +- 1.85% ) 18,220 context-switches # 536.874 /sec ( +- 6.41% ) 624 cpu-migrations # 18.387 /sec ( +- 9.07% ) 11,319 page-faults # 333.528 /sec ( +- 1.97% ) 76,657,643,617 cycles # 2.259 GHz ( +- 0.40% ) 141,403,302,768 instructions # 1.84 insn per cycle ( +- 0.37% ) 25,518,463,888 branches # 751.932 M/sec ( +- 0.35% ) 156,151,050 branch-misses # 0.61% of all branches ( +- 0.63% ) 4.8423 +- 0.0892 seconds time elapsed ( +- 1.84% ) After this change: ~# echo function_graph > /sys/kernel/tracing/current_tracer ~# perf stat -r 10 ./hackbench 10 Time: 3.340 Time: 3.192 Time: 3.129 Time: 2.579 Time: 2.589 Time: 2.798 Time: 2.791 Time: 2.955 Time: 3.044 Time: 3.065 Performance counter stats for './hackbench 10' (10 runs): 24,416.30 msec task-clock # 6.996 CPUs utilized ( +- 2.74% ) 16,764 context-switches # 686.590 /sec ( +- 5.85% ) 469 cpu-migrations # 19.208 /sec ( +- 6.14% ) 11,519 page-faults # 471.775 /sec ( +- 1.92% ) 53,895,628,450 cycles # 2.207 GHz ( +- 0.52% ) 105,552,664,638 instructions # 1.96 insn per cycle ( +- 0.47% ) 17,808,672,667 branches # 729.376 M/sec ( +- 0.48% ) 133,075,435 branch-misses # 0.75% of all branches ( +- 0.59% ) 3.490 +- 0.112 seconds time elapsed ( +- 3.22% ) Also removed unneeded "unlikely()" around the retaddr code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.204074053@goodmis.org Fixes: 9cd2992f2d6c8 ("fgraph: Have set_graph_notrace only affect function_graph tracer") # Performance only Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23blktrace: remove redundant return at end of functionColin Ian King
A recent change added return 0 before an existing return statement at the end of function blk_trace_setup. The final return is now redundant, so remove it. Fixes: 64d124798244 ("blktrace: move copy_[to|from]_user() out of ->debugfs_lock") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241204150450.399005-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23blktrace: move copy_[to|from]_user() out of ->debugfs_lockMing Lei
Move copy_[to|from]_user() out of ->debugfs_lock and cut the dependency between mm->mmap_lock and q->debugfs_lock, then we avoids lots of lockdep false positive warning. Obviously ->debug_lock isn't needed for copy_[to|from]_user(). The only behavior change is to call blk_trace_remove() in case of setup failure handling by re-grabbing ->debugfs_lock, and this way is just fine since we do cover concurrent setup() & remove(). Reported-by: syzbot+91585b36b538053343e4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-block/67450fd4.050a0220.1286eb.0007.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742e584.050a0220.1cc393.0038.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742a600.050a0220.1cc393.002e.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/67420102.050a0220.1cc393.0019.GAE@google.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128125029.4152292-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23blktrace: don't centralize grabbing q->debugfs_mutex in blk_trace_ioctlMing Lei
Call each handler directly and the handler do grab q->debugfs_mutex, prepare for killing dependency between ->debug_mutex and ->mmap_lock. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128125029.4152292-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-24tracing/kprobe: Make trace_kprobe's module callback called after jump_label ↵Masami Hiramatsu (Google)
update Make sure the trace_kprobe's module notifer callback function is called after jump_label's callback is called. Since the trace_kprobe's callback eventually checks jump_label address during registering new kprobe on the loading module, jump_label must be updated before this registration happens. Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/ Fixes: 614243181050 ("tracing/kprobes: Support module init function probing") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-23Merge back earlier cpufreq material for 6.14Rafael J. Wysocki
2024-12-22stackleak: Use str_enabled_disabled() helper in stack_erasing_sysctl()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241222223157.135164-2-thorsten.blum@linux.dev Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-22tracing: Add task_prctl_unknown tracepointMarco Elver
prctl() is a complex syscall which multiplexes its functionality based on a large set of PR_* options. Currently we count 64 such options. The return value of unknown options is -EINVAL, and doesn't distinguish from known options that were passed invalid args that also return -EINVAL. To understand if programs are attempting to use prctl() options not yet available on the running kernel, provide the task_prctl_unknown tracepoint. Note, this tracepoint is in an unlikely cold path, and would therefore be suitable for continuous monitoring (e.g. via perf_event_open). While the above is likely the simplest usecase, additionally this tracepoint can help unlock some testing scenarios (where probing sys_enter or sys_exit causes undesirable performance overheads): a. unprivileged triggering of a test module: test modules may register a probe to be called back on task_prctl_unknown, and pick a very large unknown prctl() option upon which they perform a test function for an unprivileged user; b. unprivileged triggering of an eBPF program function: similar as idea (a). Example trace_pipe output: test-380 [001] ..... 78.142904: task_prctl_unknown: option=1234 arg2=101 arg3=102 arg4=103 arg5=104 Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20241108113455.2924361-1-elver@google.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-22Merge tag 'lockdep-for-tip.20241220' of ↵Peter Zijlstra
git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux into locking/core Lockdep changes for v6.14: - Use swap() macro in the ww_mutex test. - Minor fixes and documentation for lockdep configs on internal data structure sizes. - Some "-Wunused-function" warning fixes for Clang. Rust locking changes for v6.14: - Add Rust locking files into LOCKING PRIMITIVES maintainer entry. - Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking. - Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards. - Add lockdep checking when creating a new lock `Guard`.
2024-12-22watch_queue: Use page->private instead of page->indexMatthew Wilcox (Oracle)
We are attempting to eliminate page->index, so use page->private instead. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241125175443.2911738-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-21Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM. The usual bunch of singletons and doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) mm: huge_memory: handle strsep not finding delimiter alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG alloc_tag: fix module allocation tags populated area calculation mm/codetag: clear tags before swap mm/vmstat: fix a W=1 clang compiler warning mm: convert partially_mapped set/clear operations to be atomic nilfs2: fix buffer head leaks in calls to truncate_inode_pages() vmalloc: fix accounting with i915 mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy() fork: avoid inappropriate uprobe access to invalid mm nilfs2: prevent use of deleted inode zram: fix uninitialized ZRAM not releasing backing device zram: refuse to use zero sized block device as backing device mm: use clear_user_(high)page() for arch with special user folio handling mm: introduce cpu_icache_is_aliasing() across all architectures mm: add RCU annotation to pte_offset_map(_lock) mm: correctly reference merged VMA mm: use aligned address in copy_user_gigantic_page() mm: use aligned address in clear_gigantic_page() mm: shmem: fix ShmemHugePages at swapout ...
2024-12-21Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull BPF fixes from Daniel Borkmann: - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test bpf_skb_change_tail() in TC ingress selftests/bpf: Introduce socket_helpers.h for TC tests selftests/bpf: Add a BPF selftest for bpf_skb_change_tail() bpf: Check negative offsets in __bpf_skb_min_len() tcp_bpf: Fix copied value in tcp_bpf_sendmsg skmsg: Return copied bytes in sk_msg_memcopy_from_iter tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress() selftests/bpf: Fix compilation error in get_uprobe_offset() selftests/bpf: Use asm constraint "m" for LoongArch bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
2024-12-20kheaders: Ignore silly-rename filesDavid Howells
Tell tar to ignore silly-rename files (".__afs*" and ".nfs*") when building the header archive. These occur when a file that is open is unlinked locally, but hasn't yet been closed. Such files are visible to the user via the getdents() syscall and so programs may want to do things with them. During the kernel build, such files may be made during the processing of header files and the cleanup may get deferred by fput() which may result in tar seeing these files when it reads the directory, but they may have disappeared by the time it tries to open them, causing tar to fail with an error. Further, we don't want to include them in the tarball if they still exist. With CONFIG_HEADERS_INSTALL=y, something like the following may be seen: find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory tar: ./include/linux/greybus/.__afs3C95: File removed before we read it The find warning doesn't seem to cause a problem. Fix this by telling tar when called from in gen_kheaders.sh to exclude such files. This only affects afs and nfs; cifs uses the Windows Hidden attribute to prevent the file from being seen. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241213135013.2964079-2-dhowells@redhat.com cc: Masahiro Yamada <masahiroy@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-kernel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20Merge tag 'trace-ringbuffer-v6.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. * tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers ring-buffer: Fix overflow in __rb_map_vma
2024-12-20sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabledJohn Stultz
A common pattern seen when wake_qs are used to defer a wakeup until after a lock is released is something like: preempt_disable(); raw_spin_unlock(lock); wake_up_q(wake_q); preempt_enable(); So create some raw_spin_unlock*_wake() helper functions to clean this up. Applies on top of the fix I submitted here: https://lore.kernel.org/lkml/20241212222138.2400498-1-jstultz@google.com/ NOTE: I recognise the unlock()/unlock_irq()/unlock_irqrestore() variants creates its own duplication, which we could use a macro to generate the similar functions, but I often dislike how those generation macros making finding the actual implementation harder, so I left the three functions as is. If folks would prefer otherwise, let me know and I'll switch it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241217040803.243420-1-jstultz@google.com
2024-12-20Merge branch 'locking/urgent'Peter Zijlstra
Sync with urgent -- avoid conflicts. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-20docs: Update Schedstat version to 17Swapnil Sapkal
Update the Schedstat version to 17 as more fields are added to report different kinds of imbalances in the sched domain. Also domain field started printing corresponding domain name. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com
2024-12-20sched/stats: Print domain name in /proc/schedstatK Prateek Nayak
Currently, there does not exist a straightforward way to extract the names of the sched domains and match them to the per-cpu domain entry in /proc/schedstat other than looking at the debugfs files which are only visible after enabling "verbose" debug after commit 34320745dfc9 ("sched/debug: Put sched/domains files under the verbose flag") Since tools like `perf sched stats`[1] require displaying per-domain information in user friendly manner, display the names of sched domain, alongside their level in /proc/schedstat. Domain names also makes the /proc/schedstat data unambiguous when some of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine where CPU0 and CPU64 are SMT siblings and CPU64 is offline: Before: cpu0 ... domain0 ... domain1 ... cpu1 ... domain0 ... domain1 ... domain2 ... After: cpu0 ... domain0 MC ... domain1 PKG ... cpu1 ... domain0 SMT ... domain1 MC ... domain2 PKG ... [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20241220063224.17767-6-swapnil.sapkal@amd.com
2024-12-20sched: Move sched domain name out of CONFIG_SCHED_DEBUGSwapnil Sapkal
/proc/schedstat file shows cpu and sched domain level scheduler statistics. It does not show domain name instead shows domain level. It will be very useful for tools like `perf sched stats`[1] to aggragate domain level stats if domain names are shown in /proc/schedstat. But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG. [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/ Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com
2024-12-20sched: Report the different kinds of imbalances in /proc/schedstatSwapnil Sapkal
In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), which is not very useful because lb_imbalance does not mention whether the imbalance is due to load, utilization, nr_tasks or misfit_tasks. Remove this field from /proc/schedstat. Currently there is no field in /proc/schedstat to report different types of imbalances. Introduce new fields in /proc/schedstat to report the total imbalances in load, utilization, nr_tasks or misfit_tasks. Added fields to /proc/schedstat: - lb_imbalance_load: Total imbalance due to load. - lb_imbalance_util: Total imbalance due to utilization. - lb_imbalance_task: Total imbalance due to number of tasks. - lb_imbalance_misfit: Total imbalance due to misfit tasks. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
2024-12-20sched/fair: Cleanup in migrate_degrades_locality() to improve readabilityPeter Zijlstra
migrate_degrade_locality() would return {1, 0, -1} respectively to indicate that migration would degrade-locality, would improve locality, would be ambivalent to locality improvements. This patch improves readability by changing the return value to mean: * Any positive value degrades locality * 0 migration doesn't affect locality * Any negative value improves locality [Swapnil: Fixed comments around code and wrote commit log] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com
2024-12-20sched/fair: Fix value reported by hot tasks pulled in /proc/schedstatPeter Zijlstra
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally. [Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log] Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com
2024-12-20sched/fair: Update comments after sched_tick() rename.Sebastian Andrzej Siewior
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2 ("sched/balancing: Rename scheduler_tick() => sched_tick()"). Update comments still referring to scheduler_tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241219085839.302378-1-bigeasy@linutronix.de
2024-12-19lockdep: Move lockdep_assert_locked() under #ifdef CONFIG_PROVE_LOCKINGAndy Shevchenko
When lockdep_assert_locked() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and CONFIG_PROVE_LOCKING=n: kernel/locking/lockdep.c:160:20: error: unused function 'lockdep_assert_locked' [-Werror,-Wunused-function] Fix this by moving it under the respective ifdeffery. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). [Boqun: add more config information of the error] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20241202193445.769567-1-andriy.shevchenko@linux.intel.com
2024-12-19lockdep: Mark chain_hlock_class_idx() with __maybe_unusedAndy Shevchenko
When chain_hlock_class_idx() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and CONFIG_PROVE_LOCKING=n: kernel/locking/lockdep.c:435:28: error: unused function 'chain_hlock_class_idx' [-Werror,-Wunused-function] Fix this by marking it with __maybe_unused. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). [Boqun: add more config information of the error] Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20241209170810.1485183-1-andriy.shevchenko@linux.intel.com
2024-12-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/renesas/rswitch.h 32fd46f5b69e ("net: renesas: rswitch: remove speed from gwca structure") 922b4b955a03 ("net: renesas: rswitch: rework ts tags management") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from ↵Tvrtko Ursulin
!WQ_MEM_RECLAIM worker After commit 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM") amdgpu started seeing the following warning: [ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu] ... [ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] ... [ ] Call Trace: [ ] <TASK> ... [ ] ? check_flush_dependency+0xf5/0x110 ... [ ] cancel_delayed_work_sync+0x6e/0x80 [ ] amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu] [ ] amdgpu_ring_alloc+0x40/0x50 [amdgpu] [ ] amdgpu_ib_schedule+0xf4/0x810 [amdgpu] [ ] ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched] [ ] amdgpu_job_run+0xaa/0x1f0 [amdgpu] [ ] drm_sched_run_job_work+0x257/0x430 [gpu_sched] [ ] process_one_work+0x217/0x720 ... [ ] </TASK> The intent of the verifcation done in check_flush_depedency is to ensure forward progress during memory reclaim, by flagging cases when either a memory reclaim process, or a memory reclaim work item is flushed from a context not marked as memory reclaim safe. This is correct when flushing, but when called from the cancel(_delayed)_work_sync() paths it is a false positive because work is either already running, or will not be running at all. Therefore cancelling it is safe and we can relax the warning criteria by letting the helper know of the calling context. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue") References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM") Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-18fork: avoid inappropriate uprobe access to invalid mmLorenzo Stoakes
If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path. This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"). This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable. The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup. Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm. We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done. This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case. Link: https://lkml.kernel.org/r/20241210172412.52995-1-lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+2d788f4f7cb660dac4b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18bpf: bpf_local_storage: Always use bpf_mem_alloc in PREEMPT_RTMartin KaFai Lau
In PREEMPT_RT, kmalloc(GFP_ATOMIC) is still not safe in non preemptible context. bpf_mem_alloc must be used in PREEMPT_RT. This patch is to enforce bpf_mem_alloc in the bpf_local_storage when CONFIG_PREEMPT_RT is enabled. [ 35.118559] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 35.118566] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1832, name: test_progs [ 35.118569] preempt_count: 1, expected: 0 [ 35.118571] RCU nest depth: 1, expected: 1 [ 35.118577] INFO: lockdep is turned off. ... [ 35.118647] __might_resched+0x433/0x5b0 [ 35.118677] rt_spin_lock+0xc3/0x290 [ 35.118700] ___slab_alloc+0x72/0xc40 [ 35.118723] __kmalloc_noprof+0x13f/0x4e0 [ 35.118732] bpf_map_kzalloc+0xe5/0x220 [ 35.118740] bpf_selem_alloc+0x1d2/0x7b0 [ 35.118755] bpf_local_storage_update+0x2fa/0x8b0 [ 35.118784] bpf_sk_storage_get_tracing+0x15a/0x1d0 [ 35.118791] bpf_prog_9a118d86fca78ebb_trace_inet_sock_set_state+0x44/0x66 [ 35.118795] bpf_trace_run3+0x222/0x400 [ 35.118820] __bpf_trace_inet_sock_set_state+0x11/0x20 [ 35.118824] trace_inet_sock_set_state+0x112/0x130 [ 35.118830] inet_sk_state_store+0x41/0x90 [ 35.118836] tcp_set_state+0x3b3/0x640 There is no need to adjust the gfp_flags passing to the bpf_mem_cache_alloc_flags() which only honors the GFP_KERNEL. The verifier has ensured GFP_KERNEL is passed only in sleepable context. It has been an old issue since the first introduction of the bpf_local_storage ~5 years ago, so this patch targets the bpf-next. bpf_mem_alloc is needed to solve it, so the Fixes tag is set to the commit when bpf_mem_alloc was first used in the bpf_local_storage. Fixes: 08a7ce384e33 ("bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241218193000.2084281-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-18PM: EM: Move sched domains rebuild function from schedutil to EMRafael J. Wysocki
Function sugov_eas_rebuild_sd() defined in the schedutil cpufreq governor implements generic functionality that may be useful in other places. In particular, there is a plan to use it in the intel_pstate driver in the future. For this reason, move it from schedutil to the energy model code and rename it to em_rebuild_sched_domains(). This also helps to get rid of some #ifdeffery in schedutil which is a plus. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
2024-12-18trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffersSteven Rostedt
The TP_printk() of a TRACE_EVENT() is a generic printf format that any developer can create for their event. It may include pointers to strings and such. A boot mapped buffer may contain data from a previous kernel where the strings addresses are different. One solution is to copy the event content and update the pointers by the recorded delta, but a simpler solution (for now) is to just use the print_fields() function to print these events. The print_fields() function just iterates the fields and prints them according to what type they are, and ignores the TP_printk() format from the event itself. To understand the difference, when printing via TP_printk() the output looks like this: 4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false 4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false 4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false 4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields) the same event output looks like this: 4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0) 4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0) 4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0) 4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>