summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-12-10bpf: check changes_pkt_data property for extension programsEduard Zingerman
When processing calls to global sub-programs, verifier decides whether to invalidate all packet pointers in current state depending on the changes_pkt_data property of the global sub-program. Because of this, an extension program replacing a global sub-program must be compatible with changes_pkt_data property of the sub-program being replaced. This commit: - adds changes_pkt_data flag to struct bpf_prog_aux: - this flag is set in check_cfg() for main sub-program; - in jit_subprogs() for other sub-programs; - modifies bpf_check_attach_btf_id() to check changes_pkt_data flag; - moves call to check_attach_btf_id() after the call to check_cfg(), because it needs changes_pkt_data flag to be set: bpf_check: ... ... - check_attach_btf_id resolve_pseudo_ldimm64 resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded bpf_prog_is_offloaded check_cfg check_cfg + check_attach_btf_id ... ... The following fields are set by check_attach_btf_id(): - env->ops - prog->aux->attach_btf_trace - prog->aux->attach_func_name - prog->aux->attach_func_proto - prog->aux->dst_trampoline - prog->aux->mod - prog->aux->saved_dst_attach_type - prog->aux->saved_dst_prog_type - prog->expected_attach_type Neither of these fields are used by resolve_pseudo_ldimm64() or bpf_prog_offload_verifier_prep() (for netronome and netdevsim drivers), so the reordering is safe. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf: track changes_pkt_data property for global functionsEduard Zingerman
When processing calls to certain helpers, verifier invalidates all packet pointers in a current state. For example, consider the following program: __attribute__((__noinline__)) long skb_pull_data(struct __sk_buff *sk, __u32 len) { return bpf_skb_pull_data(sk, len); } SEC("tc") int test_invalidate_checks(struct __sk_buff *sk) { int *p = (void *)(long)sk->data; if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP; skb_pull_data(sk, 0); *p = 42; return TCX_PASS; } After a call to bpf_skb_pull_data() the pointer 'p' can't be used safely. See function filter.c:bpf_helper_changes_pkt_data() for a list of such helpers. At the moment verifier invalidates packet pointers when processing helper function calls, and does not traverse global sub-programs when processing calls to global sub-programs. This means that calls to helpers done from global sub-programs do not invalidate pointers in the caller state. E.g. the program above is unsafe, but is not rejected by verifier. This commit fixes the omission by computing field bpf_subprog_info->changes_pkt_data for each sub-program before main verification pass. changes_pkt_data should be set if: - subprogram calls helper for which bpf_helper_changes_pkt_data returns true; - subprogram calls a global function, for which bpf_subprog_info->changes_pkt_data should be set. The verifier.c:check_cfg() pass is modified to compute this information. The commit relies on depth first instruction traversal done by check_cfg() and absence of recursive function calls: - check_cfg() would eventually visit every call to subprogram S in a state when S is fully explored; - when S is fully explored: - every direct helper call within S is explored (and thus changes_pkt_data is set if needed); - every call to subprogram S1 called by S was visited with S1 fully explored (and thus S inherits changes_pkt_data from S1). The downside of such approach is that dead code elimination is not taken into account: if a helper call inside global function is dead because of current configuration, verifier would conservatively assume that the call occurs for the purpose of the changes_pkt_data computation. Reported-by: Nick Zavaritsky <mejedi@gmail.com> Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf: refactor bpf_helper_changes_pkt_data to use helper numberEduard Zingerman
Use BPF helper number instead of function pointer in bpf_helper_changes_pkt_data(). This would simplify usage of this function in verifier.c:check_cfg() (in a follow-up patch), where only helper number is easily available and there is no real need to lookup helper proto. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf: add find_containing_subprog() utility functionEduard Zingerman
Add a utility function, looking for a subprogram containing a given instruction index, rewrite find_subprog() to use this function. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241210041100.1898468-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_progJiri Olsa
Syzbot reported [1] crash that happens for following tracing scenario: - create tracepoint perf event with attr.inherit=1, attach it to the process and set bpf program to it - attached process forks -> chid creates inherited event the new child event shares the parent's bpf program and tp_event (hence prog_array) which is global for tracepoint - exit both process and its child -> release both events - first perf_event_detach_bpf_prog call will release tp_event->prog_array and second perf_event_detach_bpf_prog will crash, because tp_event->prog_array is NULL The fix makes sure the perf_event_detach_bpf_prog checks prog_array is valid before it tries to remove the bpf program from it. [1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad Fixes: 0ee288e69d03 ("bpf,perf: Fix perf_event_detach_bpf_prog error handling") Reported-by: syzbot+2e0d2840414ce817aaac@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org
2024-12-10bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavorsJann Horn
Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU protection. But it is possible to attach a non-sleepable BPF program to a uprobe, and non-sleepable BPF programs are freed via normal RCU (see __bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal RCU grace period does not imply a tasks-trace-RCU grace period. Fix it by explicitly waiting for a tasks-trace-RCU grace period after removing the attachment of a bpf_prog to a perf_event. Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20241210-bpf-fix-actual-uprobe-uaf-v1-1-19439849dd44@google.com
2024-12-10sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_taskJohn Stultz
Commit 8b5e770ed7c0 ("sched/deadline: Optimize pull_dl_task()") added a goto label seems would be better written as a while loop. So replace the goto with a while loop, to make it easier to read. Reported-by: Todd Kjos <tkjos@google.com> Signed-off-by: John Stultz <jstultz@google.com> Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20241206000009.1226085-1-jstultz@google.com
2024-12-10rseq: Validate read-only fields under DEBUG_RSEQ configMathieu Desnoyers
The rseq uapi requires cooperation between users of the rseq fields to ensure that all libraries and applications using rseq within a process do not interfere with each other. This is especially important for fields which are meant to be read-only from user-space, as documented in uapi/linux/rseq.h: - cpu_id_start, - cpu_id, - node_id, - mm_cid. Storing to those fields from a user-space library prevents any sharing of the rseq ABI with other libraries and applications, as other users are not aware that the content of those fields has been altered by a third-party library. This is unfortunately the current behavior of tcmalloc: it purposefully overlaps part of a cached value with the cpu_id_start upper bits to get notified about preemption, because the kernel clears those upper bits before returning to user-space. This behavior does not conform to the rseq uapi header ABI. This prevents tcmalloc from using rseq when rseq is registered by the GNU C library 2.35+. It requires tcmalloc users to disable glibc rseq registration with a glibc tunable, which is a sad state of affairs. Considering that tcmalloc and the GNU C library are the two first upstream projects using rseq, and that they are already incompatible due to use of this hack, adding kernel-level validation of all read-only fields content is necessary to ensure future users of rseq abide by the rseq ABI requirements. Validate that user-space does not corrupt the read-only fields and conform to the rseq uapi header ABI when the kernel is built with CONFIG_DEBUG_RSEQ=y. This is done by storing a copy of the read-only fields in the task_struct, and validating the prior values present in user-space before updating them. If the values do not match, print a warning on the console (printk_ratelimited()). This is a first step to identify misuses of the rseq ABI by printing a warning on the console. After a giving some time to userspace to correct its use of rseq, the plan is to eventually terminate offending processes with SIGSEGV. This change is expected to produce warnings for the upstream tcmalloc implementation, but tcmalloc developers mentioned they were open to adapt their implementation to kernel-level change. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://github.com/google/tcmalloc/issues/144
2024-12-09Merge tag 'sched_urgent_for_v6.13_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Remove wrong enqueueing of a task for a later wakeup when a task blocks on a RT mutex - Do not setup a new deadline entity on a boosted task as that has happened already - Update preempt= kernel command line param - Prevent needless softirqd wakeups in the idle task's context - Detect the case where the idle load balancer CPU becomes busy and avoid unnecessary load balancing invocation - Remove an unnecessary load balancing need_resched() call in nohz_csd_func() - Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the warning to catch any other cases - Remove a wrong warning when a cpuset update makes the task affinity no longer a subset of the cpuset * tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex sched/deadline: Fix warning in migrate_enable for boosted tasks sched/core: Update kernel boot parameters for LAZY preempt. sched/core: Prevent wakeup of ksoftirqd during idle load balance sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel sched: fix warning in sched_setaffinity sched/deadline: Fix replenish_dl_new_period dl_server condition
2024-12-09futex: fix user access on powerpcLinus Torvalds
The powerpc user access code is special, and unlike other architectures distinguishes between user access for reading and writing. And commit 43a43faf5376 ("futex: improve user space accesses") messed that up. It went undetected elsewhere, but caused ppc32 to fail early during boot, because the user access had been started with user_read_access_begin(), but then finished off with just a plain "user_access_end()". Note that the address-masking user access helpers don't even have that read-vs-write distinction, so if powerpc ever wants to do address masking tricks, we'll have to do some extra work for it. [ Make sure to also do it for the EFAULT case, as pointed out by Christophe Leroy ] Reported-by: Andreas Schwab <schwab@linux-m68k.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/87bjxl6b0i.fsf@igel.home/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-09uprobes: Guard against kmemdup() failing in dup_return_instance()Andrii Nakryiko
If kmemdup() failed to alloc memory, don't proceed with extra_consumers copy. Fixes: e62f2d492728 ("uprobes: Simplify session consumer tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241206183436.968068-1-andrii@kernel.org
2024-12-09perf/core: Export perf_exclude_event()Namhyung Kim
While at it, rename the same function in s390 cpum_sf PMU. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Acked-by: Thomas Richter <tmricht@linux.ibm.com> Link: https://lore.kernel.org/r/20241203180441.1634709-2-namhyung@kernel.org
2024-12-09uprobes: Reuse return_instances between multiple uretprobes within taskAndrii Nakryiko
Instead of constantly allocating and freeing very short-lived struct return_instance, reuse it as much as possible within current task. For that, store a linked list of reusable return_instances within current->utask. The only complication is that ri_timer() might be still processing such return_instance. And so while the main uretprobe processing logic might be already done with return_instance and would be OK to immediately reuse it for the next uretprobe instance, it's not correct to unconditionally reuse it just like that. Instead we make sure that ri_timer() can't possibly be processing it by using seqcount_t, with ri_timer() being "a writer", while free_ret_instance() being "a reader". If, after we unlink return instance from utask->return_instances list, we know that ri_timer() hasn't gotten to processing utask->return_instances yet, then we can be sure that immediate return_instance reuse is OK, and so we put it onto utask->ri_pool for future (potentially, almost immediate) reuse. This change shows improvements both in single CPU performance (by avoiding relatively expensive kmalloc/free combon) and in terms of multi-CPU scalability, where you can see that per-CPU throughput doesn't decline as steeply with increased number of CPUs (which were previously attributed to kmalloc()/free() through profiling): BASELINE (latest perf/core) =========================== uretprobe-nop ( 1 cpus): 1.898 ± 0.002M/s ( 1.898M/s/cpu) uretprobe-nop ( 2 cpus): 3.574 ± 0.011M/s ( 1.787M/s/cpu) uretprobe-nop ( 3 cpus): 5.279 ± 0.066M/s ( 1.760M/s/cpu) uretprobe-nop ( 4 cpus): 6.824 ± 0.047M/s ( 1.706M/s/cpu) uretprobe-nop ( 5 cpus): 8.339 ± 0.060M/s ( 1.668M/s/cpu) uretprobe-nop ( 6 cpus): 9.812 ± 0.047M/s ( 1.635M/s/cpu) uretprobe-nop ( 7 cpus): 11.030 ± 0.048M/s ( 1.576M/s/cpu) uretprobe-nop ( 8 cpus): 12.453 ± 0.126M/s ( 1.557M/s/cpu) uretprobe-nop (10 cpus): 14.838 ± 0.044M/s ( 1.484M/s/cpu) uretprobe-nop (12 cpus): 17.092 ± 0.115M/s ( 1.424M/s/cpu) uretprobe-nop (14 cpus): 19.576 ± 0.022M/s ( 1.398M/s/cpu) uretprobe-nop (16 cpus): 22.264 ± 0.015M/s ( 1.391M/s/cpu) uretprobe-nop (24 cpus): 33.534 ± 0.078M/s ( 1.397M/s/cpu) uretprobe-nop (32 cpus): 43.262 ± 0.127M/s ( 1.352M/s/cpu) uretprobe-nop (40 cpus): 53.252 ± 0.080M/s ( 1.331M/s/cpu) uretprobe-nop (48 cpus): 55.778 ± 0.045M/s ( 1.162M/s/cpu) uretprobe-nop (56 cpus): 56.850 ± 0.227M/s ( 1.015M/s/cpu) uretprobe-nop (64 cpus): 62.005 ± 0.077M/s ( 0.969M/s/cpu) uretprobe-nop (72 cpus): 66.445 ± 0.236M/s ( 0.923M/s/cpu) uretprobe-nop (80 cpus): 68.353 ± 0.180M/s ( 0.854M/s/cpu) THIS PATCHSET (on top of latest perf/core) ========================================== uretprobe-nop ( 1 cpus): 2.253 ± 0.004M/s ( 2.253M/s/cpu) uretprobe-nop ( 2 cpus): 4.281 ± 0.003M/s ( 2.140M/s/cpu) uretprobe-nop ( 3 cpus): 6.389 ± 0.027M/s ( 2.130M/s/cpu) uretprobe-nop ( 4 cpus): 8.328 ± 0.005M/s ( 2.082M/s/cpu) uretprobe-nop ( 5 cpus): 10.353 ± 0.001M/s ( 2.071M/s/cpu) uretprobe-nop ( 6 cpus): 12.513 ± 0.010M/s ( 2.086M/s/cpu) uretprobe-nop ( 7 cpus): 14.525 ± 0.017M/s ( 2.075M/s/cpu) uretprobe-nop ( 8 cpus): 15.633 ± 0.013M/s ( 1.954M/s/cpu) uretprobe-nop (10 cpus): 19.532 ± 0.011M/s ( 1.953M/s/cpu) uretprobe-nop (12 cpus): 21.405 ± 0.009M/s ( 1.784M/s/cpu) uretprobe-nop (14 cpus): 24.857 ± 0.020M/s ( 1.776M/s/cpu) uretprobe-nop (16 cpus): 26.466 ± 0.018M/s ( 1.654M/s/cpu) uretprobe-nop (24 cpus): 40.513 ± 0.222M/s ( 1.688M/s/cpu) uretprobe-nop (32 cpus): 54.180 ± 0.074M/s ( 1.693M/s/cpu) uretprobe-nop (40 cpus): 66.100 ± 0.082M/s ( 1.652M/s/cpu) uretprobe-nop (48 cpus): 70.544 ± 0.068M/s ( 1.470M/s/cpu) uretprobe-nop (56 cpus): 74.494 ± 0.055M/s ( 1.330M/s/cpu) uretprobe-nop (64 cpus): 79.317 ± 0.029M/s ( 1.239M/s/cpu) uretprobe-nop (72 cpus): 84.875 ± 0.020M/s ( 1.179M/s/cpu) uretprobe-nop (80 cpus): 92.318 ± 0.224M/s ( 1.154M/s/cpu) For reference, with uprobe-nop we hit the following throughput: uprobe-nop (80 cpus): 143.485 ± 0.035M/s ( 1.794M/s/cpu) So now uretprobe stays a bit closer to that performance. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-5-andrii@kernel.org
2024-12-09uprobes: Ensure return_instance is detached from the list before freeingAndrii Nakryiko
Ensure that by the time we call free_ret_instance() to clean up an instance of struct return_instance it isn't reachable from utask->return_instances anymore. free_ret_instance() is called in a few different situations, all but one of which already are fine w.r.t. return_instance visibility: - uprobe_free_utask() guarantees that ri_timer() won't be called (through timer_delete_sync() call), and so there is no need to unlink anything, because entire utask is being freed; - uprobe_handle_trampoline() is already unlinking to-be-freed return_instance with rcu_assign_pointer() before calling free_ret_instance(). Only cleanup_return_instances() violates this property, which so far is not causing problems due to RCU-delayed freeing of return_instance, which we'll change in the next patch. So make sure we unlink return_instance before passing it into free_ret_instance(), as otherwise reuse will be unsafe. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-4-andrii@kernel.org
2024-12-09uprobes: Decouple return_instance list traversal and freeingAndrii Nakryiko
free_ret_instance() has two unrelated responsibilities: actually cleaning up return_instance's resources and freeing memory, and also helping with utask->return_instances list traversal by returning the next alive pointer. There is no reason why these two aspects have to be mixed together, so turn free_ret_instance() into void-returning function and make callers do list traversal on their own. We'll use this simplification in the next patch that will guarantee that to-be-freed return_instance isn't reachable from utask->return_instances list. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-3-andrii@kernel.org
2024-12-09uprobes: Simplify session consumer trackingAndrii Nakryiko
In practice, each return_instance will typically contain either zero or one return_consumer, depending on whether it has any uprobe session consumer attached or not. It's highly unlikely that more than one uprobe session consumers will be attached to any given uprobe, so there is no need to optimize for that case. But the way we currently do memory allocation and accounting is by pre-allocating the space for 4 session consumers in contiguous block of memory next to struct return_instance fixed part. This is unnecessarily wasteful. This patch changes this to keep struct return_instance fixed-sized with one pre-allocated return_consumer, while (in a highly unlikely scenario) allowing for more session consumers in a separate dynamically allocated and reallocated array. We also simplify accounting a bit by not maintaining a separate temporary capacity for consumers array, and, instead, relying on krealloc() to be a no-op if underlying memory can accommodate a slightly bigger allocation (but again, it's very uncommon scenario to even have to do this reallocation). All this gets rid of ri_size(), simplifies push_consumer() and removes confusing ri->consumers_cnt re-assignment, while containing this singular preallocated consumer logic contained within a few simple preexisting helpers. Having fixed-sized struct return_instance simplifies and speeds up return_instance reuse that we ultimately add later in this patch set, see follow up patches. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-2-andrii@kernel.org
2024-12-09sched/fair: Untangle NEXT_BUDDY and pick_next_task()Peter Zijlstra
There are 3 sites using set_next_buddy() and only one is conditional on NEXT_BUDDY, the other two sites are unconditional; to note: - yield_to_task() - cgroup dequeue / pick optimization However, having NEXT_BUDDY control both the wakeup-preemption and the picking side of things means its near useless. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
2024-12-09sched/fair: Mark m*_vruntime() with __maybe_unusedAndy Shevchenko
When max_vruntime() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function] 526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime) | ^~~~~~~~~~~~ Fix this by marking them with __maybe_unused (all cases for the sake of symmetry). See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241202173546.634433-1-andriy.shevchenko@linux.intel.com
2024-12-09sched/fair: Fix variable declaration positionVincent Guittot
Move variable declaration at the beginning of the function Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org
2024-12-09sched/fair: Do not try to migrate delayed dequeue taskVincent Guittot
Migrating a delayed dequeued task doesn't help in balancing the number of runnable tasks in the system. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename cfs_rq.nr_running into nr_queuedVincent Guittot
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed dequeue tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org
2024-12-09sched/fair: Remove unused cfs_rq.idle_nr_runningVincent Guittot
cfs_rq.idle_nr_running field is not used anywhere so we can remove the useless associated computation. Last user went in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-9-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idleVincent Guittot
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore as it includes delayed dequeue tasks as well. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org
2024-12-09sched/fair: Removed unsued cfs_rq.h_nr_delayedVincent Guittot
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready to run or delayed dequeue Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org
2024-12-09sched/fair: Use the new cfs_rq.h_nr_runnableVincent Guittot
Use the new h_nr_runnable that tracks only queued and runnable tasks in the statistics that are used to balance the system: - PELT runnable_avg - deciding if a group is overloaded or has spare capacity - numa stats - reduced capacity management - load balance - nohz kick It should be noticed that the rq->nr_running still counts the delayed dequeued tasks as delayed dequeue is a fair feature that is meaningless at core level. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org
2024-12-09sched/fair: Add new cfs_rq.h_nr_runnableVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task. Create a new h_nr_runnable that tracks only queued and runnable tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename h_nr_running into h_nr_queuedVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
2024-12-09Merge branch 'sched/urgent'Peter Zijlstra
Sync with urgent bits as a base for further work. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-09sched/eevdf: More PELT vs DELAYED_DEQUEUEPeter Zijlstra
Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust the cfs_rq runnable stats, which are based off of h_nr_running. Track h_nr_delayed such that we can discount those and adjust the signal. Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/ Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/ [ Fixes checkpatch warnings and rebased ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
2024-12-09sched/fair: Fix sched_can_stop_tick() for fair tasksVincent Guittot
We can't stop the tick of a rq if there are at least 2 tasks enqueued in the whole hierarchy and not only at the root cfs rq. rq->cfs.nr_running tracks the number of sched_entity at one level whereas rq->cfs.h_nr_running tracks all queued tasks in the hierarchy. Fixes: 11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
2024-12-09sched/fair: Fix NEXT_BUDDYK Prateek Nayak
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in pick_next_entity(). Moving clear_buddies() up before the delayed dequeue bits ensures no ->next buddy becomes delayed. Further ensure no new ->next buddy ever starts as delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Adam Li <adamli@os.amperecomputing.com> Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
2024-12-09livepatch: Add stack_order sysfs attributeWardenjohn
Add "stack_order" sysfs attribute which holds the order in which a live patch module was loaded into the system. A user can then determine an active live patched version of a function. cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1 means that livepatch_1 is the first live patch applied cat /sys/kernel/livepatch/livepatch_module/stack_order -> N means that livepatch_module is the Nth live patch applied Suggested-by: Petr Mladek <pmladek@suse.com> Suggested-by: Miroslav Benes <mbenes@suse.cz> Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Wardenjohn <zhangwarden@gmail.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com [pmladek@suse.com: Updated kernel version and date in the ABI documentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. Trivial conflict: tools/testing/selftests/bpf/prog_tests/verifier.c Adjacent changes in: Auto-merging kernel/bpf/verifier.c Auto-merging samples/bpf/Makefile Auto-merging tools/testing/selftests/bpf/.gitignore Auto-merging tools/testing/selftests/bpf/Makefile Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-08Merge tag 'irq_urgent_for_v6.13_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a /proc/interrupts formatting regression - Have the BCM2836 interrupt controller enter power management states properly - Other fixlets * tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing genirq/proc: Add missing space separator back irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND irqchip/gic-v3: Fix irq_complete_ack() comment
2024-12-08Merge tag 'timers_urgent_for_v6.13_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Handle the case where clocksources with small counter width can, in conjunction with overly long idle sleeps, falsely trigger the negative motion detection of clocksources * tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Make negative motion detection more robust
2024-12-08Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM. The usual bunch of singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) iio: magnetometer: yas530: use signed integer type for clamp limits sched/numa: fix memory leak due to the overwritten vma->numab_state mm/damon: fix order of arguments in damos_before_apply tracepoint lib: stackinit: hide never-taken branch from compiler mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio() scatterlist: fix incorrect func name in kernel-doc mm: correct typo in MMAP_STATE() macro mm: respect mmap hint address when aligning for THP mm: memcg: declare do_memsw_account inline mm/codetag: swap tags when migrate pages ocfs2: update seq_file index in ocfs2_dlm_seq_next stackdepot: fix stack_depot_save_flags() in NMI context mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags() mm: fix vrealloc()'s KASAN poisoning logic Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()" selftests/damon: add _damon_sysfs.py to TEST_FILES selftest: hugetlb_dio: fix test naming ocfs2: free inode when ocfs2_get_init_inode() fails nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry() ...
2024-12-08tracing/eprobe: Fix to release eprobe when failed to add dyn_eventMasami Hiramatsu (Google)
Fix eprobe event to unregister event call and release eprobe when it fails to add dynamic event correctly. Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/ Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-06Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann:: - Fix several issues for BPF LPM trie map which were found by syzbot and during addition of new test cases (Hou Tao) - Fix a missing process_iter_arg register type check in the BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu) - Fix several correctness gaps in the BPF verifier when interacting with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu) - Fix OOB BPF map writes when deleting elements for the case of xsk map as well as devmap (Maciej Fijalkowski) - Fix xsk sockets to always clear DMA mapping information when unmapping the pool (Larysa Zaremba) - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after sent bytes have been finalized (Zijian Zhang) - Fix BPF sockmap with vsocks which was missing a queue check in poll and sockmap cleanup on close (Michal Luczaj) - Fix tools infra to override makefile ARCH variable if defined but empty, which addresses cross-building tools. (Björn Töpel) - Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols (Thomas Weißschuh) - Fix a NULL pointer dereference in bpftool (Amir Mohammadi) - Fix BPF selftests to check for CONFIG_PREEMPTION instead of CONFIG_PREEMPT (Sebastian Andrzej Siewior) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits) selftests/bpf: Add more test cases for LPM trie selftests/bpf: Move test_lpm_map.c to map_tests bpf: Use raw_spinlock_t for LPM trie bpf: Switch to bpf mem allocator for LPM trie bpf: Fix exact match conditions in trie_get_next_key() bpf: Handle in-place update for full LPM trie correctly bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem bpf: Remove unnecessary check when updating LPM trie selftests/bpf: Add test for narrow spill into 64-bit spilled scalar selftests/bpf: Add test for reading from STACK_INVALID slots selftests/bpf: Introduce __caps_unpriv annotation for tests bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS bpf: Zero index arg error string for dynptr and iter selftests/bpf: Add tests for iter arg check bpf: Ensure reg is PTR_TO_STACK in process_iter_arg tools: Override makefile ARCH variable if defined, but empty selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap ...
2024-12-06bpf: Use raw_spinlock_t for LPM trieHou Tao
After switching from kmalloc() to the bpf memory allocator, there will be no blocking operation during the update of LPM trie. Therefore, change trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in atomic context, even on RT kernels. The max value of prefixlen is 2048. Therefore, update or deletion operations will find the target after at most 2048 comparisons. Constructing a test case which updates an element after 2048 comparisons under a 8 CPU VM, and the average time and the maximal time for such update operation is about 210us and 900us. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Switch to bpf mem allocator for LPM trieHou Tao
Multiple syzbot warnings have been reported. These warnings are mainly about the lock order between trie->lock and kmalloc()'s internal lock. See report [1] as an example: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted ------------------------------------------------------ syz.3.2069/15008 is trying to acquire lock: ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ... but task is already holding lock: ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ... which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&trie->lock){-.-.}-{2:2}: __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 trie_delete_elem+0xb0/0x820 ___bpf_prog_run+0x3e51/0xabd0 __bpf_prog_run32+0xc1/0x100 bpf_dispatcher_nop_func ...... bpf_trace_run2+0x231/0x590 __bpf_trace_contention_end+0xca/0x110 trace_contention_end.constprop.0+0xea/0x170 __pv_queued_spin_lock_slowpath+0x28e/0xcc0 pv_queued_spin_lock_slowpath queued_spin_lock_slowpath queued_spin_lock do_raw_spin_lock+0x210/0x2c0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x42/0x60 __put_partials+0xc3/0x170 qlink_free qlist_free_all+0x4e/0x140 kasan_quarantine_reduce+0x192/0x1e0 __kasan_slab_alloc+0x69/0x90 kasan_slab_alloc slab_post_alloc_hook slab_alloc_node kmem_cache_alloc_node_noprof+0x153/0x310 __alloc_skb+0x2b1/0x380 ...... -> #0 (&n->list_lock){-.-.}-{2:2}: check_prev_add check_prevs_add validate_chain __lock_acquire+0x2478/0x3b30 lock_acquire lock_acquire+0x1b1/0x560 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 get_partial_node.part.0+0x20/0x350 get_partial_node get_partial ___slab_alloc+0x65b/0x1870 __slab_alloc.constprop.0+0x56/0xb0 __slab_alloc_node slab_alloc_node __do_kmalloc_node __kmalloc_node_noprof+0x35c/0x440 kmalloc_node_noprof bpf_map_kmalloc_node+0x98/0x4a0 lpm_trie_node_alloc trie_update_elem+0x1ef/0xe00 bpf_map_update_value+0x2c1/0x6c0 map_update_elem+0x623/0x910 __sys_bpf+0x90c/0x49a0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&trie->lock); lock(&n->list_lock); lock(&trie->lock); lock(&n->list_lock); *** DEADLOCK *** [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7 A bpf program attached to trace_contention_end() triggers after acquiring &n->list_lock. The program invokes trie_delete_elem(), which then acquires trie->lock. However, it is possible that another process is invoking trie_update_elem(). trie_update_elem() will acquire trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke get_partial_node() and try to acquire &n->list_lock (not necessarily the same lock object). Therefore, lockdep warns about the circular locking dependency. Invoking kmalloc() before acquiring trie->lock could fix the warning. However, since BPF programs call be invoked from any context (e.g., through kprobe/tracepoint/fentry), there may still be lock ordering problems for internal locks in kmalloc() or trie->lock itself. To eliminate these potential lock ordering problems with kmalloc()'s internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent BPF memory allocator APIs that can be invoked in any context. The lock ordering problems with trie->lock (e.g., reentrance) will be handled separately. Three aspects of this change require explanation: 1. Intermediate and leaf nodes are allocated from the same allocator. Since the value size of LPM trie is usually small, using a single alocator reduces the memory overhead of the BPF memory allocator. 2. Leaf nodes are allocated before disabling IRQs. This handles cases where leaf_size is large (e.g., > 4KB - 8) and updates require intermediate node allocation. If leaf nodes were allocated in IRQ-disabled region, the free objects in BPF memory allocator would not be refilled timely and the intermediate node allocation may fail. 3. Paired migrate_{disable|enable}() calls for node alloc and free. The BPF memory allocator uses per-CPU struct internally, these paired calls are necessary to guarantee correctness. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Fix exact match conditions in trie_get_next_key()Hou Tao
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify an exact match, However, it is incorrect because when the target key doesn't fully match the found node (e.g., node->prefixlen != matchlen), these two nodes may also have the same prefixlen. It will return expected result when the passed key exist in the trie. However when a recently-deleted key or nonexistent key is passed to trie_get_next_key(), it may skip keys and return incorrect result. Fix it by using node->prefixlen == matchlen to identify exact matches. When the condition is true after the search, it also implies node->prefixlen equals key->prefixlen, otherwise, the search would return NULL instead. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle in-place update for full LPM trie correctlyHou Tao
When a LPM trie is full, in-place updates of existing elements incorrectly return -ENOSPC. Fix this by deferring the check of trie->n_entries. For new insertions, n_entries must not exceed max_entries. However, in-place updates are allowed even when the trie is full. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trieHou Tao
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST flags. These flags can be specified by users and are relevant since LPM trie supports exact matches during update. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elemHou Tao
There is no need to call kfree(im_node) when updating element fails, because im_node must be NULL. Remove the unnecessary kfree() for im_node. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary check when updating LPM trieHou Tao
When "node->prefixlen == matchlen" is true, it means that the node is fully matched. If "node->prefixlen == key->prefixlen" is false, it means the prefix length of key is greater than the prefix length of node, otherwise, matchlen will not be equal with node->prefixlen. However, it also implies that the prefix length of node must be less than max_prefixlen. Therefore, "node->prefixlen == trie->max_prefixlen" will always be false when the check of "node->prefixlen == key->prefixlen" returns false. Remove this unnecessary comparison. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-05sched/numa: fix memory leak due to the overwritten vma->numab_stateAdrian Huang
[Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 20*40 (== 800) tasks. # dmesg | grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0 [<ffffffff8113f715>] task_numa_work+0x725/0xa00 [<ffffffff8110f878>] task_work_run+0x58/0x90 [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0 [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150 [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: * a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reported-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05bpf, xdp: constify some bpf_prog * function argumentsAlexander Lobakin
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05audit: fix suffixed '/' filename matchingRicardo Robaina
When the user specifies a directory to delete with the suffix '/', the audit record fails to collect the filename, resulting in the following logs: type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=2 name=(null) type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=1 name=(null) It happens because the value of the variables dname, and n->name->name in __audit_inode_child() differ only by the suffix '/'. This commit treats this corner case by handling pathname's trailing slashes in audit_compare_dname_path(). Steps to reproduce the issue: # auditctl -w /tmp $ mkdir /tmp/foo $ rm -r /tmp/foo/ # ausearch -i | grep PATH | tail -3 The first version of this patch was based on a GitHub patch/PR by user @hqh2010 [1]. Link: https://github.com/linux-audit/audit-kernel/pull/148 [1] Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: subject tweak, trim old metadata] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-05Merge tag 'audit-pr-20241205' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit build problem workaround from Paul Moore: "A minor audit patch that shuffles some code slightly to workaround a GCC bug affecting a number of people. The GCC folks have been able to reproduce the problem and are discussing solutions (see the bug report link in the commit), but since the workaround is trivial let's do that in the kernel so we can unblock people who are hitting this" * tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: workaround a GCC bug triggered by task comm changes
2024-12-05Merge tag 'trace-v6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix trace histogram sort function cmp_entries_dup() The sort function cmp_entries_dup() returns either 1 or 0, and not -1 if parameter "a" is less than "b" by memcmp(). - Fix archs that call trace_hardirqs_off() without RCU watching Both x86 and arm64 no longer call any tracepoints with RCU not watching. It was assumed that it was safe to get rid of trace_*_rcuidle() version of the tracepoint calls. This was needed to get rid of the SRCU protection and be able to implement features like faultable traceponits and add rust tracepoints. Unfortunately, there were a few architectures that still relied on that logic. There's only one file that has tracepoints that are called without RCU watching. Add macro logic around the tracepoints for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined will check if the code is in the idle path (the only place RCU isn't watching), and enable RCU around calling the tracepoint, but only do it if the tracepoint is enabled. * tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix archs that still call tracepoints without RCU watching tracing: Fix cmp_entries_dup() to respect sort() comparison rules