summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2024-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long"). The follow up verifier work depends on it. And the fix in commit 6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator"). It's fixing instability of BPF CI on s390 arch. No conflicts. Adjacent changes in: Auto-merging arch/Kconfig Auto-merging kernel/bpf/helpers.c Auto-merging kernel/bpf/memalloc.c Auto-merging kernel/bpf/verifier.c Auto-merging mm/slab_common.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Add kernel symbol for struct_ops trampolineXu Kuohai
Without kernel symbols for struct_ops trampoline, the unwinder may produce unexpected stacktraces. For example, the x86 ORC and FP unwinders check if an IP is in kernel text by verifying the presence of the IP's kernel symbol. When a struct_ops trampoline address is encountered, the unwinder stops due to the absence of symbol, resulting in an incomplete stacktrace that consists only of direct and indirect child functions called from the trampoline. The arm64 unwinder is another example. While the arm64 unwinder can proceed across a struct_ops trampoline address, the corresponding symbol name is displayed as "unknown", which is confusing. Thus, add kernel symbol for struct_ops trampoline. The name is bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the type name of the struct_ops, and <member_name> is the name of the member that the trampoline is linked to. Below is a comparison of stacktraces captured on x86 by perf record, before and after this patch. Before: ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms]) After: ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms]) ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms]) ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms]) ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms]) ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms]) ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms]) ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms]) ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms]) ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms]) ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms]) ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms]) ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms]) ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms]) ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms]) ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms]) ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms]) Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112145849.3436772-4-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Use function pointers count as struct_ops links countXu Kuohai
Only function pointers in a struct_ops structure can be linked to bpf progs, so set the links count to the function pointers count, instead of the total members count in the structure. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20241112145849.3436772-3-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Remove unused member rcu from bpf_struct_ops_mapXu Kuohai
The rcu member in bpf_struct_ops_map is not used after commit b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.") Remove it. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20241112145849.3436772-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Support private stack for struct_ops progsYonghong Song
For struct_ops progs, whether a particular prog uses private stack depends on prog->aux->priv_stack_requested setting before actual insn-level verification for that prog. One particular implementation is to piggyback on struct_ops->check_member(). The next patch has an example for this. The struct_ops->check_member() sets prog->aux->priv_stack_requested to be true which enables private stack usage. The struct_ops prog follows the same rule as kprobe/tracing progs after function bpf_enable_priv_stack(). For example, even a struct_ops prog requests private stack, it could still use normal kernel stack if the stack size is small (< 64 bytes). Similar to tracing progs, nested same cpu same prog run will be skipped. A field (recursion_detected()) is added to bpf_prog_aux structure. If bpf_prog->aux->recursion_detected is implemented by the struct_ops subsystem and nested same cpu/prog happens, the function will be triggered to report an error, collect related info, etc. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163933.2224962-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Enable private stack for eligible subprogsYonghong Song
If private stack is used by any subprog, set that subprog prog->aux->jits_use_priv_stack to be true so later jit can allocate private stack for that subprog properly. Also set env->prog->aux->jits_use_priv_stack to be true if any subprog uses private stack. This is a use case for a single main prog (no subprogs) to use private stack, and also a use case for later struct-ops progs where env->prog->aux->jits_use_priv_stack will enable recursion check if any subprog uses private stack. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163912.2224007-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Find eligible subprogs for private stack supportYonghong Song
Private stack will be allocated with percpu allocator in jit time. To avoid complexity at runtime, only one copy of private stack is available per cpu per prog. So runtime recursion check is necessary to avoid stack corruption. Current private stack only supports kprobe/perf_event/tp/raw_tp which has recursion check in the kernel, and prog types that use bpf trampoline recursion check. For trampoline related prog types, currently only tracing progs have recursion checking. To avoid complexity, all async_cb subprogs use normal kernel stack including those subprogs used by both main prog subtree and async_cb subtree. Any prog having tail call also uses kernel stack. To avoid jit penalty with private stack support, a subprog stack size threshold is set such that only if the stack size is no less than the threshold, private stack is supported. The current threshold is 64 bytes. This avoids jit penality if the stack usage is small. A useless 'continue' is also removed from a loop in func check_max_stack_depth(). Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163907.2223839-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-11bpf: Drop special callback reference handlingKumar Kartikeya Dwivedi
Logic to prevent callbacks from acquiring new references for the program (i.e. leaving acquired references), and releasing caller references (i.e. those acquired in parent frames) was introduced in commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks"). This was necessary because back then, the verifier simulated each callback once (that could potentially be executed N times, where N can be zero). This meant that callbacks that left lingering resources or cleared caller resources could do it more than once, operating on undefined state or leaking memory. With the fixes to callback verification in commit ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"), all of this extra logic is no longer necessary. Hence, drop it as part of this commit. Cc: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Refactor active lock managementKumar Kartikeya Dwivedi
When bpf_spin_lock was introduced originally, there was deliberation on whether to use an array of lock IDs, but since bpf_spin_lock is limited to holding a single lock at any given time, we've been using a single ID to identify the held lock. In preparation for introducing spin locks that can be taken multiple times, introduce support for acquiring multiple lock IDs. For this purpose, reuse the acquired_refs array and store both lock and pointer references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to disambiguate and find the relevant entry. The ptr field is used to track the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be matched with protected fields within the same "allocation", i.e. bpf_obj_new object or map value. The struct active_lock is changed to an int as the state is part of the acquired_refs array, and we only need active_lock as a cheap way of detecting lock presence. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Call free_htab_elem() after htab_unlock_bucket()Hou Tao
For htab of maps, when the map is removed from the htab, it may hold the last reference of the map. bpf_map_fd_put_ptr() will invoke bpf_map_free_id() to free the id of the removed map element. However, bpf_map_fd_put_ptr() is invoked while holding a bucket lock (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock (spinlock_t), triggering the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.11.0-rc4+ #49 Not tainted ----------------------------- test_maps/4881 is trying to lock: ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70 other info that might help us debug this: context-{5:5} 2 locks held by test_maps/4881: #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270 #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80 stack backtrace: CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... Call Trace: <TASK> dump_stack_lvl+0x6e/0xb0 dump_stack+0x10/0x20 __lock_acquire+0x73e/0x36c0 lock_acquire+0x182/0x450 _raw_spin_lock_irqsave+0x43/0x70 bpf_map_free_id.part.0+0x21/0x70 bpf_map_put+0xcf/0x110 bpf_map_fd_put_ptr+0x9a/0xb0 free_htab_elem+0x69/0xe0 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 bpf_map_update_value+0x266/0x380 __sys_bpf+0x21bb/0x36b0 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e One way to fix the lockdep warning is using raw_spinlock_t for map_idr_lock as well. However, bpf_map_alloc_id() invokes idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a similar lockdep warning because the slab's lock (s->cpu_slab->lock) is still a spinlock. Instead of changing map_idr_lock's type, fix the issue by invoking htab_put_fd_value() after htab_unlock_bucket(). However, only deferring the invocation of htab_put_fd_value() is not enough, because the old map pointers in htab of maps can not be saved during batched deletion. Therefore, also defer the invocation of free_htab_elem(), so these to-be-freed elements could be linked together similar to lru map. There are four callers for ->map_fd_put_ptr: (1) alloc_htab_elem() (through htab_put_fd_value()) It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of htab_put_fd_value() can not simply move after htab_unlock_bucket(), because the old element has already been stashed in htab->extra_elems. It may be reused immediately after htab_unlock_bucket() and the invocation of htab_put_fd_value() after htab_unlock_bucket() may release the newly-added element incorrectly. Therefore, saving the map pointer of the old element for htab of maps before unlocking the bucket and releasing the map_ptr after unlock. Beside the map pointer in the old element, should do the same thing for the special fields in the old element as well. (2) free_htab_elem() (through htab_put_fd_value()) Its caller includes __htab_map_lookup_and_delete_elem(), htab_map_delete_elem() and __htab_map_lookup_and_delete_batch(). For htab_map_delete_elem(), simply invoke free_htab_elem() after htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just like lru map, linking the to-be-freed element into node_to_free list and invoking free_htab_elem() for these element after unlock. It is safe to reuse batch_flink as the link for node_to_free, because these elements have been removed from the hash llist. Because htab of maps doesn't support lookup_and_delete operation, __htab_map_lookup_and_delete_elem() doesn't have the problem, so kept it as is. (3) fd_htab_map_free() It invokes ->map_fd_put_ptr without raw_spinlock_t. (4) bpf_fd_htab_map_update_elem() It invokes ->map_fd_put_ptr without raw_spinlock_t. After moving free_htab_elem() outside htab bucket lock scope, using pcpu_freelist_push() instead of __pcpu_freelist_push() to disable the irq before freeing elements, and protecting the invocations of bpf_mem_cache_free() with migrate_{disable|enable} pair. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Add support for uprobe multi session attachJiri Olsa
Adding support to attach BPF program for entry and return probe of the same function. This is common use case which at the moment requires to create two uprobe multi links. Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the BPF program on return probe simply by returning zero or non zero from the entry BPF program execution to execute or not the BPF program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
2024-11-11bpf: Allow return values 0 and 1 for kprobe sessionJiri Olsa
The kprobe session program can return only 0 or 1, instruct verifier to check for that. Fixes: 535a3692ba72 ("bpf: Add support for kprobe session attach") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-2-jolsa@kernel.org
2024-11-04bpf: Mark raw_tp arguments with PTR_MAYBE_NULLKumar Kartikeya Dwivedi
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL checks to be deleted, and accesses to such pointers potentially crashing the kernel. To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special case the dereference and pointer arithmetic to permit it, and allow passing them into helpers/kfuncs; these exceptions are made for raw_tp programs only. Ensure that we don't do this when ref_obj_id > 0, as in that case this is an acquired object and doesn't need such adjustment. The reason we do mask_raw_tp_trusted_reg logic is because other will recheck in places whether the register is a trusted_reg, and then consider our register as untrusted when detecting the presence of the PTR_MAYBE_NULL flag. To allow safe dereference, we enable PROBE_MEM marking when we see loads into trusted pointers with PTR_MAYBE_NULL. While trusted raw_tp arguments can also be passed into helpers or kfuncs where such broken assumption may cause issues, a future patch set will tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can already be passed into helpers and causes similar problems. Thus, they are left alone for now. It is possible that these checks also permit passing non-raw_tp args that are trusted PTR_TO_BTF_ID with null marking. In such a case, allowing dereference when pointer is NULL expands allowed behavior, so won't regress existing programs, and the case of passing these into helpers is the same as above and will be dealt with later. Also update the failure case in tp_btf_nullable selftest to capture the new behavior, as the verifier will no longer cause an error when directly dereference a raw tracepoint argument marked as __nullable. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb Reviewed-by: Jiri Olsa <jolsa@kernel.org> Reported-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241104171959.2938862-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Unify resource leak checksKumar Kartikeya Dwivedi
There are similar checks for covering locks, references, RCU read sections and preempt_disable sections in 3 places in the verifer, i.e. for tail calls, bpf_ld_[abs, ind], and exit path (for BPF_EXIT and bpf_throw). Unify all of these into a common check_resource_leak function to avoid code duplication. Also update the error strings in selftests to the new ones in the same change to ensure clean bisection. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Tighten tail call checks for lingering locks, RCU, preempt_disableKumar Kartikeya Dwivedi
There are three situations when a program logically exits and transfers control to the kernel or another program: bpf_throw, BPF_EXIT, and tail calls. The former two check for any lingering locks and references, but tail calls currently do not. Expand the checks to check for spin locks, RCU read sections and preempt disabled sections. Spin locks are indirectly preventing tail calls as function calls are disallowed, but the checks for preemption and RCU are more relaxed, hence ensure tail calls are prevented in their presence. Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Fixes: fc7566ad0a82 ("bpf: Introduce bpf_preempt_[disable,enable] kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-01bpf: ensure RCU Tasks Trace GP for sleepable raw tracepoint BPF linksAndrii Nakryiko
Now that kernel supports sleepable tracepoints, the fact that bpf_probe_unregister() is asynchronous, i.e., that it doesn't wait for any in-flight tracepoints to conclude before returning, we now need to delay BPF raw tp link's deallocation and bpf_prog_put() of its underlying BPF program (regardless of program's own sleepable semantics) until after full RCU Tasks Trace GP. With that GP over, we'll have a guarantee that no tracepoint can reach BPF link and thus its BPF program. We use newly added tracepoint_is_faultable() check to know when this RCU Tasks Trace GP is necessary and utilize BPF link's own sleepable flag passed through bpf_link_init_sleepable() initializer. Link: https://lore.kernel.org/20241101181754.782341-3-andrii@kernel.org Tested-by: Jordan Rife <jrife@google.com> Reported-by: Jordan Rife <jrife@google.com> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01bpf: decouple BPF link/attach hook and BPF program sleepable semanticsAndrii Nakryiko
BPF link's lifecycle protection scheme depends on both BPF hook and BPF program. If *either* of those require RCU Tasks Trace GP, then we need to go through a chain of GPs before putting BPF program refcount and deallocating BPF link memory. This patch adds bpf_link-specific sleepable flag, which can be set to true even if underlying BPF program is not sleepable itself. If either link->sleepable or link->prog->sleepable is true, we'll go through a chain of RCU Tasks Trace GP and RCU GP before putting BPF program and freeing memory. This will be used to protect BPF link for sleepable (faultable) raw tracepoints in the next patch. Link: https://lore.kernel.org/20241101181754.782341-2-andrii@kernel.org Tested-by: Jordan Rife <jrife@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01bpf: put bpf_link's program when link is safe to be deallocatedAndrii Nakryiko
In general, BPF link's underlying BPF program should be considered to be reachable through attach hook -> link -> prog chain, and, pessimistically, we have to assume that as long as link's memory is not safe to free, attach hook's code might hold a pointer to BPF program and use it. As such, it's not (generally) correct to put link's program early before waiting for RCU GPs to go through. More eager bpf_prog_put() that we currently do is mostly correct due to BPF program's release code doing similar RCU GP waiting, but as will be shown in the following patches, BPF program can be non-sleepable (and, thus, reliant on only "classic" RCU GP), while BPF link's attach hook can have sleepable semantics and needs to be protected by RCU Tasks Trace, and for such cases BPF link has to go through RCU Tasks Trace + "classic" RCU GPs before being deallocated. And so, if we put BPF program early, we might free BPF program before we free BPF link, leading to use-after-free situation. So, this patch defers bpf_prog_put() until we are ready to perform bpf_link's deallocation. At worst, this delays BPF program freeing by one extra RCU GP, but that seems completely acceptable. Alternatively, we'd need more elaborate ways to determine BPF hook, BPF link, and BPF program lifetimes, and how they relate to each other, which seems like an unnecessary complication. Note, for most BPF links we still will perform eager bpf_prog_put() and link dealloc, so for those BPF links there are no observable changes whatsoever. Only BPF links that use deferred dealloc might notice slightly delayed freeing of BPF programs. Also, to reduce code and logic duplication, extract program put + link dealloc logic into bpf_link_dealloc() helper. Link: https://lore.kernel.org/20241101181754.782341-1-andrii@kernel.org Tested-by: Jordan Rife <jrife@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01bpf: Add open coded version of kmem_cache iteratorNamhyung Kim
Add a new open coded iterator for kmem_cache which can be called from a BPF program like below. It doesn't take any argument and traverses all kmem_cache entries. struct kmem_cache *pos; bpf_for_each(kmem_cache, pos) { ... } As it needs to grab slab_mutex, it should be called from sleepable BPF programs only. Also update the existing iterator code to use the open coded version internally as suggested by Andrii. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241030222819.1800667-1-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-31Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to force a checkpoint when the program's jump history becomes too long (Eduard Zingerman) - Add several fixes to the BPF bits iterator addressing issues like memory leaks and overflow problems (Hou Tao) - Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong) - Fix BPF test infra's LIVE_FRAME frame update after a page has been recycled (Toke Høiland-Jørgensen) - Fix BPF verifier and undo the 40-bytes extra stack space for bpf_fastcall patterns due to various bugs (Eduard Zingerman) - Fix a BPF sockmap race condition which could trigger a NULL pointer dereference in sock_map_link_update_prog (Cong Wang) - Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under the socket lock (Jiayuan Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled selftests/bpf: Add three test cases for bits_iter bpf: Use __u64 to save the bits in bits iterator bpf: Check the validity of nr_words in bpf_iter_bits_new() bpf: Add bpf_mem_alloc_check_size() helper bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf: disallow 40-bytes extra stack for bpf_fastcall patterns selftests/bpf: Add test for trie_get_next_key() bpf: Fix out-of-bounds write in trie_get_next_key() selftests/bpf: Test with a very short loop bpf: Force checkpoint when jmp history is too long bpf: fix filed access without lock sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
2024-10-30bpf: Use __u64 to save the bits in bits iteratorHou Tao
On 32-bit hosts (e.g., arm32), when a bpf program passes a u64 to bpf_iter_bits_new(), bpf_iter_bits_new() will use bits_copy to store the content of the u64. However, bits_copy is only 4 bytes, leading to stack corruption. The straightforward solution would be to replace u64 with unsigned long in bpf_iter_bits_new(). However, this introduces confusion and problems for 32-bit hosts because the size of ulong in bpf program is 8 bytes, but it is treated as 4-bytes after passed to bpf_iter_bits_new(). Fix it by changing the type of both bits and bit_count from unsigned long to u64. However, the change is not enough. The main reason is that bpf_iter_bits_next() uses find_next_bit() to find the next bit and the pointer passed to find_next_bit() is an unsigned long pointer instead of a u64 pointer. For 32-bit little-endian host, it is fine but it is not the case for 32-bit big-endian host. Because under 32-bit big-endian host, the first iterated unsigned long will be the bits 32-63 of the u64 instead of the expected bits 0-31. Therefore, in addition to changing the type, swap the two unsigned longs within the u64 for 32-bit big-endian host. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Check the validity of nr_words in bpf_iter_bits_new()Hou Tao
Check the validity of nr_words in bpf_iter_bits_new(). Without this check, when multiplication overflow occurs for nr_bits (e.g., when nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008). Fix it by limiting the maximum value of nr_words to 511. The value is derived from the current implementation of BPF memory allocator. To ensure compatibility if the BPF memory allocator's size limitation changes in the future, use the helper bpf_mem_alloc_check_size() to check whether nr_bytes is too larger. And return -E2BIG instead of -ENOMEM for oversized nr_bytes. Fixes: 4665415975b0 ("bpf: Add bits iterator") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Add bpf_mem_alloc_check_size() helperHou Tao
Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()Hou Tao
bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the bits are dynamically allocated. However, the check is incorrect and may cause a kmemleak as shown below: unreferenced object 0xffff88812628c8c0 (size 32): comm "swapper/0", pid 1, jiffies 4294727320 hex dump (first 32 bytes): b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U........... f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 .............. backtrace (crc 781e32cc): [<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80 [<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0 [<00000000597124d6>] __alloc.isra.0+0x89/0xb0 [<000000004ebfffcd>] alloc_bulk+0x2af/0x720 [<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0 [<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610 [<000000008b616eac>] bpf_global_ma_init+0x19/0x30 [<00000000fc473efc>] do_one_initcall+0xd3/0x3c0 [<00000000ec81498c>] kernel_init_freeable+0x66a/0x940 [<00000000b119f72f>] kernel_init+0x20/0x160 [<00000000f11ac9a7>] ret_from_fork+0x3c/0x70 [<0000000004671da4>] ret_from_fork_asm+0x1a/0x30 That is because nr_bits will be set as zero in bpf_iter_bits_next() after all bits have been iterated. Fix the issue by setting kit->bit to kit->nr_bits instead of setting kit->nr_bits to zero when the iteration completes in bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to check whether the iteration is complete and still use "nr_bits > 64" to indicate whether bits are dynamically allocated. The "!nr_bits" check is necessary because bpf_iter_bits_new() may fail before setting kit->nr_bits, and this condition will stop the iteration early instead of accessing the zeroed or freed kit->bits. Considering the initial value of kit->bits is -1 and the type of kit->nr_bits is unsigned int, change the type of kit->nr_bits to int. The potential overflow problem will be handled in the following patch. Fixes: 4665415975b0 ("bpf: Add bits iterator") Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29bpf: disallow 40-bytes extra stack for bpf_fastcall patternsEduard Zingerman
Hou Tao reported an issue with bpf_fastcall patterns allowing extra stack space above MAX_BPF_STACK limit. This extra stack allowance is not integrated properly with the following verifier parts: - backtracking logic still assumes that stack can't exceed MAX_BPF_STACK; - bpf_verifier_env->scratched_stack_slots assumes only 64 slots are available. Here is an example of an issue with precision tracking (note stack slot -8 tracked as precise instead of -520): 0: (b7) r1 = 42 ; R1_w=42 1: (b7) r2 = 42 ; R2_w=42 2: (7b) *(u64 *)(r10 -512) = r1 ; R1_w=42 R10=fp0 fp-512_w=42 3: (7b) *(u64 *)(r10 -520) = r2 ; R2_w=42 R10=fp0 fp-520_w=42 4: (85) call bpf_get_smp_processor_id#8 ; R0_w=scalar(...) 5: (79) r2 = *(u64 *)(r10 -520) ; R2_w=42 R10=fp0 fp-520_w=42 6: (79) r1 = *(u64 *)(r10 -512) ; R1_w=42 R10=fp0 fp-512_w=42 7: (bf) r3 = r10 ; R3_w=fp0 R10=fp0 8: (0f) r3 += r2 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 7: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 6: (79) r1 = *(u64 *)(r10 -512) mark_precise: frame0: regs=r2 stack= before 5: (79) r2 = *(u64 *)(r10 -520) mark_precise: frame0: regs= stack=-8 before 4: (85) call bpf_get_smp_processor_id#8 mark_precise: frame0: regs= stack=-8 before 3: (7b) *(u64 *)(r10 -520) = r2 mark_precise: frame0: regs=r2 stack= before 2: (7b) *(u64 *)(r10 -512) = r1 mark_precise: frame0: regs=r2 stack= before 1: (b7) r2 = 42 9: R2_w=42 R3_w=fp42 9: (95) exit This patch disables the additional allowance for the moment. Also, two test cases are removed: - bpf_fastcall_max_stack_ok: it fails w/o additional stack allowance; - bpf_fastcall_max_stack_fail: this test is no longer necessary, stack size follows regular rules, pattern invalidation is checked by other test cases. Reported-by: Hou Tao <houtao@huaweicloud.com> Closes: https://lore.kernel.org/bpf/20241023022752.172005-1-houtao@huaweicloud.com/ Fixes: 5b5f51bff1b6 ("bpf: no_caller_saved_registers attribute for helper calls") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241029193911.1575719-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29Merge tag 'cgroup-for-6.12-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cgroup_bpf_release_fn() could saturate system_wq with cgrp->bpf.release_work which can then form a circular dependency leading to deadlocks. Fix by using a dedicated workqueue. The system_wq's max concurrency limit is being increased separately. - Fix theoretical off-by-one bug when enforcing max cgroup hierarchy depth * tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Fix potential overflow issue when checking max_depth cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
2024-10-29bpf: Fix out-of-bounds write in trie_get_next_key()Byeonguk Jeong
trie_get_next_key() allocates a node stack with size trie->max_prefixlen, while it writes (trie->max_prefixlen + 1) nodes to the stack when it has full paths from the root to leaves. For example, consider a trie with max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Tested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29bpf: Force checkpoint when jmp history is too longEduard Zingerman
A specifically crafted program might trick verifier into growing very long jump history within a single bpf_verifier_state instance. Very long jump history makes mark_chain_precision() unreasonably slow, especially in case if verifier processes a loop. Mitigate this by forcing new state in is_state_visited() in case if current state's jump history is too long. Use same constant as in `skip_inf_loop_check`, but multiply it by arbitrarily chosen value 2 to account for jump history containing not only information about jumps, but also information about stack access. For an example of problematic program consider the code below, w/o this patch the example is processed by verifier for ~15 minutes, before failing to allocate big-enough chunk for jmp_history. 0: r7 = *(u16 *)(r1 +0);" 1: r7 += 0x1ab064b9;" 2: if r7 & 0x702000 goto 1b; 3: r7 &= 0x1ee60e;" 4: r7 += r1;" 5: if r7 s> 0x37d2 goto +0;" 6: r0 = 0;" 7: exit;" Perf profiling shows that most of the time is spent in mark_chain_precision() ~95%. The easiest way to explain why this program causes problems is to apply the following patch: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0c216e71cec7..4b4823961abe 100644 \--- a/include/linux/bpf.h \+++ b/include/linux/bpf.h \@@ -1926,7 +1926,7 @@ struct bpf_array { }; }; -#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ +#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 33 /* Maximum number of loops for bpf_loop and bpf_iter_num. diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f514247ba8ba..75e88be3bb3e 100644 \--- a/kernel/bpf/verifier.c \+++ b/kernel/bpf/verifier.c \@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) skip_inf_loop_check: if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && - env->insn_processed - env->prev_insn_processed < 100) + env->insn_processed - env->prev_insn_processed < 100) { + verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n", + env->insn_idx, + env->jmps_processed - env->prev_jmps_processed, + cur->jmp_history_cnt); add_new_state = false; + } goto miss; } /* If sl->state is a part of a loop and this loop's entry is a part of \@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) if (!add_new_state) return 0; + verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n", + env->insn_idx); + /* There were no equivalent states, remember the current one. * Technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) And observe verification log: ... is_state_visited: new checkpoint at 5, resetting env->jmps_processed 5: R1=ctx() R7=ctx(...) 5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...) 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0 6: R1=ctx() R7=ctx(...) R10=fp0 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74 from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... mark_precise 152 steps for r7 ... 2: R7_w=scalar(...) is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... BPF program is too large. Processed 257 insn The log output shows that checkpoint at label (1) is never created, because it is suppressed by `skip_inf_loop_check` logic: a. When 'if' at (2) is processed it pushes a state with insn_idx (1) onto stack and proceeds to (3); b. At (5) checkpoint is created, and this resets env->{jmps,insns}_processed. c. Verification proceeds and reaches `exit`; d. State saved at step (a) is popped from stack and is_state_visited() considers if checkpoint needs to be added, but because env->{jmps,insns}_processed had been just reset at step (b) the `skip_inf_loop_check` logic forces `add_new_state` to false. e. Verifier proceeds with current state, which slowly accumulates more and more entries in the jump history. The accumulation of entries in the jump history is a problem because of two factors: - it eventually exhausts memory available for kmalloc() allocation; - mark_chain_precision() traverses the jump history of a state, meaning that if `r7` is marked precise, verifier would iterate ever growing jump history until parent state boundary is reached. (note: the log also shows a REG INVARIANTS VIOLATION warning upon jset processing, but that's another bug to fix). With this patch applied, the example above is rejected by verifier under 1s of time, reaching 1M instructions limit. The program is a simplified reproducer from syzbot report. Previous discussion could be found at [1]. The patch does not cause any changes in verification performance, when tested on selftests from veristat.cfg and cilium programs taken from [2]. [1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/ [2] https://github.com/anakryiko/cilium Changelog: - v1 -> v2: - moved patch to bpf tree; - moved force_new_state variable initialization after declaration and shortened the comment. v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/ Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
2024-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add uptr support in the map_value of the task local storage.Martin KaFai Lau
This patch adds uptr support in the map_value of the task local storage. struct map_value { struct user_data __uptr *uptr; }; struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC); __type(key, int); __type(value, struct value_type); } datamap SEC(".maps"); A new bpf_obj_pin_uptrs() is added to pin the user page and also stores the kernel address back to the uptr for the bpf prog to use later. It currently does not support the uptr pointing to a user struct across two pages. It also excludes PageHighMem support to keep it simple. As of now, the 32bit bpf jit is missing other more crucial bpf features. For example, many important bpf features depend on bpf kfunc now but so far only one arch (x86-32) supports it which was added by me as an example when kfunc was first introduced to bpf. The uptr can only be stored to the task local storage by the syscall update_elem. Meaning the uptr will not be considered if it is provided by the bpf prog through bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE). This is enforced by only calling bpf_local_storage_update(swap_uptrs==true) in bpf_pid_task_storage_update_elem. Everywhere else will have swap_uptrs==false. This will pump down to bpf_selem_alloc(swap_uptrs==true). It is the only case that bpf_selem_alloc() will take the uptr value when updating the newly allocated selem. bpf_obj_swap_uptrs() is added to swap the uptr between the SDATA(selem)->data and the user provided map_value in "void *value". bpf_obj_swap_uptrs() makes the SDATA(selem)->data takes the ownership of the uptr and the user space provided map_value will have NULL in the uptr. The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem() returning error. If the map->ops->map_update_elem has reached a state that the local storage has taken the uptr ownership, the bpf_obj_unpin_uptrs() will be a no op because the uptr is NULL. A "__"bpf_obj_unpin_uptrs is added to make this error path unpin easier such that it does not have to check the map->record is NULL or not. BPF_F_LOCK is not supported when the map_value has uptr. This can be revisited later if there is a use case. A similar swap_uptrs idea can be considered. The final bit is to do unpin_user_page in the bpf_obj_free_fields(). The earlier patch has ensured that the bpf_obj_free_fields() has gone through the rcu gp when needed. Cc: linux-mm@kvack.org Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_obj_free_fields to the rcu callbackMartin KaFai Lau
A later patch will enable the uptr usage in the task_local_storage map. This will require the unpin_user_page() to be done after the rcu task trace gp for the cases that the uptr may still be used by a bpf prog. The bpf_obj_free_fields() will be the one doing unpin_user_page(), so this patch is to postpone calling bpf_obj_free_fields() to the rcu callback. The bpf_obj_free_fields() is only required to be done in the rcu callback when bpf->bpf_ma==true and reuse_now==false. bpf->bpf_ma==true case is because uptr will only be enabled in task storage which has already been moved to bpf_mem_alloc. The bpf->bpf_ma==false case can be supported in the future also if there is a need. reuse_now==false when the selem (aka storage) is deleted by bpf prog (bpf_task_storage_delete) or by syscall delete_elem(). In both cases, bpf_obj_free_fields() needs to wait for rcu gp. A few words on reuse_now==true. reuse_now==true when the storage's owner (i.e. the task_struct) is destructing or the map itself is doing map_free(). In both cases, no bpf prog should have a hold on the selem and its uptrs, so there is no need to postpone bpf_obj_free_fields(). reuse_now==true should be the common case for local storage usage where the storage exists throughout the lifetime of its owner (task_struct). The bpf_obj_free_fields() needs to use the map->record. Doing bpf_obj_free_fields() in a rcu callback will require the bpf_local_storage_map_free() to wait for rcu_barrier. An optimization could be only waiting for rcu_barrier when the map has uptr in its map_value. This will require either yet another rcu callback function or adding a bool in the selem to flag if the SDATA(selem)->smap is still valid. This patch chooses to keep it simple and wait for rcu_barrier for maps that use bpf_mem_alloc. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_selem_free() in bpf_selem_unlink_storage_nolock()Martin KaFai Lau
In a later patch, bpf_selem_free() will call unpin_user_page() through bpf_obj_free_fields(). unpin_user_page() may take spin_lock. However, some bpf_selem_free() call paths have held a raw_spin_lock. Like this: raw_spin_lock_irqsave() bpf_selem_unlink_storage_nolock() bpf_selem_free() unpin_user_page() spin_lock() To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be done after releasing the raw_spinlock. The "bool reuse_now" arg is replaced with "struct hlist_head *free_selem_list" in bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock() will append the to-be-free selem at the free_selem_list. The caller of bpf_selem_unlink_storage_nolock() will need to call the new bpf_selem_free_list(free_selem_list, reuse_now) to free the selem after releasing the raw_spinlock. Note that the selem->snode cannot be reused for linking to the free_selem_list because the selem->snode is protected by the raw_spinlock that we want to avoid holding. A new "struct hlist_node free_node;" is union-ized with the rcu_head. Only the first one successfully hlist_del_init_rcu(&selem->snode) will be able to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode), the free_node and rcu_head usage is serialized such that they can share the 16 bytes in a union. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add "bool swap_uptrs" arg to bpf_local_storage_update() and ↵Martin KaFai Lau
bpf_selem_alloc() In a later patch, the task local storage will only accept uptr from the syscall update_elem and will not accept uptr from the bpf prog. The reason is the bpf prog does not have a way to provide a valid user space address. bpf_local_storage_update() and bpf_selem_alloc() are used by both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE) and bpf syscall update_elem. "bool swap_uptrs" arg is added to bpf_local_storage_update() and bpf_selem_alloc() to tell if it is called by the bpf prog or by the bpf syscall. When swap_uptrs==true, it is called by the syscall. The arg is named (swap_)uptrs because the later patch will swap the uptrs between the newly allocated selem and the user space provided map_value. It will make error handling easier in case map->ops->map_update_elem() fails and the caller can decide if it needs to unpin the uptr in the user space provided map_value or the bpf_local_storage_update() has already taken the uptr ownership and will take care of unpinning it also. Only swap_uptrs==false is passed now. The logic to handle the true case will be added in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Handle BPF_UPTR in verifierKui-Feng Lee
This patch adds BPF_UPTR support to the verifier. Not that only the map_value will support the "__uptr" type tag. This patch enforces only BPF_LDX is allowed to the value of an uptr. After BPF_LDX, it will mark the dst_reg as PTR_TO_MEM | PTR_MAYBE_NULL with size deduced from the field.kptr.btf_id. This will make the dst_reg pointed memory to be readable and writable as scalar. There is a redundant "val_reg = reg_state(env, value_regno);" statement in the check_map_kptr_access(). This patch takes this chance to remove it also. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Support __uptr type tag in BTFKui-Feng Lee
This patch introduces the "__uptr" type tag to BTF. It is to define a pointer pointing to the user space memory. This patch adds BTF logic to pass the "__uptr" type tag. btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only be supported in the map_value of the task storage map. However, btf_parse_struct_meta() also uses btf_find_kptr() but it is not interested in "__uptr". This patch adds a "field_mask" argument to btf_find_kptr() which will return BTF_FIELD_IGNORE if the caller is not interested in a “__uptr” field. btf_parse_kptr() is also reused to parse the uptr. The btf_check_and_fixup_fields() is changed to do extra checks on the uptr to ensure that its struct size is not larger than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE supported kernel struct will be used, so it is also not allowed now. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Check validity of link->type in bpf_link_show_fdinfo()Hou Tao
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing bpf_link_type_strs[link->type] may result in an out-of-bounds access. To spot such missed invocations early in the future, checking the validity of link->type in bpf_link_show_fdinfo() and emitting a warning when such invocations are missed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
2024-10-23bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()Andrii Nakryiko
We need `goto next_insn;` at the end of patching instead of `continue;`. It currently works by accident by making verifier re-process patched instructions. Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Fixes: 314a53623cd4 ("bpf: inline bpf_get_branch_snapshot() helper") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Fix overloading of MEM_UNINIT's meaningDaniel Borkmann
Lonial reported an issue in the BPF verifier where check_mem_size_reg() has the following code: if (!tnum_is_const(reg->var_off)) /* For unprivileged variable accesses, disable raw * mode so that the program is required to * initialize all the memory that the helper could * just partially fill up. */ meta = NULL; This means that writes are not checked when the register containing the size of the passed buffer has not a fixed size. Through this bug, a BPF program can write to a map which is marked as read-only, for example, .rodata global maps. The problem is that MEM_UNINIT's initial meaning that "the passed buffer to the BPF helper does not need to be initialized" which was added back in commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type") got overloaded over time with "the passed buffer is being written to". The problem however is that checks such as the above which were added later via 06c1c049721a ("bpf: allow helpers access to variable memory") set meta to NULL in order force the user to always initialize the passed buffer to the helper. Due to the current double meaning of MEM_UNINIT, this bypasses verifier write checks to the memory (not boundary checks though) and only assumes the latter memory is read instead. Fix this by reverting MEM_UNINIT back to its original meaning, and having MEM_WRITE as an annotation to BPF helpers in order to then trigger the BPF verifier checks for writing to memory. Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO} we can access fn->arg_type[arg - 1] since it must contain a preceding ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed altogether since we do check both BPF_READ and BPF_WRITE. Same for the equivalent check_kfunc_mem_size_reg(). Fixes: 7b3552d3f9f6 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access") Fixes: 97e6d7dab1ca ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access") Fixes: 15baa55ff5b0 ("bpf/verifier: allow all functions to read user provided context") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Add MEM_WRITE attributeDaniel Borkmann
Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper writes into the memory passed as an argument. In the past MEM_UNINIT has been (ab)used for this function, but the latter merely tells the verifier that the passed memory can be uninitialized. There have been bugs with overloading the latter but aside from that there are also cases where the passed memory is read + written which currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Preserve param->string when parsing mount optionsHou Tao
In bpf_parse_param(), keep the value of param->string intact so it can be freed later. Otherwise, the kmalloc area pointed to by param->string will be leaked as shown below: unreferenced object 0xffff888118c46d20 (size 8): comm "new_name", pid 12109, jiffies 4295580214 hex dump (first 8 bytes): 61 6e 79 00 38 c9 5c 7e any.8.\~ backtrace (crc e1b7f876): [<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80 [<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0 [<000000003e29b886>] memdup_user+0x32/0xa0 [<0000000007248326>] strndup_user+0x46/0x60 [<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0 [<0000000018657927>] x64_sys_call+0xff/0x9f0 [<00000000c0cabc95>] do_syscall_64+0x3b/0xc0 [<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 6c1752e0b6ca ("bpf: Support symbolic BPF FS delegation mount options") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
2024-10-21bpf: Implement bpf_send_signal_task() kfuncPuranjay Mohan
Implement bpf_send_signal_task kfunc that is similar to bpf_send_signal_thread and bpf_send_signal helpers but can be used to send signals to other threads and processes. It also supports sending a cookie with the signal similar to sigqueue(). If the receiving process establishes a handler for the signal using the SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the si_value field of the siginfo_t structure passed as the second argument to the handler. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org
2024-10-18Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to not affect subreg_def marks in its range propagation (Eduard Zingerman) - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx (Dimitar Kanaliev) - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition (Daniel Borkmann) - Fix a NULL pointer dereference in BPF devmap due to missing rxq information (Florian Kauer) - Fix a memory leak in bpf_core_apply (Jiri Olsa) - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs (Hou Tao) - Fix build ID fetching where memory areas backing the file were created with memfd_secret (Andrii Nakryiko) - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid (Jordan Rome) - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks (Michal Luczaj) - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri) - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall (Pu Lehui) - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved (Thomas Weißschuh) - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned (Toke Høiland-Jørgensen) - Fix a BPF selftest compilation error in cgroup-related tests with musl libc (Tony Ambardar) - Several fixes to BPF link info dumps to fill missing fields (Tyrone Wu) - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called (Simon Sundberg) - Ensure that internal and user-facing bpf_redirect flags don't overlap (Toke Høiland-Jørgensen) - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van Riel) - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT (Wander Lairson Costa) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits) lib/buildid: Handle memfd_secret() files in build_id_parse() selftests/bpf: Add test case for delta propagation bpf: Fix print_reg_state's constant scalar dump bpf: Fix incorrect delta propagation between linked registers bpf: Properly test iter/task tid filtering bpf: Fix iter/task tid filtering riscv, bpf: Make BPF_CMPXCHG fully ordered bpf, vsock: Drop static vsock_bpf_prot initialization vsock: Update msg_count on read_skb() vsock: Update rx_bytes on read_skb() bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock selftests/bpf: Add asserts for netfilter link info bpf: Fix link info netfilter flags to populate defrag flag selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx() selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx() bpf: Fix truncation bug in coerce_reg_to_size_sx() selftests/bpf: Assert link info uprobe_multi count & path_size if unset bpf: Fix unpopulated path_size when uprobe_multi fields unset selftests/bpf: Fix cross-compiling urandom_read selftests/bpf: Add test for kfunc module order ...
2024-10-17bpf: Fix print_reg_state's constant scalar dumpDaniel Borkmann
print_reg_state() should not consider adding reg->off to reg->var_off.value when dumping scalars. Scalars can be produced with reg->off != 0 through BPF_ADD_CONST, and thus as-is this can skew the register log dump. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016134913.32249-2-daniel@iogearbox.net
2024-10-17bpf: Fix incorrect delta propagation between linked registersDaniel Borkmann
Nathaniel reported a bug in the linked scalar delta tracking, which can lead to accepting a program with OOB access. The specific code is related to the sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a constant offset between two scalar registers tracked by the same register id. The verifier attempts to track "similar" scalars in order to propagate bounds information learned about one scalar to others. For instance, if r1 and r2 are known to contain the same value, then upon encountering 'if (r1 != 0x1234) goto xyz', not only does it know that r1 is equal to 0x1234 on the path where that conditional jump is not taken, it also knows that r2 is. Additionally, with env->bpf_capable set, the verifier will track scalars which should be a constant delta apart (if r1 is known to be one greater than r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.) The code path for the latter in adjust_reg_min_max_vals() is reached when processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals() knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the alu32 bool), the only information saved in dst_reg is the id of the source register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant offset (reg->off). Later, the function sync_linked_regs() will attempt to use this information to propagate bounds information from one register (known_reg) to others, meaning, for all R in linked_regs, it copies known_reg range (and possibly adjusting delta) into R for the case of R->id == known_reg->id. For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the verifier adjusts the register as reg = known_reg; reg += delta where delta is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar into a fake_reg to then simulate the addition of reg += fake_reg. This is only correct, however, if the value in reg was created by a 64-bit addition. When reg contains the result of a 32-bit addition operation, its upper 32 bits will always be zero. sync_linked_regs() on the other hand, may cause the verifier to believe that the addition between fake_reg and reg overflows into those upper bits. For example, if reg was generated by adding the constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1 and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF, sync_linked_regs() will tell the verifier that reg is equal to the constant 0x100000000. This is incorrect as the actual value of reg will be 0, as the 32-bit addition will wrap around. Example: 0: (b7) r0 = 0; R0_w=0 1: (18) r1 = 0x80000001; R1_w=0x80000001 3: (37) r1 /= 1; R1_w=scalar() 4: (bf) r2 = r1; R1_w=scalar(id=1) R2_w=scalar(id=1) 5: (bf) r4 = r1; R1_w=scalar(id=1) R4_w=scalar(id=1) 6: (04) w2 += 2147483647; R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 7: (04) w4 += 0 ; R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 8: (15) if r2 == 0x0 goto pc+1 10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0 What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id are all the same which later lets sync_linked_regs() to be invoked. Then, in a next step constants are added with alu32 to r2 and r4, setting their ->off, as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and propagate ranges to its linked registers. The verifier now believes the upper 32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001. One approach for a simple fix suitable also for stable is to limit the constant delta tracking to only 64-bit alu addition. If necessary at some later point, BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid mixing the two under the tradeoff to further complicate sync_linked_regs(). However, none of the added tests from dedf56d775c0 ("selftests/bpf: Add tests for add_const") make this necessary at this point, meaning, BPF CI also passes with just limiting tracking to 64-bit alu addition. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20241016134913.32249-1-daniel@iogearbox.net
2024-10-17bpf: Fix iter/task tid filteringJordan Rome
In userspace, you can add a tid filter by setting the "task.tid" field for "bpf_iter_link_info". However, `get_pid_task` when called for the `BPF_TASK_ITER_TID` type should have been using `PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid). Fixes: f0d74c4da1f0 ("bpf: Parameterize task iterators.") Signed-off-by: Jordan Rome <linux@jordanrome.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016210048.1213935-1-linux@jordanrome.com
2024-10-16bpf: Prevent tailcall infinite loop caused by freplaceLeon Hwang
There is a potential infinite loop issue that can occur when using a combination of tail calls and freplace. In an upcoming selftest, the attach target for entry_freplace of tailcall_freplace.c is subprog_tc of tc_bpf2bpf.c, while the tail call in entry_freplace leads to entry_tc. This results in an infinite loop: entry_tc -> subprog_tc -> entry_freplace --tailcall-> entry_tc. The problem arises because the tail_call_cnt in entry_freplace resets to zero each time entry_freplace is executed, causing the tail call mechanism to never terminate, eventually leading to a kernel panic. To fix this issue, the solution is twofold: 1. Prevent updating a program extended by an freplace program to a prog_array map. 2. Prevent extending a program that is already part of a prog_array map with an freplace program. This ensures that: * If a program or its subprogram has been extended by an freplace program, it can no longer be updated to a prog_array map. * If a program has been added to a prog_array map, neither it nor its subprograms can be extended by an freplace program. Moreover, an extension program should not be tailcalled. As such, return -EINVAL if the program has a type of BPF_PROG_TYPE_EXT when adding it to a prog_array map. Additionally, fix a minor code style issue by replacing eight spaces with a tab for proper formatting. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20241015150207.70264-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16bpf: Add bpf_task_from_vpid() kfuncJuntong Deng
bpf_task_from_pid() that currently exists looks up the struct task_struct corresponding to the pid in the root pid namespace (init_pid_ns). This patch adds bpf_task_from_vpid() which looks up the struct task_struct corresponding to vpid in the pid namespace of the current process. This is useful for getting information about other processes in the same pid namespace. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB5848E50DA58F79CDE65433C399442@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16mm/bpf: Add bpf_get_kmem_cache() kfuncNamhyung Kim
The bpf_get_kmem_cache() is to get a slab cache information from a virtual address like virt_to_cache(). If the address is a pointer to a slab object, it'd return a valid kmem_cache pointer, otherwise NULL is returned. It doesn't grab a reference count of the kmem_cache so the caller is responsible to manage the access. The returned point is marked as PTR_UNTRUSTED. The intended use case for now is to symbolize locks in slab objects from the lock contention tracepoints. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241010232505.1339892-3-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-15bpf: Fix truncation bug in coerce_reg_to_size_sx()Dimitar Kanaliev
coerce_reg_to_size_sx() updates the register state after a sign-extension operation. However, there's a bug in the assignment order of the unsigned min/max values, leading to incorrect truncation: 0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: (57) r0 &= 1 ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) 2: (07) r0 += 254 ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1)) 3: (bf) r0 = (s8)r0 ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1)) In the current implementation, the unsigned 32-bit min/max values (u32_min_value and u32_max_value) are assigned directly from the 64-bit signed min/max values (s64_min and s64_max): reg->umin_value = reg->u32_min_value = s64_min; reg->umax_value = reg->u32_max_value = s64_max; Due to the chain assigmnent, this is equivalent to: reg->u32_min_value = s64_min; // Unintended truncation reg->umin_value = reg->u32_min_value; reg->u32_max_value = s64_max; // Unintended truncation reg->umax_value = reg->u32_max_value; Fixes: 1f9a1ea821ff ("bpf: Support new sign-extension load insns") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-14bpf: Add kmem_cache iteratorNamhyung Kim
The new "kmem_cache" iterator will traverse the list of slab caches and call attached BPF programs for each entry. It should check the argument (ctx.s) if it's NULL before using it. Now the iteration grabs the slab_mutex only if it traverse the list and releases the mutex when it runs the BPF program. The kmem_cache entry is protected by a refcount during the execution. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab Link: https://lore.kernel.org/r/20241010232505.1339892-2-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>