summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-01-08sched_ext: keep running prev when prev->scx.slice != 0Henry Huang
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-07tracing/hist: Support POLLPRI event for poll on histogramMasami Hiramatsu (Google)
Since POLLIN will not be flushed until the hist file is read, the user needs to repeatedly read() and poll() on the hist file for monitoring the event continuously. But the read() is somewhat redundant when the user is only monitoring for event updates. Add POLLPRI poll event on the hist file so the event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait for the next event without having to issue a read(). Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07tracing/hist: Add poll(POLLIN) support on hist fileMasami Hiramatsu (Google)
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN. Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events. This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file. NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram. Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-07tracing: Fix using ret variable in tracing_set_tracer()Steven Rostedt
When the function tracing_set_tracer() switched over to using the guard() infrastructure, it did not need to save the 'ret' variable and would just return the value when an error arised, instead of setting ret and jumping to an out label. When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the "ret" variable to be initialized to zero and had set 'ret' while holding an arch_spin_lock() (not used by guard), and then upon releasing the lock it would check 'ret' and exit if set. But because ret was only set when an error occurred while holding the locks, 'ret' would be used uninitialized if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should be self contain. Make sure 'ret' is also set when no error occurred. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/ Fixes: d33b10c0c73ad ("tracing: Switch trace.c code over to use guard()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-01-07kernel/ksysfs.c: simplify bin_attribute definitionThomas Weißschuh
The notes attribute can be implemented in terms of BIN_ATTR_SIMPLE(). This saves memory at runtime and is a preparation for the constification of struct bin_attribute. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20241121-sysfs-const-bin_attr-ksysfs-v1-1-972faced149d@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-06bpf: Allow bpf_for/bpf_repeat calls while holding a spinlockEmil Tsalapatis
Add the bpf_iter_num_* kfuncs called by bpf_for in special_kfunc_list, and allow the calls even while holding a spin lock. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250104202528.882482-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06sched_ext: Include remaining task time slice in error state dumpAndrea Righi
Report the remaining time slice when dumping task information during an error exit. This information can be useful for tracking incorrect or excessively long time slices in schedulers that implement dynamic time slice logic. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ONAndrea Righi
With commit 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"), scx_bpf_dsq_insert() can use SCX_DSQ_LOCAL_ON for direct dispatch from ops.enqueue() to target the local DSQ of any CPU. Update the documentation accordingly. Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06sched_ext: idle: small CPU iteration refactoringAndrea Righi
Replace the loop to check if all SMT CPUs are idle with cpumask_subset(). This simplifies the code and slightly improves efficiency, while preserving the original behavior. Note that idle_masks.smt handling remains racy, which is acceptable as it serves as an optimization and is self-correcting. Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06Merge tag 'vfs-6.13-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Relax assertions on failure to encode file handles The ->encode_fh() method can fail for various reasons. None of them warrant a WARN_ON(). - Fix overlayfs file handle encoding by allowing encoding an fid from an inode without an alias - Make sure fuse_dir_open() handles FOPEN_KEEP_CACHE. If it's not specified fuse needs to invaludate the directory inode page cache - Fix qnx6 so it builds with gcc-15 - Various fixes for netfslib and ceph and nfs filesystems: - Ignore silly rename files from afs and nfs when building header archives - Fix read result collection in netfslib with multiple subrequests - Handle ENOMEM for netfslib buffered reads - Fix oops in nfs_netfs_init_request() - Parse the secctx command immediately in cachefiles - Remove a redundant smp_rmb() in netfslib - Handle recursion in read retry in netfslib - Fix clearing of folio_queue - Fix missing cancellation of copy-to_cache when the cache for a file is temporarly disabled in netfslib - Sanity check the hfs root record - Fix zero padding data issues in concurrent write scenarios - Fix is_mnt_ns_file() after converting nsfs to path_from_stashed() - Fix missing declaration of init_files - Increase I/O priority when writing revoke records in jbd2 - Flush filesystem device before updating tail sequence in jbd2 * tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits) ovl: support encoding fid from inode with no alias ovl: pass realinode to ovl_encode_real_fh() instead of realdentry fuse: respect FOPEN_KEEP_CACHE on opendir netfs: Fix is-caching check in read-retry netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled netfs: Fix ceph copy to cache on write-begin netfs: Work around recursion by abandoning retry if nothing read netfs: Fix missing barriers by using clear_and_wake_up_bit() netfs: Remove redundant use of smp_rmb() cachefiles: Parse the "secctx" immediately nfs: Fix oops in nfs_netfs_init_request() when copying to cache netfs: Fix enomem handling in buffered reads netfs: Fix non-contiguous donation between completed reads kheaders: Ignore silly-rename files fs: relax assertions on failure to encode file handles fs: fix missing declaration of init_files fs: fix is_mnt_ns_file() iomap: fix zero padding data issue in concurrent append writes iomap: pass byte granular end position to iomap_add_to_ioend jbd2: flush filesystem device before updating tail sequence ...
2025-01-06kernel/cgroup: Add "dmem" memory accounting cgroupMaarten Lankhorst
This code is based on the RDMA and misc cgroup initially, but now uses page_counter. It uses the same min/low/max semantics as the memory cgroup as a result. There's a small mismatch as TTM uses u64, and page_counter long pages. In practice it's not a problem. 32-bits systems don't really come with >=4GB cards and as long as we're consistently wrong with units, it's fine. The device page size may not be in the same units as kernel page size, and each region might also have a different page size (VRAM vs GART for example). The interface is simple: - Call dmem_cgroup_register_region() - Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory, use dmem_cgroup__uncharge when freeing it. This may return an error code, or -EAGAIN when the cgroup limit is reached. In that case a reference to the limiting pool is returned. - The limiting cs can be used as compare function for dmem_cgroup_state_evict_valuable. - After having evicted enough, drop reference to limiting cs with dmem_cgroup_pool_state_put. This API allows you to limit device resources with cgroups. You can see the supported cards in /sys/fs/cgroup/dmem.capacity You need to echo +dmem to cgroup.subtree_control, and then you can partition device memory. Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Co-developed-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-05Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) MAINTAINERS: change Arınç _NAL's name and email address scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity mm/util: make memdup_user_nul() similar to memdup_user() mm, madvise: fix potential workingset node list_lru leaks mm/damon/core: fix ignored quota goals and filters of newly committed schemes mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() mm/list_lru: fix false warning of negative counter vmstat: disable vmstat_work on vmstat_cpu_down_prep() mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' mm: shmem: fix incorrect index alignment for within_size policy percpu: remove intermediate variable in PERCPU_PTR() mm: zswap: fix race between [de]compression and CPU hotunplug ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit kcov: mark in_softirq_really() as __always_inline docs: mm: fix the incorrect 'FileHugeMapped' field mailmap: modify the entry for Mathieu Othacehe mm/kmemleak: fix sleeping function called from invalid context at print message mm: hugetlb: independent PMD page table shared count maple_tree: reload mas before the second call for mas_empty_area ...
2025-01-04padata: fix sysfs store callback checkThomas Weißschuh
padata_sysfs_store() was copied from padata_sysfs_show() but this check was not adapted. Today there is no attribute which can fail this check, but if there is one it may as well be correct. Fixes: 5e017dc3f8bc ("padata: Added sysfs primitives to padata subsystem") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h f91a5b808938 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK") 3f330db30638 ("net: reformat kdoc return statements") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03Merge tag 'sched_ext-for-6.13-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors
2025-01-03Merge tag 'wq-for-6.13-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Suppress a corner case spurious flush dependency warning - Two trivial changes * tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add printf attribute to __alloc_workqueue() workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker rust: add safety comment in workqueue traits
2025-01-03bpf: Reject struct_ops registration that uses module ptr and the module ↵Martin KaFai Lau
btf_id is missing There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n. In particular, the report is on tcp_congestion_ops that has a "struct module *owner" member. For struct_ops that has a "struct module *owner" member, it can be extended either by the regular kernel module or by the bpf_struct_ops. bpf_try_module_get() will be used to do the refcounting and different refcount is done based on the owner pointer. When CONFIG_MODULES=n, the btf_id of the "struct module" is missing: WARN: resolve_btfids: unresolved symbol module Thus, the bpf_try_module_get() cannot do the correct refcounting. Not all subsystem's struct_ops requires the "struct module *owner" member. e.g. the recent sched_ext_ops. This patch is to disable bpf_struct_ops registration if the struct_ops has the "struct module *" member and the "struct module" btf_id is missing. The btf_type_is_fwd() helper is moved to the btf.h header file for this test. This has happened since the beginning of bpf_struct_ops which has gone through many changes. The Fixes tag is set to a recent commit that this patch can apply cleanly. Considering CONFIG_MODULES=n is not common and the age of the issue, targeting for bpf-next also. Fixes: 1611603537a4 ("bpf: Create argument information for nullable arguments.") Reported-by: Robert Morris <rtm@csail.mit.edu> Closes: https://lore.kernel.org/bpf/74665.1733669976@localhost/ Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241220201818.127152-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-03Merge tag 'ftrace-v6.13-rc5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. * tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix function profiler's filtering functionality fgraph: Add READ_ONCE() when accessing fgraph_array[]
2025-01-02ftrace: Fix function profiler's filtering functionalityKohei Enju
Commit c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering"), function profiler (enabled via function_profile_enabled) has been showing statistics for all functions, ignoring set_ftrace_filter settings. While tracers are instantiated, the function profiler is not. Therefore, it should use the global set_ftrace_filter for consistency. This patch modifies the function profiler to use the global filter, fixing the filtering functionality. Before (filtering not working): ``` root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/* Function Hit Time Avg s^2 -------- --- ---- --- --- schedule 314 22290594 us 70989.15 us 40372231 us x64_sys_call 1527 8762510 us 5738.382 us 3414354 us schedule_hrtimeout_range 176 8665356 us 49234.98 us 405618876 us __x64_sys_ppoll 324 5656635 us 17458.75 us 19203976 us do_sys_poll 324 5653747 us 17449.83 us 19214945 us schedule_timeout 67 5531396 us 82558.15 us 2136740827 us __x64_sys_pselect6 12 3029540 us 252461.7 us 63296940171 us do_pselect.constprop.0 12 3029532 us 252461.0 us 63296952931 us ``` After (filtering working): ``` root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/* Function Hit Time Avg s^2 -------- --- ---- --- --- vfs_write 462 68476.43 us 148.217 us 25874.48 us vfs_read 641 9611.356 us 14.994 us 28868.07 us vfs_fstat 890 878.094 us 0.986 us 1.667 us vfs_fstatat 227 757.176 us 3.335 us 18.928 us vfs_statx 226 610.610 us 2.701 us 17.749 us vfs_getattr_nosec 1187 460.919 us 0.388 us 0.326 us vfs_statx_path 297 343.287 us 1.155 us 11.116 us vfs_rename 6 291.575 us 48.595 us 9889.236 us ``` Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Kohei Enju <enjuk@amazon.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-02fgraph: Add READ_ONCE() when accessing fgraph_array[]Zilin Guan
In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[] elements, which are fgraph_ops. The loop checks if an element is a fgraph_stub to prevent using a fgraph_stub afterward. However, if the compiler reloads fgraph_array[] after this check, it might race with an update to fgraph_array[] that introduces a fgraph_stub. This could result in the stub being processed, but the stub contains a null "func_hash" field, leading to a NULL pointer dereference. To ensure that the gops compared against the fgraph_stub matches the gops processed later, add a READ_ONCE(). A similar patch appears in commit 63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]"). Cc: stable@vger.kernel.org Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks") Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-02kallsyms: Use kthread_run_on_cpu()Frederic Weisbecker
Use the proper API instead of open coding it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-12-31tracing: Have process_string() also allow arraysSteven Rostedt
In order to catch a common bug where a TRACE_EVENT() TP_fast_assign() assigns an address of an allocated string to the ring buffer and then references it in TP_printk(), which can be executed hours later when the string is free, the function test_event_printk() runs on all events as they are registered to make sure there's no unwanted dereferencing. It calls process_string() to handle cases in TP_printk() format that has "%s". It returns whether or not the string is safe. But it can have some false positives. For instance, xe_bo_move() has: TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s", __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size, xe_mem_type_to_name[__entry->old_placement], xe_mem_type_to_name[__entry->new_placement], __get_str(device_id)) Where the "%s" references into xe_mem_type_to_name[]. This is an array of pointers that should be safe for the event to access. Instead of flagging this as a bad reference, if a reference points to an array, where the record field is the index, consider it safe. Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()") Reported-by: Genes Lists <lists@sapience.com> Tested-by: Gene C <arch@sapience.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-30bpf: Use refcount_t instead of atomic_t for mmap_countPei Xiao
Use an API that resembles more the actual use of mmap_count. Found by cocci: kernel/bpf/arena.c:245:6-25: WARNING: atomic_dec_and_test variation before object free at line 249. Fixes: b90d77e5fd78 ("bpf: Fix remap of arena.") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412292037.LXlYSHKl-lkp@intel.com/ Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn> Link: https://lore.kernel.org/r/6ecce439a6bc81adb85d5080908ea8959b792a50.1735542814.git.xiaopei01@kylinos.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30kcov: mark in_softirq_really() as __always_inlineArnd Bergmann
If gcc decides not to inline in_softirq_really(), objtool warns about a function call with UACCESS enabled: kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled Mark this as __always_inline to avoid the problem. Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org Fixes: 7d4df2dad312 ("kcov: properly check for softirq context") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Marco Elver <elver@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30bpf: Remove unused MT_ENTRY defineLorenzo Pieralisi
The range tree introduction removed the need for maple tree usage but missed removing the MT_ENTRY defined value that was used to mark maple tree allocated entries. Remove the MT_ENTRY define. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20241223115901.14207-1-lpieralisi@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30bpf: Fix holes in special_kfunc_list if !CONFIG_NETThomas Weißschuh
If the function is not available its entry has to be replaced with BTF_ID_UNUSED instead of skipped. Otherwise the list doesn't work correctly. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/lkml/CAADnVQJQpVziHzrPCCpGE5=8uzw2OkxP8gqe1FkJ6_XVVyVbNw@mail.gmail.com/ Fixes: 00a5acdbf398 ("bpf: Fix configuration-dependent BTF function references") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20241219-bpf-fix-special_kfunc_list-v1-1-d9d50dd61505@weissschuh.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30bpf, verifier: Improve precision of BPF_MULMatan Shachnai
This patch improves (or maintains) the precision of register value tracking in BPF_MUL across all possible inputs. It also simplifies scalar32_min_max_mul() and scalar_min_max_mul(). As it stands, BPF_MUL is composed of three functions: case BPF_MUL: tnum_mul(); scalar32_min_max_mul(); scalar_min_max_mul(); The current implementation of scalar_min_max_mul() restricts the u64 input ranges of dst_reg and src_reg to be within [0, U32_MAX]: /* Both values are positive, so we can work with unsigned and * copy the result to signed (unless it exceeds S64_MAX). */ if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) { /* Potential overflow, we know nothing */ __mark_reg64_unbounded(dst_reg); return; } This restriction is done to avoid unsigned overflow, which could otherwise wrap the result around 0, and leave an unsound output where umin > umax. We also observe that limiting these u64 input ranges to [0, U32_MAX] leads to a loss of precision. Consider the case where the u64 bounds of dst_reg are [0, 2^34] and the u64 bounds of src_reg are [0, 2^2]. While the multiplication of these two bounds doesn't overflow and is sound [0, 2^36], the current scalar_min_max_mul() would set the entire register state to unbounded. Importantly, we update BPF_MUL to allow signed bound multiplication (i.e. multiplying negative bounds) as well as allow u64 inputs to take on values from [0, U64_MAX]. We perform signed multiplication on two bounds [a,b] and [c,d] by multiplying every combination of the bounds (i.e. a*c, a*d, b*c, and b*d) and checking for overflow of each product. If there is an overflow, we mark the signed bounds unbounded [S64_MIN, S64_MAX]. In the case of no overflow, we take the minimum of these products to be the resulting smin, and the maximum to be the resulting smax. The key idea here is that if there’s no possibility of overflow, either when multiplying signed bounds or unsigned bounds, we can safely multiply the respective bounds; otherwise, we set the bounds that exhibit overflow (during multiplication) to unbounded. if (check_mul_overflow(*dst_umax, src_reg->umax_value, dst_umax) || (check_mul_overflow(*dst_umin, src_reg->umin_value, dst_umin))) { /* Overflow possible, we know nothing */ *dst_umin = 0; *dst_umax = U64_MAX; } ... Below, we provide an example BPF program (below) that exhibits the imprecision in the current BPF_MUL, where the outputs are all unbounded. In contrast, the updated BPF_MUL produces a bounded register state: BPF_LD_IMM64(BPF_REG_1, 11), BPF_LD_IMM64(BPF_REG_2, 4503599627370624), BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0), BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0), BPF_ALU64_REG(BPF_AND, BPF_REG_1, BPF_REG_2), BPF_LD_IMM64(BPF_REG_3, 809591906117232263), BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1), BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), Verifier log using the old BPF_MUL: func#0 @0 0: R1=ctx() R10=fp0 0: (18) r1 = 0xb ; R1_w=11 2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080 4: (87) r2 = -r2 ; R2_w=scalar() 5: (87) r2 = -r2 ; R2_w=scalar() 6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar() 7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687 9: (2f) r3 *= r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar() ... Verifier using the new updated BPF_MUL (more precise bounds at label 9) func#0 @0 0: R1=ctx() R10=fp0 0: (18) r1 = 0xb ; R1_w=11 2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080 4: (87) r2 = -r2 ; R2_w=scalar() 5: (87) r2 = -r2 ; R2_w=scalar() 6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar() 7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687 9: (2f) r3 *= r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar(smin=0,smax=umax=0x7b96bb0a94a3a7cd,var_off=(0x0; 0x7fffffffffffffff)) ... Finally, we proved the soundness of the new scalar_min_max_mul() and scalar32_min_max_mul() functions. Typically, multiplication operations are expensive to check with bitvector-based solvers. We were able to prove the soundness of these functions using Non-Linear Integer Arithmetic (NIA) theory. Additionally, using Agni [2,3], we obtained the encodings for scalar32_min_max_mul() and scalar_min_max_mul() in bitvector theory, and were able to prove their soundness using 8-bit bitvectors (instead of 64-bit bitvectors that the functions actually use). In conclusion, with this patch, 1. We were able to show that we can improve the overall precision of BPF_MUL. We proved (using an SMT solver) that this new version of BPF_MUL is at least as precise as the current version for all inputs and more precise for some inputs. 2. We are able to prove the soundness of the new scalar_min_max_mul() and scalar32_min_max_mul(). By leveraging the existing proof of tnum_mul [1], we can say that the composition of these three functions within BPF_MUL is sound. [1] https://ieeexplore.ieee.org/abstract/document/9741267 [2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12 [3] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@gmail.com> Link: https://lore.kernel.org/r/20241218032337.12214-2-m.shachnai@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30cpu: Remove unused init_cpu_onlineDr. David Alan Gilbert
The last use of init_cpu_online() was removed by the commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-12-29sched_ext: idle: introduce check_builtin_idle_enabled() helperAndrea Righi
Minor refactoring to add a helper function for checking if the built-in idle CPU selection policy is enabled. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29sched_ext: idle: clarify commentsAndrea Righi
Add a comments to clarify about the usage of cpumask_intersects(). Moreover, update scx_select_cpu_dfl() description clarifying that the final step of the idle selection logic involves searching for any idle CPU in the system that the task can use. Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29sched_ext: idle: use assign_cpu() to update the idle cpumaskAndrea Righi
Use the assign_cpu() helper to set or clear the CPU in the idle mask, based on the idle condition. Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29Merge tag 'locking-urgent-2024-12-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix missed rtmutex wakeups causing sporadic boot hangs and other misbehavior" * tag 'locking-urgent-2024-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Make sure we wake anything on the wake_q when we release the lock->wait_lock
2024-12-27verification/dot2k: Auto patch current kernel sourceGabriele Monaco
dot2k suggests a list of changes to the kernel tree while adding a monitor: edit tracepoints header, Makefile, Kconfig and moving the monitor folder. Those changes can be easily run automatically. Add a flag to dot2k to alter the kernel source. The kernel source directory can be either assumed from the PWD, or from the running kernel, if installed. This feature works best if the kernel tree is a git repository, so that its easier to make sure there are no unintended changes. The main RV files (e.g. Makefile) have now a comment placeholder that can be useful for manual editing (e.g. to know where to add new monitors) and it is used by the script to append the required lines. We also slightly adapt the file handling functions in dot2k: __open_file is now called __read_file and also closes the file before returning the content; __create_file is now a more general __write_file, we no longer return on FileExistsError (not thrown while opening), a new __create_file simply calls __write_file specifying the monitor folder in the path. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20241227144752.362911-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-27rv: Simplify manual steps in monitor creationGabriele Monaco
While creating a new monitor in RV, besides generating code from dot2k, there are a few manual steps which can be tedious and error prone, like adding the tracepoints, makefile lines and kconfig. This patch restructures the existing monitors to keep some files in the monitor's folder itself, which can be automatically generated by future versions of dot2k. Monitors have now their own Kconfig and tracepoint snippets. For simplicity, the main tracepoint definition, is moved to the RV directory, it defines only the tracepoint classes and includes the monitor-specific tracepoints, which reside in the monitor directory. Tracepoints and Kconfig no longer need to be copied and adapted from existing ones but only need to be included in the main files. The Makefile remains untouched since there's little advantage in having a separated Makefile for each monitor with a single line and including it in the main RV Makefile. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20241227144752.362911-6-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-27Merge tag 'probes-fixes-v6.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Change the priority of the module callback of kprobe events so that it is called after the jump label list on the module is updated. This ensures the kprobe can check whether it is not on the jump label address correctly" * tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
2024-12-26tracing/fprobe: Remove nr_maxactive from fprobeMasami Hiramatsu (Google)
Remove depercated fprobe::nr_maxactive. This involves fprobe events to rejects the maxactive number. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173519007257.391279.946804046982289337.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fprobe: Add fprobe_header encoding featureMasami Hiramatsu (Google)
Fprobe store its data structure address and size on the fgraph return stack by __fprobe_header. But most 64bit architecture can combine those to one unsigned long value because 4 MSB in the kernel address are the same. With this encoding, fprobe can consume less space on ret_stack. This introduces asm/fprobe.h to define arch dependent encode/decode macros. Note that since fprobe depends on CONFIG_HAVE_FUNCTION_GRAPH_FREGS, currently only arm64, loongarch, riscv, s390 and x86 are supported. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173519005783.391279.5307910947400277525.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fprobe: Rewrite fprobe on function-graph tracerMasami Hiramatsu (Google)
Rewrite fprobe implementation on function-graph tracer. Major API changes are: - 'nr_maxactive' field is deprecated. - This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only on x86_64. - Currently the entry size is limited in 15 * sizeof(long). - If there is too many fprobe exit handler set on the same function, it will fail to probe. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/173519003970.391279.14406792285453830996.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNCMasami Hiramatsu (Google)
Add CONFIG_HAVE_FTRACE_GRAPH_FUNC kconfig in addition to ftrace_graph_func macro check. This is for the other feature (e.g. FPROBE) which requires to access ftrace_regs from fgraph_ops::entryfunc() can avoid compiling if the fgraph can not pass the valid ftrace_regs. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173519001472.391279.1174901685282588467.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabledMasami Hiramatsu (Google)
Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is converted from ftrace_regs by ftrace_partial_regs(), thus some registers may always returns 0. But it should be enough for function entry (access arguments) and exit (access return value). Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173519000417.391279.14011193569589886419.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Florent Revest <revest@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGSMasami Hiramatsu (Google)
Allow fprobe events to be enabled with CONFIG_DYNAMIC_FTRACE_WITH_ARGS. With this change, fprobe events mostly use ftrace_regs instead of pt_regs. Note that if the arch doesn't enable HAVE_FTRACE_REGS_HAVING_PT_REGS, fprobe events will not be able to be used from perf. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518999352.391279.13332699755290175168.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fprobe: Use ftrace_regs in fprobe exit handlerMasami Hiramatsu (Google)
Change the fprobe exit handler to use ftrace_regs structure instead of pt_regs. This also introduce HAVE_FTRACE_REGS_HAVING_PT_REGS which means the ftrace_regs is including the pt_regs so that ftrace_regs can provide pt_regs without memory allocation. Fprobe introduces a new dependency with that. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: bpf <bpf@vger.kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Cc: Matt Bobrowski <mattbobrowski@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Hao Luo <haoluo@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/173518995092.391279.6765116450352977627.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fprobe: Use ftrace_regs in fprobe entry handlerMasami Hiramatsu (Google)
This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe on arm64. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518994037.391279.2786805566359674586.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Florent Revest <revest@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fgraph: Pass ftrace_regs to retfuncMasami Hiramatsu (Google)
Pass ftrace_regs to the fgraph_ops::retfunc(). If ftrace_regs is not available, it passes a NULL instead. User callback function can access some registers (including return address) via this ftrace_regs. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518992972.391279.14055405490327765506.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fgraph: Replace fgraph_ret_regs with ftrace_regsMasami Hiramatsu (Google)
Use ftrace_regs instead of fgraph_ret_regs for tracing return value on function_graph tracer because of simplifying the callback interface. The CONFIG_HAVE_FUNCTION_GRAPH_RETVAL is also replaced by CONFIG_HAVE_FUNCTION_GRAPH_FREGS. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173518991508.391279.16635322774382197642.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26fgraph: Pass ftrace_regs to entryfuncMasami Hiramatsu (Google)
Pass ftrace_regs to the fgraph_ops::entryfunc(). If ftrace_regs is not available, it passes a NULL instead. User callback function can access some registers (including return address) via this ftrace_regs. Note that the ftrace_regs can be NULL when the arch does NOT define: HAVE_DYNAMIC_FTRACE_WITH_ARGS or HAVE_DYNAMIC_FTRACE_WITH_REGS. More specifically, if HAVE_DYNAMIC_FTRACE_WITH_REGS is defined but not the HAVE_DYNAMIC_FTRACE_WITH_ARGS, and the ftrace ops used to register the function callback does not set FTRACE_OPS_FL_SAVE_REGS. In this case, ftrace_regs can be NULL in user callback. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173518990044.391279.17406984900626078579.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing/string: Create and use __free(argv_free) in trace_dynevent.cSteven Rostedt
The function dyn_event_release() uses argv_split() which must be freed via argv_free(). It contains several error paths that do a goto out to call argv_free() for cleanup. This makes the code complex and error prone. Create a new __free() directive __free(argv_free) that will call argv_free() for data allocated with argv_split(), and use it in the dyn_event_release() function. Cc: Kees Cook <kees@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Shevchenko <andy@kernel.org> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/20241220103313.4a74ec8e@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_stat.c code over to use guard()Steven Rostedt
There are a couple functions in trace_stat.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.870318466@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_stack.c code over to use guard()Steven Rostedt
The function stack_trace_sysctl() uses a goto on the error path to jump to the mutex_unlock() code. Replace the logic to use guard() and let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.684913592@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26tracing: Switch trace_osnoise.c code over to use guard() and __free()Steven Rostedt
The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>