summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-02-18bpf: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/e4be2486f02a8e0ef5aa42624f1708d23e88ad57.1738746821.git.namcao@linutronix.de
2025-02-18time: Switch to hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/170bb691a0d59917c8268a98c80b607128fc9f7f.1738746821.git.namcao@linutronix.de
2025-02-18perf: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/f611e6d3fc6996bbcf0e19fe234f75edebe4332f.1738746821.git.namcao@linutronix.de
2025-02-18fork: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/174111145b945391e48936d6debcd43caec3e406.1738746821.git.namcao@linutronix.de
2025-02-18sched: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
2025-02-18kallsyms: Remove KALLSYMS_ABSOLUTE_PERCPUBrian Gerst
x86-64 was the only user. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250123190747.745588-16-brgerst@gmail.com
2025-02-18posix-timers: Invoke cond_resched() during exit_itimers()Benjamin Segall
exit_itimers() loops through every timer in the process to delete it. This requires taking the system-wide hash_lock for each of these timers, and contends with other processes trying to create or delete timers. When a process creates hundreds of thousands of timers, and then exits while other processes contend with it, this can trigger softlockups on CONFIG_PREEMPT=n. Add a cond_resched() invocation into the loop to allow the system to make progress. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/xm2634gg2n23.fsf@google.com
2025-02-18hrtimers: Replace hrtimer_clock_to_base_table with switch-caseAndy Shevchenko
Clang and GCC complain about overlapped initialisers in the hrtimer_clock_to_base_table definition. With `make W=1` and CONFIG_WERROR=y (which is default nowadays) this breaks the build: CC kernel/time/hrtimer.o kernel/time/hrtimer.c:124:21: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides] 124 | [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME, kernel/time/hrtimer.c:122:27: note: previous initialization is here 122 | [0 ... MAX_CLOCKS - 1] = HRTIMER_MAX_CLOCK_BASES, (and similar for CLOCK_MONOTONIC, CLOCK_BOOTTIME, and CLOCK_TAI). hrtimer_clockid_to_base(), which uses the table, is only used in __hrtimer_init(), which is not a hotpath. Therefore replace the table lookup with a switch case in hrtimer_clockid_to_base() to avoid this warning. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250214134424.3367619-1-andriy.shevchenko@linux.intel.com
2025-02-18uprobes: Don't use %pK through printkThomas Weißschuh
Restricted pointers ("%pK") are not meant to be used through printk(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. For more background, see: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250217-restricted-pointers-uprobes-v1-1-e8cbe5bb22a7@linutronix.de
2025-02-18sched: Compact RSEQ concurrency IDs with reduced threads and affinityMathieu Desnoyers
When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") adds a per-mm/CPU recent_cid which is never unset unless a thread migrates. This is a tradeoff between: A) Preserving cache locality after a transition from many threads to few threads, or after reducing the hamming weight of the allowed CPU mask. B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask easy to document and understand. C) Allowing applications to eventually react to mm_cid compaction after reduction of the nr threads or allowed CPU mask, making the tracking of mm_cid compaction easier by shrinking it back towards 0 or not. D) Making sure applications that periodically reduce and then increase again the nr threads or allowed CPU mask still benefit from good cache locality with mm_cid. Introduce the following changes: * After shrinking the number of threads or reducing the number of allowed CPUs, reduce the value of max_nr_cid so expansion of CID allocation will preserve cache locality if the number of threads or allowed CPUs increase again. * Only re-use a recent_cid if it is within the max_nr_cid upper bound, else find the first available CID. Fixes: 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com
2025-02-17bpf: Allow struct_ops prog to return referenced kptrAmery Hung
Allow a struct_ops program to return a referenced kptr if the struct_ops operator's return type is a struct pointer. To make sure the returned pointer continues to be valid in the kernel, several constraints are required: 1) The type of the pointer must matches the return type 2) The pointer originally comes from the kernel (not locally allocated) 3) The pointer is in its unmodified form Implementation wise, a referenced kptr first needs to be allowed to _leak_ in check_reference_leak() if it is in the return register. Then, in check_return_code(), constraints 1-3 are checked. During struct_ops registration, a check is also added to warn about operators with non-struct pointer return. In addition, since the first user, Qdisc_ops::dequeue, allows a NULL pointer to be returned when there is no skb to be dequeued, we will allow a scalar value with value equals to NULL to be returned. In the future when there is a struct_ops user that always expects a valid pointer to be returned from an operator, we may extend tagging to the return value. We can tell the verifier to only allow NULL pointer return if the return value is tagged with MAY_BE_NULL. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17bpf: Support getting referenced kptr from struct_ops argumentAmery Hung
Allows struct_ops programs to acqurie referenced kptrs from arguments by directly reading the argument. The verifier will acquire a reference for struct_ops a argument tagged with "__ref" in the stub function in the beginning of the main program. The user will be able to access the referenced kptr directly by reading the context as long as it has not been released by the program. This new mechanism to acquire referenced kptr (compared to the existing "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons. In the first use case, Qdisc_ops, an skb is passed to .enqueue in the first argument. This mechanism provides a natural way for users to get a referenced kptr in the .enqueue struct_ops programs and makes sure that a qdisc will always enqueue or drop the skb. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17bpf: Make every prog keep a copy of ctx_arg_infoAmery Hung
Currently, ctx_arg_info is read-only in the view of the verifier since it is shared among programs of the same attach type. Make each program have their own copy of ctx_arg_info so that we can use it to store program specific information. In the next patch where we support acquiring a referenced kptr through a struct_ops argument tagged with "__ref", ctx_arg_info->ref_obj_id will be used to store the unique reference object id of the argument. This avoids creating a requirement in the verifier that "__ref" tagged arguments must be the first set of references acquired [0]. [0] https://lore.kernel.org/bpf/20241220195619.2022866-2-amery.hung@gmail.com/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250217190640.1748177-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-17Merge tag 'vfs-6.14-rc4.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "It was reported that the acct(2) system call can be used to trigger a NULL deref in cases where it is set to write to a file that triggers an internal lookup. This can e.g., happen when pointing acct(2) to /sys/power/resume. At the point the where the write to this file happens the calling task has already exited and called exit_fs() but an internal lookup might be triggered through lookup_bdev(). This may trigger a NULL-deref when accessing current->fs. Reorganize the code so that the the final write happens from the workqueue but with the caller's credentials. This preserves the (strange) permission model and has almost no regression risk. Also block access to kernel internal filesystems as well as procfs and sysfs in the first place. Various fixes for netfslib: - Fix a number of read-retry hangs, including: - Incorrect getting/putting of references on subreqs as we retry them - Failure to track whether a last old subrequest in a retried set is superfluous - Inconsistency in the usage of wait queues used for subrequests (ie. using clear_and_wake_up_bit() whilst waiting on a private waitqueue) - Add stats counters for retries and publish in /proc/fs/netfs/stats. This is not a fix per se, but is useful in debugging and shouldn't otherwise change the operation of the code - Fix the ordering of queuing subrequests with respect to setting the request flag that says we've now queued them all" * tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix setting NETFS_RREQ_ALL_QUEUED to be after all subreqs queued netfs: Add retry stat counters netfs: Fix a number of read-retry hangs acct: block access to kernel internal filesystems acct: perform last write from workqueue
2025-02-17Merge 6.14-rc3 into tty-nextGreg Kroah-Hartman
We need the tty changes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17Merge 6.14-rc3 into driver-core-nextGreg Kroah-Hartman
We need the faux_device changes in here for future work. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-16Merge tag 'irq_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq Kconfig cleanup from Borislav Petkov: - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS * tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
2025-02-16Merge tag 'sched_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Clarify what happens when a task is woken up from the wake queue and make clear its removal from that queue is atomic * tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clarify wake_up_q()'s write to task->wake_q.next
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Make idle static keys privateAndrea Righi
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-15Merge tag 'trace-ring-buffer-v6.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace ring buffer fixes from Steven Rostedt: - Enable resize on mmap() error When a process mmaps a ring buffer, its size is locked and resizing is disabled. But if the user passes in a wrong parameter, the mmap() can fail after the resize was disabled and the mmap() exits with error without reenabling the ring buffer resize. This prevents the ring buffer from ever being resized after that. Reenable resizing of the ring buffer on mmap() error. - Have resizing return proper error and not always -ENOMEM If the ring buffer is mmapped by one task and another task tries to resize the buffer it will error with -ENOMEM. This is confusing to the user as there may be plenty of memory available. Have it return the error that actually happens (in this case -EBUSY) where the user can understand why the resize failed. - Test the sub-buffer array to validate persistent memory buffer On boot up, the initialization of the persistent memory buffer will do a validation check to see if the content of the data is valid, and if so, it will use the memory as is, otherwise it re-initializes it. There's meta data in this persistent memory that keeps track of which sub-buffer is the reader page and an array that states the order of the sub-buffers. The values in this array are indexes into the sub-buffers. The validator checks to make sure that all the entries in the array are within the sub-buffer list index, but it does not check for duplications. While working on this code, the array got corrupted and had duplicates, where not all the sub-buffers were accounted for. This passed the validator as all entries were valid, but the link list was incorrect and could have caused a crash. The corruption only produced incorrect data, but it could have been more severe. To fix this, create a bitmask that covers all the sub-buffer indexes and set it to all zeros. While iterating the array checking the values of the array content, have it set a bit corresponding to the index in the array. If the bit was already set, then it is a duplicate and mark the buffer as invalid and reset it. - Prevent mmap()ing persistent ring buffer The persistent ring buffer uses vmap() to map the persistent memory. Currently, the mmap() logic only uses virt_to_page() to get the page from the ring buffer memory and use that to map to user space. This works because a normal ring buffer uses alloc_page() to allocate its memory. But because the persistent ring buffer use vmap() it causes a kernel crash. Fixing this to work with vmap() is not hard, but since mmap() on persistent memory buffers never worked, just have the mmap() return -ENODEV (what was returned before mmap() for persistent memory ring buffers, as they never supported mmap. Normal buffers will still allow mmap(). Implementing mmap() for persistent memory ring buffers can wait till the next merge window. - Fix polling on persistent ring buffers There's a "buffer_percent" option (default set to 50), that is used to have reads of the ring buffer binary data block until the buffer fills to that percentage. The field "pages_touched" is incremented every time a new sub-buffer has content added to it. This field is used in the calculations to determine the amount of content is in the buffer and if it exceeds the "buffer_percent" then it will wake the task polling on the buffer. As persistent ring buffers can be created by the content from a previous boot, the "pages_touched" field was not updated. This means that if a task were to poll on the persistent buffer, it would block even if the buffer was completely full. It would block even if the "buffer_percent" was zero, because with "pages_touched" as zero, it would be calculated as the buffer having no content. Update pages_touched when initializing the persistent ring buffer from a previous boot. * tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Update pages_touched to reflect persistent buffer content tracing: Do not allow mmap() of persistent ring buffer ring-buffer: Validate the persistent meta data subbuf array tracing: Have the error of __tracing_resize_ring_buffer() passed to user ring-buffer: Unlock resize on mmap error
2025-02-15ring-buffer: Update pages_touched to reflect persistent buffer contentSteven Rostedt
The pages_touched field represents the number of subbuffers in the ring buffer that have content that can be read. This is used in accounting of "dirty_pages" and "buffer_percent" to allow the user to wait for the buffer to be filled to a certain amount before it reads the buffer in blocking mode. The persistent buffer never updated this value so it was set to zero, and this accounting would take it as it had no content. This would cause user space to wait for content even though there's enough content in the ring buffer that satisfies the buffer_percent. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15tracing: Do not allow mmap() of persistent ring bufferSteven Rostedt
When trying to mmap a trace instance buffer that is attached to reserve_mem, it would crash: BUG: unable to handle page fault for address: ffffe97bd00025c8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 2862f3067 P4D 2862f3067 PUD 0 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:validate_page_before_insert+0x5/0xb0 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000 FS: 00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0 Call Trace: <TASK> ? __die_body.cold+0x19/0x1f ? __die+0x2e/0x40 ? page_fault_oops+0x157/0x2b0 ? search_module_extables+0x53/0x80 ? validate_page_before_insert+0x5/0xb0 ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70 ? __bad_area_nosemaphore+0x16e/0x1b0 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x77/0x90 ? exc_page_fault+0x22b/0x230 ? asm_exc_page_fault+0x2b/0x30 ? validate_page_before_insert+0x5/0xb0 ? vm_insert_pages+0x151/0x400 __rb_map_vma+0x21f/0x3f0 ring_buffer_map+0x21b/0x2f0 tracing_buffers_mmap+0x70/0xd0 __mmap_region+0x6f0/0xbd0 mmap_region+0x7f/0x130 do_mmap+0x475/0x610 vm_mmap_pgoff+0xf2/0x1d0 ksys_mmap_pgoff+0x166/0x200 __x64_sys_mmap+0x37/0x50 x64_sys_call+0x1670/0x1d70 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The reason was that the code that maps the ring buffer pages to user space has: page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); And uses that in: vm_insert_pages(vma, vma->vm_start, pages, &nr_pages); But virt_to_page() does not work with vmap()'d memory which is what the persistent ring buffer has. It is rather trivial to allow this, but for now just disable mmap() of instances that have their ring buffer from the reserve_mem option. If an mmap() is performed on a persistent buffer it will return -ENODEV just like it would if the .mmap field wasn't defined in the file_operations structure. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15kernfs: Use RCU to access kernfs_node::parent.Sebastian Andrzej Siewior
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent} pointer. This is a preparation to access kernfs_node::parent under RCU and ensure that the pointer remains stable under the RCU lifetime guarantees. For a complete path, as it is done in kernfs_path_from_node(), the kernfs_rename_lock is still required in order to obtain a stable parent relationship while computing the relevant node depth. This must not change while the nodes are inspected in order to build the path. If the kernfs user never moves the nodes (changes the parent) then the kernfs_rename_lock is not required and the RCU guarantees are sufficient. This "restriction" can be set with KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required. Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU access and use RCU accessor while accessing the node. Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can not change. Acked-by: Tejun Heo <tj@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250213145023.2820193-6-bigeasy@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-14bpf: Fix array bounds error with may_gotoJiayuan Chen
may_goto uses an additional 8 bytes on the stack, which causes the interpreters[] array to go out of bounds when calculating index by stack_size. 1. If a BPF program is rewritten, re-evaluate the stack size. For non-JIT cases, reject loading directly. 2. For non-JIT cases, calculating interpreters[idx] may still cause out-of-bounds array access, and just warn about it. 3. For jit_requested cases, the execution of bpf_func also needs to be warned. So move the definition of function __bpf_prog_ret0_warn out of the macro definition CONFIG_BPF_JIT_ALWAYS_ON. Reported-by: syzbot+d2a2c639d03ac200a4f1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/0000000000000f823606139faa5d@google.com/ Fixes: 011832b97b311 ("bpf: Introduce may_goto instruction") Signed-off-by: Jiayuan Chen <mrpre@163.com> Link: https://lore.kernel.org/r/20250214091823.46042-2-mrpre@163.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-14Merge tag 'sched_ext-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix lock imbalance in a corner case of dispatch_to_local_dsq() - Migration disabled tasks were confusing some BPF schedulers and its handling had a bug. Fix it and simplify the default behavior by dispatching them automatically - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing kfuncs that require the task argument to be the rq operation is currently operating on and thus is rq-locked. Allow them. - Fix autogroup migration handling bug which was occasionally triggering a warning in the cgroup migration path - tools/sched_ext, selftest and other misc updates * tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h. sched_ext: selftests: Fix grammar in tests description sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq() sched_ext: Fix migration disabled handling in targeted dispatches sched_ext: Implement auto local dispatching of migration disabled tasks sched_ext: Fix incorrect time delta calculation in time_delta() sched_ext: Fix lock imbalance in dispatch_to_local_dsq() sched_ext: selftests/dsp_local_on: Fix selftest on UP systems tools/sched_ext: Add helper to check task migration state sched_ext: Fix incorrect autogroup migration detection sched_ext: selftests/dsp_local_on: Fix sporadic failures selftests/sched_ext: Fix enum resolution sched_ext: Include task weight in the error state dump sched_ext: Fixes typos in comments
2025-02-14Merge tag 'cgroup-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a race window where a newly forked task could escape cgroup.kill - Remove incorrectly included steal time from cpu.stat::usage_usec - Minor update in selftest * tag 'cgroup-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Remove steal time from usage_usec selftests/cgroup: use bash in test_cpuset_v1_hp.sh cgroup: fix race between fork and cgroup.kill
2025-02-14Merge tag 'wq-for-6.14-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix a regression where a worker pool can be freed before rescuer workers are done with it leading to user-after-free * tag 'wq-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Put the pwq after detaching the rescuer from the pool
2025-02-14ring-buffer: Validate the persistent meta data subbuf arraySteven Rostedt
The meta data for a mapped ring buffer contains an array of indexes of all the subbuffers. The first entry is the reader page, and the rest of the entries lay out the order of the subbuffers in how the ring buffer link list is to be created. The validator currently makes sure that all the entries are within the range of 0 and nr_subbufs. But it does not check if there are any duplicates. While working on the ring buffer, I corrupted this array, where I added duplicates. The validator did not catch it and created the ring buffer link list on top of it. Luckily, the corruption was only that the reader page was also in the writer path and only presented corrupted data but did not crash the kernel. But if there were duplicates in the writer side, then it could corrupt the ring buffer link list and cause a crash. Create a bitmask array with the size of the number of subbuffers. Then clear it. When walking through the subbuf array checking to see if the entries are within the range, test if its bit is already set in the subbuf_mask. If it is, then there is duplicates and fail the validation. If not, set the corresponding bit and continue. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home Fixes: c76883f18e59b ("ring-buffer: Add test if range of boot buffer is valid") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14tracing: Have the error of __tracing_resize_ring_buffer() passed to userSteven Rostedt
Currently if __tracing_resize_ring_buffer() returns an error, the tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory issue that caused the function to fail. If the ring buffer is memory mapped, then the resizing of the ring buffer will be disabled. But if the user tries to resize the buffer, it will get an -ENOMEM returned, which is confusing because there is plenty of memory. The actual error returned was -EBUSY, which would make much more sense to the user. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14ring-buffer: Unlock resize on mmap errorSteven Rostedt
Memory mapping the tracing ring buffer will disable resizing the buffer. But if there's an error in the memory mapping like an invalid parameter, the function exits out without re-enabling the resizing of the ring buffer, preventing the ring buffer from being resized after that. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14workqueue: Log additional details when rejecting workWill Deacon
Syzbot regularly runs into the following warning on arm64: | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 current_wq_worker kernel/workqueue_internal.h:69 [inline] | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 is_chained_work kernel/workqueue.c:2199 [inline] | WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 | Modules linked in: | CPU: 1 UID: 0 PID: 6023 Comm: klogd Not tainted 6.13.0-rc2-syzkaller-g2e7aff49b5da #0 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 | pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __queue_work+0xe50/0x1308 kernel/workqueue_internal.h:69 | lr : current_wq_worker kernel/workqueue_internal.h:69 [inline] | lr : is_chained_work kernel/workqueue.c:2199 [inline] | lr : __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 [...] | __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 (L) | delayed_work_timer_fn+0x74/0x90 kernel/workqueue.c:2485 | call_timer_fn+0x1b4/0x8b8 kernel/time/timer.c:1793 | expire_timers kernel/time/timer.c:1839 [inline] | __run_timers kernel/time/timer.c:2418 [inline] | __run_timer_base+0x59c/0x7b4 kernel/time/timer.c:2430 | run_timer_base kernel/time/timer.c:2439 [inline] | run_timer_softirq+0xcc/0x194 kernel/time/timer.c:2449 The warning is probably because we are trying to queue work into a destroyed workqueue, but the softirq context makes it hard to pinpoint the problematic caller. Extend the warning diagnostics to print both the function we are trying to queue as well as the name of the workqueue. Cc: Tejun Heo <tj@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Link: https://syzkaller.appspot.com/bug?extid=e13e654d315d4da1277c Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14sched_ext: Provides a sysfs 'events' to expose core event countersChangwoo Min
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value. In addition, the format of scx_dump_event() is adjusted as the event name gets longer. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14x86/ibt: Clean up is_endbr()Peter Zijlstra
Pretty much every caller of is_endbr() actually wants to test something at an address and ends up doing get_kernel_nofault(). Fold the lot into a more convenient helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250207122546.181367417@infradead.org
2025-02-14Merge branch 'x86/mm'Peter Zijlstra
Depends on the simplifications from commit 1d7e707af446 ("Revert "x86/module: prepare module loading for ROX allocations of text"") Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-02-14module: don't annotate ROX memory as kmemleak_not_leak()Mike Rapoport (Microsoft)
The ROX memory allocations are part of a larger vmalloc allocation and annotating them with kmemleak_not_leak() confuses kmemleak. Skip kmemleak_not_leak() annotations for the ROX areas. Fixes: c287c0723329 ("module: switch to execmem API for remapping as RW and restoring ROX") Fixes: 64f6a4e10c05 ("x86: re-enable EXECMEM_ROX support") Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250214084531.3299390-1-rppt@kernel.org
2025-02-14sched/fair: Refactor can_migrate_task() to elimate loopingI Hsin Cheng
The function "can_migrate_task()" utilize "for_each_cpu_and" with a "if" statement inside to find the destination cpu. It's the same logic to find the first set bit of the result of the bitwise-AND of "env->dst_grpmask", "env->cpus" and "p->cpus_ptr". Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the first cpu within the intersection as the destination cpu, so we can elimate the need of looping and multiple times of branch. After the refactoring this part of the code can speed up from ~115ns to ~54ns, according to the test below. Ran the test for 5 times and the result is showned in the following table, and the test script is paste in next section. ------------------------------------------------------- |Old method| 130| 118| 115| 109| 106| avg ~115ns| ------------------------------------------------------- |New method| 58| 55| 54| 48| 55| avg ~54ns| ------------------------------------------------------- Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250210103019.283824-1-richard120310@gmail.com
2025-02-14sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasksTianchen Ding
When a task is enqueued and its parent cgroup se is already on_rq, this parent cgroup se will not be enqueued again, and hence the root->min_slice leaves unchanged. The same issue happens when a task is dequeued and its parent cgroup se has other runnable entities, and the parent cgroup se will not be dequeued. Force propagating min_slice when se doesn't need to be enqueued or dequeued. Ensure the se hierarchy always get the latest min_slice. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com
2025-02-14sched: Don't define sched_clock_irqtime as static keyYafang Shao
The sched_clock_irqtime was defined as a static key in commit 8722903cbb8f ('sched: Define sched_clock_irqtime as static key'). However, this change introduces a 'sleeping in atomic context' warning, as shown below: arch/x86/kernel/tsc.c:1214 mark_tsc_unstable() warn: sleeping in atomic context As analyzed by Dan, the affected code path is as follows: vcpu_load() <- disables preempt -> kvm_arch_vcpu_load() -> mark_tsc_unstable() <- sleeps virt/kvm/kvm_main.c 166 void vcpu_load(struct kvm_vcpu *vcpu) 167 { 168 int cpu = get_cpu(); ^^^^^^^^^^ This get_cpu() disables preemption. 169 170 __this_cpu_write(kvm_running_vcpu, vcpu); 171 preempt_notifier_register(&vcpu->preempt_notifier); 172 kvm_arch_vcpu_load(vcpu, cpu); 173 put_cpu(); 174 } arch/x86/kvm/x86.c 4979 if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) { 4980 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 : 4981 rdtsc() - vcpu->arch.last_host_tsc; 4982 if (tsc_delta < 0) 4983 mark_tsc_unstable("KVM discovered backwards TSC"); arch/x86/kernel/tsc.c 1206 void mark_tsc_unstable(char *reason) 1207 { 1208 if (tsc_unstable) 1209 return; 1210 1211 tsc_unstable = 1; 1212 if (using_native_sched_clock()) 1213 clear_sched_clock_stable(); --> 1214 disable_sched_clock_irqtime(); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ kernel/jump_label.c 245 void static_key_disable(struct static_key *key) 246 { 247 cpus_read_lock(); ^^^^^^^^^^^^^^^^ This lock has a might_sleep() in it which triggers the static checker warning. 248 static_key_disable_cpuslocked(key); 249 cpus_read_unlock(); 250 } Let revert this change for now as {disable,enable}_sched_clock_irqtime are used in many places, as pointed out by Sean, including the following: The code path in clocksource_watchdog(): clocksource_watchdog() | -> spin_lock(&watchdog_lock); | -> __clocksource_unstable() | -> clocksource.mark_unstable() == tsc_cs_mark_unstable() | -> disable_sched_clock_irqtime() And the code path in sched_clock_register(): /* Cannot register a sched_clock with interrupts on */ local_irq_save(flags); ... /* Enable IRQ time accounting if we have a fast enough sched_clock() */ if (irqtime > 0 || (irqtime == -1 && rate >= 1000000)) enable_sched_clock_irqtime(); local_irq_restore(flags); [lkp@intel.com: reported a build error in the prev version] Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/ Fixes: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Sean Christopherson <seanjc@google.com> Debugged-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
2025-02-14sched: Reduce the default slice to avoid tasks getting an extra tickzihan zhou
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which means that we have a default slice of: 0.75 for 1 cpu 1.50 up to 3 cpus 2.25 up to 7 cpus 3.00 for 8 cpus and above. For HZ=250 and HZ=100, because of the tick accuracy, the runtime of tasks is far higher than their slice. For HZ=1000 with 8 cpus or more, the accuracy of tick is already satisfactory, but there is still an issue that tasks will get an extra tick because the tick often arrives a little faster than expected. In this case, the task can only wait until the next tick to consider that it has reached its deadline, and will run 1ms longer. vruntime + sysctl_sched_base_slice = deadline |-----------|-----------|-----------|-----------| 1ms 1ms 1ms 1ms ^ ^ ^ ^ tick1 tick2 tick3 tick4(nearly 4ms) There are two reasons for tick error: clockevent precision and the CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even without it, because of clockevent precision, tick still often less than 1ms. In order to make scheduling more precise, we changed 0.75 to 0.70, Using 0.70 instead of 0.75 should not change much for other configs and would fix this issue: 0.70 for 1 cpu 1.40 up to 3 cpus 2.10 up to 7 cpus 2.8 for 8 cpus and above. This does not guarantee that tasks can run the slice time accurately every time, but occasionally running an extra tick has little impact. Signed-off-by: zihan zhou <15645113830zzh@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
2025-02-14sched: Cancel the slice protection of the idle entityzihan zhou
A wakeup non-idle entity should preempt idle entity at any time, but because of the slice protection of the idle entity, the non-idle entity has to wait, so just cancel it. This patch is aimed at minimizing the impact of SCHED_IDLE on SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for 1s and then runs for 3 ms, running cyclictest on the same cpu, has a maximum latency of 3 ms, which is caused by the slice protection of the idle entity. It is unreasonable. With this patch, the cyclictest latency under the same conditions is basically the same on the cpu with idle processes and on empty cpu. [peterz: add helpers] Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") Signed-off-by: zihan zhou <15645113830zzh@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250208080850.16300-1-15645113830zzh@gmail.com
2025-02-14Revert "kernel/debug: Mask KGDB NMI upon entry"Douglas Anderson
This reverts commit 5a14fead07bcf4e0acc877a8d9e1d1f40a441153. No architectures ever implemented `enable_nmi` since the later patches in the series adding it never landed. It's been a long time. Drop it. NOTE: this is not a clean revert due to changes in the file in the meantime. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20250129082535.3.I2254953cd852f31f354456689d68b2d910de3fbe@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-14Revert "kdb: Implement disable_nmi command"Douglas Anderson
This reverts commit ad394f66fa57ae66014cb74f337e2820bac4c417. No architectures ever implemented `enable_nmi` since the later patches in the series adding it never landed. It's been a long time. Drop it. NOTE: this is not a clean revert due to changes in the file in the meantime. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20250129082535.2.Ib91bfb95bdcf77591257a84063fdeb5b4dce65b1@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-13bpf: fs/xattr: Add BPF kfuncs to set and remove xattrsSong Liu
Add the following kfuncs to set and remove xattrs from BPF programs: bpf_set_dentry_xattr bpf_remove_dentry_xattr bpf_set_dentry_xattr_locked bpf_remove_dentry_xattr_locked The _locked version of these kfuncs are called from hooks where dentry->d_inode is already locked. Instead of requiring the user to know which version of the kfuncs to use, the verifier will pick the proper kfunc based on the calling hook. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250130213549.3353349-5-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13bpf: lsm: Add two more sleepable hooksSong Liu
Add bpf_lsm_inode_removexattr and bpf_lsm_inode_post_removexattr to list sleepable_lsm_hooks. These two hooks are always called from sleepable context. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250130213549.3353349-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13bpf: Add tracepoints with null-able argumentsJiri Olsa
Some of the tracepoints slipped when we did the first scan, adding them now. Fixes: 838a10bd2ebf ("bpf: Augment raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250210175913.2893549-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-13cpu: export lockdep_assert_cpus_held()Hamza Mahfooz
If CONFIG_HYPERV=m, lockdep_assert_cpus_held() is undefined for HyperV. So, export the function so that GPL drivers can use it more broadly. Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250117203309.192072-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250117203309.192072-1-hamzamahfooz@linux.microsoft.com>
2025-02-13sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUPTejun Heo
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries. When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue. However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"). Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose to enable ttwu_queue optimization. v2: Update the patch description and comment re. ops.select_cpu() being skipped in some cases as opposed to always as per Neel. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Neel Natu <neelnatu@google.com> Reported-by: Barret Rhoden <brho@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13sched_ext: Use SCX_CALL_OP_TASK in task_tick_scxChuyi Zhou
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group. Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>