summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-12-05clocksource: Make negative motion detection more robustThomas Gleixner
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a 24-bit wide clocksource. It turns out that the calculated maximal idle time, which limits idle sleeps to prevent clocksource wrap arounds, is close to the point where the negative motion detection triggers. max_idle_ns: 597268854 ns negative motion tripping point: 671088640 ns If the idle wakeup is delayed beyond that point, the clocksource advances far enough to trigger the negative motion detection. This prevents the clock to advance and in the worst case the system stalls completely if the consecutive sleeps based on the stale clock are delayed as well. Cure this by calculating a more robust cut-off value for negative motion, which covers 87.5% of the actual clocksource counter width. Compare the delta against this value to catch negative motion. This is specifically for clock sources with a small counter width as their wrap around time is close to the half counter width. For clock sources with wide counters this is not a problem because the maximum idle time is far from the half counter width due to the math overflow protection constraints. For the case at hand this results in a tripping point of 1174405120ns. Note, that this cannot prevent issues when the delay exceeds the 87.5% margin, but that's not different from the previous unchecked version which allowed arbitrary time jumps. Systems with small counter width are prone to invalid results, but this problem is unlikely to be seen on real hardware. If such a system completely stalls for more than half a second, then there are other more urgent problems than the counter wrapping around. Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net
2024-12-05tracing: Fix archs that still call tracepoints without RCU watchingSteven Rostedt
Tracepoints require having RCU "watching" as it uses RCU to do updates to the tracepoints. There are some cases that would call a tracepoint when RCU was not "watching". This was usually in the idle path where RCU has "shutdown". For the few locations that had tracepoints without RCU watching, there was an trace_*_rcuidle() variant that could be used. This used SRCU for protection. There are tracepoints that trace when interrupts and preemption are enabled and disabled. In some architectures, these tracepoints are called in a path where RCU is not watching. When x86 and arm64 removed these locations, it was incorrectly assumed that it would be safe to remove the trace_*_rcuidle() variant and also remove the SRCU logic, as it made the code more complex and harder to implement new tracepoint features (like faultable tracepoints and tracepoints in rust). Instead of bringing back the trace_*_rcuidle(), as it will not be trivial to do as new code has already been added depending on its removal, add a workaround to the one file that still requires it (trace_preemptirq.c). If the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if the code is in the idle path, and if so, call ct_irq_enter/exit() which will enable RCU around the tracepoint. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()") Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/ Acked-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-05smp/scf: Evaluate local cond_func() before IPI side-effectsMathieu Desnoyers
In smp_call_function_many_cond(), the local cond_func() is evaluated after triggering the remote CPU IPIs. If cond_func() depends on loading shared state updated by other CPU's IPI handlers func(), then triggering execution of remote CPUs IPI before evaluating cond_func() may have unexpected consequences. One example scenario is evaluating a jiffies delay in cond_func(), which is updated by func() in the IPI handlers. This situation can prevent execution of periodic cleanup code on the local CPU. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241203163558.3455535-1-mathieu.desnoyers@efficios.com
2024-12-05PM: sleep: autosleep: don't include 'pm_wakeup.h' directlyWolfram Sang
The header clearly states that it does not want to be included directly, only via 'device.h'. 'platform_device.h' works equally well. Remove the direct inclusion. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://patch.msgid.link/20241118072917.3853-16-wsa+renesas@sang-engineering.com [ rjw: Subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-12-04audit: workaround a GCC bug triggered by task comm changesYafang shao
A build failure has been reported with the following details: In file included from include/linux/string.h:390, from include/linux/bitmap.h:13, from include/linux/cpumask.h:12, from include/linux/smp.h:13, from include/linux/lockdep.h:14, from include/linux/spinlock.h:63, from include/linux/wait.h:9, from include/linux/wait_bit.h:8, from include/linux/fs.h:6, from kernel/auditsc.c:37: In function 'sized_strscpy', inlined from '__audit_ptrace' at kernel/auditsc.c:2732:2: >> include/linux/fortify-string.h:293:17: error: call to '__write_overflow' declared with attribute error: detected write beyond size of object (1st parameter) 293 | __write_overflow(); | ^~~~~~~~~~~~~~~~~~ In function 'sized_strscpy', inlined from 'audit_signal_info_syscall' at kernel/auditsc.c:2759:3: >> include/linux/fortify-string.h:293:17: error: call to '__write_overflow' declared with attribute error: detected write beyond size of object (1st parameter) 293 | __write_overflow(); | ^~~~~~~~~~~~~~~~~~ The issue appears to be a GCC bug, though the root cause remains unclear at this time. For now, let's implement a workaround. A bug report has also been filed with GCC [0]. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117912 [0] Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410171420.1V00ICVG-lkp@intel.com/ Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20241128182435.57a1ea6f@gandalf.local.home/ Reported-by: Zhuo, Qiuxu <qiuxu.zhuo@intel.com> Closes: https://lore.kernel.org/all/CY8PR11MB71348E568DBDA576F17DAFF389362@CY8PR11MB7134.namprd11.prod.outlook.com/ Originally-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/linux-hardening/202410171059.C2C395030@keescook/ Signed-off-by: Yafang shao <laoar.shao@gmail.com> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Yafang Shao <laoar.shao@gmail.com> [PM: subject tweak, description line wrapping] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04sched_ext: Use the NUMA scheduling domain for NUMA optimizationsAndrea Righi
Rely on the NUMA scheduling domain topology, instead of accessing NUMA topology information directly. There is basically no functional change, but in this way we ensure consistent use of the same topology information determined by the scheduling subsystem. Fixes: f6ce6b949304 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-04lsm: replace context+len with lsm_contextCasey Schaufler
Replace the (secctx,seclen) pointer pair with a single lsm_context pointer to allow return of the LSM identifier along with the context and context length. This allows security_release_secctx() to know how to release the context. Callers have been modified to use or save the returned data from the new structure. security_secid_to_secctx() and security_lsmproc_to_secctx() will now return the length value on success instead of 0. Cc: netdev@vger.kernel.org Cc: audit@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: Todd Kjos <tkjos@google.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak, kdoc fix, signedness fix from Dan Carpenter] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04bpf: Fix narrow scalar spill onto 64-bit spilled scalar slotsTao Lyu
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the verifier aims to reject partial overwrite on an 8-byte stack slot that contains a spilled pointer. However, in such a scenario, it rejects all partial stack overwrites as long as the targeted stack slot is a spilled register, because it does not check if the stack slot is a spilled pointer. Incomplete checks will result in the rejection of valid programs, which spill narrower scalar values onto scalar slots, as shown below. 0: R1=ctx() R10=fp0 ; asm volatile ( @ repro.bpf.c:679 0: (7a) *(u64 *)(r10 -8) = 1 ; R10=fp0 fp-8_w=1 1: (62) *(u32 *)(r10 -8) = 1 attempt to corrupt spilled pointer on stack processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0. Fix this by expanding the check to not consider spilled scalar registers when rejecting the write into the stack. Previous discussion on this patch is at link [0]. [0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch Fixes: ab125ed3ec1c ("bpf: fix check for attempt to corrupt spilled pointer") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_miscKumar Kartikeya Dwivedi
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to STACK_MISC when allow_ptr_leaks is false, since invalid contents shouldn't be read unless the program has the relevant capabilities. The relaxation only makes sense when env->allow_ptr_leaks is true. However, such conversion in privileged mode becomes unnecessary, as invalid slots can be read without being upgraded to STACK_MISC. Currently, the condition is inverted (i.e. checking for true instead of false), simply remove it to restore correct behavior. Fixes: eaf18febd6eb ("bpf: preserve STACK_ZERO slots on partial reg spills") Acked-by: Andrii Nakryiko <andrii@kernel.org> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Improve verifier log for resource leak on exitKumar Kartikeya Dwivedi
The verifier log when leaking resources on BPF_EXIT may be a bit confusing, as it's a problem only when finally existing from the main prog, not from any of the subprogs. Hence, update the verifier error string and the corresponding selftests matching on it. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Introduce support for bpf_local_irq_{save,restore}Kumar Kartikeya Dwivedi
Teach the verifier about IRQ-disabled sections through the introduction of two new kfuncs, bpf_local_irq_save, to save IRQ state and disable them, and bpf_local_irq_restore, to restore IRQ state and enable them back again. For the purposes of tracking the saved IRQ state, the verifier is taught about a new special object on the stack of type STACK_IRQ_FLAG. This is a 8 byte value which saves the IRQ flags which are to be passed back to the IRQ restore kfunc. Renumber the enums for REF_TYPE_* to simplify the check in find_lock_state, filtering out non-lock types as they grow will become cumbersome and is unecessary. To track a dynamic number of IRQ-disabled regions and their associated saved states, a new resource type RES_TYPE_IRQ is introduced, which its state management functions: acquire_irq_state and release_irq_state, taking advantage of the refactoring and clean ups made in earlier commits. One notable requirement of the kernel's IRQ save and restore API is that they cannot happen out of order. For this purpose, when releasing reference we keep track of the prev_id we saw with REF_TYPE_IRQ. Since reference states are inserted in increasing order of the index, this is used to remember the ordering of acquisitions of IRQ saved states, so that we maintain a logical stack in acquisition order of resource identities, and can enforce LIFO ordering when restoring IRQ state. The top of the stack is maintained using bpf_verifier_state's active_irq_id. To maintain the stack property when releasing reference states, we need to modify release_reference_state to instead shift the remaining array left using memmove instead of swapping deleted element with last that might break the ordering. A selftest to test this subtle behavior is added in late patches. The logic to detect initialized and unitialized irq flag slots, marking and unmarking is similar to how it's done for iterators. No additional checks are needed in refsafe for REF_TYPE_IRQ, apart from the usual check_id satisfiability check on the ref[i].id. We have to perform the same check_ids check on state->active_irq_id as well. To ensure we don't get assigned REF_TYPE_PTR by default after acquire_reference_state, if someone forgets to assign the type, let's also renumber the enum ref_state_type. This way any unassigned types get caught by refsafe's default switch statement, don't assume REF_TYPE_PTR by default. The kfuncs themselves are plain wrappers over local_irq_save and local_irq_restore macros. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Refactor mark_{dynptr,iter}_readKumar Kartikeya Dwivedi
There is possibility of sharing code between mark_dynptr_read and mark_iter_read for updating liveness information of their stack slots. Consolidate common logic into mark_stack_slot_obj_read function in preparation for the next patch which needs the same logic for its own stack slots. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Refactor {acquire,release}_reference_stateKumar Kartikeya Dwivedi
In preparation for introducing support for more reference types which have to add and remove reference state, refactor the acquire_reference_state and release_reference_state functions to share common logic. The acquire_reference_state function simply handles growing the acquired refs and returning the pointer to the new uninitialized element, which can be filled in by the caller. The release_reference_state function simply erases a reference state entry in the acquired_refs array and shrinks it. The callers are responsible for finding the suitable element by matching on various fields of the reference state and requesting deletion through this function. It is not supposed to be called directly. Existing callers of release_reference_state were using it to find and remove state for a given ref_obj_id without scrubbing the associated registers in the verifier state. Introduce release_reference_nomark to provide this functionality and convert callers. We now use this new release_reference_nomark function within release_reference as well. It needs to operate on a verifier state instead of taking verifier env as mark_ptr_or_null_regs requires operating on verifier state of the two branches of a NULL condition check, therefore env->cur_state cannot be used directly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Consolidate locks and reference state in verifier stateKumar Kartikeya Dwivedi
Currently, state for RCU read locks and preemption is in bpf_verifier_state, while locks and pointer reference state remains in bpf_func_state. There is no particular reason to keep the latter in bpf_func_state. Additionally, it is copied into a new frame's state and copied back to the caller frame's state everytime the verifier processes a pseudo call instruction. This is a bit wasteful, given this state is global for a given verification state / path. Move all resource and reference related state in bpf_verifier_state structure in this patch, in preparation for introducing new reference state types in the future. Since we switch print_verifier_state and friends to print using vstate, we now need to explicitly pass in the verifier state from the caller along with the bpf_func_state, so modify the prototype and callers to do so. To ensure func state matches the verifier state when we're printing data, take in frame number instead of bpf_func_state pointer instead and avoid inconsistencies induced by the caller. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204030400.208005-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04lsm: ensure the correct LSM context releaserCasey Schaufler
Add a new lsm_context data structure to hold all the information about a "security context", including the string, its size and which LSM allocated the string. The allocation information is necessary because LSMs have different policies regarding the lifecycle of these strings. SELinux allocates and destroys them on each use, whereas Smack provides a pointer to an entry in a list that never goes away. Update security_release_secctx() to use the lsm_context instead of a (char *, len) pair. Change its callers to do likewise. The LSMs supporting this hook have had comments added to remind the developer that there is more work to be done. The BPF security module provides all LSM hooks. While there has yet to be a known instance of a BPF configuration that uses security contexts, the possibility is real. In the existing implementation there is potential for multiple frees in that case. Cc: linux-integrity@vger.kernel.org Cc: netdev@vger.kernel.org Cc: audit@vger.kernel.org Cc: netfilter-devel@vger.kernel.org To: Pablo Neira Ayuso <pablo@netfilter.org> Cc: linux-nfs@vger.kernel.org Cc: Todd Kjos <tkjos@google.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04tracing: Fix cmp_entries_dup() to respect sort() comparison rulesKuan-Wei Chiu
The cmp_entries_dup() function used as the comparator for sort() violated the symmetry and transitivity properties required by the sorting algorithm. Specifically, it returned 1 whenever memcmp() was non-zero, which broke the following expectations: * Symmetry: If x < y, then y > x. * Transitivity: If x < y and y < z, then x < z. These violations could lead to incorrect sorting and failure to correctly identify duplicate elements. Fix the issue by directly returning the result of memcmp(), which adheres to the required comparison properties. Cc: stable@vger.kernel.org Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map") Link: https://lore.kernel.org/20241203202228.1274403-1-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-03genirq/proc: Add missing space separator backThomas Gleixner
The recent conversion of show_interrupts() to seq_put_decimal_ull_width() caused a formatting regression as it drops a previosuly existing space separator. Add it back by unconditionally inserting a space after the interrupt counts and removing the extra leading space from the chip name prints. Fixes: f9ed1f7c2e26 ("genirq/proc: Use seq_put_decimal_ull_width() for decimal values") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: David Wang <00107082@163.com> Link: https://lore.kernel.org/all/87zfldt5g4.ffs@tglx Closes: https://lore.kernel.org/all/4ce18851-6e9f-bbe-8319-cc5e69fb45c@linux-m68k.org
2024-12-03genirq: Reuse irq_thread_fn() for forced thread caseAndy Shevchenko
rq_forced_thread_fn() uses the same action callback as the non-forced variant but with different locking decorations. Reuse irq_thread_fn() here to make that clear. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241119104339.2112455-3-andriy.shevchenko@linux.intel.com
2024-12-03genirq: Move irq_thread_fn() further up in the codeAndy Shevchenko
In a preparation to reuse irq_thread_fn() move it further up in the code. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241119104339.2112455-2-andriy.shevchenko@linux.intel.com
2024-12-02bpf: Zero index arg error string for dynptr and iterKumar Kartikeya Dwivedi
Andrii spotted that process_dynptr_func's rejection of incorrect argument register type will print an error string where argument numbers are not zero-indexed, unlike elsewhere in the verifier. Fix this by subtracting 1 from regno. The same scenario exists for iterator messages. Fix selftest error strings that match on the exact argument number while we're at it to ensure clean bisection. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203002235.3776418-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02bpf: Ensure reg is PTR_TO_STACK in process_iter_argTao Lyu
Currently, KF_ARG_PTR_TO_ITER handling missed checking the reg->type and ensuring it is PTR_TO_STACK. Instead of enforcing this in the caller of process_iter_arg, move the check into it instead so that all callers will gain the check by default. This is similar to process_dynptr_func. An existing selftest in verifier_bits_iter.c fails due to this change, but it's because it was passing a NULL pointer into iter_next helper and getting an error further down the checks, but probably meant to pass an uninitialized iterator on the stack (as is done in the subsequent test below it). We will gain coverage for non-PTR_TO_STACK arguments in later patches hence just change the declaration to zero-ed stack object. Fixes: 06accc8779c1 ("bpf: add support for open-coded iterator loops") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Lyu <tao.lyu@epfl.ch> [ Kartikeya: move check into process_iter_arg, rewrite commit log ] Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203000238.3602922-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02bpf: Refactor bpf_tracing_func_proto() and remove bpf_get_probe_write_proto()Marco Elver
With bpf_get_probe_write_proto() no longer printing a message, we can avoid it being a special case with its own permission check. Refactor bpf_tracing_func_proto() similar to bpf_base_func_proto() to have a section conditional on bpf_token_capable(CAP_SYS_ADMIN), where the proto for bpf_probe_write_user() is returned. Finally, remove the unnecessary bpf_get_probe_write_proto(). This simplifies the code, and adding additional CAP_SYS_ADMIN-only helpers in future avoids duplicating the same CAP_SYS_ADMIN check. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Marco Elver <elver@google.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20241129090040.2690691-2-elver@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02bpf: Remove bpf_probe_write_user() warning messageMarco Elver
The warning message for bpf_probe_write_user() was introduced in 96ae52279594 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers"), with the following in the commit message: Given this feature is meant for experiments, and it has a risk of crashing the system, and running programs, we print a warning on when a proglet that attempts to use this helper is installed, along with the pid and process name. After 8 years since 96ae52279594, bpf_probe_write_user() has found successful applications beyond experiments [1, 2], with no other good alternatives. Despite its intended purpose for "experiments", that doesn't stop Hyrum's law, and there are likely many more users depending on this helper: "[..] it does not matter what you promise [..] all observable behaviors of your system will be depended on by somebody." The ominous "helper that may corrupt user memory!" has offered no real benefit, and has been found to lead to confusion where the system administrator is loading programs with valid use cases. As such, remove the warning message. Link: https://lore.kernel.org/lkml/20240404190146.1898103-1-elver@google.com/ [1] Link: https://lore.kernel.org/r/lkml/CAAn3qOUMD81-vxLLfep0H6rRd74ho2VaekdL4HjKq+Y1t9KdXQ@mail.gmail.com/ [2] Link: https://lore.kernel.org/all/CAEf4Bzb4D_=zuJrg3PawMOW3KqF8JvJm9SwF81_XHR2+u5hkUg@mail.gmail.com/ Signed-off-by: Marco Elver <elver@google.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20241129090040.2690691-1-elver@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISEWaiman Long
As all the non-domain and non-managed_irq housekeeping types have been unified to HK_TYPE_KERNEL_NOISE, replace all these references in the scheduler to use HK_TYPE_KERNEL_NOISE. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com
2024-12-02sched/isolation: Consolidate housekeeping cpumasks that are always identicalWaiman Long
The housekeeping cpumasks are only set by two boot commandline parameters: "nohz_full" and "isolcpus". When there is more than one of "nohz_full" or "isolcpus", the extra ones must have the same CPU list or the setup will fail partially. The HK_TYPE_DOMAIN and HK_TYPE_MANAGED_IRQ types are settable by "isolcpus" only and their settings can be independent of the other types. The other housekeeping types are all set by "nohz_full" or "isolcpus=nohz" without a way to set them individually. So they all have identical cpumasks. There is actually no point in having different cpumasks for these "nohz_full" only housekeeping types. Consolidate these types to use the same cpumask by aliasing them to the same value. If there is a need to set any of them independently in the future, we can break them out to their own cpumasks again. With this change, the number of cpumasks in the housekeeping structure drops from 9 to 3. Other than that, there should be no other functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-4-longman@redhat.com
2024-12-02sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"Waiman Long
The "isolcpus=nohz" boot parameter and flag were used to disable tick when running a single task. Nowsdays, this "nohz" flag is seldomly used as it is included as part of the "nohz_full" parameter. Extend this flag to cover other kernel noises disabled by the "nohz_full" parameter to make them equivalent. This also eliminates the need to use both the "isolcpus" and the "nohz_full" parameters to fully isolated a given set of CPUs. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-3-longman@redhat.com
2024-12-02sched/core: Remove HK_TYPE_SCHEDWaiman Long
The HK_TYPE_SCHED housekeeping type is defined but not set anywhere. So any code that try to use HK_TYPE_SCHED are essentially dead code. So remove HK_TYPE_SCHED and any code that use it. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-2-longman@redhat.com
2024-12-02uprobes: add speculative lockless VMA-to-inode-to-uprobe resolutionAndrii Nakryiko
Given filp_cachep is marked SLAB_TYPESAFE_BY_RCU (and FMODE_BACKING files, a special case, now goes through RCU-delated freeing), we can safely access vma->vm_file->f_inode field locklessly under just rcu_read_lock() protection, which enables looking up uprobe from uprobes_tree completely locklessly and speculatively without the need to acquire mmap_lock for reads. In most cases, anyway, assuming that there are no parallel mm and/or VMA modifications. The underlying struct file's memory won't go away from under us (even if struct file can be reused in the meantime). We rely on newly added mmap_lock_speculate_{try_begin,retry}() helpers to validate that mm_struct stays intact for entire duration of this speculation. If not, we fall back to mmap_lock-protected lookup. The speculative logic is written in such a way that it will safely handle any garbage values that might be read from vma or file structs. Benchmarking results speak for themselves. BEFORE (latest tip/perf/core) ============================= uprobe-nop ( 1 cpus): 3.384 ± 0.004M/s ( 3.384M/s/cpu) uprobe-nop ( 2 cpus): 5.456 ± 0.005M/s ( 2.728M/s/cpu) uprobe-nop ( 3 cpus): 7.863 ± 0.015M/s ( 2.621M/s/cpu) uprobe-nop ( 4 cpus): 9.442 ± 0.008M/s ( 2.360M/s/cpu) uprobe-nop ( 5 cpus): 11.036 ± 0.013M/s ( 2.207M/s/cpu) uprobe-nop ( 6 cpus): 10.884 ± 0.019M/s ( 1.814M/s/cpu) uprobe-nop ( 7 cpus): 7.897 ± 0.145M/s ( 1.128M/s/cpu) uprobe-nop ( 8 cpus): 10.021 ± 0.128M/s ( 1.253M/s/cpu) uprobe-nop (10 cpus): 9.932 ± 0.170M/s ( 0.993M/s/cpu) uprobe-nop (12 cpus): 8.369 ± 0.056M/s ( 0.697M/s/cpu) uprobe-nop (14 cpus): 8.678 ± 0.017M/s ( 0.620M/s/cpu) uprobe-nop (16 cpus): 7.392 ± 0.003M/s ( 0.462M/s/cpu) uprobe-nop (24 cpus): 5.326 ± 0.178M/s ( 0.222M/s/cpu) uprobe-nop (32 cpus): 5.426 ± 0.059M/s ( 0.170M/s/cpu) uprobe-nop (40 cpus): 5.262 ± 0.070M/s ( 0.132M/s/cpu) uprobe-nop (48 cpus): 6.121 ± 0.010M/s ( 0.128M/s/cpu) uprobe-nop (56 cpus): 6.252 ± 0.035M/s ( 0.112M/s/cpu) uprobe-nop (64 cpus): 7.644 ± 0.023M/s ( 0.119M/s/cpu) uprobe-nop (72 cpus): 7.781 ± 0.001M/s ( 0.108M/s/cpu) uprobe-nop (80 cpus): 8.992 ± 0.048M/s ( 0.112M/s/cpu) AFTER ===== uprobe-nop ( 1 cpus): 3.534 ± 0.033M/s ( 3.534M/s/cpu) uprobe-nop ( 2 cpus): 6.701 ± 0.007M/s ( 3.351M/s/cpu) uprobe-nop ( 3 cpus): 10.031 ± 0.007M/s ( 3.344M/s/cpu) uprobe-nop ( 4 cpus): 13.003 ± 0.012M/s ( 3.251M/s/cpu) uprobe-nop ( 5 cpus): 16.274 ± 0.006M/s ( 3.255M/s/cpu) uprobe-nop ( 6 cpus): 19.563 ± 0.024M/s ( 3.261M/s/cpu) uprobe-nop ( 7 cpus): 22.696 ± 0.054M/s ( 3.242M/s/cpu) uprobe-nop ( 8 cpus): 24.534 ± 0.010M/s ( 3.067M/s/cpu) uprobe-nop (10 cpus): 30.475 ± 0.117M/s ( 3.047M/s/cpu) uprobe-nop (12 cpus): 33.371 ± 0.017M/s ( 2.781M/s/cpu) uprobe-nop (14 cpus): 38.864 ± 0.004M/s ( 2.776M/s/cpu) uprobe-nop (16 cpus): 41.476 ± 0.020M/s ( 2.592M/s/cpu) uprobe-nop (24 cpus): 64.696 ± 0.021M/s ( 2.696M/s/cpu) uprobe-nop (32 cpus): 85.054 ± 0.027M/s ( 2.658M/s/cpu) uprobe-nop (40 cpus): 101.979 ± 0.032M/s ( 2.549M/s/cpu) uprobe-nop (48 cpus): 110.518 ± 0.056M/s ( 2.302M/s/cpu) uprobe-nop (56 cpus): 117.737 ± 0.020M/s ( 2.102M/s/cpu) uprobe-nop (64 cpus): 124.613 ± 0.079M/s ( 1.947M/s/cpu) uprobe-nop (72 cpus): 133.239 ± 0.032M/s ( 1.851M/s/cpu) uprobe-nop (80 cpus): 142.037 ± 0.138M/s ( 1.775M/s/cpu) Previously total throughput was maxing out at 11mln/s, and gradually declining past 8 cores. With this change, it now keeps growing with each added CPU, reaching 142mln/s at 80 CPUs (this was measured on a 80-core Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20241122035922.3321100-3-andrii@kernel.org
2024-12-02uprobes: simplify find_active_uprobe_rcu() VMA checksAndrii Nakryiko
At the point where find_active_uprobe_rcu() is used we know that VMA in question has triggered software breakpoint, so we don't need to validate vma->vm_flags. Keep only vma->vm_file NULL check. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20241122035922.3321100-2-andrii@kernel.org
2024-12-02mm: convert mm_lock_seq to a proper seqcountSuren Baghdasaryan
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com
2024-12-02sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()Valentin Schneider
Andy reported that clang gets upset with CONFIG_CFS_BANDWIDTH=n: kernel/sched/fair.c:6580:20: error: unused function 'cfs_bandwidth_used' [-Werror,-Wunused-function] 6580 | static inline bool cfs_bandwidth_used(void) | ^~~~~~~~~~~~~~~~~~ Indeed, cfs_bandwidth_used() is only used within functions defined under CONFIG_CFS_BANDWIDTH=y. Remove its CONFIG_CFS_BANDWIDTH=n declaration & definition. Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20241127165501.160004-1-vschneid@redhat.com
2024-12-02sched/deadline: Consolidate Timer CancellationWander Lairson Costa
After commit b58652db66c9 ("sched/deadline: Fix task_struct reference leak"), I identified additional calls to hrtimer_try_to_cancel that might also require a dl_server check. It remains unclear whether this omission was intentional or accidental in those contexts. This patch consolidates the timer cancellation logic into dedicated functions, ensuring consistent behavior across all calls. Additionally, it reduces code duplication and improves overall code cleanliness. Note the use of the __always_inline keyword. In some instances, we have a task_struct pointer, dereference the dl member, and then use the container_of macro to retrieve the task_struct pointer again. By inlining the code, the compiler can potentially optimize out this redundant round trip. Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20240724142253.27145-3-wander@redhat.com
2024-12-02sched/deadline: Check bandwidth overflow earlier for hotplugJuri Lelli
Currently we check for bandwidth overflow potentially due to hotplug operations at the end of sched_cpu_deactivate(), after the cpu going offline has already been removed from scheduling, active_mask, etc. This can create issues for DEADLINE tasks, as there is a substantial race window between the start of sched_cpu_deactivate() and the moment we possibly decide to roll-back the operation if dl_bw_deactivate() returns failure in cpuset_cpu_inactive(). An example is a throttled task that sees its replenishment timer firing while the cpu it was previously running on is considered offline, but before dl_bw_deactivate() had a chance to say no and roll-back happened. Fix this by directly calling dl_bw_deactivate() first thing in sched_cpu_deactivate() and do the required calculation in the former function considering the cpu passed as an argument as offline already. By doing so we also simplify sched_cpu_deactivate(), as there is no need anymore for any kind of roll-back if we fail early. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb
2024-12-02sched/deadline: Correctly account for allocated bandwidth during hotplugJuri Lelli
For hotplug operations, DEADLINE needs to check that there is still enough bandwidth left after removing the CPU that is going offline. We however fail to do so currently. Restore the correct behavior by restructuring dl_bw_manage() a bit, so that overflow conditions (not enough bandwidth left) are properly checked. Also account for dl_server bandwidth, i.e. discount such bandwidth in the calculation since NORMAL tasks will be anyway moved away from the CPU as a result of the hotplug operation. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com
2024-12-02sched/deadline: Restore dl_server bandwidth on non-destructive root domain ↵Juri Lelli
changes When root domain non-destructive changes (e.g., only modifying one of the existing root domains while the rest is not touched) happen we still need to clear DEADLINE bandwidth accounting so that it's then properly restored, taking into account DEADLINE tasks associated to each cpuset (associated to each root domain). After the introduction of dl_servers, we fail to restore such servers contribution after non-destructive changes (as they are only considered on destructive changes when runqueues are attached to the new domains). Fix this by making sure we iterate over the dl_servers attached to domains that have not been destroyed and add their bandwidth contribution back correctly. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20241114142810.794657-2-juri.lelli@redhat.com
2024-12-02sched: add READ_ONCE to task_on_rq_queuedHarshit Agarwal
task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release in __block_task, and also read with READ_ONCE in task_on_rq_migrating. Make all of these accesses pair together by adding READ_ONCE in the task_on_rq_queued. Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com
2024-12-02sched: Don't try to catch up excess steal time.Suleiman Souhlal
When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, in some situations, as they end up getting additional steal time that did not actually happen. This is because there is a window between reading the elapsed time in update_rq_clock() and sampling the steal time in update_rq_clock_task(). If the VCPU gets preempted between those two points, any additional steal time is accounted to the outgoing task even though the calculated delta did not actually contain any of that "stolen" time. When this race happens, we can end up with steal time that exceeds the calculated delta, and the previous code would try to catch up that excess steal time in future clock updates, which is given to the next, incoming task, even though it did not actually have any time stolen. This behavior is particularly bad when steal time can be very long, which we've seen when trying to extend steal time to contain the duration that the host was suspended [0]. When this happens, clock_task stays frozen, during which the running task stays running for the whole duration, since its run time doesn't increase. However the race can happen even under normal operation. Ideally we would read the elapsed cpu time and the steal time atomically, to prevent this race from happening in the first place, but doing so is non-trivial. Since the time between those two points isn't otherwise accounted anywhere, neither to the outgoing task nor the incoming task (because the "end of outgoing task" and "start of incoming task" timestamps are the same), I would argue that the right thing to do is to simply drop any excess steal time, in order to prevent these issues. [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/ Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
2024-12-02locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutexJohn Stultz
Anders had bisected a crash using PREEMPT_RT with linux-next and isolated it down to commit 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock"), where it seemed the wake_q structure was somehow getting corrupted causing a null pointer traversal. I was able to easily repoduce this with PREEMPT_RT and managed to isolate down that through various call stacks we were actually calling wake_up_q() twice on the same wake_q. I found that in the problematic commit, I had added the wake_up_q() call in task_blocks_on_rt_mutex() around __ww_mutex_add_waiter(), following a similar pattern in __mutex_lock_common(). However, its just wrong. We haven't dropped the lock->wait_lock, so its contrary to the point of the original patch. And it didn't match the __mutex_lock_common() logic of re-initializing the wake_q after calling it midway in the stack. Looking at it now, the wake_up_q() call is incorrect and should just be removed. So drop the erronious logic I had added. Fixes: 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock") Closes: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/ Reported-by: Anders Roxell <anders.roxell@linaro.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Anders Roxell <anders.roxell@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20241114190051.552665-1-jstultz@google.com
2024-12-02sched/deadline: Fix warning in migrate_enable for boosted tasksWander Lairson Costa
When running the following command: while true; do stress-ng --cyclic 30 --timeout 30s --minimize --quiet done a warning is eventually triggered: WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794 setup_new_dl_entity+0x13e/0x180 ... Call Trace: <TASK> ? show_trace_log_lvl+0x1c4/0x2df ? enqueue_dl_entity+0x631/0x6e0 ? setup_new_dl_entity+0x13e/0x180 ? __warn+0x7e/0xd0 ? report_bug+0x11a/0x1a0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 enqueue_dl_entity+0x631/0x6e0 enqueue_task_dl+0x7d/0x120 __do_set_cpus_allowed+0xe3/0x280 __set_cpus_allowed_ptr_locked+0x140/0x1d0 __set_cpus_allowed_ptr+0x54/0xa0 migrate_enable+0x7e/0x150 rt_spin_unlock+0x1c/0x90 group_send_sig_info+0xf7/0x1a0 ? kill_pid_info+0x1f/0x1d0 kill_pid_info+0x78/0x1d0 kill_proc_info+0x5b/0x110 __x64_sys_kill+0x93/0xc0 do_syscall_64+0x5c/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f0dab31f92b This warning occurs because set_cpus_allowed dequeues and enqueues tasks with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning is triggered. A boosted task already had its parameters set by rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary, hence the WARN_ON call. Check if we are requeueing a boosted task and avoid calling setup_new_dl_entity if that's the case. Fixes: 295d6d5e3736 ("sched/deadline: Fix switching to -deadline") Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com
2024-12-02sched/core: Prevent wakeup of ksoftirqd during idle load balanceK Prateek Nayak
Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on from the IPI handler on the idle CPU. If the SMP function is invoked from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd because soft interrupts are handled before ksoftirqd get on the CPU. Adding a trace_printk() in nohz_csd_func() at the spot of raising SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup, and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the current behavior: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func <idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000 <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED] <idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120 ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 ... Use __raise_softirq_irqoff() to raise the softirq. The SMP function call is always invoked on the requested CPU in an interrupt handler. It is guaranteed that soft interrupts are handled at the end. Following are the observations with the changes when enabling the same set of events: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance <idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] No unnecessary ksoftirqd wakeups are seen from idle task's context to service the softirq. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1] Reported-by: Julia Lawall <julia.lawall@inria.fr> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com
2024-12-02sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning ↵K Prateek Nayak
busy Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the TIF_NEED_RESCHED flag in idle task's thread info and relying on flush_smp_call_function_queue() in idle exit path to run the call-function. A softirq raised by the call-function is handled shortly after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag remains set and is only cleared later when schedule_idle() calls __schedule(). need_resched() check in _nohz_idle_balance() exists to bail out of load balancing if another task has woken up on the CPU currently in-charge of idle load balancing which is being processed in SCHED_SOFTIRQ context. Since the optimization mentioned above overloads the interpretation of TIF_NEED_RESCHED, check for idle_cpu() before going with the existing need_resched() check which can catch a genuine task wakeup on an idle CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as well as the case where ksoftirqd needs to be preempted as a result of new task wakeup or slice expiry. In case of PREEMPT_RT or threadirqs, although the idle load balancing may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd is the only fair task going back to sleep will trigger a newidle balance on the CPU which will alleviate some imbalance if it exists if idle balance fails to do so. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
2024-12-02sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()K Prateek Nayak
The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance") Since then, it has travelled quite a bit but it seems like an idle_cpu() check currently is sufficient to detect the need to bail out from an idle load balancing. To justify this removal, consider all the following case where an idle load balancing could race with a task wakeup: o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") a target perceived to be idle (target_rq->nr_running == 0) will return true for ttwu_queue_cond(target) which will offload the task wakeup to the idle target via an IPI. In all such cases target_rq->ttwu_pending will be set to 1 before queuing the wake function. If an idle load balance races here, following scenarios are possible: - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual IPI is sent to the CPU to wake it out of idle. If the nohz_csd_func() queues before sched_ttwu_pending(), the idle load balance will bail out since idle_cpu(target) returns 0 since target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after sched_ttwu_pending() it should see rq->nr_running to be non-zero and bail out of idle load balancing. - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI, the sender will simply set TIF_NEED_RESCHED for the target to put it out of idle and flush_smp_call_function_queue() in do_idle() will execute the call function. Depending on the ordering of the queuing of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in nohz_csd_func() should either see target_rq->ttwu_pending = 1 or target_rq->nr_running to be non-zero if there is a genuine task wakeup racing with the idle load balance kick. o The waker CPU perceives the target CPU to be busy (targer_rq->nr_running != 0) but the CPU is in fact going idle and due to a series of unfortunate events, the system reaches a case where the waker CPU decides to perform the wakeup by itself in ttwu_queue() on the target CPU but target is concurrently selected for idle load balance (XXX: Can this happen? I'm not sure, but we'll consider the mother of all coincidences to estimate the worst case scenario). ttwu_do_activate() calls enqueue_task() which would increment "rq->nr_running" post which it calls wakeup_preempt() which is responsible for setting TIF_NEED_RESCHED (via a resched IPI or by setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key thing to note in this case is that rq->nr_running is already non-zero in case of a wakeup before TIF_NEED_RESCHED is set which would lead to idle_cpu() check returning false. In all cases, it seems that need_resched() check is unnecessary when checking for idle_cpu() first since an impending wakeup racing with idle load balancer will either set the "rq->ttwu_pending" or indicate a newly woken task via "rq->nr_running". Chasing the reason why this check might have existed in the first place, I came across Peter's suggestion on the fist iteration of Suresh's patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was: sched_ttwu_do_pending(list); if (unlikely((rq->idle == current) && rq->nohz_balance_kick && !need_resched())) raise_softirq_irqoff(SCHED_SOFTIRQ); Since the condition to raise the SCHED_SOFIRQ was preceded by sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in the current upstream kernel, the need_resched() check was necessary to catch a newly queued task. Peter suggested modifying it to: if (idle_cpu() && rq->nohz_balance_kick && !need_resched()) raise_softirq_irqoff(SCHED_SOFTIRQ); where idle_cpu() seems to have replaced "rq->idle == current" check. Even back then, the idle_cpu() check would have been sufficient to catch a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") overloads the interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based on Peter's suggestion. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com
2024-12-02softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernelK Prateek Nayak
do_softirq_post_smp_call_flush() on PREEMPT_RT kernels carries a WARN_ON_ONCE() for any SOFTIRQ being raised from an SMP-call-function. Since do_softirq_post_smp_call_flush() is called with preempt disabled, raising a SOFTIRQ during flush_smp_call_function_queue() can lead to longer preempt disabled sections. Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") IPIs to an idle CPU in TIF_POLLING_NRFLAG mode can be optimized out by instead setting TIF_NEED_RESCHED bit in idle task's thread_info and relying on the flush_smp_call_function_queue() in the idle-exit path to run the SMP-call-function. To trigger an idle load balancing, the scheduler queues nohz_csd_function() responsible for triggering an idle load balancing on a target nohz idle CPU and sends an IPI. Only now, this IPI is optimized out and the SMP-call-function is executed from flush_smp_call_function_queue() in do_idle() which can raise a SCHED_SOFTIRQ to trigger the balancing. So far, this went undetected since, the need_resched() check in nohz_csd_function() would make it bail out of idle load balancing early as the idle thread does not clear TIF_POLLING_NRFLAG before calling flush_smp_call_function_queue(). The need_resched() check was added with the intent to catch a new task wakeup, however, it has recently discovered to be unnecessary and will be removed in the subsequent commit after which nohz_csd_function() can raise a SCHED_SOFTIRQ from flush_smp_call_function_queue() to trigger an idle load balance on an idle target in TIF_POLLING_NRFLAG mode. nohz_csd_function() bails out early if "idle_cpu()" check for the target CPU, and does not lock the target CPU's rq until the very end, once it has found tasks to run on the CPU and will not inhibit the wakeup of, or running of a newly woken up higher priority task. Account for this and prevent a WARN_ON_ONCE() when SCHED_SOFTIRQ is raised from flush_smp_call_function_queue(). Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-2-kprateek.nayak@amd.com
2024-12-02sched: fix warning in sched_setaffinityJosh Don
Commit 8f9ea86fdf99b added some logic to sched_setaffinity that included a WARN when a per-task affinity assignment races with a cpuset update. Specifically, we can have a race where a cpuset update results in the task affinity no longer being a subset of the cpuset. That's fine; we have a fallback to instead use the cpuset mask. However, we have a WARN set up that will trigger if the cpuset mask has no overlap at all with the requested task affinity. This shouldn't be a warning condition; its trivial to create this condition. Reproduced the warning by the following setup: - $PID inside a cpuset cgroup - another thread repeatedly switching the cpuset cpus from 1-2 to just 1 - another thread repeatedly setting the $PID affinity (via taskset) to 2 Fixes: 8f9ea86fdf99b ("sched: Always preserve the user requested cpumask") Signed-off-by: Josh Don <joshdon@google.com> Acked-and-tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://lkml.kernel.org/r/20241111182738.1832953-1-joshdon@google.com
2024-12-02sched/deadline: Fix replenish_dl_new_period dl_server conditionJuri Lelli
The condition in replenish_dl_new_period() that checks if a reservation (dl_server) is deferred and is not handling a starvation case is obviously wrong. Fix it. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
2024-12-02Merge tag 'v6.13-rc1' into perf/core, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-12-02pid: allow pid_max to be set per pid namespaceChristian Brauner
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default (cf. [1]). Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available. In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t. That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it. Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general. Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here. What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface. [1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com - rebased from 5.14-rc1 - a few fixes (missing ns_free_inum on error path, missing initialization, etc) - permission check changes in pid_table_root_permissions - unsigned int pid_max -> int pid_max (keep pid_max type as it was) - add READ_ONCE in alloc_pid() as suggested by Christian - rebased from 6.7 and take into account: * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) * sysctl: treewide: constify ctl_table_header::ctl_table_arg * pidfd: add pidfs * tracing: Move saved_cmdline code into trace_sched_switch.c Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20241122132459.135120-2-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02trace: avoid pointless cred reference count bumpChristian Brauner
The creds are allocated via prepare_creds() which has already taken a reference. Link: https://lore.kernel.org/r/20241125-work-cred-v2-25-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02cgroup: avoid pointless cred reference count bumpChristian Brauner
of->file->f_cred already holds a reference count that is stable during the operation. Link: https://lore.kernel.org/r/20241125-work-cred-v2-24-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>