summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2025-04-06Merge tag 'sched-urgent-2025-04-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a nonsensical Kconfig combination - Remove an unnecessary rseq-notification * tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Eliminate useless task_work on execve sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-03Merge tag 'sched_ext-for-6.15-rc0-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Calling scx_bpf_create_dsq() with the same ID would succeed creating duplicate DSQs. Fix it to return -EEXIST. - scx_select_cpu_dfl() fixes and cleanups. - Synchronize tool/sched_ext with external scheduler repo. While this isn't a fix. There's no risk to the kernel and it's better if they stay synced closer. * tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: tools/sched_ext: Sync with scx repo sched_ext: initialize built-in idle state before ops.init() sched_ext: create_dsq: Return -EEXIST on duplicate request sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl() sched_ext: idle: Fix return code of scx_select_cpu_dfl()
2025-04-03rseq: Eliminate useless task_work on execveMathieu Desnoyers
Eliminate a useless task_work on execve by moving the call to rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error path of bprm_execve(). The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is pointless in the success case, because rseq_execve() will clear the rseq pointer before returning to userspace. sched_mm_cid_after_execve() is called from both the success and error paths of bprm_execve(). The call to rseq_set_notify_resume() is needed on error because the mm_cid may have changed. Also move the rseq_execve() to right after sched_mm_cid_after_execve() in bprm_execve(). [ mingo: Merged to a recent upstream kernel, extended the changelog. ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250327132945.1558783-1-mathieu.desnoyers@efficios.com
2025-03-27Merge tag 'trace-latency-v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull latency tracing updates from Steven Rostedt: - Add some trace events to osnoise and timerlat sample generation This adds more information to the osnoise and timerlat tracers as well as allows BPF programs to be attached to these locations to extract even more data. - Fix to DECLARE_TRACE_CONDITION() macro It wasn't used but now will be and it happened to be broken causing the build to fail. - Add scheduler specification monitors to runtime verifier (RV) This is a continuation of Daniel Bristot's work. RV allows monitors to run and react concurrently. Running the cumulative model is equivalent to running single components using the same reactors, with the advantage that it's easier to point out which specification failed in case of error. This update introduces nested monitors to RV, in short, the sysfs monitor folder will contain a monitor named sched, which is nothing but an empty container for other monitors. Controlling the sched monitor (enable, disable, set reactors) controls all nested monitors. The following scheduling monitors are added: - sco: scheduling context operations Monitor to ensure sched_set_state happens only in thread context - tss: task switch while scheduling Monitor to ensure sched_switch happens only in scheduling context - snroc: set non runnable on its own context Monitor to ensure set_state happens only in the respective task's context - scpd: schedule called with preemption disabled Monitor to ensure schedule is called with preemption disabled - snep: schedule does not enable preempt Monitor to ensure schedule does not enable preempt - sncid: schedule not called with interrupt disabled Monitor to ensure schedule is not called with interrupt disabled * tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tools/rv: Allow rv list to filter for container Documentation/rv: Add docs for the sched monitors verification/dot2k: Add support for nested monitors tools/rv: Add support for nested monitors rv: Add scpd, snep and sncid per-cpu monitors rv: Add snroc per-task monitor rv: Add sco and tss per-cpu monitors rv: Add option for nested monitors and include sched sched: Add sched tracepoints for RV task model rv: Add license identifiers to monitor files tracing: Fix DECLARE_TRACE_CONDITION trace/osnoise: Add trace events for samples
2025-03-26sched_ext: initialize built-in idle state before ops.init()Andrea Righi
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example. However, this would result in an error, since the idle state has not been properly initialized yet. This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init(): $ sudo scx_simple DEBUG DUMP =========================================================================== scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ... Fix this by properly initializing the idle state before ops.init() is called. With this change applied: $ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ... Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26sched_ext: create_dsq: Return -EEXIST on duplicate requestJake Hillion
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour. Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST. Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Acked-by: Andrea Righi <arighi@nvidia.com> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()Tejun Heo
scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove it. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>
2025-03-26sched_ext: idle: Fix return code of scx_select_cpu_dfl()Andrea Righi
Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl() if a fully idle SMT core cannot be found, instead of falling back to @prev_cpu, which is not a fully idle SMT core in this case. Fixes: c414c2171cd9e ("sched_ext: idle: Honor idle flags in the built-in idle selection policy") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'objtool-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - The biggest change is the new option to automatically fail the build on objtool warnings: CONFIG_OBJTOOL_WERROR. While there are no currently known unfixed false positives left, such an expansion in the severity of objtool warnings inevitably creates a risk of build failures, so it's disabled by default and depends on !COMPILE_TEST, so it shouldn't be enabled on allyesconfig/allmodconfig builds and won't be forced on people who just accept build-time defaults in 'make oldconfig'. While the option is strongly recommended, only people who enable it explicitly should see it. (Josh Poimboeuf) - Disable branch profiling in noinstr code with a broad brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf) - Create backup object files on objtool errors and print exact objtool arguments to make failure analysis easier (Josh Poimboeuf) - Improve noreturn handling (Josh Poimboeuf) - Improve rodata handling (Tiezhu Yang) - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang) - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar) * tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) tracing: Disable branch profiling in noinstr code objtool: Use O_CREAT with explicit mode mask objtool: Add CONFIG_OBJTOOL_WERROR objtool: Create backup on error and print args objtool: Change "warning:" to "error:" for --Werror objtool: Add --Werror option objtool: Add --output option objtool: Upgrade "Linked object detected" warning to error objtool: Consolidate option validation objtool: Remove --unret dependency on --rethunk objtool: Increase per-function WARN_FUNC() rate limit objtool: Update documentation objtool: Improve __noreturn annotation warning objtool: Fix error handling inconsistencies in check() x86/traps: Make exc_double_fault() consistently noreturn LoongArch: Enable jump table for objtool objtool/LoongArch: Add support for goto table objtool/LoongArch: Add support for switch table objtool: Handle PC relative relocation type objtool: Handle different entry size of rodata ...
2025-03-24Merge tag 'rcu-next-v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Boqun Feng: "Documentation: - Add broken-timing possibility to stallwarn.rst - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr() - Document self-propagating callbacks - Point call_srcu() to call_rcu() for detailed memory ordering - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text - Remove references to old grace-period-wait primitives srcu: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu() Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite() srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible RCU torture: - Add get_torture_init_jiffies() to return the start time of the test - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in - Add grace period sequence number logging at the beginning and end of failure/close-call results - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point - Make cur_ops->format_gp_seqs take buffer length - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool - Complain when invalid SRCU reader_flavor is specified - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing Misc: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes() - Fix get_state_synchronize_rcu_full() GP-start detection - Move RCU Tasks self-tests to core_initcall() - Print segment lengths in show_rcu_nocb_gp_state() - Make RCU watch ct_kernel_exit_state() warning - Flush console log from kernel_power_off() - rcutorture: Allow a negative value for nfakewriters - rcu: Update TREE05.boot to test normal synchronize_rcu() - rcu: Use _full() API to debug synchronize_rcu() Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs() - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY - Update __cond_resched comment about RCU quiescent states - Handle unstable rdp in rcu_read_unlock_strict() - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y - osnoise: Provide quiescent states - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination - Limit PREEMPT_RCU configurations - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y" * tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits) rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y rcu: limit PREEMPT_RCU configurations rcutorture: Update ->extendables check for lazy preemption rcutorture: Update rcutorture_one_extend_check() for lazy preemption osnoise: provide quiescent states rcu: Use _full() API to debug synchronize_rcu() rcu: Update TREE05.boot to test normal synchronize_rcu() rcutorture: Allow a negative value for nfakewriters Flush console log from kernel_power_off() context_tracking: Make RCU watch ct_kernel_exit_state() warning rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state() rcu-tasks: Move RCU Tasks self-tests to core_initcall() rcu: Fix get_state_synchronize_rcu_full() GP-start detection torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe() srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing rcutorture: Complain when invalid SRCU reader_flavor is specified rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool rcutorture: Make cur_ops->format_gp_seqs take buffer length rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output ...
2025-03-24Merge tag 'sched_ext-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add mechanism to count and report internal events. This significantly improves visibility on subtle corner conditions. - The default idle CPU selection logic is revamped and improved in multiple ways including being made topology aware. - sched_ext was disabling ttwu_queue for simplicity, which can be costly when hardware topology is more complex. Implement SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively enable ttwu_queue. - tools/sched_ext updates to improve compatibility among others. - Other misc updates and fixes. - sched_ext/for-6.14-fixes were pulled a few times to receive prerequisite fixes and resolve conflicts. * tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits) sched_ext: idle: Refactor scx_select_cpu_dfl() sched_ext: idle: Honor idle flags in the built-in idle selection policy sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local() sched_ext: Add trace point to track sched_ext core events sched_ext: Change the event type from u64 to s64 sched_ext: Documentation: add task lifecycle summary tools/sched_ext: Provide a compatible helper for scx_bpf_events() selftests/sched_ext: Add NUMA-aware scheduler test tools/sched_ext: Provide consistent access to scx flags sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior sched_ext: idle: Introduce scx_bpf_nr_node_ids() sched_ext: idle: Introduce node-aware idle cpu kfunc helpers sched_ext: idle: Per-node idle cpumasks sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE sched_ext: idle: Make idle static keys private sched/topology: Introduce for_each_node_numadist() iterator mm/numa: Introduce nearest_node_nodemask() nodemask: numa: reorganize inclusion path nodemask: add nodes_copy() tools/sched_ext: Sync with scx repo ...
2025-03-24sched: Add sched tracepoints for RV task modelGabriele Monaco
Add the following tracepoints: * sched_entry(bool preempt, ip) Called while entering __schedule * sched_exit(bool is_switch, ip) Called while exiting __schedule * sched_set_state(task, curr_state, state) Called when a task changes its state (to and from running) These tracepoints are useful to describe the Linux task model and are adapted from the patches by Daniel Bristot de Oliveira (https://bristot.me/linux-task-model/). Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-2-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-22tracing: Disable branch profiling in noinstr codeJosh Poimboeuf
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-21Merge tag 'sched-urgent-2025-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Revert a scheduler performance optimization that regressed other workloads" * tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19sched/debug: Make 'const_debug' tunables unconditional __read_mostlyIngo Molnar
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-3-mingo@kernel.org
2025-03-19sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()Ingo Molnar
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-17sched/topology: Stop exposing partition_sched_domains_lockedJuri Lelli
The are no callers of partition_sched_domains_locked() outside topology.c. Stop exposing such function. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSC96a8FcqWV3G@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Remove redundant dl_clear_root_domain callJuri Lelli
We completely clean and restore root domains bandwidth accounting after every root domains change, so the dl_clear_root_domain() call in partition_sched_domains_locked() is redundant. Remove it. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRtcX4tz4tcLRR@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Generalize unique visiting of root domainsJuri Lelli
Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Wrappers for sched_domains_mutexJuri Lelli
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Ignore special tasks when rebuilding domainsJuri Lelli
SCHED_DEADLINE special tasks get a fake bandwidth that is only used to make sure sleeping and priority inheritance 'work', but it is ignored for runtime enforcement and admission control. Be consistent with it also when rebuilding root domains. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250313170011.357208-2-juri.lelli@redhat.com
2025-03-17sched: Add a generic function to return the preemption stringSebastian Andrzej Siewior
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die" case but not for regular warning. With the addition of DYNAMIC_PREEMPT for PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set simultaneously. That means that everyone who tried to add that piece of information gets it wrong for PREEMPT_RT because PREEMPT is checked first. Provide a generic function which returns the current scheduling model considering LAZY preempt and the current state of PREEMPT_DYNAMIC. The resulting strings are: ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │ └───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘ [ The dynamic building of the string can lead to an empty string if the function is invoked simultaneously on two CPUs. ] Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Co-developed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20250314160810.2373416-2-bigeasy@linutronix.de
2025-03-15Revert "sched/core: Reduce cost of sched_move_task when config autogroup"Dietmar Eggemann
This reverts commit eff6c8ce8d4d7faef75f66614dd20bb50595d261. Hazem reported a 30% drop in UnixBench spawn test with commit eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM (aarch64) (single level MC sched domain): https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com There is an early bail from sched_move_task() if p->sched_task_group is equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope' (Ubuntu '22.04.5 LTS'). So in: do_exit() sched_autogroup_exit_task() sched_move_task() if sched_get_task_group(p) == p->sched_task_group return /* p is enqueued */ dequeue_task() \ sched_change_group() | task_change_group_fair() | detach_task_cfs_rq() | (1) set_task_rq() | attach_task_cfs_rq() | enqueue_task() / (1) isn't called for p anymore. Turns out that the regression is related to sgs->group_util in group_is_overloaded() and group_has_capacity(). If (1) isn't called for all the 'spawn' tasks then sgs->group_util is ~900 and sgs->group_capacity = 1024 (single CPU sched domain) and this leads to group_is_overloaded() returning true (2) and group_has_capacity() false (3) much more often compared to the case when (1) is called. I.e. there are much more cases of 'group_is_overloaded' and 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which then returns much more often a CPU != smp_processor_id() (5). This isn't good for these extremely short running tasks (FORK + EXIT) and also involves calling sched_balance_find_dst_group_cpu() unnecessary (single CPU sched domain). Instead if (1) is called for 'p->flags & PF_EXITING' then the path (4),(6) is taken much more often. select_task_rq_fair(..., wake_flags = WF_FORK) cpu = smp_processor_id() new_cpu = sched_balance_find_dst_cpu(..., cpu, ...) group = sched_balance_find_dst_group(..., cpu) do { update_sg_wakeup_stats() sgs->group_type = group_classify() if group_is_overloaded() (2) return group_overloaded if !group_has_capacity() (3) return group_fully_busy return group_has_spare (4) } while group if local_sgs.group_type > idlest_sgs.group_type return idlest (5) case group_has_spare: if local_sgs.idle_cpus >= idlest_sgs.idle_cpus return NULL (6) Unixbench Tests './Run -c 4 spawn' on: (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4') and Ubuntu 22.04.5 LTS (aarch64). Shell & test run in '/user.slice/user-1000.slice/session-1.scope'. w/o patch w/ patch 21005 27120 (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and Ubuntu 22.04.5 LTS (x86_64). Shell & test run in '/A'. w/o patch w/ patch 67675 88806 CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal 0 or 1. Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Hagar Hemdan <hagarhem@amazon.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com
2025-03-15sched/uclamp: Optimize sched_uclamp_used static key enablingXuewen Yan
Repeat calls of static_branch_enable() to an already enabled static key introduce overhead, because it calls cpus_read_lock(). Users may frequently set the uclamp value of tasks, triggering the repeat enabling of the sched_uclamp_used static key. Optimize this and avoid repeat calls to static_branch_enable() by checking whether it's enabled already. [ mingo: Rewrote the changelog for legibility ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-2-xuewen.yan@unisoc.com
2025-03-15sched/uclamp: Use the uclamp_is_used() helper instead of open-coding itXuewen Yan
Don't open-code static_branch_unlikely(&sched_uclamp_used), we have the uclamp_is_used() wrapper around it. [ mingo: Clean up the changelog ] Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250219093747.2612-1-xuewen.yan@unisoc.com
2025-03-14Merge tag 'sched-urgent-2025-03-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a sleeping-while-atomic bug caused by a recent optimization utilizing static keys that didn't consider that the static_key_disable() call could be triggered in atomic context. Revert the optimization" * tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Don't define sched_clock_irqtime as static key
2025-03-14sched_ext: idle: Refactor scx_select_cpu_dfl()Andrea Righi
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Honor idle flags in the built-in idle selection policyAndrea Righi
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()Andrea Righi
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class. However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit. Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-12Merge tag 'sched_ext-for-6.14-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "BPF schedulers could trigger a crash by passing in an invalid CPU to the scx_bpf_select_cpu_dfl() helper. Fix it by verifying input validity" * tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
2025-03-10sched/clock: Don't define sched_clock_irqtime as static keyYafang Shao
The sched_clock_irqtime was defined as a static key in: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key") However, this change introduces a 'sleeping in atomic context' warning: arch/x86/kernel/tsc.c:1214 mark_tsc_unstable() warn: sleeping in atomic context As analyzed by Dan, the affected code path is as follows: vcpu_load() <- disables preempt -> kvm_arch_vcpu_load() -> mark_tsc_unstable() <- sleeps virt/kvm/kvm_main.c 166 void vcpu_load(struct kvm_vcpu *vcpu) 167 { 168 int cpu = get_cpu(); ^^^^^^^^^^ This get_cpu() disables preemption. 169 170 __this_cpu_write(kvm_running_vcpu, vcpu); 171 preempt_notifier_register(&vcpu->preempt_notifier); 172 kvm_arch_vcpu_load(vcpu, cpu); 173 put_cpu(); 174 } arch/x86/kvm/x86.c 4979 if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) { 4980 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 : 4981 rdtsc() - vcpu->arch.last_host_tsc; 4982 if (tsc_delta < 0) 4983 mark_tsc_unstable("KVM discovered backwards TSC"); arch/x86/kernel/tsc.c 1206 void mark_tsc_unstable(char *reason) 1207 { 1208 if (tsc_unstable) 1209 return; 1210 1211 tsc_unstable = 1; 1212 if (using_native_sched_clock()) 1213 clear_sched_clock_stable(); --> 1214 disable_sched_clock_irqtime(); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ kernel/jump_label.c 245 void static_key_disable(struct static_key *key) 246 { 247 cpus_read_lock(); ^^^^^^^^^^^^^^^^ This lock has a might_sleep() in it which triggers the static checker warning. 248 static_key_disable_cpuslocked(key); 249 cpus_read_unlock(); 250 } Let revert this change for now as {disable,enable}_sched_clock_irqtime are used in many places, as pointed out by Sean, including the following: The code path in clocksource_watchdog(): clocksource_watchdog() | -> spin_lock(&watchdog_lock); | -> __clocksource_unstable() | -> clocksource.mark_unstable() == tsc_cs_mark_unstable() | -> disable_sched_clock_irqtime() And the code path in sched_clock_register(): /* Cannot register a sched_clock with interrupts on */ local_irq_save(flags); ... /* Enable IRQ time accounting if we have a fast enough sched_clock() */ if (irqtime > 0 || (irqtime == -1 && rate >= 1000000)) enable_sched_clock_irqtime(); local_irq_restore(flags); [ lkp@intel.com: reported a build error in the prev version ] [ mingo: cherry-picked it over into sched/urgent ] Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/ Fixes: 8722903cbb8f ("sched: Define sched_clock_irqtime as static key") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Dan Carpenter <dan.carpenter@linaro.org> Debugged-by: Sean Christopherson <seanjc@google.com> Debugged-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
2025-03-06Merge branch 'sched/urgent' into sched/core, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-06sched/deadline: Use online cpus for validating runtimeShrikanth Hegde
The ftrace selftest reported a failure because writing -1 to sched_rt_runtime_us returns -EBUSY. This happens when the possible CPUs are different from active CPUs. Active CPUs are part of one root domain, while remaining CPUs are part of def_root_domain. Since active cpumask is being used, this results in cpus=0 when a non active CPUs is used in the loop. Fix it by looping over the online CPUs instead for validating the bandwidth calculations. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
2025-03-05sched/fair: Fix potential memory corruption in child_cfs_rq_on_listZecheng Li
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq. This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list, making the conversion invalid and potentially leading to memory corruption. Depending on the relative positions of leaf_cfs_rq_list and the task group (tg) pointer within the struct, this can cause a memory fault or access garbage data. The issue arises in list_add_leaf_cfs_rq, where both cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list. This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main conditional in child_cfs_rq_on_list. This ensures that the container_of operation will convert a correct cfs_rq struct. This check is sufficient because only cfs_rqs on the same CPU are added to the list, so verifying the 'prev' pointer against the current rq's list head is enough. Fixes a potential memory corruption issue that due to current struct layout might not be manifesting as a crash but could lead to unpredictable behavior when the layout changes. Fixes: fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling") Signed-off-by: Zecheng Li <zecheng@google.com> Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250304214031.2882646-1-zecheng@google.com
2025-03-04sched_ext: Add trace point to track sched_ext core eventsChangwoo Min
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that trigger a particular event. The trace point can be used as other trace points, so it can be used in, for example, `perf trace` and BPF programs, as follows: ====== $> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"' ====== ====== struct tp_sched_ext_event { struct trace_entry ent; u32 __data_loc_name; s64 delta; }; SEC("tracepoint/sched_ext/sched_ext_event") int rtp_add_event(struct tp_sched_ext_event *ctx) { char event_name[128]; unsigned short offset = ctx->__data_loc_name & 0xFFFF; bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset); bpf_printk("name %s delta %lld", event_name, ctx->delta); return 0; } ====== Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Change the event type from u64 to s64Changwoo Min
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()Andrea Righi
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash. To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified. Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-28Merge tag 'sched-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Prevent cond_resched() based preemption when interrupts are disabled, on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels" * tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Prevent rescheduling when interrupts are disabled
2025-02-27sched/core: Prevent rescheduling when interrupts are disabledThomas Gleixner
David reported a warning observed while loop testing kexec jump: Interrupts enabled after irqrouter_resume+0x0/0x50 WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220 kernel_kexec+0xf6/0x180 __do_sys_reboot+0x206/0x250 do_syscall_64+0x95/0x180 The corresponding interrupt flag trace: hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90 hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90 That means __up_console_sem() was invoked with interrupts enabled. Further instrumentation revealed that in the interrupt disabled section of kexec jump one of the syscore_suspend() callbacks woke up a task, which set the NEED_RESCHED flag. A later callback in the resume path invoked cond_resched() which in turn led to the invocation of the scheduler: __cond_resched+0x21/0x60 down_timeout+0x18/0x60 acpi_os_wait_semaphore+0x4c/0x80 acpi_ut_acquire_mutex+0x3d/0x100 acpi_ns_get_node+0x27/0x60 acpi_ns_evaluate+0x1cb/0x2d0 acpi_rs_set_srs_method_data+0x156/0x190 acpi_pci_link_set+0x11c/0x290 irqrouter_resume+0x54/0x60 syscore_resume+0x6a/0x200 kernel_kexec+0x145/0x1c0 __do_sys_reboot+0xeb/0x240 do_syscall_64+0x95/0x180 This is a long standing problem, which probably got more visible with the recent printk changes. Something does a task wakeup and the scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and invokes schedule() from a completely bogus context. The scheduler enables interrupts after context switching, which causes the above warning at the end. Quite some of the code paths in syscore_suspend()/resume() can result in triggering a wakeup with the exactly same consequences. They might not have done so yet, but as they share a lot of code with normal operations it's just a question of time. The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling models. Full preemption is not affected as cond_resched() is disabled and the preemption check preemptible() takes the interrupt disabled flag into account. Cure the problem by adding a corresponding check into cond_resched(). Reported-by: David Woodhouse <dwmw@amazon.co.uk> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
2025-02-26Merge tag 'sched_ext-for-6.14-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "pick_task_scx() has a workaround to avoid stalling when the fair class's balance() says yes but pick_task() says no. The workaround was incorrectly deciding to keep the prev taks running if the task is on SCX even when the task is in a sleeping state, which can lead to several confusing failure modes. Fix it by testing the prev task is currently queued on SCX instead" * tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
2025-02-25sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behaviorAndrea Righi
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node() should always return a CPU from the specified node, regardless of its idle state. Also clarify this logic in the function documentation. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called ↵Tejun Heo
without balance() a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair(). The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable. This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks. Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Pat Cody <patcody@meta.com> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-21sched: Add unlikey branch hints to several system callsColin Ian King
Adding an unlikely() hint on early error return paths improves the run-time performance of several sched related system calls. Benchmarking on an i9-12900 shows the following per system call performance improvements: before after improvement sched_getattr 182.4ns 170.6ns ~6.5% sched_setattr 284.3ns 267.6ns ~5.9% sched_getparam 161.6ns 148.1ns ~8.4% sched_setparam 1265.4ns 1227.6ns ~3.0% sched_getscheduler 129.4ns 118.2ns ~8.7% sched_setscheduler 1237.3ns 1216.7ns ~1.7% Results are based on running 20 tests with turbo disabled (to reduce clock freq turbo changes), with 10 second run per test based on the number of system calls per second. The % standard deviation of the measurements for the 20 tests was 0.05% to 0.40%, so the results are reliable. Tested on kernel build with gcc 14.2.1 Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250219142423.45516-1-colin.i.king@gmail.com