summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2025-01-21sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeueK Prateek Nayak
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity corresponds to a task or a cfs_rq. Consider the following scenario: root / \ A B (*) delayed since B is no longer eligible on root | | Task0 Task1 <--- dequeue_task_fair() - task blocks When Task1 blocks (dequeue_entity() for task's se returns true), dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the hierarchy of Task1. However, when the sched_entity corresponding to cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the hierarchy too leading to both dequeue_entity() and set_delayed() decrementing h_nr_runnable for the dequeue of the same task. A SCHED_WARN_ON() to inspect h_nr_runnable post its update in dequeue_entities() like below: cfs_rq->h_nr_runnable -= h_nr_runnable; SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); is consistently tripped when running wakeup intensive workloads like hackbench in a cgroup. This error is self correcting since cfs_rq are per-cpu and cannot migrate. The entitiy is either picked for full dequeue or is requeued when a task wakes up below it. Both those paths call clear_delayed() which again increments h_nr_runnable of the hierarchy without considering if the entity corresponds to a task or not. h_nr_runnable will eventually reflect the correct value however in the interim, the incorrect values can still influence PELT calculation which uses se->runnable_weight or cfs_rq->h_nr_runnable. Since only delayed tasks take the early return path in dequeue_entities() and enqueue_task_fair(), adjust the h_nr_runnable in {set,clear}_delayed() only when a task is delayed as this path skips the h_nr_* update loops and returns early. For entities corresponding to cfs_rq, the h_nr_* update loop in the caller will do the right thing. Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
2025-01-20Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.14: - Use str_enable_disable()-like helpers in cpufreq (Krzysztof Kozlowski). - Extend the Apple cpufreq driver to support more SoCs (Hector Martin, Nick Chan). - Add new cpufreq driver for Airoha SoCs (Christian Marangi). - Fix using cpufreq-dt as module (Andreas Kemnade). - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter Edwards, Sibi Sankar, Manivannan Sadhasivam). - Fix the maximum supported frequency computation in the ACPI cpufreq driver to avoid relying on unfounded assumptions (Gautham Shenoy). - Fix an amd-pstate driver regression with preferred core rankings not being used (Mario Limonciello). - Fix a precision issue with frequency calculation in the amd-pstate driver (Naresh Solanki). - Add ftrace event to the amd-pstate driver for active mode (Mario Limonciello). - Set default EPP policy on Ryzen processors in amd-pstate (Mario Limonciello). - Clean up the amd-pstate cpufreq driver and optimize it to increase code reuse (Mario Limonciello, Dhananjay Ugwekar). - Use CPPC to get scaling factors between HWP performance levels and frequency in the intel_pstate driver and make it stop using a built -in scaling factor for the Arrow Lake processor (Rafael Wysocki). - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for consistency with CPU offline (Christian Loehle). - Fix superfluous updates caused by need_freq_update in the schedutil cpufreq governor (Sultan Alsawaf). * pm-cpufreq: (40 commits) cpufreq: Use str_enable_disable()-like helpers cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver cpufreq: ACPI: Fix max-frequency computation cpufreq/amd-pstate: Refactor max frequency calculation cpufreq/amd-pstate: Fix prefcore rankings cpufreq: sparc: change kzalloc to kcalloc cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT cpufreq: apple-soc: Increase cluster switch timeout to 400us cpufreq: apple-soc: Use 32-bit read for status register cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1 cpufreq: apple-soc: Drop setting the PS2 field on M2+ dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq cpufreq: fix using cpufreq-dt as module cpufreq: scmi: Register for limit change notifications cpufreq: schedutil: Fix superfluous updates caused by need_freq_update cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN ...
2025-01-20Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'Rafael J. Wysocki
Merge updates related to system sleep, a cpuidle update and an Energy Model handling code update for 6.14-rc1: - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson). - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan). - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang). - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap). - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy). - Move sched domains rebuild function from the schedutil cpufreq governor to the Energy Model handling code (Rafael Wysocki). * pm-sleep: PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment PM: wakeup: implement devm_device_init_wakeup() helper PM: sleep: sysfs: don't include 'pm_wakeup.h' directly PM: sleep: autosleep: don't include 'pm_wakeup.h' directly PM: sleep: Update stale comment in device_resume() * pm-cpuidle: intel_idle: add Clearwater Forest SoC support * pm-em: PM: EM: Move sched domains rebuild function from schedutil to EM
2025-01-19Merge tag 'sched_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag
2025-01-16Merge back earlier cpufreq material for 6.14Rafael J. Wysocki
2025-01-13lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWNNicholas Piggin
CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit bf2c59fce4074 ("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to switch away from the mm but leave it set in active_mm to be cleaned up later. So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(), which is the last hotplug state before teardown (CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread from this point, so it has no chance to pick up a new lazy tlb mm. This removes the lazy tlb mm handling wart in CPU unplug. With this, idle_task_exit() is not needed anymore and can be cleaned up. This leaves the prototype alone, to be cleaned after this change. herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/ and made adjustments on the initial patch proposed by Nicholas. Link: https://lkml.kernel.org/r/20230524060455.147699-1-npiggin@gmail.com Link: https://lore.kernel.org/all/20230525205253.E2FAEC433EF@smtp.kernel.org/ Link: https://lkml.kernel.org/r/20241104142318.3295663-1-herton@redhat.com Fixes: 2655421ae69f ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13kasan: make kasan_record_aux_stack_noalloc() the default behaviourPeter Zijlstra
kasan_record_aux_stack_noalloc() was introduced to record a stack trace without allocating memory in the process. It has been added to callers which were invoked while a raw_spinlock_t was held. More and more callers were identified and changed over time. Is it a good thing to have this while functions try their best to do a locklessly setup? The only downside of having kasan_record_aux_stack() not allocate any memory is that we end up without a stacktrace if stackdepot runs out of memory and at the same stacktrace was not recorded before To quote Marco Elver from https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/ | I'd be in favor, it simplifies things. And stack depot should be | able to replenish its pool sufficiently in the "non-aux" cases | i.e. regular allocations. Worst case we fail to record some | aux stacks, but I think that's only really bad if there's a bug | around one of these allocations. In general the probabilities | of this being a regression are extremely small [...] Make the kasan_record_aux_stack_noalloc() behaviour default as kasan_record_aux_stack(). [bigeasy@linutronix.de: dressed the diff as patch] Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de Fixes: 7cb3007ce2da ("kasan: generic: introduce kasan_record_aux_stack_noalloc()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: <kasan-dev@googlegroups.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: syzkaller-bugs@googlegroups.com Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13sched_ext: fix kernel-doc warningsRandy Dunlap
Use the correct function parameter names and function names. Use the correct kernel-doc comment format for struct sched_ext_ops to eliminate a bunch of warnings. ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked' ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set' ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less' ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup' ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass' ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask' ext.c:209: warning: Incorrect use of kernel-doc format: * select_cpu - Pick the target CPU for a task which is being woken up ext.c:236: warning: Incorrect use of kernel-doc format: * enqueue - Enqueue a task on the BPF scheduler ext.c:251: warning: Incorrect use of kernel-doc format: * dequeue - Remove a task from the BPF scheduler ext.c:267: warning: Incorrect use of kernel-doc format: * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs ext.c:290: warning: Incorrect use of kernel-doc format: * tick - Periodic tick ext.c:300: warning: Incorrect use of kernel-doc format: * runnable - A task is becoming runnable on its associated CPU ext.c:327: warning: Incorrect use of kernel-doc format: * running - A task is starting to run on its associated CPU ext.c:335: warning: Incorrect use of kernel-doc format: * stopping - A task is stopping execution ext.c:346: warning: Incorrect use of kernel-doc format: * quiescent - A task is becoming not runnable on its associated CPU ext.c:366: warning: Incorrect use of kernel-doc format: * yield - Yield CPU ext.c:381: warning: Incorrect use of kernel-doc format: * core_sched_before - Task ordering for core-sched ext.c:399: warning: Incorrect use of kernel-doc format: * set_weight - Set task weight ext.c:408: warning: Incorrect use of kernel-doc format: * set_cpumask - Set CPU affinity ext.c:418: warning: Incorrect use of kernel-doc format: * update_idle - Update the idle state of a CPU ext.c:439: warning: Incorrect use of kernel-doc format: * cpu_acquire - A CPU is becoming available to the BPF scheduler ext.c:449: warning: Incorrect use of kernel-doc format: * cpu_release - A CPU is taken away from the BPF scheduler ext.c:461: warning: Incorrect use of kernel-doc format: * init_task - Initialize a task to run in a BPF scheduler ext.c:476: warning: Incorrect use of kernel-doc format: * exit_task - Exit a previously-running task from the system ext.c:485: warning: Incorrect use of kernel-doc format: * enable - Enable BPF scheduling for a task ext.c:494: warning: Incorrect use of kernel-doc format: * disable - Disable BPF scheduling for a task ext.c:504: warning: Incorrect use of kernel-doc format: * dump - Dump BPF scheduler state on error ext.c:512: warning: Incorrect use of kernel-doc format: * dump_cpu - Dump BPF scheduler state for a CPU on error ext.c:524: warning: Incorrect use of kernel-doc format: * dump_task - Dump BPF scheduler state for a runnable task on error ext.c:535: warning: Incorrect use of kernel-doc format: * cgroup_init - Initialize a cgroup ext.c:550: warning: Incorrect use of kernel-doc format: * cgroup_exit - Exit a cgroup ext.c:559: warning: Incorrect use of kernel-doc format: * cgroup_prep_move - Prepare a task to be moved to a different cgroup ext.c:574: warning: Incorrect use of kernel-doc format: * cgroup_move - Commit cgroup move ext.c:585: warning: Incorrect use of kernel-doc format: * cgroup_cancel_move - Cancel cgroup move ext.c:597: warning: Incorrect use of kernel-doc format: * cgroup_set_weight - A cgroup's weight is being changed ext.c:611: warning: Incorrect use of kernel-doc format: * cpu_online - A CPU became online ext.c:620: warning: Incorrect use of kernel-doc format: * cpu_offline - A CPU is going offline ext.c:633: warning: Incorrect use of kernel-doc format: * init - Initialize the BPF scheduler ext.c:638: warning: Incorrect use of kernel-doc format: * exit - Clean up after the BPF scheduler ext.c:648: warning: Incorrect use of kernel-doc format: * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch ext.c:653: warning: Incorrect use of kernel-doc format: * flags - %SCX_OPS_* flags ext.c:658: warning: Incorrect use of kernel-doc format: * timeout_ms - The maximum amount of time, in milliseconds, that a ext.c:667: warning: Incorrect use of kernel-doc format: * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default ext.c:673: warning: Incorrect use of kernel-doc format: * hotplug_seq - A sequence number that may be set by the scheduler to ext.c:682: warning: Incorrect use of kernel-doc format: * name - BPF scheduler's name ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops' ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-13psi: Fix race when task wakes up before psi_sched_switch() adjusts flagsChengming Zhou
When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed: psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 When investigating the series of events leading up to the splat, following sequence was observed: [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120 ... [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831) [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1) [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ... psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, with the introduction of DELAY_DEQUEUE, the block task can wakeup when newidle balance drops the runqueue lock during __schedule(). If a task wakes before psi_sched_switch() adjusts the PSI flags, skip any modifications in psi_enqueue() which would still see the flags of a running task and not a blocked one. Instead, rely on psi_sched_switch() to do the right thing. Since the status returned by try_to_block_task() may no longer be true by the time schedule reaches psi_sched_switch(), check if the task is blocked or not using a combination of task_on_rq_queued() and p->se.sched_delayed checks. [ prateek: Commit message, testing, early bailout in psi_enqueue() ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5 Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Link: https://lore.kernel.org/r/20241227061941.2315-1-kprateek.nayak@amd.com
2025-01-13sched, psi: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com
2025-01-13sched: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com
2025-01-13sched: Define sched_clock_irqtime as static keyYafang Shao
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com
2025-01-13sched/fair: Do not compute overloaded status unnecessarily during lbK Prateek Nayak
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
2025-01-13sched/fair: Do not compute NUMA Balancing stats unnecessarily during lbK Prateek Nayak
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
2025-01-13sched/core: Prioritize migrating eligible tasks in sched_balance_rq()Hao Jia
When the PLACE_LAG scheduling feature is enabled and dst_cfs_rq->nr_queued is greater than 1, if a task is ineligible (lag < 0) on the source cpu runqueue, it will also be ineligible when it is migrated to the destination cpu runqueue. Because we will keep the original equivalent lag of the task in place_entity(). So if the task was ineligible before, it will still be ineligible after migration. So in sched_balance_rq(), we prioritize migrating eligible tasks, and we soft-limit ineligible tasks, allowing them to migrate only when nr_balance_failed is non-zero to avoid load-balancing trying very hard to balance the load. Below are some benchmark test results. From my test results, this patch shows a slight improvement on hackbench. Benchmark ========= All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, and test machine is: Single NUMA machine model is 13th Gen Intel(R) Core(TM) i7-13700, 12 Core/24 HT. Based on master b86545e02e8c. Results ======= hackbench-process-pipes vanilla patched Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%) Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%) Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%) Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%) Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%) Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%) Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%) Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%) Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%) hackbench-process-sockets vanilla patched Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%) Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%) Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%) Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%) Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%) Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%) Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%) Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%) Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%) tbench4 Throughput vanilla patched Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%) Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%) Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%) Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%) Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%) Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%) Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%) Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%) Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
2025-01-13sched/debug: Change need_resched warnings to pr_errDavid Rientjes
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic. It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve. Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Josh Don <joshdon@google.com> Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com
2025-01-13sched/fair: Encapsulate set custom slice in a __setparam_fair() functionVincent Guittot
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file. No functional changes expected Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
2025-01-13sched: Fix race between yield_to() and try_to_wake_up()Tianchen Ding
We met a SCHED_WARN in set_next_buddy(): __warn_printk set_next_buddy yield_to_task_fair yield_to kvm_vcpu_yield_to [kvm] ... After a short dig, we found the rq_lock held by yield_to() may not be exactly the rq that the target task belongs to. There is a race window against try_to_wake_up(). CPU0 target_task blocking on CPU1 lock rq0 & rq1 double check task_rq == p_rq, ok woken to CPU2 (lock task_pi & rq2) task_rq = rq2 yield_to_task_fair (w/o lock rq2) In this race window, yield_to() is operating the task w/o the correct lock. Fix this by taking task pi_lock first. Fixes: d95f41220065 ("sched: Add yield_to(task, preempt) functionality") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241231055020.6521-1-dtcccc@linux.alibaba.com
2025-01-13sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUEPeter Zijlstra
Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag. However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts. As such, don't adjust the weight for empty group entities and let them compete at their original weight. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
2025-01-12delayacct: add delay min to record delay peakWang Yaxin
Delay accounting can now calculate the average delay of processes, detect the overall system load, and also record the 'delay max' to identify potential abnormal delays. However, 'delay min' can help us identify another useful delay peak. By comparing the difference between 'delay max' and 'delay min', we can understand the optimization space for latency, providing a reference for the optimization of latency performance. Use case ========= bash-4.4# ./getdelays -d -t 242 print delayacct stats ON TGID 242 CPU count real total virtual total delay total delay average delay max delay min 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms IO count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms SWAP count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms RECLAIM count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms THRASHING count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms COMPACT count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms WPCOPY count delay total delay average delay max delay min 156 11215873 0.072ms 0.207403ms 0.033913ms IRQ count delay total delay average delay max delay min 0 0 0.000ms 0.000000ms 0.000000ms Link: https://lkml.kernel.org/r/20241220173105906EOdsPhzjMLYNJJBqgz1ga@zte.com.cn Co-developed-by: Wang Yong <wang.yong12@zte.com.cn> Signed-off-by: Wang Yong <wang.yong12@zte.com.cn> Co-developed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Co-developed-by: Kun Jiang <jiang.kun2@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Peilin He <he.peilin@zte.com.cn> Cc: tuqiang <tu.qiang35@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12delayacct: add delay max to record delay peakWang Yaxin
Introduce the use cases of delay max, which can help quickly detect potential abnormal delays in the system and record the types and specific details of delay spikes. Problem ======== Delay accounting can track the average delay of processes to show system workload. However, when a process experiences a significant delay, maybe a delay spike, which adversely affects performance, getdelays can only display the average system delay over a period of time. Yet, average delay is unhelpful for diagnosing delay peak. It is not even possible to determine which type of delay has spiked, as this information might be masked by the average delay. Solution ========= the 'delay max' can display delay peak since the system's startup, which can record potential abnormal delays over time, including the type of delay and the maximum delay. This is helpful for quickly identifying crash caused by delay. Use case ========= bash# ./getdelays -d -p 244 print delayacct stats ON PID 244 CPU count real total virtual total delay total delay average delay max 68 192000000 213676651 705643 0.010ms 0.306381ms IO count delay total delay average delay max 0 0 0.000ms 0.000000ms SWAP count delay total delay average delay max 0 0 0.000ms 0.000000ms RECLAIM count delay total delay average delay max 0 0 0.000ms 0.000000ms THRASHING count delay total delay average delay max 0 0 0.000ms 0.000000ms COMPACT count delay total delay average delay max 0 0 0.000ms 0.000000ms WPCOPY count delay total delay average delay max 235 15648284 0.067ms 0.263842ms IRQ count delay total delay average delay max 0 0 0.000ms 0.000000ms [wang.yaxin@zte.com.cn: update docs and fix some spelling errors] Link: https://lkml.kernel.org/r/20241213192700771XKZ8H30OtHSeziGqRVMs0@zte.com.cn Link: https://lkml.kernel.org/r/20241203164848805CS62CQPQWG9GLdQj2_BxS@zte.com.cn Co-developed-by: Wang Yong <wang.yong12@zte.com.cn> Signed-off-by: Wang Yong <wang.yong12@zte.com.cn> Co-developed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Peilin He <he.peilin@zte.com.cn> Cc: tuqiang <tu.qiang35@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-10Merge tag 'sched_ext-for-6.13-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix corner case bug where ops.dispatch() couldn't extend the execution of the current task if SCX_OPS_ENQ_LAST is set. - Fix ops.cpu_release() not being called when a SCX task is preempted by a higher priority sched class task. - Fix buitin idle mask being incorrectly left as busy after an idle CPU is picked and kicked. - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq pinning related sanity checks which could trigger spuriously. Switch to raw_spin_rq_lock(). * tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Refresh idle masks during idle-to-idle transitions sched_ext: switch class when preempted by higher priority scheduler sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() sched_ext: keep running prev when prev->scx.slice != 0
2025-01-10sched_ext: idle: Refresh idle masks during idle-to-idle transitionsAndrea Righi
With the consolidation of put_prev_task/set_next_task(), see commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the first set_next_task()"), we are now skipping the transition between these two functions when the previous and the next tasks are the same. As a result, the scx idle state of a CPU is updated only when transitioning to or from the idle thread. While this is generally correct, it can lead to uneven and inefficient core utilization in certain scenarios [1]. A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu() selects and marks an idle CPU as busy, followed by a wake-up via scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU continues running the idle thread, returns to idle, but remains marked as busy, preventing it from being selected again as an idle CPU (until a task eventually runs on it and releases the CPU). For example, running a workload that uses 20% of each CPU, combined with an scx scheduler using proactive wake-ups, results in the following core utilization: CPU 0: 25.7% CPU 1: 29.3% CPU 2: 26.5% CPU 3: 25.5% CPU 4: 0.0% CPU 5: 25.5% CPU 6: 0.0% CPU 7: 10.5% To address this, refresh the idle state also in pick_task_idle(), during idle-to-idle transitions, but only trigger ops.update_idle() on actual state changes to prevent unnecessary updates to the scx scheduler and maintain balanced state transitions. With this change in place, the core utilization in the previous example becomes the following: CPU 0: 18.8% CPU 1: 19.4% CPU 2: 18.0% CPU 3: 18.7% CPU 4: 19.3% CPU 5: 18.9% CPU 6: 18.7% CPU 7: 19.3% [1] https://github.com/sched-ext/scx/pull/1139 Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Implement scx_bpf_now()Changwoo Min
Returns a high-performance monotonically non-decreasing clock for the current CPU. The clock returned is in nanoseconds. It provides the following properties: 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently to account for execution time and track tasks' runtime properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which eventually reads a hardware timestamp counter -- is neither performant nor scalable. scx_bpf_now() aims to provide a high-performance clock by using the rq clock in the scheduler core whenever possible. 2) High enough resolution for the BPF scheduler use cases: In most BPF scheduler use cases, the required clock resolution is lower than the most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically uses the rq clock in the scheduler core whenever it is valid. It considers that the rq clock is valid from the time the rq clock is updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock). 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now() guarantees the clock never goes backward when comparing them in the same CPU. On the other hand, when comparing clocks in different CPUs, there is no such guarantee -- the clock can go backward. It provides a monotonically *non-decreasing* clock so that it would provide the same clock values in two different scx_bpf_now() calls in the same CPU during the same period of when the rq clock is valid. An rq clock becomes valid when it is updated using update_rq_clock() and invalidated when the rq is unlocked using rq_unpin_lock(). Let's suppose the following timeline in the scheduler core: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is set), so scx_bpf_now() calls during [T2, T4) (including T3) will return the rq clock updated at T2. For duration [T4, T7), when a BPF scheduler can still call scx_bpf_now() (T5), we consider the rq clock is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling scx_bpf_now() at T5, we will return a fresh clock value by calling sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks from a previous scx scheduler, invalidate all the rq clocks when unloading a BPF scheduler. One example of calling scx_bpf_now(), when the rq clock is invalid (like T5), is in scx_central [1]. The scx_central scheduler uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh clock value rather than returning the old one (T2). [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Relocate scx_enabled() related codeChangwoo Min
scx_enabled() will be used in scx_rq_clock_update/invalidate() in the following patch, so relocate the scx_enabled() related code to the proper location. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-09sched/fair: Fix EEVDF entity placement bug causing scheduling lagPeter Zijlstra
I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out *AND* come back, notably: -1123*1048576/2 = -588775424 -588775424*2/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08sched,arm64: Handle CPU isolation on last resort fallback rq selectionFrederic Weisbecker
When a kthread or any other task has an affinity mask that is fully offline or unallowed, the scheduler reaffines the task to all possible CPUs as a last resort. This default decision doesn't mix up very well with nohz_full CPUs that are part of the possible cpumask but don't want to be disturbed by unbound kthreads or even detached pinned user tasks. Make the fallback affinity setting aware of nohz_full. Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-08sched_ext: switch class when preempted by higher priority schedulerHonglei Wang
ops.cpu_release() function, if defined, must be invoked when preempted by a higher priority scheduler class task. This scenario was skipped in commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix it. Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()") Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()Changwoo Min
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks. For each CPU, it acquires a lock using rq_lock() regardless of whether a CPU is offline or the CPU is currently running a task in a higher scheduler class (e.g., deadline). The rq_lock() is supposed to be used for online CPUs, and the use of rq_lock() may trigger an unnecessary warning in rq_pin_lock(). Therefore, replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass(). Without this change, we observe the following warning: ===== START ===== [ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90 ===== END ===== Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08sched_ext: keep running prev when prev->scx.slice != 0Henry Huang
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06sched_ext: Include remaining task time slice in error state dumpAndrea Righi
Report the remaining time slice when dumping task information during an error exit. This information can be useful for tracking incorrect or excessively long time slices in schedulers that implement dynamic time slice logic. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ONAndrea Righi
With commit 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"), scx_bpf_dsq_insert() can use SCX_DSQ_LOCAL_ON for direct dispatch from ops.enqueue() to target the local DSQ of any CPU. Update the documentation accordingly. Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06sched_ext: idle: small CPU iteration refactoringAndrea Righi
Replace the loop to check if all SMT CPUs are idle with cpumask_subset(). This simplifies the code and slightly improves efficiency, while preserving the original behavior. Note that idle_masks.smt handling remains racy, which is acceptable as it serves as an optimization and is self-correcting. Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-03Merge tag 'sched_ext-for-6.13-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors
2024-12-29sched_ext: idle: introduce check_builtin_idle_enabled() helperAndrea Righi
Minor refactoring to add a helper function for checking if the built-in idle CPU selection policy is enabled. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29sched_ext: idle: clarify commentsAndrea Righi
Add a comments to clarify about the usage of cpumask_intersects(). Moreover, update scx_select_cpu_dfl() description clarifying that the final step of the idle selection logic involves searching for any idle CPU in the system that the task can use. Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29sched_ext: idle: use assign_cpu() to update the idle cpumaskAndrea Righi
Use the assign_cpu() helper to set or clear the CPU in the idle mask, based on the idle condition. Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24sched_ext: initialize kit->cursor.flagsHenry Huang
struct bpf_iter_scx_dsq *it maybe not initialized. If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice before scx_bpf_dsq_move, it would cause unexpected behaviors: 1. Assign a huge slice into p->scx.slice 2. Assign a invalid vtime into p->scx.dsq_vtime Signed-off-by: Henry Huang <henry.hj@antgroup.com> Fixes: 6462dd53a260 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern") Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-23Merge back earlier cpufreq material for 6.14Rafael J. Wysocki
2024-12-20docs: Update Schedstat version to 17Swapnil Sapkal
Update the Schedstat version to 17 as more fields are added to report different kinds of imbalances in the sched domain. Also domain field started printing corresponding domain name. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com
2024-12-20sched/stats: Print domain name in /proc/schedstatK Prateek Nayak
Currently, there does not exist a straightforward way to extract the names of the sched domains and match them to the per-cpu domain entry in /proc/schedstat other than looking at the debugfs files which are only visible after enabling "verbose" debug after commit 34320745dfc9 ("sched/debug: Put sched/domains files under the verbose flag") Since tools like `perf sched stats`[1] require displaying per-domain information in user friendly manner, display the names of sched domain, alongside their level in /proc/schedstat. Domain names also makes the /proc/schedstat data unambiguous when some of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine where CPU0 and CPU64 are SMT siblings and CPU64 is offline: Before: cpu0 ... domain0 ... domain1 ... cpu1 ... domain0 ... domain1 ... domain2 ... After: cpu0 ... domain0 MC ... domain1 PKG ... cpu1 ... domain0 SMT ... domain1 MC ... domain2 PKG ... [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20241220063224.17767-6-swapnil.sapkal@amd.com
2024-12-20sched: Move sched domain name out of CONFIG_SCHED_DEBUGSwapnil Sapkal
/proc/schedstat file shows cpu and sched domain level scheduler statistics. It does not show domain name instead shows domain level. It will be very useful for tools like `perf sched stats`[1] to aggragate domain level stats if domain names are shown in /proc/schedstat. But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG. [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/ Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com
2024-12-20sched: Report the different kinds of imbalances in /proc/schedstatSwapnil Sapkal
In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), which is not very useful because lb_imbalance does not mention whether the imbalance is due to load, utilization, nr_tasks or misfit_tasks. Remove this field from /proc/schedstat. Currently there is no field in /proc/schedstat to report different types of imbalances. Introduce new fields in /proc/schedstat to report the total imbalances in load, utilization, nr_tasks or misfit_tasks. Added fields to /proc/schedstat: - lb_imbalance_load: Total imbalance due to load. - lb_imbalance_util: Total imbalance due to utilization. - lb_imbalance_task: Total imbalance due to number of tasks. - lb_imbalance_misfit: Total imbalance due to misfit tasks. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
2024-12-20sched/fair: Cleanup in migrate_degrades_locality() to improve readabilityPeter Zijlstra
migrate_degrade_locality() would return {1, 0, -1} respectively to indicate that migration would degrade-locality, would improve locality, would be ambivalent to locality improvements. This patch improves readability by changing the return value to mean: * Any positive value degrades locality * 0 migration doesn't affect locality * Any negative value improves locality [Swapnil: Fixed comments around code and wrote commit log] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com
2024-12-20sched/fair: Fix value reported by hot tasks pulled in /proc/schedstatPeter Zijlstra
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally. [Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log] Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com
2024-12-20sched/fair: Update comments after sched_tick() rename.Sebastian Andrzej Siewior
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2 ("sched/balancing: Rename scheduler_tick() => sched_tick()"). Update comments still referring to scheduler_tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241219085839.302378-1-bigeasy@linutronix.de
2024-12-18PM: EM: Move sched domains rebuild function from schedutil to EMRafael J. Wysocki
Function sugov_eas_rebuild_sd() defined in the schedutil cpufreq governor implements generic functionality that may be useful in other places. In particular, there is a plan to use it in the intel_pstate driver in the future. For this reason, move it from schedutil to the energy model code and rename it to em_rebuild_sched_domains(). This also helps to get rid of some #ifdeffery in schedutil which is a plus. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
2024-12-18cpufreq: schedutil: Fix superfluous updates caused by need_freq_updateSultan Alsawaf (unemployed)
A redundant frequency update is only truly needed when there is a policy limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS. In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a frequency update _all the time_, not just for a policy limits change, because need_freq_update is never cleared. Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads to a redundant frequency update, regardless of whether or not the driver specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the same as the current one. Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS when there's a policy limits change, and clearing need_freq_update when a requisite redundant update occurs. This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test and instead setting need_freq_update to false in sugov_update_next_freq(). Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Signed-off-by: Sultan Alsawaf (unemployed) <sultan@kerneltoast.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20241212015734.41241-2-sultan@kerneltoast.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>