summaryrefslogtreecommitdiff
path: root/kernel/sched/fair.c
AgeCommit message (Collapse)Author
2024-12-09sched/fair: Add new cfs_rq.h_nr_runnableVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task. Create a new h_nr_runnable that tracks only queued and runnable tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename h_nr_running into h_nr_queuedVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
2024-12-09Merge branch 'sched/urgent'Peter Zijlstra
Sync with urgent bits as a base for further work. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-09sched/eevdf: More PELT vs DELAYED_DEQUEUEPeter Zijlstra
Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust the cfs_rq runnable stats, which are based off of h_nr_running. Track h_nr_delayed such that we can discount those and adjust the signal. Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/ Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/ [ Fixes checkpatch warnings and rebased ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
2024-12-09sched/fair: Fix NEXT_BUDDYK Prateek Nayak
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in pick_next_entity(). Moving clear_buddies() up before the delayed dequeue bits ensures no ->next buddy becomes delayed. Further ensure no new ->next buddy ever starts as delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Adam Li <adamli@os.amperecomputing.com> Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
2024-12-05sched/numa: fix memory leak due to the overwritten vma->numab_stateAdrian Huang
[Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 20*40 (== 800) tasks. # dmesg | grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0 [<ffffffff8113f715>] task_numa_work+0x725/0xa00 [<ffffffff8110f878>] task_work_run+0x58/0x90 [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0 [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150 [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: * a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reported-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-02sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISEWaiman Long
As all the non-domain and non-managed_irq housekeeping types have been unified to HK_TYPE_KERNEL_NOISE, replace all these references in the scheduler to use HK_TYPE_KERNEL_NOISE. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com
2024-12-02sched/core: Remove HK_TYPE_SCHEDWaiman Long
The HK_TYPE_SCHED housekeeping type is defined but not set anywhere. So any code that try to use HK_TYPE_SCHED are essentially dead code. So remove HK_TYPE_SCHED and any code that use it. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20241030175253.125248-2-longman@redhat.com
2024-12-02sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()Valentin Schneider
Andy reported that clang gets upset with CONFIG_CFS_BANDWIDTH=n: kernel/sched/fair.c:6580:20: error: unused function 'cfs_bandwidth_used' [-Werror,-Wunused-function] 6580 | static inline bool cfs_bandwidth_used(void) | ^~~~~~~~~~~~~~~~~~ Indeed, cfs_bandwidth_used() is only used within functions defined under CONFIG_CFS_BANDWIDTH=y. Remove its CONFIG_CFS_BANDWIDTH=n declaration & definition. Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20241127165501.160004-1-vschneid@redhat.com
2024-12-02sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning ↵K Prateek Nayak
busy Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the TIF_NEED_RESCHED flag in idle task's thread info and relying on flush_smp_call_function_queue() in idle exit path to run the call-function. A softirq raised by the call-function is handled shortly after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag remains set and is only cleared later when schedule_idle() calls __schedule(). need_resched() check in _nohz_idle_balance() exists to bail out of load balancing if another task has woken up on the CPU currently in-charge of idle load balancing which is being processed in SCHED_SOFTIRQ context. Since the optimization mentioned above overloads the interpretation of TIF_NEED_RESCHED, check for idle_cpu() before going with the existing need_resched() check which can catch a genuine task wakeup on an idle CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as well as the case where ksoftirqd needs to be preempted as a result of new task wakeup or slice expiry. In case of PREEMPT_RT or threadirqs, although the idle load balancing may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd is the only fair task going back to sleep will trigger a newidle balance on the CPU which will alleviate some imbalance if it exists if idle balance fails to do so. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
2024-11-19Merge tag 'sched-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) Idle loop: - Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) Preparatory patches for proxy execution: - Add move_queued_task_locked helper (Connor O'Brien) - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - Split out __schedule() deactivate task logic into a helper (John Stultz) - Split scheduler and execution contexts (Peter Zijlstra) - Make mutex::wait_lock irq safe (Juri Lelli) - Expose __mutex_owner() (Juri Lelli) - Remove wakeups from under mutex::wait_lock (Peter Zijlstra) Misc fixes and cleanups: - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - remove the DOUBLE_TICK feature (Huang Shijie) - fix the comment for PREEMPT_SHORT (Huang Shijie) - Fix unnused variable warning (Christian Loehle) - No PREEMPT_RT=y for all{yes,mod}config" * tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY. sched: No PREEMPT_RT=y for all{yes,mod}config riscv: add PREEMPT_LAZY support sched, x86: Enable Lazy preemption sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT sched: Add Lazy preemption model sched: Add TIF_NEED_RESCHED_LAZY infrastructure sched/ext: Remove sched_fork() hack sched: Initialize idle tasks only once sched: psi: pass enqueue/dequeue flags to psi callbacks directly sched/uclamp: Fix unnused variable warning sched: Split scheduler and execution contexts sched: Split out __schedule() deactivate task logic into a helper sched: Consolidate pick_*_task to task_is_pushable helper sched: Add move_queued_task_locked helper locking/mutex: Expose __mutex_owner() locking/mutex: Make mutex::wait_lock irq safe locking/mutex: Remove wakeups from under mutex::wait_lock sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads sched: idle: Optimize the generic idle loop by removing needless memory barrier ...
2024-11-05sched: Add Lazy preemption modelPeter Zijlstra
Change fair to use resched_curr_lazy(), which, when the lazy preemption model is selected, will set TIF_NEED_RESCHED_LAZY. This LAZY bit will be promoted to the full NEED_RESCHED bit on tick. As such, the average delay between setting LAZY and actually rescheduling will be TICK_NSEC/2. In short, Lazy preemption will delay preemption for fair class but will function as Full preemption for all the other classes, most notably the realtime (RR/FIFO/DEADLINE) classes. The goal is to bridge the performance gap with Voluntary, such that we might eventually remove that option entirely. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
2024-10-26sched/numa: Fix the potential null pointer dereference in task_numa_work()Shawn Wang
When running stress-ng-vm-segv test, we found a null pointer dereference error in task_numa_work(). Here is the backtrace: [323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 ...... [323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se ...... [323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [323676.067115] pc : vma_migratable+0x1c/0xd0 [323676.067122] lr : task_numa_work+0x1ec/0x4e0 [323676.067127] sp : ffff8000ada73d20 [323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010 [323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000 [323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000 [323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8 [323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035 [323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8 [323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4 [323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001 [323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000 [323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000 [323676.067152] Call trace: [323676.067153] vma_migratable+0x1c/0xd0 [323676.067155] task_numa_work+0x1ec/0x4e0 [323676.067157] task_work_run+0x78/0xd8 [323676.067161] do_notify_resume+0x1ec/0x290 [323676.067163] el0_svc+0x150/0x160 [323676.067167] el0t_64_sync_handler+0xf8/0x128 [323676.067170] el0t_64_sync+0x17c/0x180 [323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000) [323676.067177] SMP: stopping secondary CPUs [323676.070184] Starting crashdump kernel... stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error handling function of the system, which tries to cause a SIGSEGV error on return from unmapping the whole address space of the child process. Normally this program will not cause kernel crashes. But before the munmap system call returns to user mode, a potential task_numa_work() for numa balancing could be added and executed. In this scenario, since the child process has no vma after munmap, the vma_next() in task_numa_work() will return a null pointer even if the vma iterator restarts from 0. Recheck the vma pointer before dereferencing it in task_numa_work(). Fixes: 214dbc428137 ("sched: convert to vma iterator") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v6.2+ Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com
2024-10-23sched: Fix pick_next_task_fair() vs try_to_wake_up() racePeter Zijlstra
Syzkaller robot reported KCSAN tripping over the ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task(). The report noted that both pick_next_task_fair() and try_to_wake_up() were concurrently trying to write to the same p->on_rq, violating the assertion -- even though both paths hold rq->__lock. The logical consequence is that both code paths end up holding a different rq->__lock. And looking through ttwu(), this is possible when the __block_task() 'p->on_rq = 0' store is visible to the ttwu() 'p->on_rq' load, which then assumes the task is not queued and continues to migrate it. Rearrange things such that __block_task() releases @p with the store and no code thereafter will use @p again. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net
2024-10-14sched: Split scheduler and execution contextsPeter Zijlstra
Let's define the "scheduling context" as all the scheduler state in task_struct for the task chosen to run, which we'll call the donor task, and the "execution context" as all state required to actually run the task. Currently both are intertwined in task_struct. We want to logically split these such that we can use the scheduling context of the donor task selected to be scheduled, but use the execution context of a different task to actually be run. To this purpose, introduce rq->donor field to point to the task_struct chosen from the runqueue by the scheduler, and will be used for scheduler state, and preserve rq->curr to indicate the execution context of the task that will actually be run. This patch introduces the donor field as a union with curr, so it doesn't cause the contexts to be split yet, but adds the logic to handle everything separately. [add additional comments and update more sched_class code to use rq::proxy] [jstultz: Rebased and resolved minor collisions, reworked to use accessors, tweaked update_curr_common to use rq_proxy fixing rt scheduling issues] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com
2024-10-14Merge branch 'tip/sched/urgent'Peter Zijlstra
Sync with sched/urgent to avoid conflicts. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-10-11sched: Fix delayed_dequeue vs switched_from_fair()Peter Zijlstra
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") and its follow up fixes try to deal with a rather unfortunate situation where is task is enqueued in a new class, even though it shouldn't have been. Mostly because the existing ->switched_to/from() hooks are in the wrong place for this case. This all led to Paul being able to trigger failures at something like once per 10k CPU hours of RCU torture. For now, do the ugly thing and move the code to the right place by ignoring the switch hooks. Note: Clean up the whole sched_class::switch*_{to,from}() thing. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
2024-10-07sched/fair: optimize the PLACE_LAG when se->vlag is zeroHuang Shijie
When PLACE_LAG is enabled, from the relationship: vl_i = (W + w_i)*vl'_i / W we know that if vl'_i(se->vlag) is zero, the vl_i is zero too. So if se->vlag is zero, there is no need to waste cycles to do the calculation. Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Link: https://lkml.kernel.org/r/20241001070021.10626-1-shijie@os.amperecomputing.com
2024-10-07sched/fair: remove the DOUBLE_TICK featureHuang Shijie
The patch "5e963f2bd46 sched/fair: Commit to EEVDF" removed the code following the DOUBLE_TICK: - - if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1) - check_preempt_tick(cfs_rq, curr); The DOUBLE_TICK feature becomes dead code now, so remove it. Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Christoph Lameter (Ampere)" <cl@linux.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20241001065451.10356-1-shijie@os.amperecomputing.com
2024-10-02sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_runningChen Yu
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") introduced a mechanism that a wakee with shorter slice could preempt the current running task. It also lower the bar for the current task to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running when the current task has ran out of time slice. But there is a scenario that is problematic. Say, if there is 1 cfs task and 1 rt task, before 85e511df3cec, update_deadline() will not trigger a reschedule, and after 85e511df3cec, since rq->nr_running is 2 and resched is true, a resched_curr() would happen. Some workloads (like the hackbench reported by lkp) do not like over-scheduling. We can see that the preemption rate has been increased by 2.2%: 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches Restore its previous check criterion. Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com Reported-by: kernel test robot <oliver.sang@intel.com> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Honglei Wang <jameshongleiwang@126.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20240925085440.358138-1-yu.c.chen@intel.com
2024-10-02sched: Fix sched_delayed vs cfs_bandwidthMike Galbraith
Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq() leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP can proceed as it would have sans DELAY_DEQUEUE treatment. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/7515d2e64c989b9e3b828a9e21bcd959b99df06a.camel@gmx.de
2024-09-21Merge tag 'sched_ext-for-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext support from Tejun Heo: "This implements a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs. The goals of this are: - Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. - Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. - Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments" See individual commits for more documentation, but also the cover letter for the latest series: Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/ * tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits) sched: Move update_other_load_avgs() to kernel/sched/pelt.c sched_ext: Don't trigger ops.quiescent/runnable() on migrations sched_ext: Synchronize bypass state changes with rq lock scx_qmap: Implement highpri boosting sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() sched_ext: Compact struct bpf_iter_scx_dsq_kern sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() sched_ext: Move consume_local_task() upward sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() sched_ext: Reorder args for consume_local/remote_task() sched_ext: Restructure dispatch_to_local_dsq() sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON sched_ext: Refactor consume_remote_task() sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate sched_ext: Add missing static to scx_dump_data sched_ext: Add missing static to scx_has_op[] sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() sched_ext: Add a cgroup scheduler which uses flattened hierarchy sched_ext: Add cgroup support ...
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-19Merge tag 'sched-core-2024-09-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ...
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10kernel/sched: Fix util_est accounting for DELAY_DEQUEUEDietmar Eggemann
Remove delayed tasks from util_est even they are runnable. Exclude delayed task which are (a) migrating between rq's or (b) in a SAVE/RESTORE dequeue/enqueue. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
2024-09-10sched/pelt: Use rq_clock_task() for hw_pressureChen Yu
commit 97450eb90965 ("sched/pelt: Remove shift of thermal clock") removed the decay_shift for hw_pressure. This commit uses the sched_clock_task() in sched_tick() while it replaces the sched_clock_task() with rq_clock_pelt() in __update_blocked_others(). This could bring inconsistence. One possible scenario I can think of is in ___update_load_sum(): u64 delta = now - sa->last_update_time 'now' could be calculated by rq_clock_pelt() from __update_blocked_others(), and last_update_time was calculated by rq_clock_task() previously from sched_tick(). Usually the former chases after the latter, it cause a very large 'delta' and brings unexpected behavior. Fixes: 97450eb90965 ("sched/pelt: Remove shift of thermal clock") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
2024-09-10sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.cVincent Guittot
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file with others utilization related functions. No functional change. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
2024-09-03sched/numa: Fix the vma scan starving issueYujie Liu
Problem statement: Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of the vma scan might create less Numa page fault information. The insufficient information makes it harder for the Numa balancer to make decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to bring back part of the performance. Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a long duration of remote Numa node read was observed by PMU events: A few cores having ~500MB/s remote memory access for ~20 seconds. It causes high core-to-core variance and performance penalty. After the investigation, it is found that many vmas are skipped due to the active PID check. According to the trace events, in most cases, vma_is_accessed() returns false because the history access info stored in pids_active array has been cleared. Proposal: The main idea is to adjust vma_is_accessed() to let it return true easier. Thus compare the diff between mm->numa_scan_seq and vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold, scan the vma. This patch especially helps the cases where there are small number of threads, like the process-based SPECcpu. Without this patch, if the SPECcpu process access the vma at the beginning, then sleeps for a long time, the pid_active array will be cleared. A a result, if this process is woken up again, it never has a chance to set prot_none anymore. Because only the first 2 times of access is granted for vma scan: (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be worse, no other threads within the task can help set the prot_none. This causes information lost. Raghavendra helped test current patch and got the positive result on the AMD platform: autonumabench NUMA01 base patched Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%* Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%* Duration User 380345.36 368252.04 Duration System 1358.89 1156.23 Duration Elapsed 2277.45 2213.25 autonumabench NUMA02 Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%* Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%* Duration User 1513.23 1575.48 Duration System 8.33 8.13 Duration Elapsed 28.59 29.71 kernbench Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%* Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%* Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%* Duration User 68816.41 67615.74 Duration System 21873.94 22848.08 Duration Elapsed 506.66 504.55 Intel 256 CPUs/2 Sockets: autonuma benchmark also shows improvements: v6.10-rc5 v6.10-rc5 +patch Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%* Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%* Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%* Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%* Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%* Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%* Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%* Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%* v6.10-rc5 v6.10-rc5 +patch Duration User 1064010.16 1075416.23 Duration System 3307.64 3104.66 Duration Elapsed 4537.54 4604.73 The SPECcpu remote node access issue disappears with the patch applied. Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com> Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03Merge branch 'tip/sched/core' into for-6.12Tejun Heo
- Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03sched: Add put_prev_task(.next)Peter Zijlstra
In order to tell the previous sched_class what the next task is, add put_prev_task(.next). Notable SCX will use this to: 1) determine the next task will leave the SCX sched class and push the current task to another CPU if possible. 2) statistics on how often and which other classes preempt it Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
2024-09-03sched: Rework dl_serverPeter Zijlstra
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
2024-09-03sched: Combine the last put_prev_task() and the first set_next_task()Peter Zijlstra
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
2024-09-03sched: Rework pick_next_task()Peter Zijlstra
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
2024-09-03sched: Clean up DL server vs core schedPeter Zijlstra
Abide by the simple rule: pick_next_task() := pick_task() + set_next_task(.first = true) This allows us to trivially get rid of server_pick_next() and things collapse nicely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
2024-09-03sched: Fixup set_next_task() implementationsPeter Zijlstra
The rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) Turns out, there's still a few things in pick_next_task() that are missing from that combination. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
2024-09-03sched/fair: Properly deactivate sched_delayed task upon class changeValentin Schneider
__sched_setscheduler() goes through an enqueue/dequeue cycle like so: flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; prev_class->dequeue_task(rq, p, flags); new_class->enqueue_task(rq, p, flags); when prev_class := fair_sched_class, this is followed by: dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP); the idea being that since the task has switched classes, we need to drop the sched_delayed logic and have that task be deactivated per its previous dequeue_task(..., DEQUEUE_SLEEP). Unfortunately, this leaves the task on_rq. This is missing the tail end of dequeue_entities() that issues __block_task(), which __sched_setscheduler() won't have done due to not using DEQUEUE_DELAYED - not that it should, as it is pretty much a fair_sched_class specific thing. Make switched_from_fair() properly deactivate sched_delayed tasks upon class changes via __block_task(), as if a dequeue_task(..., DEQUEUE_DELAYED) had been issued. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
2024-09-01mm: create promo_wmark_pages and clean up open-coded sitesKaiyang Zhao
Patch series "mm: print the promo watermark in zoneinfo", v2. This patch (of 2): Define promo_wmark_pages and convert current call sites of wmark_pages with fixed WMARK_PROMO to using it instead. Link: https://lkml.kernel.org/r/20240801232548.36604-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240801232548.36604-2-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01memory tiering: introduce folio_use_access_time() checkZi Yan
If memory tiering mode is on and a folio is not in the top tier memory, folio's cpupid field is repurposed to store page access time. Instead of an open coded check, use a function to encapsulate the check. Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20Merge branch 'tip/sched/core' into for-6.12Tejun Heo
To receive 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20softirq: Remove unused 'action' parameter from action callbackCaleb Sander Mateos
When soft interrupt actions are called, they are passed a pointer to the struct softirq action which contains the action's function pointer. This pointer isn't useful, as the action callback already knows what function it is. And since each callback handles a specific soft interrupt, the callback also knows which soft interrupt number is running. No soft interrupt action callback actually uses this parameter, so remove it from the function pointer signature. This clarifies that soft interrupt actions are global routines and makes it slightly cheaper to call them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-17sched/eevdf: Propagate min_slice up the cgroup hierarchyPeter Zijlstra
In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org
2024-08-17sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestionPeter Zijlstra
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
2024-08-17sched/eevdf: Allow shorter slices to wakeup-preemptPeter Zijlstra
Part of the reason to have shorter slices is to improve responsiveness. Allow shorter slices to preempt longer slices on wakeup. Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT 1 massive_intr:(5) | 846018.956 ms | 779188 | avg: 0.273 ms | max: 58.337 ms | sum:212545.245 ms | 2 massive_intr:(5) | 853450.693 ms | 792269 | avg: 0.275 ms | max: 71.193 ms | sum:218263.588 ms | 3 massive_intr:(5) | 843888.920 ms | 771456 | avg: 0.277 ms | max: 92.405 ms | sum:213353.221 ms | 1 chromium-browse:(8) | 53015.889 ms | 131766 | avg: 0.463 ms | max: 36.341 ms | sum:60959.230 ms | 2 chromium-browse:(8) | 53864.088 ms | 136962 | avg: 0.480 ms | max: 27.091 ms | sum:65687.681 ms | 3 chromium-browse:(9) | 53637.904 ms | 132637 | avg: 0.481 ms | max: 24.756 ms | sum:63781.673 ms | 1 cyclictest:(5) | 12615.604 ms | 639689 | avg: 0.471 ms | max: 32.272 ms | sum:301351.094 ms | 2 cyclictest:(5) | 12511.583 ms | 642578 | avg: 0.448 ms | max: 44.243 ms | sum:287632.830 ms | 3 cyclictest:(5) | 12545.867 ms | 635953 | avg: 0.475 ms | max: 25.530 ms | sum:302374.658 ms | 100ms massive_intr 500us cyclictest PREEMPT_SHORT 1 massive_intr:(5) | 839843.919 ms | 837384 | avg: 0.264 ms | max: 74.366 ms | sum:221476.885 ms | 2 massive_intr:(5) | 852449.913 ms | 845086 | avg: 0.252 ms | max: 68.162 ms | sum:212595.968 ms | 3 massive_intr:(5) | 839180.725 ms | 836883 | avg: 0.266 ms | max: 69.742 ms | sum:222812.038 ms | 1 chromium-browse:(11) | 54591.481 ms | 138388 | avg: 0.458 ms | max: 35.427 ms | sum:63401.508 ms | 2 chromium-browse:(8) | 52034.541 ms | 132276 | avg: 0.436 ms | max: 31.826 ms | sum:57732.958 ms | 3 chromium-browse:(8) | 55231.771 ms | 141892 | avg: 0.469 ms | max: 27.607 ms | sum:66538.697 ms | 1 cyclictest:(5) | 13156.391 ms | 667412 | avg: 0.373 ms | max: 38.247 ms | sum:249174.502 ms | 2 cyclictest:(5) | 12688.939 ms | 665144 | avg: 0.374 ms | max: 33.548 ms | sum:248509.392 ms | 3 cyclictest:(5) | 13475.623 ms | 669110 | avg: 0.370 ms | max: 37.819 ms | sum:247673.390 ms | As per the numbers the, this makes cyclictest (short slice) it's max-delay more consistent and consistency drops the sum-delay. The trade-off is that the massive_intr (long slice) gets more context switches and a slight increase in sum-delay. Chunxin contributed did_preempt_short() where a task that lost slice protection from PREEMPT_SHORT gets rescheduled once it becomes in-eligible. [mike: numbers] Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
2024-08-17sched/fair: Avoid re-setting virtual deadline on 'migrations'Peter Zijlstra
During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org
2024-08-17sched/eevdf: Fixup PELT vs DELAYED_DEQUEUEPeter Zijlstra
Note that tasks that are kept on the runqueue to burn off negative lag, are not in fact runnable anymore, they'll get dequeued the moment they get picked. As such, don't count this time towards runnable. Thanks to Valentin for spotting I had this backwards initially. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
2024-08-17sched/fair: Implement DELAY_ZEROPeter Zijlstra
'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them at the 0-lag point, truncate lag (eg. don't let them earn positive lag). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org
2024-08-17sched/fair: Implement delayed dequeuePeter Zijlstra
Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by noting that lag is fundamentally a temporal measure. It should not be carried around indefinitely. OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. Since lag is intimately tied to the virtual time base, a wall-time based decay is also insufficient, notably competition is required for any of this to make sense. Instead, delay the dequeue and keep the 'tasks' on the runqueue, competing until they are eligible. Strictly speaking, we only care about keeping them until the 0-lag point, but that is a difficult proposition, instead carry them around until they get picked again, and dequeue them at that point. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org
2024-08-17sched/fair: Implement ENQUEUE_DELAYEDPeter Zijlstra
Doing a wakeup on a delayed dequeue task is about as simple as it sounds -- remove the delayed mark and enjoy the fact it was actually still on the runqueue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
2024-08-17sched/fair: Prepare pick_next_task() for delayed dequeuePeter Zijlstra
Delayed dequeue's natural end is when it gets picked again. Ensure pick_next_task() knows what to do with delayed tasks. Note, this relies on the earlier patch that made pick_next_task() state invariant -- it will restart the pick on dequeue, because obviously the just dequeued task is no longer eligible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org