summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2024-07-29sched/core: Fix picking of tasks for core scheduling with DL serverJoel Fernandes (Google)
* Use simple CFS pick_task for DL pick_task DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong because core scheduling's pick_task only calls CFS's pick_task() for evaluation / checking of the CFS task (comparing across CPUs), not for actually affirmatively picking the next task. This causes RB tree corruption issues in CFS that were found by syzbot. * Make pick_task_fair clear DL server A DL task pick might set ->dl_server, but it is possible the task will never run (say the other HT has a stop task). If the CFS task is picked in the future directly (say without DL server), ->dl_server will be set. So clear it in pick_task_fair(). This fixes the KASAN issue reported by syzbot in set_next_entity(). (DL refactoring suggestions by Vineeth Pillai). Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix priority checking for DL server picksJoel Fernandes (Google)
In core scheduling, a DL server pick (which is CFS task) should be given higher priority than tasks in other classes. Not doing so causes CFS starvation. A kselftest is added later to demonstrate this. A CFS task that is competing with RT tasks can be completely starved without this and the DL server's boosting completely ignored. Fix these problems. Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Fair server interfaceDaniel Bristot de Oliveira
Add an interface for fair server setup on debugfs. Each CPU has two files under /debug/sched/fair_server/cpu{ID}: - runtime: set runtime in ns - period: set period in ns This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set bounds on admission control. The interface also add the server to the dl bandwidth accounting. Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
2024-07-29sched/deadline: Deferrable dl serverDaniel Bristot de Oliveira
Among the motivations for the DL servers is the real-time throttling mechanism. This mechanism works by throttling the rt_rq after running for a long period without leaving space for fair tasks. The base dl server avoids this problem by boosting fair tasks instead of throttling the rt_rq. The point is that it boosts without waiting for potential starvation, causing some non-intuitive cases. For example, an IRQ dispatches two tasks on an idle system, a fair and an RT. The DL server will be activated, running the fair task before the RT one. This problem can be avoided by deferring the dl server activation. By setting the defer option, the dl_server will dispatch an SCHED_DEADLINE reservation with replenished runtime, but throttled. The dl_timer will be set for the defer time at (period - runtime) ns from start time. Thus boosting the fair rq at defer time. If the fair scheduler has the opportunity to run while waiting for defer time, the dl server runtime will be consumed. If the runtime is completely consumed before the defer time, the server will be replenished while still in a throttled state. Then, the dl_timer will be reset to the new defer time If the fair server reaches the defer time without consuming its runtime, the server will start running, following CBS rules (thus without breaking SCHED_DEADLINE). Then the server will continue the running state (without deferring) until it fair tasks are able to execute as regular fair scheduler (end of the starvation). Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Add trivial fair serverPeter Zijlstra
Use deadline servers to service fair tasks. This patch adds a fair_server deadline entity which acts as a container for fair entities and can be used to fix starvation when higher priority (wrt fair) tasks are monopolizing CPU(s). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Clear prev->dl_server in CFS pick fast pathYoussef Esmat
In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Add clearing of ->dl_server in put_prev_task_balance()Joel Fernandes (Google)
Paths using put_prev_task_balance() need to do a pick shortly after. Make sure they also clear the ->dl_server on prev as a part of that. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchyTianchen Ding
Consider the following cgroup: root | ------------------------ | | normal_cgroup idle_cgroup | | SCHED_IDLE task_A SCHED_NORMAL task_B According to the cgroup hierarchy, A should preempt B. But current check_preempt_wakeup_fair() treats cgroup se and task separately, so B will preempt A unexpectedly. Unify the wakeup logic by {c,p}se_is_idle only. This makes SCHED_IDLE of a task a relative policy that is effective only within its own cgroup, similar to the behavior of NICE. Also fix se_is_idle() definition when !CONFIG_FAIR_GROUP_SCHED. Fixes: 304000390f88 ("sched: Cgroup SCHED_IDLE support") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240626023505.1332596-1-dtcccc@linux.alibaba.com
2024-07-29sched: remove HZ_BW feature hedgePhil Auld
As a hedge against unexpected user issues commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use") included a scheduler feature to disable the new functionality. It's been a few releases (v6.6) and no screams, so remove it. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com
2024-07-29sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clockChuyi Zhou
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virtual runtime in the runqueue exceeds three times the scheduler latency, indicating significant disparity in task scheduling. Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF cfs_rq->exec_clock was used to account for time spent executing tasks. Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime accounting across classes cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in eevdf. Remove them from struct cfs_rq. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
2024-07-29sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()Peilin He
Background ========== When repeated migrate_disable() calls are made with missing the corresponding migrate_enable() calls, there is a risk of 'migration_disabled' going upper overflow because 'migration_disabled' is a type of unsigned short whose max value is 65535. In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may make the migrate_disable() ineffective within local_lock_irqsave(). This is because, during the scheduling procedure, the value of 'migration_disabled' will be checked, which can trigger CPU migration. Consequently, the count of 'rcu_read_lock_nesting' may leak due to local_lock_irqsave() and local_unlock_irqrestore() occurring on different CPUs. Usecase ======== For example, When I developed a driver, I encountered a warning like "WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315 rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month to locate this issue. Ultimately, I discovered that the lack of upper overflow detection mechanism in migrate_disable() was the root cause, leading to a significant amount of time spent on problem localization. If the upper overflow detection mechanism was added to migrate_disable(), the root cause could be very quickly and easily identified. Effect ====== Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow can help developers identify the issue quickly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peilin He<he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn> Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: Fan Yu <fan.yu9@zte.com.cn> Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29sched: Initialize the vruntime of a new task when it is first enqueuedZhang Qiao
When creating a new task, we initialize vruntime of the newly task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate. Because it uses current CPU to init the vruntime, but the new task actually runs on the cpu which be assigned at wake_up_new_task(). To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task() in wake_up_new_task(), in this way, when place_entity is called in enqueue_entity(), the vruntime of the new task will be initialized. In addition, place_entity() in task_fork_fair() was introduced for two reasons: 1. Previously, the __enqueue_entity() was in task_new_fair(), in order to provide vruntime for enqueueing the newly task, the vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was introduced by commit e9acbff6484d ("sched: introduce se->vruntime"). This is the initial state of place_entity(). 2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added child_runs_first task placement feature which based on vruntime, this also requires the new task's vruntime value. After removing the child_runs_first and enqueue_entity() from task_fork_fair(), this place_entity() no longer makes sense, so remove it also. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29sched/cputime: Fix mul_u64_u64_div_u64() precision for cputimeZheng Zucheng
In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*Valentin Schneider
Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-16Merge tag 'sched-core-2024-07-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Update Daniel Bristot de Oliveira's entry in MAINTAINERS, and credit him in CREDITS - Harmonize the lock-yielding behavior on dynamically selected preemption models with static ones - Reorganize the code a bit: split out sched/syscalls.c to reduce the size of sched/core.c - Micro-optimize psi_group_change() - Fix set_load_weight() for SCHED_IDLE tasks - Misc cleanups & fixes * tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Update MAINTAINERS and CREDITS sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks sched/psi: Optimise psi_group_change a bit sched/core: Drop spinlocks on contention iff kernel is preemptible sched/core: Move preempt_model_*() helpers from sched.h to preempt.h sched/balance: Skip unnecessary updates to idle load balancer's flags idle: Remove stale RCU comment sched/headers: Move struct pre-declarations to the beginning of the header sched/core: Clean up kernel/sched/sched.h a bit sched/core: Simplify prefetch_curr_exec_start() sched: Fix spelling in comments sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
2024-07-15Merge tag 'rcu.2024.07.12a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Update Tasks RCU and Tasks Rude RCU description in Requirements.rst and clarify rcu_assign_pointer() and rcu_dereference() ordering properties - Add lockdep assertions for RCU readers, limit inline wakeups for callback-bypass synchronize_rcu(), add an rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add Uladzislau Rezki as RCU maintainer, and fix a subtle callback-migration memory-ordering issue - Remove a number of redundant memory barriers - Remove unnecessary bypass-list lock-contention mitigation, use parking API instead of open-coded ad-hoc equivalent, and upgrade obsolete comments - Revert avoidance of a deadlock that can no longer occur and properly synchronize Tasks Trace RCU checking of runqueues - Add tests for handling of double-call_rcu() bug, add missing MODULE_DESCRIPTION, and add a script that histograms the number of calls to RCU updaters - Fill out SRCU polled-grace-period API * tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits) rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation rcu: Eliminate lockless accesses to rcu_sync->gp_count MAINTAINERS: Add Uladzislau Rezki as RCU maintainer rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter rcu/exp: Remove redundant full memory barrier at the end of GP rcu: Remove full memory barrier on RCU stall printout rcu: Remove full memory barrier on boot time eqs sanity check rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove superfluous full memory barrier upon first EQS snapshot rcu: Remove full ordering on second EQS snapshot srcu: Fill out polled grace-period APIs srcu: Update cleanup_srcu_struct() comment srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE srcu: Disable interrupts directly in srcu_gp_end() rcu: Disable interrupts directly in rcu_gp_init() rcu/tree: Reduce wake up for synchronize_rcu() common case rcu/tasks: Fix stale task snaphot for Tasks Trace tools/rcu: Add rcu-updaters.sh script rcutorture: Add missing MODULE_DESCRIPTION() macros rcutorture: Fix rcu_torture_fwd_cb_cr() data race ...
2024-07-14sched_ext: Fixes incorrect type in bpf_scx_init()Jiapeng Chong
The type_id is defined as u32type, if(type_id<0) is invalid, hence modified its type to s32. ./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-12sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatchesTejun Heo
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the local DSQ of any CPU. However, during direct dispatch from ops.select_cpu() and ops.enqueue(), this isn't allowed. This is because dispatching to the local DSQ of a remote CPU requires locking both the task's current and new rq's and such double locking can't be done directly from ops.enqueue(). While waking up a task, as ops.select_cpu() can pick any CPU and both ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still do whatever it wants to do. However, while a task is being enqueued for a different reason, e.g. after its slice expiration, only ops.enqueue() is called and there's no way for the BPF scheduler to directly dispatch to the local DSQ of a remote CPU. This gap in API forces schedulers into work-arounds which are not straightforward or optimal such as skipping direct dispatches in such cases. Implement deferred enqueueing to allow directly dispatching to the local DSQ of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be safely released, the tasks are taken off the list and queued on the target local DSQs using dispatch_to_local_dsq(). v2: - Add missing return after queue_balance_callback() in schedule_deferred(). (David). - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned and thus no longer takes @rf. Updated accordingly. - UP build warning fix. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Andrea Righi <righi.andrea@gmail.com> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com>
2024-07-12sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUPTejun Heo
SCX_RQ_BALANCING is used to mark that the rq is currently in balance(). Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether the rq is currently enqueueing for a wakeup. This will be used to implement direct dispatching to local DSQ of another CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-07-12sched_ext: Unpin and repin rq lock from balance_scx()Tejun Heo
sched_ext often needs to migrate tasks across CPUs right before execution and thus uses the balance path to dispatch tasks from the BPF scheduler. balance_scx() is called with rq locked and pinned but is passed @rf and thus allowed to unpin and unlock. Currently, @rf is passed down the call stack so the rq lock is unpinned just when double locking is needed. This creates unnecessary complications such as having to explicitly manipulate lock pinning for core scheduling. We also want to use dispatch_to_local_dsq_lock() from other paths which are called with rq locked but unpinned. rq lock handling in the dispatch path is straightforward outside the migration implementation and extending the pinning protection down the callstack doesn't add enough meaningful extra protection to justify the extra complexity. Unpin and repin rq lock from the outer balance_scx() and drop @rf passing and lock pinning handling from the inner functions. UP is updated to call balance_one() instead of balance_scx() to avoid adding NULL @rf handling to balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a CONFIG_SMP block. No functional changes intended outside of lock annotation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Righi <righi.andrea@gmail.com>
2024-07-12sched_ext: Open-code task_linked_on_dsq()Tejun Heo
task_linked_on_dsq() exists as a helper because it used to test both the rbtree and list nodes. It now only tests the list node and the list node will soon be used for something else too. The helper doesn't improve anything materially and the naming will become confusing. Open-code the list node testing and remove task_linked_on_dsq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-07-12sched: Move struct balance_callback definition upwardTejun Heo
Move struct balance_callback definition upward so that it's visible to class-specific rq struct definitions. This will be used to embed a struct balance_callback in struct scx_rq. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2024-07-11Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh ↵Ingo Molnar
the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-07-09sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migratedTejun Heo
When a running task is migrated to another CPU, the stop_task is used to preempt the running task and migrate it. This, expectedly, invokes ops.cpu_release(). If the BPF scheduler then calls scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ including the task which is being migrated. This creates an unnecessary re-enqueue of a task which is about to be deactivated and re-activated for migration anyway. It can also cause confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its allowed CPUs may not agree while migration is pending. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>
2024-07-09sched_ext: Reimplement scx_bpf_reenqueue_local()Tejun Heo
scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same local DSQ, to avoid processing the same tasks repeatedly, it first takes the number of queued tasks and processes the task at the head of the queue that number of times. This is incorrect as a task can be dispatched to the same local DSQ with SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is exhausted and the succeeding tasks won't be processed at all. Fix it by first moving all candidate tasks to a private list and then processing that list. While at it, remove the WARNs. They're rather superflous as later steps will check them anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>
2024-07-08sched_ext: Implement DSQ iteratorTejun Heo
DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator. - Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly. - Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration. v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there. - scx_qmap changes separated into its own patch. v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it. v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit. - u32_before() comparison factored out. v2: - scx_bpf_consume_task() is separated out into a separate patch. - DSQ seq and iter flags don't need to be u64. Use u32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org
2024-07-08sched_ext: Take out ->priq and ->flags from scx_dsq_nodeTejun Heo
struct scx_dsq_node contains two data structure nodes to link the containing task to a DSQ and a flags field that is protected by the lock of the associated DSQ. One reason why they are grouped into a struct is to use the type independently as a cursor node when iterating tasks on a DSQ. However, when iterating, the cursor only needs to be linked on the FIFO list and the rb_node part ends up inflating the size of the iterator data structure unnecessarily making it potentially too expensive to place it on stack. Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to scx_dsq_list_node and the field names are renamed accordingly. This will help implementing DSQ task iterator that can be allocated on stack. No functional change intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: David Vernet <void@manifault.com>
2024-07-08sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.hTejun Heo
While sched_ext was out of tree, everything sched_ext specific which can be put in kernel/sched/ext.h was put there to ease forward porting. However, kernel/sched/sched.h is the better location for some of them. Relocate. - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and sched_enq_and_set_task(). - scx_enabled() and scx_switched_all(). - for_active_class_range() and for_each_active_class(). sched_class declarations are moved above the class iterators for this. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2024-07-08sched, sched_ext: Open code for_balance_class_range()Tejun Heo
For flexibility, sched_ext allows the BPF scheduler to select the CPU to execute a task on at dispatch time so that e.g. a queue can be shared across multiple CPUs. To enable this, the dispatch path is executed from balance() so that a dispatched task can be hot-migrated to its target CPU. This means that sched_ext needs its balance() method invoked before every pick_next_task() even when the CPU is waking up from SCHED_IDLE. for_balance_class_range() defined in kernel/sched/ext.h implements this selective iteration promotion. However, the indirection obfuscates more than helps. Open code the iteration promotion in put_prev_task_balance() and remove for_balance_class_range(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2024-07-08sched_ext: Minor cleanups in kernel/sched/ext.hTejun Heo
- scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to be global. Make it static. - Relocate task_on_scx() so that the inline functions are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-07-08sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is ↵Tejun Heo
in effect sched_domains regulate the load balancing for sched_classes. A machine can be partitioned into multiple sections that are not load-balanced across using either isolcpus= boot param or cpuset partitions. In such cases, tasks that are in one partition are expected to stay within that partition. cpuset configured partitions are always reflected in each member task's cpumask. As SCX always honors the task cpumasks, the BPF scheduler is automatically in compliance with the configured partitions. However, for isolcpus= domain isolation, the isolated CPUs are simply omitted from the top-level sched_domain[s] without further restrictions on tasks' cpumasks, so, for example, a task currently running in an isolated CPU may have more CPUs in its allowed cpumask while expected to remain on the same CPU. There is no straightforward way to enforce this partitioning preemptively on BPF schedulers and erroring out after a violation can be surprising. isolcpus= domain isolation is being replaced with cpuset partitions anyway, so keep it simple and simply disallow loading a BPF scheduler if isolcpus= domain isolation is in effect. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org>
2024-07-08sched_ext: Account for idle policy when setting p->scx.weight in ↵Tejun Heo
scx_ops_enable_task() When initializing p->scx.weight, scx_ops_enable_task() wasn't considering whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole user of set_task_scx_weight(). Open code it. @weight is going to be provided by sched core in the future anyway. v2: Use the newly available @lw->weight to set @p->scx.weight in reweight_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>
2024-07-08sched, sched_ext: Simplify dl_prio() case handling in sched_fork()Tejun Heo
sched_fork() returns with -EAGAIN if dl_prio(@p). a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") added scx_pre_fork() call before it and then scx_cancel_fork() on the exit path. This is silly as the dl_prio() block can just be moved above the scx_pre_fork() call. Move the dl_prio() block above the scx_pre_fork() call and remove the now unnecessary scx_cancel_fork() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Vernet <void@manifault.com>
2024-07-08sched/ext: Add BPF function to fetch rqHongyan Xia
rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. It also contains stats in other sched_classes, which are useful to drive scheduling decisions in ext. tj: Put the new helper below scx_bpf_task_*() helpers. Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-08Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 d32960528702 ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is called causing conflicts with e83edbf88f18 ("sched: Add sched_class->reweight_task()"). Resolve the conflicts by taking set_load_weight() changes from d32960528702 and updating sched_class->reweight_task() to take pointer to struct load_weight instead of int prio. Signed-off-by: Tejun Heo<tj@kernel.org>
2024-07-04sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE ↵Tejun Heo
tasks When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net
2024-07-04sched/psi: Optimise psi_group_change a bitTvrtko Ursulin
The current code loops over the psi_states only to call a helper which then resolves back to the action needed for each state using a switch statement. That is effectively creating a double indirection of a kind which, given how all the states need to be explicitly listed and handled anyway, we can simply remove. Both the for loop and the switch statement that is. The benefit is both in the code size and CPU time spent in this function. YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to 0x2c1. Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com
2024-07-01sched_ext: Swap argument positions in kcalloc() call to avoid compiler warningTejun Heo
alloc_exit_info() calls kcalloc() but puts in the size of the element as the first argument which triggers the following gcc warning: kernel/sched/ext.c:3815:32: warning: ‘kmalloc_array_noprof’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args] Fix it by swapping the positions of the first two arguments. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: http://lkml.kernel.org/r/ZoG6zreEtQhAUr_2@linux.ibm.com
2024-07-01sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpathJohn Stultz
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...). Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call 6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure"). In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach: Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule(). In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear. With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call Reported-by: Jimmy Shiu <jimmyshiu@google.com> Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
2024-07-01sched/deadline: Fix task_struct reference leakWander Lairson Costa
During the execution of the following stress test with linux-rt: stress-ng --cyclic 30 --timeout 30 --minimize --quiet kmemleak frequently reported a memory leak concerning the task_struct: unreferenced object 0xffff8881305b8000 (size 16136): comm "stress-ng", pid 614, jiffies 4294883961 (age 286.412s) object hex dump (first 32 bytes): 02 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .@.............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ debug hex dump (first 16 bytes): 53 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 S............... backtrace: [<00000000046b6790>] dup_task_struct+0x30/0x540 [<00000000c5ca0f0b>] copy_process+0x3d9/0x50e0 [<00000000ced59777>] kernel_clone+0xb0/0x770 [<00000000a50befdc>] __do_sys_clone+0xb6/0xf0 [<000000001dbf2008>] do_syscall_64+0x5d/0xf0 [<00000000552900ff>] entry_SYSCALL_64_after_hwframe+0x6e/0x76 The issue occurs in start_dl_timer(), which increments the task_struct reference count and sets a timer. The timer callback, dl_task_timer, is supposed to decrement the reference count upon expiration. However, if enqueue_task_dl() is called before the timer expires and cancels it, the reference count is not decremented, leading to the leak. This patch fixes the reference leak by ensuring the task_struct reference count is properly decremented when the timer is canceled. Fixes: feff2e65efd8 ("sched/deadline: Unthrottle PI boosted threads while enqueuing") Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20240620125618.11419-1-wander@redhat.com
2024-07-01Revert "sched/fair: Make sure to try to detach at least one movable task"Josh Don
This reverts commit b0defa7ae03ecf91b8bfd10ede430cff12fcbd06. b0defa7ae03ec changed the load balancing logic to ignore env.max_loop if all tasks examined to that point were pinned. The goal of the patch was to make it more likely to be able to detach a task buried in a long list of pinned tasks. However, this has the unfortunate side effect of creating an O(n) iteration in detach_tasks(), as we now must fully iterate every task on a cpu if all or most are pinned. Since this load balance code is done with rq lock held, and often in softirq context, it is very easy to trigger hard lockups. We observed such hard lockups with a user who affined O(10k) threads to a single cpu. When I discussed this with Vincent he initially suggested that we keep the limit on the number of tasks to detach, but increase the number of tasks we can search. However, after some back and forth on the mailing list, he recommended we instead revert the original patch, as it seems likely no one was actually getting hit by the original issue. Fixes: b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240620214450.316280-1-joshdon@google.com
2024-06-27sched_ext: fix typo in set_weight() descriptionAndrea Righi
Correct eight to weight in the description of the .set_weight() operation in sched_ext_ops. Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-23sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signedDavid Vernet
The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative performance target of a specified CPU. Commit d86adb4fc065 ("sched_ext: Add cpuperf support") defined the @cpu argument to be unsigned. Let's update it to be signed to match the norm for the rest of ext.c and the kernel. Note that the kfunc declaration of scx_bpf_cpuperf_set() in the common.bpf.h header in tools/sched_ext already listed the cpu as signed, so this also fixes the build for tools/sched_ext and the sched_ext selftests due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus causing the compiler to error due to observing conflicting types). Fixes: d86adb4fc065 ("sched_ext: Add cpuperf support") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-21sched_ext: Add cpuperf supportTejun Heo
sched_ext currently does not integrate with schedutil. When schedutil is the governor, frequencies are left unregulated and usually get stuck close to the highest performance level from running RT tasks. Add CPU performance monitoring and scaling support by integrating into schedutil. The following kfuncs are added: - scx_bpf_cpuperf_cap(): Query the relative performance capacity of different CPUs in the system. - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU relative to its max performance. - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. This gives direct control over CPU performance setting to the BPF scheduler. The only changes on the schedutil side are accounting for the utilization factor from sched_ext and disabling frequency holding heuristics as it may not apply well to sched_ext schedulers which may have a lot weaker connection between tasks and their current / last CPU. With cpuperf support added, there is no reason to block uclamp. Enable while at it. A toy implementation of cpuperf is added to scx_qmap as a demonstration of the feature. v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util() to avoid factoring in stale util metric. (Christian) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Christian Loehle <christian.loehle@arm.com>
2024-06-21cpufreq_schedutil: Refactor sugov_cpu_is_busy()Tejun Heo
sugov_cpu_is_busy() is used to avoid decreasing performance level while the CPU is busy and called by sugov_update_single_freq() and sugov_update_single_perf(). Both callers repeat the same pattern to first test for uclamp and then the business. Let's refactor so that the tests aren't repeated. The new helper is named sugov_hold_freq() and tests both the uclamp exception and CPU business. No functional changes. This will make adding more exception conditions easier. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org>