summaryrefslogtreecommitdiff
path: root/tools/sched_ext
AgeCommit message (Collapse)Author
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-02tools/sched_ext: Sync with scx repoTejun Heo
Synchronize with https://github.com/sched-ext/scx at dc44584874f0 ("kernel: Synchronize with kernel tools/sched_ext"). - READ/WRITE_ONCE() is made more proper and READA_ONCE_ARENA() is dropped. - scale_by_task_weight[_inverse]() helpers added. - Enum defs expanded to cover more and new enums. - Don't trigger fatal error when some enums are missing from kernel BTF. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Change the event type from u64 to s64Changwoo Min
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27tools/sched_ext: Provide a compatible helper for scx_bpf_events()Andrea Righi
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a compatible way also with kernels that don't provide this kfunc. This also fixes the following error with scx_qmap when running on a kernel that does not provide scx_bpf_events(): ; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777 318: (b7) r2 = 72 ; R2_w=72 async_cb 319: <invalid kfunc call> kfunc 'scx_bpf_events' is referenced but wasn't resolved Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25tools/sched_ext: Provide consistent access to scx flagsAndrea Righi
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the user-space part of the schedulers via the compat interface. This allows schedulers / selftests to set all the ops flags in user-space, rather than having them split between BPF and user-space. Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18sched_ext: idle: Introduce node-aware idle cpu kfunc helpersAndrea Righi
Introduce a new kfunc to retrieve the node associated to a CPU: int scx_bpf_cpu_node(s32 cpu) Add the following kfuncs to provide BPF schedulers direct access to per-node idle cpumasks information: const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) Moreover, trigger an scx error when any of the non-node aware idle CPU kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14tools/sched_ext: Sync with scx repoTejun Heo
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel: Sync at ad3b301aa05a ("sched_ext: Provides a sysfs 'events' to expose core event counters")"). Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.Chuyi Zhou
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_ push_{front,back}. This patch fix this issue. Without this patch, if we use bpf_list kfunc in scx, the BPF verifier would complain: libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or module BTFs libbpf: failed to load object 'scx_foo' libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL With this patch, the bpf list kfunc will work as expected. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10tools/sched_ext: Update enum_defs.autogen.hChangwoo Min
Add where the script is located to the comment lines of the header file. This helps anyone re-generate the header file if required. Note that this is a sync from the PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1322 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTEDChangwoo Min
This provides compatible testing of SCX_ENQ_CPU_SELECTED. More specifically, it handles two cases: 1. a BPF scheduler is compiled against vmlinux.h where SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not have SCX_ENQ_CPU_SELECTED. In this case, the test result of 'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED has never skipped select_task_rq_scx(), so the result should be true. 2. a BPF scheduler is compiling against vmlinux.h where SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using SCX_ENQ_CPU_SELECTED causes compilation errors. To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(), which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE. This consists of three parts: 1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting whether SCX enums are defined in the vmlinux.h or not. 2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of SCX_ENQ_CPU_SELECTED in a compatible way. 3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap. Note that this is a sync of the relevant PR [1] in the scx repo. [1] https://github.com/sched-ext/scx/pull/1314 Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08tool/sched_ext: Event counter dumping updatesTejun Heo
- There's no need to dump event counters from both scx_qmap and scx_central. Drop counter dumping from scx_central. - bpf_printk() implies a trailing new line and the explicit new line leads to double new lines. Drop the explicit new lines. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive: - 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") as planned for-6.15 changes depend on them (e.g. adding event counter for implicit migration disabled task handling).
2025-02-07sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/centralChangwoo Min
Modify the scx_qmap and scx_celtral schedulers to print the SCX_EV_ENQ_SLICE_DFL event every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_qmap schedulerChangwoo Min
Modify the scx_qmap scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Print core event count in scx_central schedulerChangwoo Min
Modify the scx_central scheduler to print the core event counter every second. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulersChangwoo Min
scx_bpf_events() is added to the header files so the BPF scheduler can use it. Also, scx_read_event() is added to read an event type in a compatible way. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Fix incorrect time delta calculation in time_delta()Changwoo Min
When (s64)(after - before) > 0, the code returns the result of (s64)(after - before) > 0 while the intended result should be (s64)(after - before). That happens because the middle operand of the ternary operator was omitted incorrectly, returning the result of (s64)(after - before) > 0. Thus, add the middle operand -- (s64)(after - before) -- to return the correct time calculation. Fixes: d07be814fc71 ("sched_ext: Add time helpers for BPF schedulers") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27tools/sched_ext: Add helper to check task migration stateAndrea Righi
Introduce a new helper for BPF schedulers to determine whether a task can migrate or not (supporting both SMP and UP systems). Fixes: e9fe182772dc ("sched_ext: selftests/dsp_local_on: Fix sporadic failures") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Use time helpers in BPF schedulersChangwoo Min
Modify the BPF schedulers to use time helpers defined in common.bpf.h Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()Changwoo Min
In the BPF schedulers that use bpf_ktime_get_ns() -- scx_central and scx_flatcg, replace bpf_ktime_get_ns() calls to scx_bpf_now(). Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Add time helpers for BPF schedulersChangwoo Min
The following functions are added for BPF schedulers: - time_delta(after, before) - time_after(a, b) - time_before(a, b) - time_after_eq(a, b) - time_before_eq(a, b) - time_in_range(a, b, c) - time_in_range_open(a, b, c) Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10sched_ext: Add scx_bpf_now() for BPF schedulerChangwoo Min
scx_bpf_now() is added to the header files so the BPF scheduler can use it. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-12tools/sched_ext: Receive updates from SCX repoTejun Heo
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to sync userspace bits: - scx_bpf_dump_header() added which can be used to print out basic scheduler info on dump. - BPF possible/online CPU iterators added. - CO-RE enums added. The enums are autogenerated from vmlinux.h. Include the generated artifacts in tools/sched_ext to keep the Makefile simpler. - Other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-04sched_ext: fix application of sizeof to pointerguanjing
sizeof when applied to a pointer typed expression gives the size of the pointer. The proper fix in this particular case is to code sizeof(*cpuset) instead of sizeof(cpuset). This issue was detected with the help of Coccinelle. Fixes: 22a920209ab6 ("sched_ext: Implement tickless support") Signed-off-by: guanjing <guanjing@cmss.chinamobile.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-03sched_ext: Add __weak to fix the build errorsHonglei Wang
commit 5cbb302880f5 ("sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()") introduced several new functions which caused compilation errors when compiled with clang. Let's fix this by adding __weak markers. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5cbb302880f5 ("sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()") Acked-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> ↵Tejun Heo
scx_bpf_dsq_move[_vtime]*() In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the third set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. - scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT macros as they were introduced during v6.12 cycle. Wrap new API in __COMPAT macros too and trigger build errors on both __COMPAT prefixed and naked usages of the old names. The compat features will be dropped after v6.15. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()Tejun Heo
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the second rename. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Comment and documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()Tejun Heo
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the first set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-08sched_ext: Enable the ops breather and eject BPF scheduler on softlockupTejun Heo
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly behaving BPF scheduler can live-lock the system by making multiple CPUs bang on the same DSQ to the point where soft-lockup detection triggers before SCX's own watchdog can take action. It also seems possible that the machine can be live-locked enough to prevent scx_ops_helper, which is an RT task, from running in a timely manner. Implement scx_softlockup() which is called when three quarters of soft-lockup threshold has passed. The function immediately enables the ops breather and triggers an ops error to initiate ejection of the BPF scheduler. The previous and this patch combined enable the kernel to reliably recover the system from live-lock conditions that can be triggered by a poorly behaving BPF scheduler on Intel dual socket systems. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org>
2024-11-05sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new typeTejun Heo
0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") converted scx_ops_bypass_depth from an atomic to an int. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()")
2024-10-25sched_ext: Make cast_mask() inlineTejun Heo
cast_mask() doesn't do any actual work and is defined in a header file. Force it to be inline. When it is not inlined and the function is not used, it can cause verificaiton failures like the following: # tools/testing/selftests/sched_ext/runner -t minimal ===== START ===== TEST: minimal DESCRIPTION: Verify we can load a fully minimal scheduler OUTPUT: libbpf: prog 'cast_mask': missing BPF prog type, check ELF section name '.text' libbpf: prog 'cast_mask': failed to load: -22 libbpf: failed to load object 'minimal' libbpf: failed to load BPF skeleton 'minimal': -22 ERR: minimal.c:20 Failed to open and load skel not ok 1 minimal # ===== END ===== Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a748db0c8c6a ("tools/sched_ext: Receive misc updates from SCX repo")
2024-10-07sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTEDTejun Heo
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to tell whether ops.select_cpu() was called. This is incorrect as ops.select_cpu() can be skipped in the wakeup path and leads to e.g. incorrectly skipping direct dispatch for tasks that are bound to a single CPU. sched core has been updated to specify ENQUEUE_RQ_SELECTED if ->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update scx_qmap to test it instead of SCX_ENQ_WAKEUP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-02sched_ext: Add __weak markers to BPF helper function decalarationsVishal Chourasia
Fix build errors by adding __weak markers to BPF helper function declarations in header files. This resolves static assertion failures in scx_qmap.bpf.c and scx_flatcg.bpf.c where functions like scx_bpf_dispatch_from_dsq_set_slice, scx_bpf_dispatch_from_dsq_set_vtime, and scx_bpf_task_cgroup were missing the __weak attribute. [1] https://lore.kernel.org/all/ZvvfUqRNM4-jYQzH@linux.ibm.com Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-26scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBALTejun Heo
scx_flatcg was using SCX_DSQ_GLOBAL for fallback handling. However, it is assuming that SCX_DSQ_GLOBAL isn't automatically consumed, which was true a while ago but is no longer the case. Also, there are further changes planned for SCX_DSQ_GLOBAL which will disallow explicit consumption from it. Switch to a user DSQ for fallback. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-25tools/sched_ext: Receive misc updates from SCX repoTejun Heo
Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx to sync userspace bits. - LSP macros to help language servers. - bpf_cpumask_weight() declaration and cast_mask() helper. - Cosmetic updates to scx_flatcg.bpf.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-25sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycleTejun Heo
cgroup support and scx_bpf_dispatch[_vtime]_from_dsq() are newly added since 8bb30798fd6e ("sched_ext: Fixes incorrect type in bpf_scx_init()") which is the current earliest commit targeted by BPF schedulers. Add compat helpers for them and apply them in the example schedulers. These will be dropped after a few kernel releases. The exact backward compatibility window hasn't been decided yet. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-23sched_ext: Provide a sysfs enable_seq counterAndrea Righi
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09scx_qmap: Implement highpri boostingTejun Heo
Implement a silly boosting mechanism for nice -20 tasks. The only purpose is demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only works within SHARED_DSQ and makes only minor differences with increased dispatch batch (-b). This exercises moving tasks to a user DSQ and all local DSQs from ops.dispatch() and BPF timerfn. v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}(). - Drop the workaround for the iterated tasks not being trusted by the verifier. The issue is fixed from BPF side. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-09sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()Tejun Heo
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-04sched_ext: Add a cgroup scheduler which uses flattened hierarchyTejun Heo
This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of SCX_OPS_KNOB_CGROUP_WEIGHT. v4: - Revert reference counted kptr for cgv_node as the change caused easily reproducible stalls. v3: - Updated to reflect the core API changes including ops.init/exit_task() and direct dispatch from ops.select_cpu(). Fixes and improvements including additional statistics. - Use reference counted kptr for cgv_node instead of xchg'ing against stash location. - Dropped '-p' option. v2: - Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-09-04sched_ext: Add cgroup supportTejun Heo
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-09-03sched_ext: Don't call put_prev_task_scx() before picking the next taskTejun Heo
fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-08-27scx_central: Fix smatch checker warningTejun Heo
ARRAY_ELEM_PTR() is an access macro used to help the BPF verifier not confused by offseted memory acceeses by yiedling a valid pointer or NULL in a way that's clear to the verifier. As such, the canonical usage involves checking NULL return from the macro. Note that in many cases, the NULL condition can never happen - they're there just to hint the verifier. In a bpf_loop in scx_central.bpf.c::central_dispatch(), the NULL check was incorrect in that there was another dereference of the pointer in addition to the NULL checked access. This worked as the pointer can never be NULL and the verifier could tell it would never be NULL in this case. However, this still looks wrong and trips smatch: ./tools/sched_ext/scx_central.bpf.c:205 ____central_dispatch() error: we previously assumed 'gimme' could be null (see line 201) ./tools/sched_ext/scx_central.bpf.c 195 196 if (!scx_bpf_dispatch_nr_slots()) 197 break; 198 199 /* central's gimme is never set */ 200 gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); 201 if (gimme && !*gimme) ^^^^^ If gimme is NULL 202 continue; 203 204 if (dispatch_to_cpu(cpu)) --> 205 *gimme = false; Fix the NULL check so that there are no derefs if NULL. This doesn't change actual behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: http://lkml.kernel.org/r/<955e1c3c-ace2-4a1d-b246-15b8196038a3@stanley.mountain>
2024-07-12sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueuesTejun Heo
Because there was no way to directly dispatch to the local DSQ of a remote CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup enqueues. This restriction was removed and sched_ext now allows SCX_DSQ_LOCAL_ON verdicts for direct dispatches. Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to direct dispatch from ops.enqueue() on !wakeup enqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Andrea Righi <righi.andrea@gmail.com>
2024-07-08sched_ext/scx_qmap: Add an example usage of DSQ iteratorTejun Heo
Implement periodic dumping of the shared DSQ to demonstrate the use of the newly added DSQ iterator. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org
2024-07-08sched_ext: Implement DSQ iteratorTejun Heo
DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator. - Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly. - Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration. v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there. - scx_qmap changes separated into its own patch. v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it. v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit. - u32_before() comparison factored out. v2: - scx_bpf_consume_task() is separated out into a separate patch. - DSQ seq and iter flags don't need to be u64. Use u32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org