diff options
| author | Tejun Heo <tj@kernel.org> | 2025-11-11 09:18:06 -1000 |
|---|---|---|
| committer | Tejun Heo <tj@kernel.org> | 2025-11-12 06:43:44 -1000 |
| commit | 61debc251c1c9150c7bdfd5c028bc2d078e17d22 (patch) | |
| tree | ae23364e9ca058952288b1de78e049bcc486720f /tools/lib/python/abi/helpers.py | |
| parent | 3546119f18647d7ddbba579737d8a222b430cb1c (diff) | |
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.
Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.
Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.
This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.
v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Diffstat (limited to 'tools/lib/python/abi/helpers.py')
0 files changed, 0 insertions, 0 deletions
