summaryrefslogtreecommitdiff
path: root/kernel/sched/fair.c
AgeCommit message (Collapse)Author
2021-03-06sched/fair: Remove update of blocked load from newidle_balanceVincent Guittot
newidle_balance runs with both preempt and irq disabled which prevent local irq to run during this period. The duration for updating the blocked load of CPUs varies according to the number of CPU cgroups with non-decayed load and extends this critical period to an uncontrolled level. Remove the update from newidle_balance and trigger a normal ILB that will take care of the update instead. This reduces the IRQ latency from O(nr_cgroups * nr_nohz_cpus) to O(nr_cgroups). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-2-vincent.guittot@linaro.org
2021-02-26kernel: delete repeated words in commentsRandy Dunlap
Drop repeated words in kernel/events/. {if, the, that, with, time} Drop repeated words in kernel/locking/. {it, no, the} Drop repeated words in kernel/sched/. {in, not} Link: https://lkml.kernel.org/r/20210127023412.26292-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Will Deacon <will@kernel.org> [kernel/locking/] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-17sched/features: Distinguish between NORMAL and DEADLINE hrtickJuri Lelli
The HRTICK feature has traditionally been servicing configurations that need precise preemptions point for NORMAL tasks. More recently, the feature has been extended to also service DEADLINE tasks with stringent runtime enforcement needs (e.g., runtime < 1ms with HZ=1000). Enabling HRTICK sched feature currently enables the additional timer and task tick for both classes, which might introduced undesired overhead for no additional benefit if one needed it only for one of the cases. Separate HRTICK sched feature in two (and leave the traditional case name unmodified) so that it can be selectively enabled when needed. With: $ echo HRTICK > /sys/kernel/debug/sched_features the NORMAL/fair hrtick gets enabled. With: $ echo HRTICK_DL > /sys/kernel/debug/sched_features the DEADLINE hrtick gets enabled. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210208073554.14629-3-juri.lelli@redhat.com
2021-02-17rbtree, sched/fair: Use rb_add_cached()Peter Zijlstra
Reduce rbtree boiler plate by using the new helper function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17sched/fair: Merge select_idle_core/cpu()Mel Gorman
Both select_idle_core() and select_idle_cpu() do a loop over the same cpumask. Observe that by clearing the already visited CPUs, we can fold the iteration and iterate a core at a time. All we need to do is remember any non-idle CPU we encountered while scanning for an idle core. This way we'll only iterate every CPU once. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210127135203.19633-5-mgorman@techsingularity.net
2021-02-17sched/fair: Remove select_idle_smt()Mel Gorman
In order to make the next patch more readable, and to quantify the actual effectiveness of this pass, start by removing it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210125085909.4600-4-mgorman@techsingularity.net
2021-01-27sched/fair: Move avg_scan_cost calculations under SIS_PROPMel Gorman
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210125085909.4600-3-mgorman@techsingularity.net
2021-01-27sched/fair: Remove SIS_AVG_CPUMel Gorman
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210125085909.4600-2-mgorman@techsingularity.net
2021-01-27sched/eas: Don't update misfit status if the task is pinnedQais Yousef
If the task is pinned to a cpu, setting the misfit status means that we'll unnecessarily continuously attempt to migrate the task but fail. This continuous failure will cause the balance_interval to increase to a high value, and eventually cause unnecessary significant delays in balancing the system when real imbalance happens. Caught while testing uclamp where rt-app calibration loop was pinned to cpu 0, shortly after which we spawn another task with high util_clamp value. The task was failing to migrate after over 40ms of runtime due to balance_interval unnecessary expanded to a very high value from the calibration loop. Not done here, but it could be useful to extend the check for pinning to verify that the affinity of the task has a cpu that fits. We could end up in a similar situation otherwise. Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Acked-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com
2021-01-14sched: Use task_current() instead of 'rq->curr == p'Hui Su
Use the task_current() function where appropriate. No functional change. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2021-01-14sched/fair: Reduce cases for active balanceVincent Guittot
Active balance is triggered for a number of voluntary cases like misfit or pinned tasks cases but also after that a number of load balance attempts failed to migrate a task. There is no need to use active load balance when the group is overloaded because an overloaded state means that there is at least one waiting task. Nevertheless, the waiting task is not selected and detached until the threshold becomes higher than its load. This threshold increases with the number of failed lb (see the condition if ((load >> env->sd->nr_balance_failed) > env->imbalance) in detach_tasks()) and the waiting task will end up to be selected after a number of attempts. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210107103325.30851-4-vincent.guittot@linaro.org
2021-01-14sched/fair: Don't set LBF_ALL_PINNED unnecessarilyVincent Guittot
Setting LBF_ALL_PINNED during active load balance is only valid when there is only 1 running task on the rq otherwise this ends up increasing the balance interval whereas other tasks could migrate after the next interval once they become cache-cold as an example. LBF_ALL_PINNED flag is now always set it by default. It is then cleared when we find one task that can be pulled when calling detach_tasks() or during active migration. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210107103325.30851-3-vincent.guittot@linaro.org
2021-01-14sched/fair: Skip idle cfs_rqVincent Guittot
Don't waste time checking whether an idle cfs_rq could be the busiest queue. Furthermore, this can end up selecting a cfs_rq with a high load but being idle in case of migrate_load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210107103325.30851-2-vincent.guittot@linaro.org
2021-01-14sched/fair: Avoid stale CPU util_est value for schedutil in task dequeueXuewen Yan
CPU (root cfs_rq) estimated utilization (util_est) is currently used in dequeue_task_fair() to drive frequency selection before it is updated. with: CPU_util : rq->cfs.avg.util_avg CPU_util_est : rq->cfs.avg.util_est CPU_utilization : max(CPU_util, CPU_util_est) task_util : p->se.avg.util_avg task_util_est : p->se.avg.util_est dequeue_task_fair(): /* (1) CPU_util and task_util update + inform schedutil about CPU_utilization changes */ for_each_sched_entity() /* 2 loops */ (dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change() -> cpufreq_update_util() ->...-> sugov_update_[shared\|single] -> sugov_get_util() -> cpu_util_cfs() /* (2) CPU_util_est and task_util_est update */ util_est_dequeue() cpu_util_cfs() uses CPU_utilization which could lead to a false (too high) utilization value for schedutil in task ramp-down or ramp-up scenarios during task dequeue. To mitigate the issue split the util_est update (2) into: (A) CPU_util_est update in util_est_dequeue() (B) task_util_est update in util_est_update() Place (A) before (1) and keep (B) where (2) is. The latter is necessary since (B) relies on task_util update in (1). Fixes: 7f65ea42eb00 ("sched/fair: Add util_est on top of PELT") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1608283672-18240-1-git-send-email-xuewen.yan94@gmail.com
2021-01-14sched: Prevent raising SCHED_SOFTIRQ when CPU is !activeAnna-Maria Behnsen
SCHED_SOFTIRQ is raised to trigger periodic load balancing. When CPU is not active, CPU should not participate in load balancing. The scheduler uses nohz.idle_cpus_mask to keep track of the CPUs which can do idle load balancing. When bringing a CPU up the CPU is added to the mask when it reaches the active state, but on teardown the CPU stays in the mask until it goes offline and invokes sched_cpu_dying(). When SCHED_SOFTIRQ is raised on a !active CPU, there might be a pending softirq when stopping the tick which triggers a warning in NOHZ code. The SCHED_SOFTIRQ can also be raised by the scheduler tick which has the same issue. Therefore remove the CPU from nohz.idle_cpus_mask when it is marked inactive and also prevent the scheduler_tick() from raising SCHED_SOFTIRQ after this point. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20201215104400.9435-1-anna-maria@linutronix.de
2021-01-14sched/core: Rename schedutil_cpu_util() and allow rest of the kernel to use itViresh Kumar
There is nothing schedutil specific in schedutil_cpu_util(), rename it to effective_cpu_util(). Also create and expose another wrapper sched_cpu_util() which can be used by other parts of the kernel, like thermal core (that will be done in a later commit). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/db011961fb3bb8bef1c0eda5cd64564637d3ef31.1607400596.git.viresh.kumar@linaro.org
2020-12-11sched/fair: Trivial correction of the newidle_balance() commentBarry Song
idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao.hua@hisilicon.com
2020-12-11sched/fair: Clear SMT siblings after determining the core is not idleMel Gorman
The clearing of SMT siblings from the SIS mask before checking for an idle core is a small but unnecessary cost. Defer the clearing of the siblings until the scan moves to the next potential target. The cost of this was not measured as it is borderline noise but it should be self-evident. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201130144020.GS3371@techsingularity.net
2020-12-11sched: Fix kernel-doc markupMauro Carvalho Chehab
Kernel-doc requires that a kernel-doc markup to be immediately below the function prototype, as otherwise it will rename it. So, move sys_sched_yield() markup to the right place. Also fix the cpu_util() markup: Kernel-doc markups should use this format: identifier - description Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
2020-11-27Merge branch 'linus' into sched/core, to resolve semantic conflictIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-24sched: Limit the amount of NUMA imbalance that can exist at fork timeMel Gorman
At fork time currently, a local node can be allowed to fill completely and allow the periodic load balancer to fix the problem. This can be problematic in cases where a task creates lots of threads that idle until woken as part of a worker poll causing a memory bandwidth problem. However, a "real" workload suffers badly from this behaviour. The workload in question is mostly NUMA aware but spawns large numbers of threads that act as a worker pool that can be called from anywhere. These need to spread early to get reasonable behaviour. This patch limits how much a local node can fill before spilling over to another node and it will not be a universal win. Specifically, very short-lived workloads that fit within a NUMA node would prefer the memory bandwidth. As I cannot describe the "real" workload, the best proxy measure I found for illustration was a page fault microbenchmark. It's not representative of the workload but demonstrates the hazard of the current behaviour. pft timings 5.10.0-rc2 5.10.0-rc2 imbalancefloat-v2 forkspread-v2 Amean elapsed-1 46.37 ( 0.00%) 46.05 * 0.69%* Amean elapsed-4 12.43 ( 0.00%) 12.49 * -0.47%* Amean elapsed-7 7.61 ( 0.00%) 7.55 * 0.81%* Amean elapsed-12 4.79 ( 0.00%) 4.80 ( -0.17%) Amean elapsed-21 3.13 ( 0.00%) 2.89 * 7.74%* Amean elapsed-30 3.65 ( 0.00%) 2.27 * 37.62%* Amean elapsed-48 3.08 ( 0.00%) 2.13 * 30.69%* Amean elapsed-79 2.00 ( 0.00%) 1.90 * 4.95%* Amean elapsed-80 2.00 ( 0.00%) 1.90 * 4.70%* This is showing the time to fault regions belonging to threads. The target machine has 80 logical CPUs and two nodes. Note the ~30% gain when the machine is approximately the point where one node becomes fully utilised. The slower results are borderline noise. Kernel building shows similar benefits around the same balance point. Generally performance was either neutral or better in the tests conducted. The main consideration with this patch is the point where fork stops spreading a task so some workloads may benefit from different balance points but it would be a risky tuning parameter. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
2020-11-24sched/numa: Allow a floating imbalance between NUMA nodesMel Gorman
Currently, an imbalance is only allowed when a destination node is almost completely idle. This solved one basic class of problems and was the cautious approach. This patch revisits the possibility that NUMA nodes can be imbalanced until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat superficial -- it's half the cores when HT is enabled. At higher utilisations, balancing should continue as normal and keep things even until scheduler domains are fully busy or over utilised. Note that this is not expected to be a universal win. Any benchmark that prefers spreading as wide as possible with limited communication will favour the old behaviour as there is more memory bandwidth. Workloads that communicate heavily in pairs such as netperf or tbench benefit. For the tests I ran, the vast majority of workloads saw a benefit so it seems to be a worthwhile trade-off. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
2020-11-24sched: Avoid unnecessary calculation of load imbalance at clone timeMel Gorman
In find_idlest_group(), the load imbalance is only relevant when the group is either overloaded or fully busy but it is calculated unconditionally. This patch moves the imbalance calculation to the context it is required. Technically, it is a micro-optimisation but really the benefit is avoiding confusing one type of imbalance with another depending on the group_type in the next patch. No functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net
2020-11-24sched/numa: Rename nr_running and break out the magic numberMel Gorman
This is simply a preparation patch to make the following patches easier to read. No functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net
2020-11-22Merge tag 'sched-urgent-2020-11-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A couple of scheduler fixes: - Make the conditional update of the overutilized state work correctly by caching the relevant flags state before overwriting them and checking them afterwards. - Fix a data race in the wakeup path which caused loadavg on ARM64 platforms to become a random number generator. - Fix the ordering of the iowaiter accounting operations so it can't be decremented before it is incremented. - Fix a bug in the deadline scheduler vs. priority inheritance when a non-deadline task A has inherited the parameters of a deadline task B and then blocks on a non-deadline task C. The second inheritance step used the static deadline parameters of task A, which are usually 0, instead of further propagating task B's parameters. The zero initialized parameters trigger a bug in the deadline scheduler" * tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix priority inheritance with multiple scheduling classes sched: Fix rq->nr_iowait ordering sched: Fix data-race in wakeup sched/fair: Fix overutilized update in enqueue_task_fair()
2020-11-17sched/fair: Fix overutilized update in enqueue_task_fair()Quentin Perret
enqueue_task_fair() attempts to skip the overutilized update for new tasks as their util_avg is not accurate yet. However, the flag we check to do so is overwritten earlier on in the function, which makes the condition pretty much a nop. Fix this by saving the flag early on. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com
2020-11-15Merge tag 'sched-urgent-2020-11-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A set of scheduler fixes: - Address a load balancer regression by making the load balancer use the same logic as the wakeup path to spread tasks in the LLC domain - Prefer the CPU on which a task run last over the local CPU in the fast wakeup path for asymmetric CPU capacity systems to align with the symmetric case. This ensures more locality and prevents massive migration overhead on those asymetric systems - Fix a memory corruption bug in the scheduler debug code caused by handing a modified buffer pointer to kfree()" * tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Fix memory corruption caused by multiple small reads of flags sched/fair: Prefer prev cpu in asymmetric wakeup path sched/fair: Ensure tasks spreading in LLC during LB
2020-11-10sched/fair: Dissociate wakeup decisions from SD flag valueValentin Schneider
The CFS wakeup code will only ever go through EAS / its fast path on "regular" wakeups (i.e. not on forks or execs). These are currently gated by a check against 'sd_flag', which would be SD_BALANCE_WAKE at wakeup. However, we now have a flag that explicitly tells us whether a wakeup is a "regular" one, so hinge those conditions on that flag instead. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201102184514.2733-4-valentin.schneider@arm.com
2020-11-10sched: Remove select_task_rq()'s sd_flag parameterValentin Schneider
Only select_task_rq_fair() uses that parameter to do an actual domain search, other classes only care about what kind of wakeup is happening (fork, exec, or "regular") and thus just translate the flag into a wakeup type. WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to encode the wakeup types we care about. For select_task_rq_fair(), we can simply use the shiny new WF_flag : SD_flag mapping. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
2020-11-10sched/fair: Remove superfluous lock section in do_sched_cfs_slack_timer()Hui Su
Since ab93a4bc955b ("sched/fair: Remove distribute_running fromCFS bandwidth"), there is nothing to protect between raw_spin_lock_irqsave/store() in do_sched_cfs_slack_timer(). Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20201030144621.GA96974@rlk
2020-11-10sched/fair: Prefer prev cpu in asymmetric wakeup pathVincent Guittot
During fast wakeup path, scheduler always check whether local or prev cpus are good candidates for the task before looking for other cpus in the domain. With commit b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") the heterogenous system gains a dedicated path but doesn't try to reuse prev cpu whenever possible. If the previous cpu is idle and belong to the LLC domain, we should check it 1st before looking for another cpu because it stays one of the best candidate and this also stabilizes task placement on the system. This change aligns asymmetric path behavior with symmetric one and reduces cases where the task migrates across all cpus of the sd_asym_cpucapacity domains at wakeup. This change does not impact normal EAS mode but only the overloaded case or when EAS is not used. - On hikey960 with performance governor (EAS disable) ./perf bench sched pipe -T -l 50000 mainline w/ patch # migrations 999364 0 ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22% - On hikey with performance governor ./perf bench sched pipe -T -l 50000 mainline w/ patch # migrations 0 0 ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4% According to test on hikey, the patch doesn't impact symmetric system compared to current implementation (only tested on arm64) Also read the uclamped value of task's utilization at most twice instead instead each time we compare task's utilization with cpu's capacity. Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org
2020-11-10sched/fair: Ensure tasks spreading in LLC during LBVincent Guittot
schbench shows latency increase for 95 percentile above since: commit 0b0695f2b34a ("sched/fair: Rework load_balance()") Align the behavior of the load balancer with the wake up path, which tries to select an idle CPU which belongs to the LLC for a waking task. calculate_imbalance() will use nr_running instead of the spare capacity when CPUs share resources (ie cache) at the domain level. This will ensure a better spread of tasks on idle CPUs. Running schbench on a hikey (8cores arm64) shows the problem: tip/sched/core : schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10 Latency percentiles (usec) 50.0th: 33 75.0th: 45 90.0th: 51 95.0th: 4152 *99.0th: 14288 99.5th: 14288 99.9th: 14288 min=0, max=14276 tip/sched/core + patch : schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10 Latency percentiles (usec) 50.0th: 34 75.0th: 47 90.0th: 52 95.0th: 78 *99.0th: 94 99.5th: 94 99.9th: 94 min=0, max=94 Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()") Reported-by: Chris Mason <clm@fb.com> Suggested-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Rik van Riel <riel@surriel.com> Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org
2020-11-10sched/fair: Reorder throttle_cfs_rq() pathPeng Wang
As commit: 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list") does in unthrottle_cfs_rq(), throttle_cfs_rq() can also use the same pattern as dequeue_task_fair(). No functional changes. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Phil Auld <pauld@redhat.com> Cc: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/f11dd2e3ab35cc538e2eb57bf0c99b6eaffce127.1604973978.git.rocking@linux.alibaba.com
2020-10-29sched/fair: Check for idle core in wake_affineJulia Lawall
In the case of a thread wakeup, wake_affine determines whether a core will be chosen for the thread on the socket where the thread ran previously or on the socket of the waker. This is done primarily by comparing the load of the core where th thread ran previously (prev) and the load of the waker (this). commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") changed the load computation from the runnable load to the load average, where the latter includes the load of threads that have already blocked on the core. When a short-running daemon processes happens to run on prev, this change raised the situation that prev could appear to have a greater load than this, even when prev is actually idle. When prev and this are on the same socket, the idle prev is detected later, in select_idle_sibling. But if that does not hold, prev is completely ignored, causing the waking thread to move to the socket of the waker. In the case of N mostly active threads on N cores, this triggers other migrations and hurts performance. In contrast, before commit 11f10e5420f6, the load on an idle core was 0, and in the case of a non-idle waker core, the effect of wake_affine was to select prev as the target for searching for a core for the waking thread. To avoid unnecessary migrations, extend wake_affine_idle to check whether the core where the thread previously ran is currently idle, and if so simply return that core as the target. [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable load in wakeup path") This particularly has an impact when using the ondemand power manager, where kworkers run every 0.004 seconds on all cores, increasing the likelihood that an idle core will be considered to have a load. The following numbers were obtained with the benchmarking tool hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel benchmarks (https://www.nas.nasa.gov/publications/npb.html). The tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power management were used. Times are in seconds. All experiments use all 160 hardware threads. v5.9/intel-pstate v5.9+patch/intel-pstate bt.C.c 24.725724+-0.962340 23.349608+-1.607214 lu.C.x 29.105952+-4.804203 25.249052+-5.561617 sp.C.x 31.220696+-1.831335 30.227760+-2.429792 ua.C.x 26.606118+-1.767384 25.778367+-1.263850 v5.9/ondemand v5.9+patch/ondemand bt.C.c 25.330360+-1.028316 23.544036+-1.020189 lu.C.x 35.872659+-4.872090 23.719295+-3.883848 sp.C.x 32.141310+-2.289541 29.125363+-0.872300 ua.C.x 29.024597+-1.667049 25.728888+-1.539772 On the smaller data sets (A and B) and on the other NAS benchmarks there is no impact on performance. This also has a major impact on the splash2x.volrend benchmark of the parsec benchmark suite that goes from 1m25 without this patch to 0m45, in active (intel_pstate) mode. Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path") Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
2020-10-29sched: Remove relyance on STRUCT_ALIGNMENTPeter Zijlstra
Florian reported that all of kernel/sched/ is rebuild when CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is unexpected. This is due to us including vmlinux.lds.h. Jakub explained that the problem is that we put the alignment requirement on the type instead of on a variable. Type alignment is a minimum, the compiler is free to pick any larger alignment for a specific instance of the type (eg. the variable). So force the type alignment on all individual variable definitions and remove the undesired dependency on vmlinux.lds.h. Fixes: 85c2ce9104eb ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Suggested-by: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched/fair: Exclude the current CPU from find_new_ilb()Peter Zijlstra
It is possible for find_new_ilb() to select the current CPU, however, this only happens from newidle balancing, in which case need_resched() will be true, and consequently nohz_csd_func() will not trigger the softirq. Exclude the current CPU from becoming an ILB target. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29sched/fair: Improve the accuracy of sched_stat_wait statisticsjun qian
When the sched_schedstat changes from 0 to 1, some sched se maybe already in the runqueue, the se->statistics.wait_start will be 0. So it will let the (rq_of(cfs_rq)) - se->statistics.wait_start) wrong. We need to avoid this scenario. Signed-off-by: jun qian <qianjun.kernel@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20201015064846.19809-1-qianjun.kernel@gmail.com
2020-10-25treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-17task_work: cleanup notification modesJens Axboe
A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-03sched/debug: Add new tracepoint to track cpu_capacityVincent Donnefort
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
2020-10-03sched/fair: Tweak pick_next_entity()Peter Oskolkov
Currently, pick_next_entity(...) has the following structure (simplified): [...] if (last_buddy_ok()) result = last_buddy; if (next_buddy_ok()) result = next_buddy; [...] The intended behavior is to prefer next buddy over last buddy; the current code somewhat obfuscates this, and also wastes cycles checking the last buddy when eventually the next buddy is picked up. So this patch refactors two 'ifs' above into [...] if (next_buddy_ok()) result = next_buddy; else if (last_buddy_ok()) result = last_buddy; [...] Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guitttot@linaro.org> Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com
2020-09-25sched/fair: Use dst group while checking imbalance for NUMA balancerBarry Song
Barry Song noted the following Something is wrong. In find_busiest_group(), we are checking if src has higher load, however, in task_numa_find_cpu(), we are checking if dst will have higher load after balancing. It seems it is not sensible to check src. It maybe cause wrong imbalance value, for example, if dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and src_running = env->src_stats.nr_running - 1 results in 1; The current code is thinking imbalance as 0 since src_running is smaller than 2. This is inconsistent with load balancer. Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving a task "from an almost idle domain" to a "domain with spare capacity". This patch forbids movement "from a misplaced domain" to "an almost idle domain" as that is closer to what the CPU load balancer expects. This patch is not a universal win. The old behaviour was intended to allow a task from an almost idle NUMA node to migrate to its preferred node if the destination had capacity but there are corner cases. For example, a NAS compute load could be parallelised to use 1/3rd of available CPUs but not all those potential tasks are active at all times allowing this logic to trigger. An obvious example is specjbb 2005 running various numbers of warehouses on a 2 socket box with 80 cpus. specjbb 5.9.0-rc4 5.9.0-rc4 vanilla dstbalance-v1r1 Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%* Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%* Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%* Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%* Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%* Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%* Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%* Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%* Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%* Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%* Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%* Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%* Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%* Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%* Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%* Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%* Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%* Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%* Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%* Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%* Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%* Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%* Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%* Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%* Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%* Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%* Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%* Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%* Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%* Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%* Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%* Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%* Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%* Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%* Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%* Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%* Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%* Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%* Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%* Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%* Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%* Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%* Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%* Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%* Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%* Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%* Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%* Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%* Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%* Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%* Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%* Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%* Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%* Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%* Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%* Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%* Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%* Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%* Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%* Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%* Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%* Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%* Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%* Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%* Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%* Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%* Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%* Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%* Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%* Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%* Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%* Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%* Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%* Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%* Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%* Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%* Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%* Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%* Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%* Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%* The performance at the mid-point is better but not universally better. The patch is a mixed bag depending on the workload, machine and overall levels of utilisation. Sometimes it's better (sometimes much better), other times it is worse (sometimes much worse). Given that there isn't a universally good decision in this section and more people seem to prefer the patch then it may be best to keep the LB decisions consistent and revisit imbalance handling when the load balancer code changes settle down. Jirka Hladky added the following observation. Our results are mostly in line with what you see. We observe big gains (20-50%) when the system is loaded to 1/3 of the maximum capacity and mixed results at the full load - some workloads benefit from the patch at the full load, others not, but performance changes at the full load are mostly within the noise of results (+/-5%). Overall, we think this patch is helpful. [mgorman@techsingularity.net: Rewrote changelog] Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity") Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
2020-09-25sched/fair: Minimize concurrent LBs between domain levelVincent Guittot
sched domains tend to trigger simultaneously the load balance loop but the larger domains often need more time to collect statistics. This slowness makes the larger domain trying to detach tasks from a rq whereas tasks already migrated somewhere else at a sub-domain level. This is not a real problem for idle LB because the period of smaller domains will increase with its CPUs being busy and this will let time for higher ones to pulled tasks. But this becomes a problem when all CPUs are already busy because all domains stay synced when they trigger their LB. A simple way to minimize simultaneous LB of all domains is to decrement the the busy interval by 1 jiffies. Because of the busy_factor, the interval of larger domain will not be a multiple of smaller ones anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
2020-09-25sched/fair: Relax constraint on task's load during load balanceVincent Guittot
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the load balancer currently migrates the waiting task between the CPUs in an almost random manner. The success of a rq pulling a task depends of the value of nr_balance_failed of its domains and its ability to be faster than others to detach it. This behavior results in an unfair distribution of the running time between tasks because some CPUs will run most of the time, if not always, the same task whereas others will share their time between several tasks. Instead of using nr_balance_failed as a boolean to relax the condition for detaching task, the LB will use nr_balanced_failed to relax the threshold between the tasks'load and the imbalance. This mecanism prevents the same rq or domain to always win the load balance fight. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
2020-09-25sched/fair: Remove the force parameter of update_tg_load_avg()Xianting Tian
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used, sometimes update_tg_load_avg(cfs_rq, false) is used. update_tg_load_avg() has the parameter force, but in current code, it never set 1 or true to it, so remove the force parameter. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
2020-09-25sched/fair: Fix wrong cpu selecting from isolated domainXunlei Pang
We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
2020-09-25sched/numa: Use runnable_avg to classify nodeVincent Guittot
Use runnable_avg to classify numa node state similarly to what is done for normal load balancer. This helps to ensure that numa and normal balancers use the same view of the state of the system. Large arm64system: 2 nodes / 224 CPUs: hackbench -l (256000/#grp) -g #grp grp tip/sched/core +patchset improvement 1 14,008(+/- 4,99 %) 13,800(+/- 3.88 %) 1,48 % 4 4,340(+/- 5.35 %) 4.283(+/- 4.85 %) 1,33 % 16 3,357(+/- 0.55 %) 3.359(+/- 0.54 %) -0,06 % 32 3,050(+/- 0.94 %) 3.039(+/- 1,06 %) 0,38 % 64 2.968(+/- 1,85 %) 3.006(+/- 2.92 %) -1.27 % 128 3,290(+/-12.61 %) 3,108(+/- 5.97 %) 5.51 % 256 3.235(+/- 3.95 %) 3,188(+/- 2.83 %) 1.45 % Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
2020-08-26sched/fair: Simplify the work when reweighting entityJiang Biao
The code in reweight_entity() can be simplified. For a sched entity on the rq, the entity accounting can be replaced by cfs_rq instantaneous load updates currently called from within the entity accounting. Even though an entity on the rq can't represent a task in reweight_entity() (a task is always dequeued before calling this function) and so the numa task accounting and the rq->cfs_tasks list management of the entity accounting are never called, the redundant cfs_rq->nr_running decrement/increment will be avoided. Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200811113209.34057-1-benbjiang@tencent.com
2020-08-26sched/fair: Fix wrong negative conversion in find_energy_efficient_cpu()Lukasz Luba
In find_energy_efficient_cpu() 'cpu_cap' could be less that 'util'. It might be because of RT, DL (so higher sched class than CFS), irq or thermal pressure signal, which reduce the capacity value. In such situation the result of 'cpu_cap - util' might be negative but stored in the unsigned long. Then it might be compared with other unsigned long when uclamp_rq_util_with() reduced the 'util' such that is passes the fits_capacity() check. Prevent this situation and make the arithmetic more safe. Fixes: 1d42509e475cd ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20200810083004.26420-1-lukasz.luba@arm.com
2020-08-26sched/fair: Ignore cache hotness for SMT migrationJosh Don
SMT siblings share caches, so cache hotness should be irrelevant for cross-sibling migration. Signed-off-by: Josh Don <joshdon@google.com> Proposed-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200804193413.510651-1-joshdon@google.com