summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2021-06-18sched: Introduce task_is_running()Peter Zijlstra
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-18Merge branch 'sched/urgent' into sched/core, to resolve conflictsIngo Molnar
This commit in sched/urgent moved the cfs_rq_is_decayed() function: a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: 9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-17sched/fair: Age the average idle timePeter Zijlstra
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-06-17sched/cpufreq: Consider reduced CPU capacity in energy calculationLukasz Luba
Energy Aware Scheduling (EAS) needs to predict the decisions made by SchedUtil. The map_util_freq() exists to do that. There are corner cases where the max allowed frequency might be reduced (due to thermal). SchedUtil as a CPUFreq governor, is aware of that but EAS is not. This patch aims to address it. SchedUtil stores the maximum allowed frequency in 'sugov_policy::next_freq' field. EAS has to predict that value, which is the real used frequency. That value is made after a call to cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits. In the existing code EAS is not able to predict that real frequency. This leads to energy estimation errors. To avoid wrong energy estimation in EAS (due to frequency miss prediction) make sure that the step which calculates Performance Domain frequency, is also aware of the allowed CPU capacity. Furthermore, modify map_util_freq() to not extend the frequency value. Instead, use map_util_perf() to extend the util value in both places: SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity. In the end, we achieve the same desirable behavior for both subsystems and alignment in regards to the real CPU frequency. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part) Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-17sched/fair: Take thermal pressure into account while estimating energyLukasz Luba
Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests made by the SchedUtil governor to properly estimate energy used in the future. It has to take into account CPUs utilization and forecast Performance Domain (PD) frequency. There is a corner case when the max allowed frequency might be reduced due to thermal. SchedUtil is aware of that reduced frequency, so it should be taken into account also in EAS estimations. SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping to 'policy::max'. SchedUtil is responsible to respect that upper limit while setting the frequency through CPUFreq drivers. This effective frequency is stored internally in 'sugov_policy::next_freq' and EAS has to predict that value. In the existing code the raw value of arch_scale_cpu_capacity() is used for clamping the returned CPU utilization from effective_cpu_util(). This patch fixes issue with too big single CPU utilization, by introducing clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU capacity reduced by thermal pressure raw value. Thanks to knowledge about allowed CPU capacity, we don't get too big value for a single CPU utilization, which is then added to the util sum. The util sum is used as a source of information for estimating whole PD energy. To avoid wrong energy estimation in EAS (due to capped frequency), make sure that the calculation of util sum is aware of allowed CPU capacity. This thermal pressure might be visible in scenarios where the CPUs are not heavily loaded, but some other component (like GPU) drastically reduced available power budget and increased the SoC temperature. Thus, we still use EAS for task placement and CPUs are not over-utilized. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
2021-06-17sched/fair: Return early from update_tg_cfs_load() if delta == 0Dietmar Eggemann
In case the _avg delta is 0 there is no need to update se's _avg (level n) nor cfs_rq's _avg (level n-1). These values stay the same. Since cfs_rq's _avg isn't changed, i.e. no load is propagated down, cfs_rq's _sum should stay the same as well. So bail out after se's _sum has been updated. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210601083616.804229-1-dietmar.eggemann@arm.com
2021-06-17sched/pelt: Check that *_avg are null when *_sum areVincent Guittot
Check that we never break the rule that pelt's avg values are null if pelt's sum are. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210601155328.19487-1-vincent.guittot@linaro.org
2021-06-14sched/fair: Correctly insert cfs_rq's to list on unthrottleOdin Ugedal
Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
2021-06-04sched/debug: Remove obsolete init_schedstats()Eric Dumazet
Revert commit 4698f88c06b8 ("sched/debug: Fix 'schedstats=enable' cmdline option"). After commit 6041186a3258 ("init: initialize jump labels before command line option parsing") we can rely on jump label infra being ready for use when setup_schedstats() is called. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20210602112108.1709635-1-eric.dumazet@gmail.com
2021-06-03Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03sched/fair: Fix util_est UTIL_AVG_UNCHANGED handlingDietmar Eggemann
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
2021-06-03sched/pelt: Ensure that *_sum is always synced with *_avgVincent Guittot
Rounding in PELT calculation happening when entities are attached/detached of a cfs_rq can result into situations where util/runnable_avg is not null but util/runnable_sum is. This is normally not possible so we need to ensure that util/runnable_sum stays synced with util/runnable_avg. detach_entity_load_avg() is the last place where we don't sync util/runnable_sum with util/runnbale_avg when moving some sched_entities Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210601085832.12626-1-vincent.guittot@linaro.org
2021-06-01sched: Don't defer CPU pick to migration_cpu_stop()Valentin Schneider
Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop() can happen, and it *is* sort of a big deal. Looking at it some more, one will note there is a glaring hole in the deferred CPU selection: (w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't get AND'd with cpu_online_mask) $ taskset -pc 0-2 $PID # offline CPUs 3-4 $ taskset -pc 3-5 $PID `\ $PID may stay on 0-2 due to the cpumask_any_distribute() picking an offline CPU and __migrate_task() refusing to do anything due to cpu_is_allowed(). set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches the right constraints vs affinity and the online/active state of the CPUs. Reuse that instead of discarding it in the affine_move_task() case. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210526205751.842360-2-valentin.schneider@arm.com
2021-06-01sched/fair: Fix ascii art by relpacing tabsOdin Ugedal
When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al
2021-06-01sched,init: Fix DEBUG_PREEMPT vs early bootPeter Zijlstra
Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially") to cover the new PF_NO_SETAFFINITY requirement. While there, move wait_for_completion(&kthreadd_done) into kernel_init() to make it absolutely clear it is the very first thing done by the init thread. Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net
2021-05-31sched/fair: Make sure to update tg contrib for blocked loadVincent Guittot
During the update of fair blocked load (__update_blocked_fair()), we update the contribution of the cfs in tg->load_avg if cfs_rq's pelt has decayed. Nevertheless, the pelt values of a cfs_rq could have been recently updated while propagating the change of a child. In this case, cfs_rq's pelt will not decayed because it has already been updated and we don't update tg->load_avg. __update_blocked_fair ... for_each_leaf_cfs_rq_safe: child cfs_rq update cfs_rq_load_avg() for child cfs_rq ... update_load_avg(cfs_rq_of(se), se, 0) ... update cfs_rq_load_avg() for parent cfs_rq -propagation of child's load makes parent cfs_rq->load_sum becoming null -UPDATE_TG is not set so it doesn't update parent cfs_rq->tg_load_avg_contrib .. for_each_leaf_cfs_rq_safe: parent cfs_rq update cfs_rq_load_avg() for parent cfs_rq - nothing to do because parent cfs_rq has already been updated recently so cfs_rq->tg_load_avg_contrib is not updated ... parent cfs_rq is decayed list_del_leaf_cfs_rq parent cfs_rq - but it still contibutes to tg->load_avg we must set UPDATE_TG flags when propagting pending load to the parent Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.org
2021-05-31sched/fair: Keep load_avg and load_sum syncedVincent Guittot
when removing a cfs_rq from the list we only check _sum value so we must ensure that _avg and _sum stay synced so load_sum can't be null whereas load_avg is not after propagating load in the cgroup hierarchy. Use load_avg to compute load_sum similarly to what is done for util_sum and runnable_sum. Fixes: 0e2d2aaaae52 ("sched/fair: Rewrite PELT migration propagation") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-2-vincent.guittot@linaro.org
2021-05-19sched: Fix a stale comment in pick_next_task()Masahiro Yamada
fair_sched_class->next no longer exists since commit: a87e749e8fa1 ("sched: Remove struct sched_class::next field"). Now the sched_class order is specified by the linker script. Rewrite the comment in a more generic way. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210519063709.323162-1-masahiroy@kernel.org
2021-05-19sched/uclamp: Fix locking around cpu_util_update_eff()Qais Yousef
cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the uclamp_mutex or rcu_read_lock() like other call sites, which is a mistake. The uclamp_mutex is required to protect against concurrent reads and writes that could update the cgroup hierarchy. The rcu_read_lock() is required to traverse the cgroup data structures in cpu_util_update_eff(). Surround the caller with the required locks and add some asserts to better document the dependency in cpu_util_update_eff(). Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com
2021-05-19sched/uclamp: Fix wrong implementation of cpu.uclamp.minQais Yousef
cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com
2021-05-18sched: Make the idle task quack like a per-CPU kthreadValentin Schneider
For all intents and purposes, the idle task is a per-CPU kthread. It isn't created via the same route as other pcpu kthreads however, and as a result it is missing a few bells and whistles: it fails kthread_is_per_cpu() and it doesn't have PF_NO_SETAFFINITY set. Fix the former by giving the idle task a kthread struct along with the KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle() call be called more than once on the same idle task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510151024.2448573-2-valentin.schneider@arm.com
2021-05-18sched,stats: Further simplify sched_infoPeter Zijlstra
There's no point doing delta==0 updates. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-05-13sched/isolation: Reconcile rcu_nocbs= and nohz_full=Paul Gortmaker
We have a mismatch between RCU and isolation -- in relation to what is considered the maximum valid CPU number. This matters because nohz_full= and rcu_nocbs= are joined at the hip; in fact the former will enforce the latter. So we don't want a CPU mask to be valid for one and denied for the other. The difference 1st appeared as of v4.15; further details are below. As it is confusing to anyone who isn't looking at the code regularly, a reminder is in order; three values exist here: CONFIG_NR_CPUS - compiled in maximum cap on number of CPUs supported. nr_cpu_ids - possible # of CPUs (typically reflects what ACPI says) cpus_present - actual number of present/detected/installed CPUs. For this example, I'll refer to NR_CPUS=64 from "make defconfig" and nr_cpu_ids=6 for ACPI reporting on a board that could run a six core, and present=4 for a quad that is physically in the socket. From dmesg: smpboot: Allowing 6 CPUs, 2 hotplug CPUs setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:6 nr_node_ids:1 rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6. smp: Brought up 1 node, 4 CPUs And from userspace, see: paul@trash:/sys/devices/system/cpu$ cat present 0-3 paul@trash:/sys/devices/system/cpu$ cat possible 0-5 paul@trash:/sys/devices/system/cpu$ cat kernel_max 63 Everything is fine if we boot 5x5 for rcu/nohz: Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-5 rcu_nocbs=2-5 root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. ..even though there is no CPU 4 or 5. Both RCU and nohz_full are OK. Now we push that > 6 but less than NR_CPU and with 15x15 we get: Command line: BOOT_IMAGE=/boot/bzImage rcu_nocbs=2-15 nohz_full=2-15 root=/dev/sda1 ro rcu: Note: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs. rcu: Offload RCU callbacks from CPUs: 2-5. These are both functionally equivalent, as we are only changing flags on phantom CPUs that don't exist, but note the kernel interpretation changes. And worse, it only changes for one of the two - which is the problem. RCU doesn't care if you want to restrict the flags on phantom CPUs but clearly nohz_full does after this change from v4.15. edb9382175c3: ("sched/isolation: Move isolcpus= handling to the housekeeping code") - if (cpulist_parse(str, non_housekeeping_mask) < 0) { - pr_warn("Housekeeping: Incorrect nohz_full cpumask\n"); + err = cpulist_parse(str, non_housekeeping_mask); + if (err < 0 || cpumask_last(non_housekeeping_mask) >= nr_cpu_ids) { + pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n"); To be clear, the sanity check on "possible" (nr_cpu_ids) is new here. The goal was reasonable ; not wanting housekeeping to land on a not-possible CPU, but note two things: 1) this is an exclusion list, not an inclusion list; we are tracking non_housekeeping CPUs; not ones who are explicitly assigned housekeeping 2) we went one further in 9219565aa890 ("sched/isolation: Require a present CPU in housekeeping mask") - ensuring that housekeeping was sanity checking against present and not just possible CPUs. To be clear, this means the check added in v4.15 is doubly redundant. And more importantly, overly strict/restrictive. We care now, because the bitmap boot arg parsing now knows that a value of "N" is NR_CPUS; the size of the bitmap, but the bitmap code doesn't know anything about the subtleties of our max/possible/present CPU specifics as outlined above. So drop the check added in v4.15 (edb9382175c3) and make RCU and nohz_full both in alignment again on NR_CPUS so "N" works for both, and then they can fall back to nr_cpu_ids internally just as before. Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-N rcu_nocbs=2-N root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. As shown above, with this change, RCU and nohz_full are in sync, even with the use of the "N" placeholder. Same result is achieved with "15". Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20210419042659.1134916-1-paul.gortmaker@windriver.com
2021-05-12sched: Make multiple runqueue task counters 32-bitAlexey Dobriyan
Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-4-adobriyan@gmail.com
2021-05-12sched: Make nr_iowait_cpu() return 32-bit valueAlexey Dobriyan
Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-3-adobriyan@gmail.com
2021-05-12sched: Make nr_iowait() return 32-bit valueAlexey Dobriyan
Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-2-adobriyan@gmail.com
2021-05-12sched: Make nr_running() return 32-bit valueAlexey Dobriyan
Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210422200228.1423391-1-adobriyan@gmail.com
2021-05-12sched: Fix leftover comment typosIngo Molnar
A few more snuck in. Also capitalize 'CPU' while at it. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-12sched/core: Initialize the idle task with preemption disabledValentin Schneider
As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12sched: prctl() core-scheduling interfaceChris Hyser
This patch provides support for setting and copying core scheduling 'task cookies' between threads (PID), processes (TGID), and process groups (PGID). The value of core scheduling isn't that tasks don't share a core, 'nosmt' can do that. The value lies in exploiting all the sharing opportunities that exist to recover possible lost performance and that requires a degree of flexibility in the API. From a security perspective (and there are others), the thread, process and process group distinction is an existent hierarchal categorization of tasks that reflects many of the security concerns about 'data sharing'. For example, protecting against cache-snooping by a thread that can just read the memory directly isn't all that useful. With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a mechanism to create and share cookies. CREATE/SHARE_TO specify a target pid with enum pidtype used to specify the scope of the targeted tasks. For example, PIDTYPE_TGID will share the cookie with the process and all of it's threads as typically desired in a security scenario. API: prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie) prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL) prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL) where 'tgtpid/srcpid == 0' implies the current process and pidtype is kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}. For return values, EINVAL, ENOMEM are what they say. ESRCH means the tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT. [peterz: complete rewrite] Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org
2021-05-12sched: Inherit task cookie on fork()Peter Zijlstra
Note that sched_core_fork() is called from under tasklist_lock, and not from sched_fork() earlier. This avoids a few races later. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.980003687@infradead.org
2021-05-12sched: Trivial core scheduling cookie managementPeter Zijlstra
In order to not have to use pid_struct, create a new, smaller, structure to manage task cookies for core scheduling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org
2021-05-12sched: Migration changes for core schedulingAubrey Li
- Don't migrate if there is a cookie mismatch Load balance tries to move task from busiest CPU to the destination CPU. When core scheduling is enabled, if the task's cookie does not match with the destination CPU's core cookie, this task may be skipped by this CPU. This mitigates the forced idle time on the destination CPU. - Select cookie matched idle CPU In the fast path of task wakeup, select the first cookie matched idle CPU instead of the first idle CPU. - Find cookie matched idlest CPU In the slow path of task wakeup, find the idlest CPU whose core cookie matches with task's cookie Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.860083871@infradead.org
2021-05-12sched: Trivial forced-newidle balancerPeter Zijlstra
When a sibling is forced-idle to match the core-cookie; search for matching tasks to fill the core. rcu_read_unlock() can incur an infrequent deadlock in sched_core_balance(). Fix this by using the RCU-sched flavor instead. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.800048269@infradead.org
2021-05-12sched/fair: Snapshot the min_vruntime of CPUs on force idleJoel Fernandes (Google)
During force-idle, we end up doing cross-cpu comparison of vruntimes during pick_next_task. If we simply compare (vruntime-min_vruntime) across CPUs, and if the CPUs only have 1 task each, we will always end up comparing 0 with 0 and pick just one of the tasks all the time. This starves the task that was not picked. To fix this, take a snapshot of the min_vruntime when entering force idle and use it for comparison. This min_vruntime snapshot will only be used for cross-CPU vruntime comparison, and nothing else. A note about the min_vruntime snapshot and force idling: During selection: When we're not fi, we need to update snapshot. when we're fi and we were not fi, we must update snapshot. When we're fi and we were already fi, we must not update snapshot. Which gives: fib fi update 0 0 1 0 1 1 1 0 1 1 1 0 Where: fi: force-idled now fib: force-idled before So the min_vruntime snapshot needs to be updated when: !(fib && fi). Also, the cfs_prio_less() function needs to be aware of whether the core is in force idle or not, since it will be use this information to know whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass this information along via pick_task() -> prio_less(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.738542617@infradead.org
2021-05-12sched: Fix priority inversion of cookied task with siblingJoel Fernandes (Google)
The rationale is as follows. In the core-wide pick logic, even if need_sync == false, we need to go look at other CPUs (non-local CPUs) to see if they could be running RT. Say the RQs in a particular core look like this: Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task. rq0 rq1 CFS1 (tagged) RT1 (no tag) CFS2 (tagged) Say schedule() runs on rq0. Now, it will enter the above loop and pick_task(RT) will return NULL for 'p'. It will enter the above if() block and see that need_sync == false and will skip RT entirely. The end result of the selection will be (say prio(CFS1) > prio(CFS2)): rq0 rq1 CFS1 IDLE When it should have selected: rq0 rq1 IDLE RT Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.678425748@infradead.org
2021-05-12sched/fair: Fix forced idle sibling starvation corner caseVineeth Pillai
If there is only one long running local task and the sibling is forced idle, it might not get a chance to run until a schedule event happens on any cpu in the core. So we check for this condition during a tick to see if a sibling is starved and then give it a chance to schedule. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.617407840@infradead.org
2021-05-12sched: Add core wide task selection and schedulingPeter Zijlstra
Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.557559654@infradead.org
2021-05-12sched: Basic tracking of matching tasksPeter Zijlstra
Introduce task_struct::core_cookie as an opaque identifier for core scheduling. When enabled; core scheduling will only allow matching task to be on the core; where idle matches everything. When task_struct::core_cookie is set (and core scheduling is enabled) these tasks are indexed in a second RB-tree, first on cookie value then on scheduling function, such that matching task selection always finds the most elegible match. NOTE: *shudder* at the overhead... NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative is per class tracking of cookies and that just duplicates a lot of stuff for no raisin (the 2nd copy lives in the rt-mutex PI code). [Joel: folded fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.496975854@infradead.org
2021-05-12sched: Introduce sched_class::pick_task()Peter Zijlstra
Because sched_class::pick_next_task() also implies sched_class::set_next_task() (and possibly put_prev_task() and newidle_balance) it is not state invariant. This makes it unsuitable for remote task selection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Vineeth: folded fixes] Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
2021-05-12sched: Allow sched_core_put() from atomic contextPeter Zijlstra
Stuff the meat of sched_core_put() into a work such that we can use sched_core_put() from atomic context. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.377455632@infradead.org
2021-05-12sched: Optimize rq_lockp() usagePeter Zijlstra
rq_lockp() includes a static_branch(), which is asm-goto, which is asm volatile which defeats regular CSE. This means that: if (!static_branch(&foo)) return simple; if (static_branch(&foo) && cond) return complex; Doesn't fold and we get horrible code. Introduce __rq_lockp() without the static_branch() on. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.316696988@infradead.org
2021-05-12sched: Core-wide rq->lockPeter Zijlstra
Introduce the basic infrastructure to have a core wide rq->lock. This relies on the rq->__lock order being in increasing CPU number (inside a core). It is also constrained to SMT8 per lockdep (and SMT256 per preempt_count). Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and Power are known to have this). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/YJUNfzSgptjX7tG6@hirez.programming.kicks-ass.net
2021-05-12sched: Prepare for Core-wide rq->lockPeter Zijlstra
When switching on core-sched, CPUs need to agree which lock to use for their RQ. The new rule will be that rq->core_enabled will be toggled while holding all rq->__locks that belong to a core. This means we need to double check the rq->core_enabled value after each lock acquire and retry if it changed. This also has implications for those sites that take multiple RQ locks, they need to be careful that the second lock doesn't end up being the first lock. Verify the lock pointer after acquiring the first lock, because if they're on the same core, holding any of the rq->__lock instances will pin the core state. While there, change the rq->__lock order to CPU number, instead of rq address, this greatly simplifies the next patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/YJUNY0dmrJMD/BIm@hirez.programming.kicks-ass.net
2021-05-12sched: Wrap rq::lock accessPeter Zijlstra
In preparation of playing games with rq->lock, abstract the thing using an accessor. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
2021-05-12sched: Provide raw_spin_rq_*lock*() helpersPeter Zijlstra
In prepration for playing games with rq->lock, add some rq_lock wrappers. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.075967879@infradead.org
2021-05-12sched/fair: Add a few assertionsPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.015639083@infradead.org
2021-05-12sched: Simplify sched_info_on()Peter Zijlstra
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm. If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked. If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is. Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210505111525.121458839@infradead.org
2021-05-12sched: Rename sched_info_{queued,dequeued}Peter Zijlstra
For consistency, rename {queued,dequeued} to {enqueue,dequeue}. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Link: https://lkml.kernel.org/r/20210505111525.061402904@infradead.org
2021-05-12sched/fair: Fix negative energy delta in find_energy_efficient_cpu()Pierre Gondois
find_energy_efficient_cpu() (feec()) searches the best energy CPU to place a task on. To do so, compute_energy() estimates the energy impact of placing the task on a CPU, based on CPU and task utilization signals. Utilization signals can be concurrently updated while evaluating a performance domain (pd). In some cases, this leads to having a 'negative delta', i.e. placing the task in the pd is seen as an energy gain. Thus, any further energy comparison is biased. In case of a 'negative delta', return prev_cpu since: 1. a 'negative delta' happens in less than 0.5% of feec() calls, on a Juno with 6 CPUs (4 little, 2 big) 2. it is unlikely to have two consecutive 'negative delta' for a task, so if the first call fails, feec() will correctly place the task in the next feec() call 3. EAS current behavior tends to select prev_cpu if the task doesn't raise the OPP of its current pd. prev_cpu is EAS's generic decision 4. prev_cpu should be preferred to returning an error code. In the latter case, select_idle_sibling() would do the placement, selecting a big (and not energy efficient) CPU. As 3., the task would potentially reside on the big CPU for a long time Reported-by: Xuewen Yan <xuewen.yan@unisoc.com> Suggested-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Link: https://lkml.kernel.org/r/20210504090743.9688-3-Pierre.Gondois@arm.com