summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2016-08-10sched/cputime: Mitigate performance regression in times()/clock_gettime()Giovanni Gherdovich
Commit: 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4*num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity *curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total*100 L1 misses/total*100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void *pound (void *threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void *)i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void *pound (void *threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec * 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-25Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - introduce and use task_rcu_dereference()/try_get_task_struct() to fix and generalize task_struct handling (Oleg Nesterov) - do various per entity load tracking (PELT) fixes and optimizations (Peter Zijlstra) - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li) - introduce consolidated cputime output file cpuacct.usage_all and related refactorings (Zhao Lei) - ... plus misc fixes and enhancements * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show() sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums sched/fair: Rework throttle_count sync sched/core: Fix sched_getaffinity() return value kerneldoc comment sched/fair: Reorder cgroup creation code sched/fair: Apply more PELT fixes sched/fair: Fix PELT integrity for new tasks sched/cgroup: Fix cpu_cgroup_fork() handling sched/fair: Fix PELT integrity for new groups sched/fair: Fix and optimize the fork() path sched/cputime: Add steal time support to full dynticks CPU time accounting sched/cputime: Fix prev steal time accouting during CPU hotplug KVM: Fix steal clock warp during guest CPU hotplug sched/debug: Always show 'nr_migrations' sched/fair: Use task_rcu_dereference() sched/api: Introduce task_rcu_dereference() and try_get_task_struct() sched/idle: Optimize the generic idle loop sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
2016-07-25Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The locking tree was busier in this cycle than the usual pattern - a couple of major projects happened to coincide. The main changes are: - implement the atomic_fetch_{add,sub,and,or,xor}() API natively across all SMP architectures (Peter Zijlstra) - add atomic_fetch_{inc/dec}() as well, using the generic primitives (Davidlohr Bueso) - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso, Waiman Long) - optimize smp_cond_load_acquire() on arm64 and implement LSE based atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() on arm64 (Will Deacon) - introduce smp_acquire__after_ctrl_dep() and fix various barrier mis-uses and bugs (Peter Zijlstra) - after discovering ancient spin_unlock_wait() barrier bugs in its implementation and usage, strengthen its semantics and update/fix usage sites (Peter Zijlstra) - optimize mutex_trylock() fastpath (Peter Zijlstra) - ... misc fixes and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits) locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire() locking/static_keys: Fix non static symbol Sparse warning locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() locking/atomic, arch/tile: Fix tilepro build locking/atomic, arch/m68k: Remove comment locking/atomic, arch/arc: Fix build locking/Documentation: Clarify limited control-dependency scope locking/atomic, arch/rwsem: Employ atomic_long_fetch_add() locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire() locking/atomic, arch/mips: Convert to _relaxed atomics locking/atomic, arch/alpha: Convert to _relaxed atomics locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions locking/atomic: Remove linux/atomic.h:atomic_fetch_or() locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() locking/atomic: Fix atomic64_relaxed() bits locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}() locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() ...
2016-07-13sched/core: Correct off by one bug in load migration calculationThomas Gleixner
The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into account that the function is now called from a thread running on the outgoing CPU. As a result a cpu unplug leakes a load of 1 into the global load accounting mechanism. Fix it by adjusting for the currently running thread which calls calc_load_migrate(). Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: rt@linutronix.de Cc: shreyas@linux.vnet.ibm.com Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING") Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is setDaniel Bristot de Oliveira
Currently, a schedule while atomic error prints the stack trace to the kernel log and the system continue running. Although it is possible to collect the kernel log messages and analyze it, often more information are needed. Furthermore, keep the system running is not always the best choice. For example, when the preempt count underflows the system will not stop to complain about scheduling while atomic, so the kernel log can wrap around overwriting the first stack trace, tuning the analysis even more challenging. This patch uses the kernel.panic_on_warn sysctl to help out on these more complex situations. When kernel.panic_on_warn is set to 1, the kernel will panic() in the schedule while atomic detection. The default value of the sysctl is 0, maintaining the current behavior. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e8f7b80f353aa22c63bd8557208163989af8493d.1464983675.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/core: Fix sched_getaffinity() return value kerneldoc commentZev Weiss
Previous version was probably written referencing the man page for glibc's wrapper, but the wrapper's behavior differs from that of the syscall itself in this case. Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Reorder cgroup creation codePeter Zijlstra
A future patch needs rq->lock held _after_ we link the task_group into the hierarchy. In order to avoid taking every rq->lock twice, reorder things a little and create online_fair_sched_group() to be called after we link the task_group. All this code is still ran from css_alloc() so css_online() isn't in fact used for this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix PELT integrity for new tasksPeter Zijlstra
Vincent and Yuyang found another few scenarios in which entity tracking goes wobbly. The scenarios are basically due to the fact that new tasks are not immediately attached and thereby differ from the normal situation -- a task is always attached to a cfs_rq load average (such that it includes its blocked contribution) and are explicitly detached/attached on migration to another cfs_rq. Scenario 1: switch to fair class p->sched_class = fair_class; if (queued) enqueue_task(p); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() check_class_changed() switched_from() (!fair) switched_to() (fair) switched_to_fair() attach_entity_load_avg() If @p is a new task that hasn't been fair before, it will have !last_update_time and, per the above, end up in attach_entity_load_avg() _twice_. Scenario 2: change between cgroups sched_move_group(p) if (queued) dequeue_task() task_move_group_fair() detach_task_cfs_rq() detach_entity_load_avg() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() if (queued) enqueue_task(); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() Similar as with scenario 1, if @p is a new task, it will have !load_update_time and we'll end up in attach_entity_load_avg() _twice_. Furthermore, notice how we do a detach_entity_load_avg() on something that wasn't attached to begin with. As stated above; the problem is that the new task isn't yet attached to the load tracking and thereby violates the invariant assumption. This patch remedies this by ensuring a new task is indeed properly attached to the load tracking on creation, through post_init_entity_util_avg(). Of course, this isn't entirely as straightforward as one might think, since the task is hashed before we call wake_up_new_task() and thus can be poked at. We avoid this by adding TASK_NEW and teaching cpu_cgroup_can_attach() to refuse such tasks. Reported-by: Yuyang Du <yuyang.du@intel.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/cgroup: Fix cpu_cgroup_fork() handlingVincent Guittot
A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27sched/fair: Fix and optimize the fork() pathPeter Zijlstra
The task_fork_fair() callback already calls __set_task_cpu() and takes rq->lock. If we move the sched_class::task_fork callback in sched_fork() under the existing p->pi_lock, right after its set_task_cpu() call, we can avoid doing two such calls and omit the IRQ disabling on the rq->lock. Change to __set_task_cpu() to skip the migration bits, this is a new task, not a migration. Similarly, make wake_up_new_task() use __set_task_cpu() for the same reason, the task hasn't actually migrated as it hasn't ever ran. This cures the problem of calling migrate_task_rq_fair(), which does remove_entity_from_load_avg() on tasks that have never been added to the load avg to begin with. This bug would result in transiently messed up load_avg values, averaged out after a few dozen milliseconds. This is probably the reason why this bug was not found for such a long time. Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-27Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-24sched/core: Allow kthreads to fall back to online && !active cpusTejun Heo
During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is online but not active. A CPU_ONLINE callback may create or bind a kthread so that its cpus_allowed mask only allows the CPU which is being brought online. The kthread may start executing before the CPU is made active and can end up in select_fallback_rq(). In such cases, the expected behavior is selecting the CPU which is coming online; however, because select_fallback_rq() only chooses from active CPUs, it determines that the task doesn't have any viable CPU in its allowed mask and ends up overriding it to cpu_possible_mask. CPU_ONLINE callbacks should be able to put kthreads on the CPU which is coming online. Update select_fallback_rq() so that it follows cpu_online() rather than cpu_active() for kthreads. Reported-by: Gautham R Shenoy <ego@linux.vnet.ibm.com> Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while ↵Andrey Ryabinin
processing sysrq-w Lengthy output of sysrq-w may take a lot of time on slow serial console. Currently we reset NMI-watchdog on the current CPU to avoid spurious lockup messages. Sometimes this doesn't work since softlockup watchdog might trigger on another CPU which is waiting for an IPI to proceed. We reset softlockup watchdogs on all CPUs, but we do this only after listing all tasks, and this may be too late on a busy system. So, reset watchdogs CPUs earlier, in for_each_process_thread() loop. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14locking/barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()Peter Zijlstra
This new form allows using hardware assisted waiting. Some hardware (ARM64 and x86) allow monitoring an address for changes, so by providing a pointer we can use this to replace the cpu_relax() with hardware optimized methods in the future. Requested-by: Will Deacon <will.deacon@arm.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14sched/cputime: Fix prev steal time accouting during CPU hotplugWanpeng Li
Commit: e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug") ... set rq->prev_* to 0 after a CPU hotplug comes back, in order to fix the case where (after CPU hotplug) steal time is smaller than rq->prev_steal_time. However, this should never happen. Steal time was only smaller because of the KVM-specific bug fixed by the previous patch. Worse, the previous patch triggers a bug on CPU hot-unplug/plug operation: because rq->prev_steal_time is cleared, all of the CPU's past steal time will be accounted again on hot-plug. Since the root cause has been fixed, we can just revert commit e9532e69b8d1. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")' Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14sched/fair: Fix post_init_entity_util_avg() serializationPeter Zijlstra
Chris Wilson reported a divide by 0 at: post_init_entity_util_avg(): > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { Which given the lack of serialization, and the code generated from update_cfs_rq_load_avg() is entirely possible: if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sa->load_avg = max_t(long, sa->load_avg - r, 0); sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); removed_load = 1; } turns into: ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax ffffffff8108706b: 48 85 c0 test %rax,%rax ffffffff8108706e: 74 40 je ffffffff810870b0 ffffffff81087070: 4c 89 f8 mov %r15,%rax ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) ffffffff8108707e: 4c 89 f9 mov %r15,%rcx ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx Which you'll note ends up with 'sa->load_avg - r' in memory at ffffffff8108707a. By calling post_init_entity_util_avg() under rq->lock we're sure to be fully serialized against PELT updates and cannot observe intermediate state like this. Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: bsegall@google.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value") Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-10Merge branch 'stacking-fixes' (vfs stacking fixes from Jann)Linus Torvalds
Merge filesystem stacking fixes from Jann Horn. * emailed patches from Jann Horn <jannh@google.com>: sched: panic on corrupted stack end ecryptfs: forbid opening files without mmap handler proc: prevent stacking filesystems on top
2016-06-10sched: panic on corrupted stack endJann Horn
Until now, hitting this BUG_ON caused a recursive oops (because oops handling involves do_exit(), which calls into the scheduler, which in turn raises an oops), which caused stuff below the stack to be overwritten until a panic happened (e.g. via an oops in interrupt context, caused by the overwritten CPU index in the thread_info). Just panic directly. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-08sched/debug: Fix 'schedstats=enable' cmdline optionJosh Poimboeuf
The 'schedstats=enable' option doesn't work, and also produces the following warning during boot: WARNING: CPU: 0 PID: 0 at /home/jpoimboe/git/linux/kernel/jump_label.c:61 static_key_slow_inc+0x8c/0xa0 static_key_slow_inc used before call to jump_label_init Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc1+ #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 0000000000000086 3ae3475a4bea95d4 ffffffff81e03da8 ffffffff8143fc83 ffffffff81e03df8 0000000000000000 ffffffff81e03de8 ffffffff810b1ffb 0000003d00000096 ffffffff823514d0 ffff88007ff197c8 0000000000000000 Call Trace: [<ffffffff8143fc83>] dump_stack+0x85/0xc2 [<ffffffff810b1ffb>] __warn+0xcb/0xf0 [<ffffffff810b207f>] warn_slowpath_fmt+0x5f/0x80 [<ffffffff811e9c0c>] static_key_slow_inc+0x8c/0xa0 [<ffffffff810e07c6>] static_key_enable+0x16/0x40 [<ffffffff8216d633>] setup_schedstats+0x29/0x94 [<ffffffff82148a05>] unknown_bootoption+0x89/0x191 [<ffffffff810d8617>] parse_args+0x297/0x4b0 [<ffffffff82148d61>] start_kernel+0x1d8/0x4a9 [<ffffffff8214897c>] ? set_init_arg+0x55/0x55 [<ffffffff82148120>] ? early_idt_handler_array+0x120/0x120 [<ffffffff821482db>] x86_64_start_reservations+0x2f/0x31 [<ffffffff82148427>] x86_64_start_kernel+0x14a/0x16d The problem is that it tries to update the 'sched_schedstats' static key before jump labels have been initialized. Changing jump_label_init() to be called earlier before parse_early_param() wouldn't fix it: it would still fail trying to poke_text() because mm isn't yet initialized. Instead, just create a temporary '__sched_schedstats' variable which can be copied to the static key later during sched_init() after jump labels have been initialized. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default") Link: http://lkml.kernel.org/r/453775fe3433bed65731a583e228ccea806d18cd.1465322027.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-25sched/core: Fix remote wakeupsPeter Zijlstra
Commit: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") ... introduced a bug: Mike Galbraith found that it introduced a performance regression, while Paul E. McKenney reported lost wakeups and bisected it to this commit. The reason is that I mis-read ttwu_queue() such that I assumed any wakeup that got a remote queue must have had the task migrated. Since this is not so; we need to transfer this information between queueing the wakeup and actually doing the wakeup. Use a new task_struct::sched_flag for this, we already write to sched_contributes_to_load in the wakeup path so this is a hot and modified cacheline. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12sched/core: Provide a tsk_nr_cpus_allowed() helperThomas Gleixner
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows us to change the representation of ->nr_cpus_allowed if required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12sched/nohz: Fix affine unpinned timers messWanpeng Li
The following commit: 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' intended to affine unpinned timers to housekeepers: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the intention to: unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology) unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology) unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers) This patch fixes it by checking if there are busy housekeepers nearby, otherwise falls to any housekeepers/itself. After the patch: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Fixed the changelog. ] Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12sched/core: Kill sched_class::task_waking to clean up the migration logicPeter Zijlstra
With sched_class::task_waking being called only when we do set_task_cpu(), we can make sched_class::migrate_task_rq() do the work and eliminate sched_class::task_waking entirely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12sched/fair: Prepare to fix fairness problems on migrationPeter Zijlstra
Mike reported that our recent attempt to fix migration problems: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") broke interactivity and the signal starve test. We reverted that commit and now let's try it again more carefully, with some other underlying problems fixed first. One problem is that I assumed ENQUEUE_WAKING was only set when we do a cross-cpu wakeup (migration), which isn't true. This means we now destroy the vruntime history of tasks and wakeup-preemption suffers. Cure this by making my assumption true, only call sched_class::task_waking() when we do a cross-cpu wakeup. This avoids the indirect call in the case we do a local wakeup. Reported-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Cc: linux-kernel@vger.kernel.org Fixes: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12Merge branch 'smp/hotplug' into sched/core, to resolve conflictsIngo Molnar
Conflicts: kernel/sched/core.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-09sched/core: Fix comment typo in wake_q_add()Davidlohr Bueso
... the comment clearly refers to wake_up_q(), and not wake_up_list(). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1462766290-28664-1-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-09sched/core: Remove unused variableMuhammad Falak R Wani
Remove unused variable 'ret', and directly return 0. Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1462441879-10092-1-git-send-email-falakreyaz@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-06sched: Make hrtick_notifier an explicit callThomas Gleixner
No need for an extra notifier. We don't need to handle all these states. It's sufficient to kill the timer when the cpu dies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.770528462@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/fair: Make ilb_notifier an explicit callThomas Gleixner
No need for an extra notifier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.693720241@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()Thomas Gleixner
Remove the hotplug notifier and make it an explicit state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.502222097@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move CPU_ONLINE into scheduler stateThomas Gleixner
The alleged requirement that the migration notifier has a lower priority than perf is completely undocumented and there is no indication at all that this is true. perf does not even handle the CPU_ONLINE notification and perf really has nothing to do with migration. Move the CPU_ONLINE code into the sched_activate_cpu() state callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.421743581@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move calc_load_migrate() into CPU_DYINGThomas Gleixner
It really does not matter when we fold the load for the outgoing cpu. It's almost dead anyway, so there is no harm if we fail to fold the few microseconds which are required for going fully away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.328739226@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/migration: Move prepare transition to SCHED_STARTING stateThomas Gleixner
We can piggy pack that on the SCHED_STARTING state. It's not required before the cpu actually comes online. Name the function proper as it has nothing to do with migration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.248226511@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Move sync_rcu to be with set_cpu_active(false)Peter Zijlstra
The sync_rcu stuff is specificically for clearing bits in the active mask, such that everybody will observe the bit cleared and will not consider the cleared CPU for load-balancing etc. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.169219710@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched/hotplug: Convert cpu_[in]active notifiers to state machineThomas Gleixner
Now that we reduced everything into single notifiers, it's simple to move them into the hotplug state machine space. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Move sched_domains_numa_masks_clear() to DOWN_PREPAREThomas Gleixner
This is the last operation on the cpu before vanishing. No point in calling that on CPU_DEAD. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Consolidate the notifier mazeThomas Gleixner
We can maintain the ordering of the scheduler cpu hotplug functionality nicely in one notifer. Get rid of the maze. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Allow hotplug notifiers to be setup earlyThomas Gleixner
Prevent the SMP scheduler related notifiers to be executed before the smp scheduler is initialized and install them early. This is a preparatory change for further consolidation of the hotplug notifier maze. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Make set_cpu_rq_start_time() a built in hotplug stateThomas Gleixner
Start distangling the maze of hotplug notifiers in the scheduler. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-06sched: Allow per-cpu kernel threads to run on online && !activePeter Zijlstra (Intel)
In order to enable symmetric hotplug, we must mirror the online && !active state of cpu-down on the cpu-up side. However, to retain sanity, limit this state to per-cpu kthreads. Aside from the change to set_cpus_allowed_ptr(), which allow moving the per-cpu kthreads on, the other critical piece is the cpu selection for pinned tasks in select_task_rq(). This avoids dropping into select_fallback_rq(). select_fallback_rq() cannot be allowed to select !active cpus because its used to migrate user tasks away. And we do not want to move user tasks onto cpus that are in transition. Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160301152303.GV6356@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-05locking/lockdep, sched/core: Implement a better lock pinning schemePeter Zijlstra
The problem with the existing lock pinning is that each pin is of value 1; this mean you can simply unpin if you know its pinned, without having any extra information. This scheme generates a random (16 bit) cookie for each pin and requires this same cookie to unpin. This means you have to keep the cookie in context. No objsize difference for !LOCKDEP kernels. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/core: Introduce 'struct rq_flags'Peter Zijlstra
In order to be able to pass around more than just the IRQ flags in the future, add a rq_flags structure. No difference in code generation for the x86_64-defconfig build I tested. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05sched/core: Move task_rq_lock() out of linePeter Zijlstra
Its a rather large function, inline doesn't seems to make much sense: $ size defconfig-build/kernel/sched/core.o{.orig,} text data bss dec hex filename 56533 21037 2320 79890 13812 defconfig-build/kernel/sched/core.o.orig 55733 21037 2320 79090 134f2 defconfig-build/kernel/sched/core.o The 'perf bench sched messaging' micro-benchmark shows a visible improvement of 4-5%: $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done $ perf stat --null --repeat 25 -- perf bench sched messaging -g 40 -l 5000 pre: 4.582798193 seconds time elapsed ( +- 1.41% ) 4.733374877 seconds time elapsed ( +- 2.10% ) 4.560955136 seconds time elapsed ( +- 1.43% ) 4.631062303 seconds time elapsed ( +- 1.40% ) post: 4.364765213 seconds time elapsed ( +- 0.91% ) 4.454442734 seconds time elapsed ( +- 1.18% ) 4.448893817 seconds time elapsed ( +- 1.41% ) 4.424346872 seconds time elapsed ( +- 0.97% ) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28sched/core: Add switch_mm_irqs_off() and use it in the schedulerAndy Lutomirski
By default, this is the same thing as switch_mm(). x86 will override it as an optimization. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()Peter Zijlstra
Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to re-enable the tick. His observed problem is that rq->cfs.nr_running can be 1 even though there are multiple runnable CFS tasks. This happens in the cgroup case, in which case cfs.nr_running is the number of runnable entities for that level. If there is a single runnable cgroup (which can have an arbitrary number of runnable child entries itself) rq->cfs.nr_running will be 1. However, looking at that function I think there's more problems with it. It seems to assume that if there's FIFO tasks, those will run. This is incorrect. The FIFO task can have a lower prio than an RR task, in which case the RR task will run. So the whole fifo_nr_running test seems misplaced, it should go after the rr_nr_running tests. That is, only if !rr_nr_running, can we use fifo_nr_running like this. Reported-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <kernellwp@gmail.com> Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Link: http://lkml.kernel.org/r/20160421160315.GK24771@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/deadline: Fix a bug in dl_overflow()Xunlei Pang
I got a minus(very big) dl_b->total_bw during my deadline tests. # grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : -222297900 Something unusual must have happened. After some digging, I finally noticed that when changing a deadline task to normal(cfs), and changing it back to deadline immediately, after it died, we will got the wrong dl_bw->total_bw. The root cause is in dl_overflow(), it has: if (new_bw == p->dl.dl_bw) return 0; 1) When a deadline task is changed to !deadline task, it will start dl timer in switched_from_dl(), and retain previous deadline parameter till the timer expires. 2) If we change it back to deadline with the same bandwidth parameter before the timer expires, as it keeps the old bandwidth although it is not a deadline task. dl_overflow() simply returns success without updating the right data, and got the wrong dl_bw->total_bw. The solution is simple, if @p is not deadline, don't return. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460636368-1993-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updatesFrederic Weisbecker
Some code in CPU load update only concern NO_HZ configs but it is built on all configurations. When NO_HZ isn't built, that code is harmless but just happens to take some useless ressources in CPU and memory: 1) one useless field in struct rq 2) jiffies record on every tick that is never used (cpu_load_update_periodic) 3) decay_load_missed is called two times on every tick to eventually return immediately with no action taken. And that function is dead code. For pure optimization purposes, lets conditionally build the NO_HZ related code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1461080211-16271-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23sched/fair: Gather CPU load functions under a more conventional namespaceFrederic Weisbecker
The CPU load update related functions have a weak naming convention currently, starting with update_cpu_load_*() which isn't ideal as "update" is a very generic concept. Since two of these functions are public already (and a third is to come) that's enough to introduce a more conventional naming scheme. So let's do the following rename instead: update_cpu_load_*() -> cpu_load_update_*() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13sched/debug: Don't dump sched debug info in SysRq-WRabin Vincent
sysrq_sched_debug_show() can dump a lot of information. Don't print out all that if we're just trying to get a list of blocked tasks (SysRq-W). The information is still accessible with SysRq-T. Signed-off-by: Rabin Vincent <rabinv@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1459777322-30902-1-git-send-email-rabin.vincent@axis.com Signed-off-by: Ingo Molnar <mingo@kernel.org>