summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2015-11-05Merge branch 'for-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The cgroup core saw several significant updates this cycle: - percpu_rwsem for threadgroup locking is reinstated. This was temporarily dropped due to down_write latency issues. Oleg's rework of percpu_rwsem which is scheduled to be merged in this merge window resolves the issue. - On the v2 hierarchy, when controllers are enabled and disabled, all operations are atomic and can fail and revert cleanly. This allows ->can_attach() failure which is necessary for cpu RT slices. - Tasks now stay associated with the original cgroups after exit until released. This allows tracking resources held by zombies (e.g. pids) and makes it easy to find out where zombies came from on the v2 hierarchy. The pids controller was broken before these changes as zombies escaped the limits; unfortunately, updating this behavior required too many invasive changes and I don't think it's a good idea to backport them, so the pids controller on 4.3, the first version which included the pids controller, will stay broken at least until I'm sure about the cgroup core changes. - Optimization of a couple common tests using static_key" * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits) cgroup: fix race condition around termination check in css_task_iter_next() blkcg: don't create "io.stat" on the root cgroup cgroup: drop cgroup__DEVEL__legacy_files_on_dfl cgroup: replace error handling in cgroup_init() with WARN_ON()s cgroup: add cgroup_subsys->free() method and use it to fix pids controller cgroup: keep zombies associated with their original cgroups cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock cgroup: don't hold css_set_rwsem across css task iteration cgroup: reorganize css_task_iter functions cgroup: factor out css_set_move_task() cgroup: keep css_set and task lists in chronological order cgroup: make cgroup_destroy_locked() test cgroup_is_populated() cgroup: make css_sets pin the associated cgroups cgroup: relocate cgroup_[try]get/put() cgroup: move check_for_release() invocation cgroup: replace cgroup_has_tasks() with cgroup_is_populated() cgroup: make cgroup->nr_populated count the number of populated css_sets cgroup: remove an unused parameter from cgroup_task_migrate() cgroup: fix too early usage of static_branch_disable() cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically ...
2015-11-03Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle were: - sched/fair load tracking fixes and cleanups (Byungchul Park) - Make load tracking frequency scale invariant (Dietmar Eggemann) - sched/deadline updates (Juri Lelli) - stop machine fixes, cleanups and enhancements for bugs triggered by CPU hotplug stress testing (Oleg Nesterov) - scheduler preemption code rework: remove PREEMPT_ACTIVE and related cleanups (Peter Zijlstra) - Rework the sched_info::run_delay code to fix races (Peter Zijlstra) - Optimize per entity utilization tracking (Peter Zijlstra) - ... misc other fixes, cleanups and smaller updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop() sched: Start stopper early stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark() stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark() stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works() stop_machine: Ensure that a queued callback will be called before cpu_stop_park() sched/x86: Fix typo in __switch_to() comments sched/core: Remove a parameter in the migrate_task_rq() function sched/core: Drop unlikely behind BUG_ON() sched/core: Fix task and run queue sched_info::run_delay inconsistencies sched/numa: Fix task_tick_fair() from disabling numa_balancing sched/core: Add preempt_count invariant check sched/core: More notrace annotations sched/core: Kill PREEMPT_ACTIVE sched/core, sched/x86: Kill thread_info::saved_preempt_count sched/core: Simplify preempt_count tests sched/core: Robustify preemption leak checks sched/core: Stop setting PREEMPT_ACTIVE ...
2015-10-28Merge branch 'linus' into core/rcu, to fix up a semantic conflictIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-23sched/core: Add missing lockdep_unpin() annotationsPeter Zijlstra
Luca and Wanpeng reported two missing annotations that led to false lockdep complaints. Add the missing annotations. Reported-by: Luca Abeni <luca.abeni@unitn.it> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: cbce1a686700 ("sched,lockdep: Employ lock pinning") Link: http://lkml.kernel.org/r/20151023095008.GY17308@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETSOleg Nesterov
If CONFIG_CPUSETS=n then "case cpuset" changes the state and runs the already failed for_each_cpu() loop again for no reason. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20151010185315.GA24100@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()Peter Zijlstra
The cpu_active() tests are not fundamentally part of stop_two_cpus(), move then into the scheduler where they belong. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20sched: Start stopper earlyPeter Zijlstra
Ensure the stopper thread is active 'early', because the load balancer pretty much assumes that its available. And when 'online && active' the load-balancer is fully available. Not only the numa balancing stop_two_cpus() caller relies on it, but also the self migration stuff does, and at CPU_ONLINE time the cpu really is 'free' to run anything. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20151009160054.GA10176@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20Merge branch 'sched/urgent' into sched/core, to pick up fixes and resolve ↵Ingo Molnar
conflicts Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20nohz: Revert "nohz: Set isolcpus when nohz_full is set"Frederic Weisbecker
This reverts: 8cb9764fc88b ("nohz: Set isolcpus when nohz_full is set") We assumed that full-nohz users always want scheduler isolation on full dynticks CPUs, therefore we included full-nohz CPUs on cpu_isolated_map. This means that tasks run by default on CPUs outside the nohz_full range unless their affinity is explicity overwritten. This suits pure isolation workloads but when the machine is needed to run common workloads, the available sets of CPUs to run common tasks becomes reduced. We reach an extreme case when CONFIG_NO_HZ_FULL_ALL is enabled as it leaves only CPU 0 for non-isolation tasks, which makes people think that their supercomputer regressed to 90's UP - which is true in a sense. Some full-nohz users appear to be interested in running normal workloads either before or after an isolation workload. Full-nohz isn't optimized toward normal workloads but it's still better than UP performance. We are reaching a limitation in kernel presets here. Lets revert this cpu_isolated_map inclusion and let userspace do its own scheduler isolation using cpusets or explicit affinity settings. Reported-by: Ingo Molnar <mingo@kernel.org> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/1444663283-30068-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier) - Improvements to expedited grace periods. (Paul E. McKenney) - Performance improvements to and locktorture tests for percpu-rwsem. (Oleg Nesterov, Paul E. McKenney) - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso) - Documentation updates. (Paul E. McKenney) Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-15cgroup: keep zombies associated with their original cgroupsTejun Heo
cgroup_exit() is called when a task exits and disassociates the exiting task from its cgroups and half-attach it to the root cgroup. This is unnecessary and undesirable. No controller actually needs an exiting task to be disassociated with non-root cgroups. Both cpu and perf_event controllers update the association to the root cgroup from their exit callbacks just to keep consistent with the cgroup core behavior. Also, this disassociation makes it difficult to track resources held by zombies or determine where the zombies came from. Currently, pids controller is completely broken as it uncharges on exit and zombies always escape the resource restriction. With cgroup association being reset on exit, fixing it is pretty painful. There's no reason to reset cgroup membership on exit. The zombie can be removed from its css_set so that it doesn't show up on "cgroup.procs" and thus can't be migrated or interfere with cgroup removal. It can still pin and point to the css_set so that its cgroup membership is maintained. This patch makes cgroup core keep zombies associated with their cgroups at the time of exit. * Previous patches decoupled populated_cnt tracking from css_set lifetime, so a dying task can be simply unlinked from its css_set while pinning and pointing to the css_set. This keeps css_set association from task side alive while hiding it from "cgroup.procs" and populated_cnt tracking. The css_set reference is dropped when the task_struct is freed. * ->exit() callback no longer needs the css arguments as the associated css never changes once PF_EXITING is set. Removed. * cpu and perf_events controllers no longer need ->exit() callbacks. There's no reason to explicitly switch away on exit. The final schedule out is enough. The callbacks are removed. * On traditional hierarchies, nothing changes. "/proc/PID/cgroup" still reports "/" for all zombies. On the default hierarchy, "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged to at the time of exit. If the cgroup gets removed before the task is reaped, " (deleted)" is appended. v2: Build brekage due to missing dummy cgroup_free() when !CONFIG_CGROUP fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2015-10-06sched: Export sched_setscheduler_nocheckDavidlohr Bueso
The new locktorture rtmutex_lock tests exercise priority boosting, which means that they need to set some tasks to real-time priority. To do this, they use sched_setscheduler_nocheck(). However, this is not exported to modules, which results in the following error when building locktorture as a module: ERROR: "sched_setscheduler_nocheck" [kernel/locking/locktorture.ko] undefined! This commit therefore adds an EXPORT_SYMBOL_GPL() to allow this function to be invoked from locktorture when built as a module. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06sched/core: Remove a parameter in the migrate_task_rq() functionxiaofeng.yan
The parameter "int next_cpu" in the following function is unused: migrate_task_rq(struct task_struct *p, int next_cpu) Remove it. Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Drop unlikely behind BUG_ON()Geliang Tang
(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't care less and the change is ok. (2) PPC and MIPS, which HAVE_ARCH_BUG_ON, do not rely on branch predictions as it seems to be pointless [1] and thus callers should not be trying to push an optimization in the first place. (3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an unlikely compiler flag already. Hence, we can drop unlikely behind BUG_ON(). [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/6fa7125979f98bbeac26e268271769b6ca935c8d.1444051018.git.geliangtang@163.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix task and run queue sched_info::run_delay inconsistenciesPeter Zijlstra
Mike Meyer reported the following bug: > During evaluation of some performance data, it was discovered thread > and run queue run_delay accounting data was inconsistent with the other > accounting data that was collected. Further investigation found under > certain circumstances execution time was leaking into the task and > run queue accounting of run_delay. > > Consider the following sequence: > > a. thread is running. > b. thread moves beween cgroups, changes scheduling class or priority. > c. thread sleeps OR > d. thread involuntarily gives up cpu. > > a. implies: > > thread->sched_info.last_queued = 0 > > a. and b. results in the following: > > 1. dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > delta = 0 > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > 2. enqueue_task(rq, thread) > > sched_info_queued(rq, thread) > > /* thread is still on cpu at this point. */ > thread->sched_info.last_queued = task_rq(thread)->clock; > > c. results in: > > dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > > /* delta is execution time not run_delay. */ > delta = task_rq(thread)->clock - thread->sched_info.last_queued > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > Since thread was running between enqueue_task(rq, thread) and > dequeue_task(rq, thread), the delta above is really execution > time and not run_delay. > > d. results in: > > __sched_info_switch(thread, next_thread) > > sched_info_depart(rq, thread) > > sched_info_queued(rq, thread) > > /* last_queued not updated due to being non-zero */ > return > > Since thread was running between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread), the execution time > between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread) now will become > associated with run_delay due to when last_queued was last updated. > This alternative patch solves the problem by not calling sched_info_{de,}queued() in {de,en}queue_task(). Therefore the sched_info state is preserved and things work as expected. By inlining the {de,en}queue_task() functions the new condition becomes (mostly) a compile-time constant and we'll not emit any new branch instructions. It even shrinks the code (due to inlining {en,de}queue_task()): $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig text data bss dec hex filename 64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o 64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig Reported-by: Mike Meyer <Mike.Meyer@Teradata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Add preempt_count invariant checkPeter Zijlstra
Ingo requested I keep my debug check for the preempt_count invariant. Requested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: More notrace annotationsPeter Zijlstra
preempt_schedule_common() is marked notrace, but it does not use _notrace() preempt_count functions and __schedule() is also not marked notrace, which means that its perfectly possible to end up in the tracer from preempt_schedule_common(). Steve says: | Yep, there's some history to this. This was originally the issue that | caused function tracing to go into infinite recursion. But now we have | preempt_schedule_notrace(), which is used by the function tracer, and | that function must not be traced till preemption is disabled. | | Now if function tracing is running and we take an interrupt when | NEED_RESCHED is set, it calls | | preempt_schedule_common() (not traced) | | But then that calls preempt_disable() (traced) | | function tracer calls preempt_disable_notrace() followed by | preempt_enable_notrace() which will see NEED_RESCHED set, and it will | call preempt_schedule_notrace(), which stops the recursion, but | still calls __schedule() here, and that means when we return, we call | the __schedule() from preempt_schedule_common(). | | That said, I prefer this patch. Preemption is disabled before calling | __schedule(), and we get rid of a one round recursion with the | scheduler. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Simplify preempt_count testsPeter Zijlstra
Since we stopped setting PREEMPT_ACTIVE, there is no need to mask it out of preempt_count() tests. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Robustify preemption leak checksPeter Zijlstra
When we warn about a preempt_count leak; reset the preempt_count to the known good value such that the problem does not ripple forward. This is most important on x86 which has a per cpu preempt_count that is not saved/restored (after this series). So if you schedule with an invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is messed up too. Enforcing this invariant limits the borkage to just the one task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Stop setting PREEMPT_ACTIVEPeter Zijlstra
Now that nothing tests for PREEMPT_ACTIVE anymore, stop setting it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix trace_sched_switch()Peter Zijlstra
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE user, move trace_sched_switch() from prepare_task_switch() to __schedule() and propagate the @preempt argument. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Add preempt argument to __schedule()Peter Zijlstra
There is only a single PREEMPT_ACTIVE use in the regular __schedule() path and that is to circumvent the task->state check. Since the code setting PREEMPT_ACTIVE is the immediate caller of __schedule() we can replace this with a function argument. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Create preempt_count invariantPeter Zijlstra
Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers. Now that TASK_DEAD no longer results in preempt_count() == 3 during scheduling, we will always call context_switch() with preempt_count() == 2. However, we don't always end up with preempt_count() == 2 in finish_task_switch() because new tasks get created with preempt_count() == 1. Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right places. Note that we cannot use INIT_PREEMPT_COUNT as that serves another purpose (boot). After this, preempt_count() is invariant across the context switch, with exception of PREEMPT_ACTIVE. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Rework TASK_DEAD preemption exceptionPeter Zijlstra
TASK_DEAD is special in that the final schedule call from do_exit() must be done with preemption disabled. This means we end up scheduling with a preempt_count() higher than usual (3 instead of the 'expected' 2). Since future patches will want to rely on an invariant preempt_count() value during schedule, fix this up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix TASK_DEAD race in finish_task_switch()Peter Zijlstra
So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-27Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single bug fix for the scheduler to prevent dequeueing of the idle task when setting the cpus allowed mask" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix crash trying to dequeue/enqueue the idle thread
2015-09-23sched/core: Make 'sched_domain_topology' declaration staticJuergen Gross
The 'sched_domain_topology' variable is only used within kernel/sched/core.c. Make it static. Signed-off-by: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1442918939-9907-1-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "Mostly stable material, a lot of ARM fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) sched: access local runqueue directly in single_task_running arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS' arm64: KVM: Remove all traces of the ThumbEE registers arm: KVM: Disable virtual timer even if the guest is not using it arm64: KVM: Disable virtual timer even if the guest is not using it arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources KVM: s390: Replace incorrect atomic_or with atomic_andnot arm: KVM: Fix incorrect device to IPA mapping arm64: KVM: Fix user access for debug registers KVM: vmx: fix VPID is 0000H in non-root operation KVM: add halt_attempted_poll to VCPU stats kvm: fix zero length mmio searching kvm: fix double free for fast mmio eventfd kvm: factor out core eventfd assign/deassign logic kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd KVM: make the declaration of functions within 80 characters KVM: arm64: add workaround for Cortex-A57 erratum #852523 KVM: fix polling for guest halt continued even if disable it arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores arm64: KVM: set {v,}TCR_EL2 RES1 bits ...
2015-09-18sched: access local runqueue directly in single_task_runningDominik Dingel
Commit 2ee507c47293 ("sched: Add function single_task_running to let a task check if it is the only task running on a cpu") referenced the current runqueue with the smp_processor_id. When CONFIG_DEBUG_PREEMPT is enabled, that is only allowed if preemption is disabled or the currrent task is bound to the local cpu (e.g. kernel worker). With commit f78195129963 ("kvm: add halt_poll_ns module parameter") KVM calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that generates a lot of kernel messages. To avoid adding preemption in that cases, as it would limit the usefulness, we change single_task_running to access directly the cpu local runqueue. Cc: Tim Chen <tim.c.chen@linux.intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Fixes: 2ee507c472939db4b146d545352b8a7c79ef47f8 Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-18sched/core: Make policy-testing consistentHenrik Austad
Most of the policy-tests are done via the <class>_policy() helpers with the notable exception of idle. A new wrapper for valid_policy() has also been added to improve readability in set_load_weight(). This commit does not change the logical behavior of the scheduler core. Signed-off-by: Henrik Austad <henrik@austad.us> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18sched: Fix crash trying to dequeue/enqueue the idle threadPeter Zijlstra
Sasha reports that his virtual machine tries to schedule the idle thread since commit 6c37067e2786 ("sched: Change the sched_class::set_cpus_allowed() calling context"). Hit trace shows this happening from idle_thread_get()->init_idle(), which is the _second_ init_idle() invocation on that task_struct, the first being done through idle_init()->fork_idle(). (this code is insane...) Because we call init_idle() twice in a row, its ->sched_class == &idle_sched_class and ->on_rq = TASK_ON_RQ_QUEUED. This means do_set_cpus_allowed() think we're queued and will call dequeue_task(), which is implemented with BUG() for the idle class, seeing how dequeueing the idle task is a daft thing. Aside of the whole insanity of calling init_idle() _twice_, change the code to call set_cpus_allowed_common() instead as this is 'obviously' before the idle task gets ran etc.. Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6c37067e2786 ("sched: Change the sched_class::set_cpus_allowed() calling context") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-17Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A migrate_tasks() locking fix, and a late-coming nohz change plus a nohz debug check" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: 'Annotate' migrate_tasks() nohz: Assert existing housekeepers when nohz full enabled nohz: Affine unpinned timers to housekeepers
2015-09-13sched/fair: Initialize task load and utilization before placing task on rqMorten Rasmussen
Task load or utilization is not currently considered in select_task_rq_fair(), but if we want that in the future we should make sure it is not zero for new tasks. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: daniel.lezcano@linaro.org Cc: mturquette@baylibre.com Cc: pang.xunlei@zte.com.cn Cc: rjw@rjwysocki.net Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1439569394-11974-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/numa: Convert sched_numa_balancing to a static_branchSrikar Dronamraju
Variable sched_numa_balancing toggles numa_balancing feature. Hence moving from a simple read mostly variable to a more apt static_branch. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439310261-16124-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/numa: Remove the NUMA sched_featureSrikar Dronamraju
Variable sched_numa_balancing is available for both CONFIG_SCHED_DEBUG and !CONFIG_SCHED_DEBUG. All code paths now check for sched_numa_balancing. Hence remove sched_feat(NUMA). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439290813-6683-4-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/numa: Disable sched_numa_balancing on UMA systemsSrikar Dronamraju
Commit 2a1ed24 ("sched/numa: Prefer NUMA hotness over cache hotness") sets sched feature NUMA to true. However this can enable NUMA hinting faults on a UMA system. This commit ensures that NUMA hinting faults occur only on a NUMA system by setting/resetting sched_numa_balancing. This commit: - Makes sched_numa_balancing common to CONFIG_SCHED_DEBUG and !CONFIG_SCHED_DEBUG. Earlier it was only in !CONFIG_SCHED_DEBUG. - Checks for sched_numa_balancing instead of sched_feat(NUMA). Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439290813-6683-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/numa: Rename numabalancing_enabled to sched_numa_balancingSrikar Dronamraju
Simple rename of the 'numabalancing_enabled' variable to 'sched_numa_balancing'. No functional changes. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439290813-6683-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/core: Delete PF_EXITING checks from cpu_cgroup_exit() callbackKirill Tkhai
cgroup_exit() is not called from copy_process() after commit: e8604cb43690 ("cgroup: fix spurious lockdep warning in cgroup_exit()") from do_exit(). So this check is useless and the comment is obsolete. Signed-off-by: Kirill Tkhai <ktkhai@odin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13sched/core: Remove unused argument from sched_class::task_move_groupPeter Zijlstra
The previous patches made the second argument go unused, remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13Merge tag 'v4.3-rc1' into sched/core, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11sched: 'Annotate' migrate_tasks()Wanpeng Li
Kernel testing triggered this warning: | WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80() | Modules linked in: | CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2 | Call Trace: | dump_stack+0x4b/0x75 | warn_slowpath_common+0x8b/0xc0 | warn_slowpath_null+0x22/0x30 | do_set_cpus_allowed+0x7e/0x80 | cpuset_cpus_allowed_fallback+0x7c/0x170 | select_fallback_rq+0x221/0x280 | migration_call+0xe3/0x250 | notifier_call_chain+0x53/0x70 | __raw_notifier_call_chain+0x1e/0x30 | cpu_notify+0x28/0x50 | take_cpu_down+0x22/0x40 | multi_cpu_stop+0xd5/0x140 | cpu_stopper_thread+0xbc/0x170 | smpboot_thread_fn+0x174/0x2f0 | kthread+0xc4/0xe0 | ret_from_kernel_thread+0x21/0x30 As Peterz pointed out: | So the normal rules for changing task_struct::cpus_allowed are holding | both pi_lock and rq->lock, such that holding either stabilizes the mask. | | This is so that wakeup can happen without rq->lock and load-balance | without pi_lock. | | From this we already get the relaxation that we can omit acquiring | rq->lock if the task is not on the rq, because in that case | load-balancing will not apply to it. | | ** these are the rules currently tested in do_set_cpus_allowed() ** | | Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which | unconditionally acquires both locks, we could get away with holding just | rq->lock when on_rq for modification because that'd still exclude | __set_cpus_allowed_ptr(), it would also work against | __kthread_bind_mask() because that assumes !on_rq. | | That said, this is all somewhat fragile. | | Now, I don't think dropping rq->lock is quite as disastrous as it | usually is because !cpu_active at this point, which means load-balance | will not interfere, but that too is somewhat fragile. | | So we end up with a choice of two fragile.. This patch fixes it by following the rules for changing task_struct::cpus_allowed with both pi_lock and rq->lock held. Reported-by: kernel test robot <ying.huang@intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [ Modified changelog and patch. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gbl Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomic updates from Ingo Molnar: "Main changes in this cycle are: - Extend atomic primitives with coherent logic op primitives (atomic_{or,and,xor}()) and deprecate the old partial APIs (atomic_{set,clear}_mask()) The old ops were incoherent with incompatible signatures across architectures and with incomplete support. Now every architecture supports the primitives consistently (by Peter Zijlstra) - Generic support for 'relaxed atomics': - _acquire/release/relaxed() flavours of xchg(), cmpxchg() and {add,sub}_return() - atomic_read_acquire() - atomic_set_release() This came out of porting qwrlock code to arm64 (by Will Deacon) - Clean up the fragile static_key APIs that were causing repeat bugs, by introducing a new one: DEFINE_STATIC_KEY_TRUE(name); DEFINE_STATIC_KEY_FALSE(name); which define a key of different types with an initial true/false value. Then allow: static_branch_likely() static_branch_unlikely() to take a key of either type and emit the right instruction for the case. To be able to know the 'type' of the static key we encode it in the jump entry (by Peter Zijlstra) - Static key self-tests (by Jason Baron) - qrwlock optimizations (by Waiman Long) - small futex enhancements (by Davidlohr Bueso) - ... and misc other changes" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits) jump_label/x86: Work around asm build bug on older/backported GCCs locking, ARM, atomics: Define our SMP atomics in terms of _relaxed() operations locking, include/llist: Use linux/atomic.h instead of asm/cmpxchg.h locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics locking/qrwlock: Implement queue_write_unlock() using smp_store_release() locking/lockref: Remove homebrew cmpxchg64_relaxed() macro definition locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t' locking, asm-generic: Rework atomic-long.h to avoid bulk code duplication locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations locking, compiler.h: Cast away attributes in the WRITE_ONCE() magic locking/static_keys: Make verify_keys() static jump label, locking/static_keys: Update docs locking/static_keys: Provide a selftest jump_label: Provide a self-test s390/uaccess, locking/static_keys: employ static_branch_likely() x86, tsc, locking/static_keys: Employ static_branch_likely() locking/static_keys: Add selftest locking/static_keys: Add a new static_key interface locking/static_keys: Rework update logic locking/static_keys: Add static_key_{en,dis}able() helpers ...
2015-09-02Merge branch 'for-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - a new PIDs controller is added. It turns out that PIDs are actually an independent resource from kmem due to the limited PID space. - more core preparations for the v2 interface. Once cpu side interface is settled, it should be ready for lifting the devel mask. for-4.3-unified-base was temporarily branched so that other trees (block) can pull cgroup core changes that blkcg changes depend on. - a non-critical idr_preload usage bug fix. * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: pids: fix invalid get/put usage cgroup: introduce cgroup_subsys->legacy_name cgroup: don't print subsystems for the default hierarchy cgroup: make cftype->private a unsigned long cgroup: export cgrp_dfl_root cgroup: define controller file conventions cgroup: fix idr_preload usage cgroup: add documentation for the PIDs controller cgroup: implement the PIDs subsystem cgroup: allow a cgroup subsystem to reject a fork
2015-09-02nohz: Affine unpinned timers to housekeepersVatika Harlalka
The problem addressed in this patch is about affining unpinned timers. Adaptive or Full Dynticks CPUs are currently disturbed by unnecessary jitter due to firing of such timers on them. This patch will affine timers to online CPUs which are not full dynticks in NOHZ_FULL configured systems. It should not introduce overhead in nohz full off case due to static keys. Signed-off-by: Vatika Harlalka <vatikaharlalka@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1441119060-2230-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-31Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Ingo Molnar: "The main changes, mostly written by Frederic Weisbecker, include: - Fix some jiffies based cputime assumptions. (No real harm because the concerned code isn't used by full dynticks.) - Simplify jiffies <-> usecs conversions. Remove dead code. - Remove early hacks on nohz full code that avoided messing up idle nohz internals. Now nohz integrates well full and idle and such hack have become needless. - Restart nohz full tick from irq exit. (A simplification and a preparation for future optimization on scheduler kick to nohz full) - Code cleanups. - Tile driver isolation enhancement on top of nohz. (Chris Metcalf)" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: Remove useless argument on tick_nohz_task_switch() nohz: Move tick_nohz_restart_sched_tick() above its users nohz: Restart nohz full tick from irq exit nohz: Remove idle task special case nohz: Prevent tilegx network driver interrupts alpha: Fix jiffies based cputime assumption apm32: Fix cputime == jiffies assumption jiffies: Remove HZ > USEC_PER_SEC special case
2015-08-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "This is a leftover scheduler fix from the v4.2 cycle" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix cpu_active_mask/cpu_online_mask race
2015-08-31Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The biggest change in this cycle is the rewrite of the main SMP load balancing metric: the CPU load/utilization. The main goal was to make the metric more precise and more representative - see the changelog of this commit for the gory details: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") It is done in a way that significantly reduces complexity of the code: 5 files changed, 249 insertions(+), 494 deletions(-) and the performance testing results are encouraging. Nevertheless we need to keep an eye on potential regressions, since this potentially affects every SMP workload in existence. This work comes from Yuyang Du. Other changes: - SCHED_DL updates. (Andrea Parri) - Simplify architecture callbacks by removing finish_arch_switch(). (Peter Zijlstra et al) - cputime accounting: guarantee stime + utime == rtime. (Peter Zijlstra) - optimize idle CPU wakeups some more - inspired by Facebook server loads. (Mike Galbraith) - stop_machine fixes and updates. (Oleg Nesterov) - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra) - sched/numa tweaks. (Srikar Dronamraju) - misc fixes and small cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched/deadline: Fix comment in enqueue_task_dl() sched/deadline: Fix comment in push_dl_tasks() sched: Change the sched_class::set_cpus_allowed() calling context sched: Make sched_class::set_cpus_allowed() unconditional sched: Fix a race between __kthread_bind() and sched_setaffinity() sched: Ensure a task has a non-normalized vruntime when returning back to CFS sched/numa: Fix NUMA_DIRECT topology identification tile: Reorganize _switch_to() sched, sparc32: Update scheduler comments in copy_thread() sched: Remove finish_arch_switch() sched, tile: Remove finish_arch_switch sched, sh: Fold finish_arch_switch() into switch_to() sched, score: Remove finish_arch_switch() sched, avr32: Remove finish_arch_switch() sched, MIPS: Get rid of finish_arch_switch() sched, arm: Remove finish_arch_switch() sched/fair: Clean up load average references sched/fair: Provide runnable_load_avg back to cfs_rq sched/fair: Remove task and group entity load when they are dead sched/fair: Init cfs_rq's sched_entity load average ...
2015-08-25sched: Fix cpu_active_mask/cpu_online_mask raceJan H. Schönherr
There is a race condition in SMP bootup code, which may result in WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418 workqueue_cpu_up_callback() or kernel BUG at kernel/smpboot.c:135! It can be triggered with a bit of luck in Linux guests running on busy hosts. CPU0 CPUn ==== ==== _cpu_up() __cpu_up() start_secondary() set_cpu_online() cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits)); cpu_notify(CPU_ONLINE) <do stuff, see below> cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits)); During the various CPU_ONLINE callbacks CPUn is online but not active. Several things can go wrong at that point, depending on the scheduling of tasks on CPU0. Variant 1: cpu_notify(CPU_ONLINE) workqueue_cpu_up_callback() rebind_workers() set_cpus_allowed_ptr() This call fails because it requires an active CPU; rebind_workers() ends with a warning: WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418 workqueue_cpu_up_callback() Variant 2: cpu_notify(CPU_ONLINE) smpboot_thread_call() smpboot_unpark_threads() .. __kthread_unpark() __kthread_bind() wake_up_state() .. select_task_rq() select_fallback_rq() The ->wake_cpu of the unparked thread is not allowed, making a call to select_fallback_rq() necessary. Then, select_fallback_rq() cannot find an allowed, active CPU and promptly resets the allowed CPUs, so that the task in question ends up on CPU0. When those unparked tasks are eventually executed, they run immediately into a BUG: kernel BUG at kernel/smpboot.c:135! Just changing the order in which the online/active bits are set (and adding some memory barriers), would solve the two issues above. However, it would change the order of operations back to the one before commit 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()"), thus, reintroducing that particular problem. Going further back into history, we have at least the following commits touching this topic: - commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online") - commit 5fbd036b552f ("sched: Cleanup cpu_active madness") Together, these give us the following non-working solutions: - secondary CPU sets active before online, because active is assumed to be a subset of online; - secondary CPU sets online before active, because the primary CPU assumes that an online CPU is also active; - secondary CPU sets online and waits for primary CPU to set active, because it might deadlock. Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are active & online") introduces an arch-specific solution to this arch-independent problem. Now, go for a more general solution without explicit waiting and simply set active twice: once on the secondary CPU after online was set and once on the primary CPU after online was seen. set_cpus_allowed_ptr()") Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@samba.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Joerg Roedel <jroedel@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Wilson <msw@amazon.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()") Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12sched: Change the sched_class::set_cpus_allowed() calling contextPeter Zijlstra
Change the calling context of sched_class::set_cpus_allowed() such that we can assume the task is inactive. This allows us to easily make changes that affect accounting done by enqueue/dequeue. This does in fact completely remove set_cpus_allowed_rt() and greatly reduces set_cpus_allowed_dl(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dedekind1@gmail.com Cc: juri.lelli@arm.com Cc: mgorman@suse.de Cc: riel@redhat.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20150515154833.667516139@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>