summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2024-10-29sched: Pass correct scheduling policy to __setscheduler_classAboorva Devarajan
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") overlooked that __setscheduler_prio(), now __setscheduler_class() relies on p->policy for task_should_scx(), and moved the call before __setscheduler_params() updates it, causing it to be using the old p->policy value. Resolve this by changing task_should_scx() to take the policy itself instead of a task pointer, such that __sched_setscheduler() can pass in the updated policy. Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2024-10-17Merge branch 'linus' into sched/urgent, to resolve conflictIngo Molnar
Conflicts: kernel/sched/ext.c There's a context conflict between this upstream commit: 3fdb9ebcec10 sched_ext: Start schedulers with consistent p->scx.slice values ... and this fix in sched/urgent: 98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair() Resolve it. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-14sched/fair: Fix external p->on_rq usersPeter Zijlstra
Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement delayed dequeue") KVM's preemption notifiers have started mis-classifying preemption vs blocking. Notably p->on_rq is no longer sufficient to determine if a task is runnable or blocked -- the aforementioned commit introduces tasks that remain on the runqueue even through they will not run again, and should be considered blocked for many cases. Add the task_is_runnable() helper to classify things and audit all external users of the p->on_rq state. Also add a few comments. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
2024-10-14sched/psi: Fix mistaken CPU pressure indication after corrupted task state bugJohannes Weiner
Since sched_delayed tasks remain queued even after blocking, the load balancer can migrate them between runqueues while PSI considers them to be asleep. As a result, it misreads the migration requeue followed by a wakeup as a double queue: psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4 First, call psi_enqueue() after p->sched_class->enqueue_task(). A wakeup will clear p->se.sched_delayed while a migration will not, so psi can use that flag to tell them apart. Then teach psi to migrate any "sleep" state when delayed-dequeue tasks are being migrated. Delayed-dequeue tasks can be revived by ttwu_runnable(), which will call down with a new ENQUEUE_DELAYED. Instead of further complicating the wakeup conditional in enqueue_task(), identify migration contexts instead and default to wakeup handling for all other cases. It's not just the warning in dmesg, the task state corruption causes a permanent CPU pressure indication, which messes with workload/machine health monitoring. Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/ Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20241010193712.GC181795@cmpxchg.org
2024-10-11sched/core: Dequeue PSI signals for blocked tasks that are delayedPeter Zijlstra
psi_dequeue() in for blocked task expects psi_sched_switch() to clear the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task is blocked or still runnable which is no longer true with DELAY_DEQUEUE since a blocking task can be left queued on the runqueue. This can lead to PSI splats similar to: psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4 when the task is requeued since the TSK_RUNNING flag was not cleared when the task was blocked. Explicitly communicate that the task was blocked to psi_sched_switch() even if it was delayed and is still on the runqueue. [ prateek: Broke off the relevant part from [1], commit message ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/ Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ [1]
2024-10-11sched: Fix delayed_dequeue vs switched_from_fair()Peter Zijlstra
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") and its follow up fixes try to deal with a rather unfortunate situation where is task is enqueued in a new class, even though it shouldn't have been. Mostly because the existing ->switched_to/from() hooks are in the wrong place for this case. This all led to Paul being able to trigger failures at something like once per 10k CPU hours of RCU torture. For now, do the ugly thing and move the code to the right place by ignoring the switch hooks. Note: Clean up the whole sched_class::switch*_{to,from}() thing. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
2024-10-11sched/core: Disable page allocation in task_tick_mm_cid()Waiman Long
With KASAN and PREEMPT_RT enabled, calling task_work_add() in task_tick_mm_cid() may cause the following splat. [ 63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe [ 63.696416] preempt_count: 10001, expected: 0 [ 63.696416] RCU nest depth: 1, expected: 1 This problem is caused by the following call trace. sched_tick() [ acquire rq->__lock ] -> task_tick_mm_cid() -> task_work_add() -> __kasan_record_aux_stack() -> kasan_save_stack() -> stack_depot_save_flags() -> alloc_pages_mpol_noprof() -> __alloc_pages_noprof() -> get_page_from_freelist() -> rmqueue() -> rmqueue_pcplist() -> __rmqueue_pcplist() -> rmqueue_bulk() -> rt_spin_lock() The rq lock is a raw_spinlock_t. We can't sleep while holding it. IOW, we can't call alloc_pages() in stack_depot_save_flags(). The task_tick_mm_cid() function with its task_work_add() call was introduced by commit 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") in v6.4 kernel. Fortunately, there is a kasan_record_aux_stack_noalloc() variant that calls stack_depot_save_flags() while not allowing it to allocate new pages. To allow task_tick_mm_cid() to use task_work without page allocation, a new TWAF_NO_ALLOC flag is added to enable calling kasan_record_aux_stack_noalloc() instead of kasan_record_aux_stack() if set. The task_tick_mm_cid() function is modified to add this new flag. The possible downside is the missing stack trace in a KASAN report due to new page allocation required when task_work_add_noallloc() is called which should be rare. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241010014432.194742-1-longman@redhat.com
2024-10-07sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() ↵Tejun Heo
was called During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or migration is disabled. sched_ext schedulers may perform operations such as direct dispatch from ->select_task_rq() path and it is useful for them to know whether ->select_task_rq() was skipped in the ->enqueue_task() path. Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose and end up assuming incorrectly that ->select_task_rq() was called for tasks that are bound to a single CPU or migration disabled. Make select_task_rq() indicate whether ->select_task_rq() was called by setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that to ENQUEUE_RQ_SELECTED for ->enqueue_task(). This will be used by sched_ext to fix ->select_task_rq() skip detection. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-07sched/core: Make select_task_rq() take the pointer to wake_flags instead of ↵Tejun Heo
value This will be used to allow select_task_rq() to indicate whether ->select_task_rq() was called by modifying *wake_flags. This makes try_to_wake_up() call all functions that take wake_flags with WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is more consistent, and, as the flag is only tested by ->select_task_rq() implementations, it doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-09-23sched, sched_ext: Disable SM_IDLE/rq empty path when scx_enabled()Pat Somaru
Disable the rq empty path when scx is enabled. SCX must consult the BPF scheduler (via the dispatch path in balance) to determine if rq is empty. This fixes stalls when scx is enabled. Signed-off-by: Pat Somaru <patso@likewhatevs.io> Fixes: 3dcac251b066 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()") Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-21Merge tag 'sched_ext-for-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext support from Tejun Heo: "This implements a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs. The goals of this are: - Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. - Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. - Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments" See individual commits for more documentation, but also the cover letter for the latest series: Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/ * tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits) sched: Move update_other_load_avgs() to kernel/sched/pelt.c sched_ext: Don't trigger ops.quiescent/runnable() on migrations sched_ext: Synchronize bypass state changes with rq lock scx_qmap: Implement highpri boosting sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() sched_ext: Compact struct bpf_iter_scx_dsq_kern sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() sched_ext: Move consume_local_task() upward sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() sched_ext: Reorder args for consume_local/remote_task() sched_ext: Restructure dispatch_to_local_dsq() sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON sched_ext: Refactor consume_remote_task() sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate sched_ext: Add missing static to scx_dump_data sched_ext: Add missing static to scx_has_op[] sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() sched_ext: Add a cgroup scheduler which uses flattened hierarchy sched_ext: Add cgroup support ...
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-19Merge tag 'sched-core-2024-09-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ...
2024-09-18Merge tag 'rcu.release.v6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Neeraj Upadhyay: "Context tracking: - rename context tracking state related symbols and remove references to "dynticks" in various context tracking state variables and related helpers - force context_tracking_enabled_this_cpu() to be inlined to avoid leaving a noinstr section CSD lock: - enhance CSD-lock diagnostic reports - add an API to provide an indication of ongoing CSD-lock stall nocb: - update and simplify RCU nocb code to handle (de-)offloading of callbacks only for offline CPUs - fix RT throttling hrtimer being armed from offline CPU rcutorture: - remove redundant rcu_torture_ops get_gp_completed fields - add SRCU ->same_gp_state and ->get_comp_state functions - add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU polled grace periods - add CFcommon.arch for arch-specific Kconfig options - print number of update types in rcu_torture_write_types() - add rcutree.nohz_full_patience_delay testing to the TREE07 scenario - add a stall_cpu_repeat module parameter to test repeated CPU stalls - add argument to limit number of CPUs a guest OS can use in torture.sh rcustall: - abbreviate RCU CPU stall warnings during CSD-lock stalls - Allow dump_cpu_task() to be called without disabling preemption - defer printing stall-warning backtrace when holding rcu_node lock srcu: - make SRCU gp seq wrap-around faster - add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and ->reschedule_count which are used in heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays - mark idle SRCU-barrier callbacks to help identify stuck SRCU-barrier callback rcu tasks: - remove RCU Tasks Rude asynchronous APIs as they are no longer used - stop testing RCU Tasks Rude asynchronous APIs - fix access to non-existent percpu regions - check processor-ID assumptions during chosen CPU calculation for callback enqueuing - update description of rtp->tasks_gp_seq grace-period sequence number - add rcu_barrier_cb_is_done() to identify whether a given rcu_barrier callback is stuck - mark idle Tasks-RCU-barrier callbacks - add *torture_stats_print() functions to print detailed diagnostics for Tasks-RCU variants - capture start time of rcu_barrier_tasks*() operation to help distinguish a hung barrier operation from a long series of barrier operations refscale: - add a TINY scenario to support tests of Tiny RCU and Tiny SRCU - optimize process_durations() operation rcuscale: - dump stacks of stalled rcu_scale_writer() instances and grace-period statistics when rcu_scale_writer() stalls - mark idle RCU-barrier callbacks to identify stuck RCU-barrier callbacks - print detailed grace-period and barrier diagnostics on rcu_scale_writer() hangs for Tasks-RCU variants - warn if async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude - make all writer tasks report upon hang - tolerate repeated GFP_KERNEL failure in rcu_scale_writer() - use special allocator for rcu_scale_writer() - NULL out top-level pointers to heap memory to avoid double-free bugs on modprobe failures - maintain per-task instead of per-CPU callbacks count to avoid any issues with migration of either tasks or callbacks - constify struct ref_scale_ops Fixes: - use system_unbound_wq for kfree_rcu work to avoid disturbing isolated CPUs Misc: - warn on unexpected rcu_state.srs_done_tail state - better define "atomic" for list_replace_rcu() and hlist_replace_rcu() routines - annotate struct kvfree_rcu_bulk_data with __counted_by()" * tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits) rcu: Defer printing stall-warning backtrace when holding rcu_node lock rcu/nocb: Remove superfluous memory barrier after bypass enqueue rcu/nocb: Conditionally wake up rcuo if not already waiting on GP rcu/nocb: Fix RT throttling hrtimer armed from offline CPU rcu/nocb: Simplify (de-)offloading state machine context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}() rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs() rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck() rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save() rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap rcu: Rename struct rcu_data .dynticks_snap into .watching_snap rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs() rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since() rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs() rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online() context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu() context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*() refscale: Constify struct ref_scale_ops ...
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10sched: Fix sched_delayed vs sched_corePeter Zijlstra
Completely analogous to commit dfa0a574cbc4 ("sched/uclamg: Handle delayed dequeue"), avoid double dequeue for the sched_core entries. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()Peter Zijlstra
Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an IPI without actually sending an interrupt. Even in cases where the IPI handler does not queue a task on the idle CPU, do_idle() will call __schedule() since need_resched() returns true in these cases. Introduce and use SM_IDLE to identify call to __schedule() from schedule_idle() and shorten the idle re-entry time by skipping pick_next_task() when nr_running is 0 and the previous task is the idle task. With the SM_IDLE fast-path, the time taken to complete a fixed set of IPIs using ipistorm improves noticeably. Following are the numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and 3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1 ================================================================== Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean ======================= Intel Ice Lake Xeon ====================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.80 [20.51%] ==================== 3rd Generation AMD EPYC ===================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.90 [10.17%] ================================================================== [ kprateek: Commit message, SM_RTLOCK_WAIT fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
2024-09-09Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', ↵Neeraj Upadhyay
'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a
2024-09-04sched_ext: Add cgroup supportTejun Heo
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-09-04sched: Introduce CONFIG_GROUP_SCHED_WEIGHTTejun Heo
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04sched: Make cpu_shares_read_u64() use tg_weight()Tejun Heo
Move tg_weight() upward and make cpu_shares_read_u64() use it too. This makes the weight retrieval shared between cgroup v1 and v2 paths and will be used to implement cgroup support for sched_ext. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04sched: Expose css_tg()Tejun Heo
A new BPF extensible sched_class will use css_tg() in the init and exit paths to visit all task_groups by walking cgroups. v4: __setscheduler_prio() is already exposed. Dropped from this patch. v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup mechanism. v2: Expose SCHED_CHANGE_BLOCK() too and update the description. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-09-03sched_ext: Remove sched_class->switch_class()Tejun Heo
With sched_ext converted to use put_prev_task() for class switch detection, there's no user of switch_class() left. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
2024-09-03Merge branch 'tip/sched/core' into for-6.12Tejun Heo
- Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03sched: Rework dl_serverPeter Zijlstra
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
2024-09-03sched: Combine the last put_prev_task() and the first set_next_task()Peter Zijlstra
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
2024-09-03sched: Rework pick_next_task()Peter Zijlstra
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
2024-09-03sched: Split up put_prev_task_balance()Peter Zijlstra
With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
2024-09-03sched: Use set_next_task(.first) where requiredPeter Zijlstra
Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
2024-09-01task_stack: uninline stack_not_usedPasha Tatashin
Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/20240730150158.832783-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20Merge branch 'tip/sched/core' into for-6.12Tejun Heo
To receive 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-17sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestionPeter Zijlstra
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
2024-08-17sched: Teach dequeue_task() about special task statesPeter Zijlstra
Since special task states must not suffer spurious wakeups, and the proposed delayed dequeue can cause exactly these (under some boundary conditions), propagate this knowledge into dequeue_task() such that it can do the right thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
2024-08-17sched/uclamg: Handle delayed dequeuePeter Zijlstra
Delayed dequeue has tasks sit around on the runqueue that are not actually runnable -- specifically, they will be dequeued the moment they get picked. One side-effect is that such a task can get migrated, which leads to a 'nested' dequeue_task() scenario that messes up uclamp if we don't take care. Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on the runqueue. This however will have removed the task from uclamp -- per uclamp_rq_dec() in dequeue_task(). So far so good. However, if at that point the task gets migrated -- or nice adjusted or any of a myriad of operations that does a dequeue-enqueue cycle -- we'll pass through dequeue_task()/enqueue_task() again. Without modification this will lead to a double decrement for uclamp, which is wrong. Reported-by: Luis Machado <luis.machado@arm.com> Reported-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
2024-08-17sched: Prepare generic code for delayed dequeuePeter Zijlstra
While most of the delayed dequeue code can be done inside the sched_class itself, there is one location where we do not have an appropriate hook, namely ttwu_runnable(). Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking delayed dequeue tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
2024-08-17sched: Split DEQUEUE_SLEEP from deactivate_task()Peter Zijlstra
As a preparation for dequeue_task() failing, and a second code-path needing to take care of the 'success' path, split out the DEQEUE_SLEEP path from deactivate_task(). Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING ordering fail. Fixed-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
2024-08-17sched: Allow sched_class::dequeue_task() to failPeter Zijlstra
Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
2024-08-15rcu: Let dump_cpu_task() be used without preemption disabledRyo Takakura
The commit 2d7f00b2f0130 ("rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()") disabled preemption around dump_cpu_task() to suppress warning on its usage within preemtible context. Calling dump_cpu_task() doesn't required to be in non-preemptible context except for suppressing the smp_processor_id() warning. As the smp_processor_id() is evaluated along with in_hardirq() to check if it's in interrupt context, this patch removes the need for its preemtion disablement by reordering the condition so that smp_processor_id() only gets evaluated when it's in interrupt context. Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-07sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()Qais Yousef
Some find the name realtime overloaded. Use rt_or_dl() as an alternative, hopefully better, name. Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07sched/rt: Clean up usage of rt_task()Qais Yousef
rt_task() checks if a task has RT priority. But depends on your dictionary, this could mean it belongs to RT class, or is a 'realtime' task, which includes RT and DL classes. Since this has caused some confusion already on discussion [1], it seemed a clean up is due. I define the usage of rt_task() to be tasks that belong to RT class. Make sure that it returns true only for RT class and audit the users and replace the ones required the old behavior with the new realtime_task() which returns true for RT and DL classes. Introduce similar realtime_prio() to create similar distinction to rt_prio() and update the users that required the old behavior to use the new function. Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. Document the functions to make it more obvious what is the difference between them. PI-boosted tasks is a factor that must be taken into account when choosing which function to use. Rename task_is_realtime() to realtime_task_policy() as the old name is confusing against the new realtime_task(). No functional changes were intended. [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/ Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-06sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu()Tejun Heo
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are subtle differences. It currently open codes all the tests. This is cumbersome to understand and error-prone in case the intersecting tests need to be updated. Factor out the common part - testing whether the task is allowed on the CPU at all regardless of the CPU state - into task_allowed_on_cpu() and make both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the code is now linked between the two and each contains only the extra tests that differ between them, it's less error-prone when the conditions need to be updated. Also, improve the comment to explain why they are different. v2: Replace accidental "extern inline" with "static inline" (Peter). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Simplify UP support by enabling sched_class->balance() in UPTejun Heo
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was not available in UP, it instead called the internal balance function from put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is rather nasty. Enabling sched_class->balance() on UP shouldn't cause any meaningful overhead. Enable balance() on UP and drop the ugly workaround. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Add scx_enabled() test to @start_class promotion in ↵Tejun Heo
put_prev_task_balance() SCX needs its balance() invoked even when waking up from a lower priority sched class (idle) and put_prev_task_balance() thus has the logic to promote @start_class if it's lower than ext_sched_class. This is only needed when SCX is enabled. Add scx_enabled() test to avoid unnecessary overhead when SCX is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-06sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()Tejun Heo
The way sched_can_stop_tick() used scx_can_stop_tick() was rather confusing and the behavior wasn't ideal when SCX is enabled in partial mode. Simplify it so that: - scx_can_stop_tick() can say no if scx_enabled(). - CFS tests rq->cfs.nr_running > 1 instead of rq->nr_running. This is easier to follow and leads to the correct answer whether SCX is disabled, enabled in partial mode or all tasks are switched to SCX. Peter, note that this is a bit different from your suggestion where sched_can_stop_tick() unconditionally returns scx_can_stop_tick() iff scx_switched_all(). The problem is that in partial mode, tick can be stopped when there is only one SCX task even if the BPF scheduler didn't ask and isn't ready for it. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
2024-08-04Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12 Pull tip/sched/core to resolve the following four conflicts. While 2-4 are simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly. 1. 2c8d046d5d51 ("sched: Add normal_policy()") vs. faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy") The former converts direct test on p->policy to use the helper normal_policy(). The latter moves the p->policy test to a different location. Resolve by converting the test on p->plicy in the new location to use normal_policy(). 2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. a110a81c52a9 ("sched/deadline: Deferrable dl server") Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple context conflict. Resolve by taking changes from both. 3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") vs. c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()") The former changes for_each_class() itertion to use for_each_active_class(). The latter moves away the adjacent dl_server handling code. Simple context conflict. Resolve by taking changes from both. 4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()") vs. 31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper") 2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper") The former adds scx_rq_deactivate() call. The latter two change code around it. Simple context conflict. Resolve by taking changes from both. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-07-30Merge tag 'v6.11-rc1' into for-6.12Tejun Heo
Linux 6.11-rc1
2024-07-29sched/rt: Remove default bandwidth controlPeter Zijlstra
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED. Enable fair_server with parameters equivalent to RT throttling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix priority checking for DL server picksJoel Fernandes (Google)
In core scheduling, a DL server pick (which is CFS task) should be given higher priority than tasks in other classes. Not doing so causes CFS starvation. A kselftest is added later to demonstrate this. A CFS task that is competing with RT tasks can be completely starved without this and the DL server's boosting completely ignored. Fix these problems. Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Add trivial fair serverPeter Zijlstra
Use deadline servers to service fair tasks. This patch adds a fair_server deadline entity which acts as a container for fair entities and can be used to fix starvation when higher priority (wrt fair) tasks are monopolizing CPU(s). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Clear prev->dl_server in CFS pick fast pathYoussef Esmat
In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org