Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core facilities:
- Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
optimizes fair-class preemption by delaying preemption requests to
the tick boundary, while working as full preemption for
RR/FIFO/DEADLINE classes. (Peter Zijlstra)
- x86: Enable Lazy preemption (Peter Zijlstra)
- riscv: Enable Lazy preemption (Jisheng Zhang)
- Initialize idle tasks only once (Thomas Gleixner)
- sched/ext: Remove sched_fork() hack (Thomas Gleixner)
Fair scheduler:
- Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
Idle loop:
- Optimize the generic idle loop by removing unnecessary memory
barrier (Zhongqiu Han)
RSEQ:
- Improve cache locality of RSEQ concurrency IDs for intermittent
workloads (Mathieu Desnoyers)
Waitqueues:
- Make wake_up_{bit,var} less fragile (Neil Brown)
PSI:
- Pass enqueue/dequeue flags to psi callbacks directly (Johannes
Weiner)
Preparatory patches for proxy execution:
- Add move_queued_task_locked helper (Connor O'Brien)
- Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
- Split out __schedule() deactivate task logic into a helper (John
Stultz)
- Split scheduler and execution contexts (Peter Zijlstra)
- Make mutex::wait_lock irq safe (Juri Lelli)
- Expose __mutex_owner() (Juri Lelli)
- Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
Misc fixes and cleanups:
- Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
Disseldorp)
- Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
Siewior)
- Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
- remove the DOUBLE_TICK feature (Huang Shijie)
- fix the comment for PREEMPT_SHORT (Huang Shijie)
- Fix unnused variable warning (Christian Loehle)
- No PREEMPT_RT=y for all{yes,mod}config"
* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
sched: No PREEMPT_RT=y for all{yes,mod}config
riscv: add PREEMPT_LAZY support
sched, x86: Enable Lazy preemption
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
sched: Add Lazy preemption model
sched: Add TIF_NEED_RESCHED_LAZY infrastructure
sched/ext: Remove sched_fork() hack
sched: Initialize idle tasks only once
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
sched/uclamp: Fix unnused variable warning
sched: Split scheduler and execution contexts
sched: Split out __schedule() deactivate task logic into a helper
sched: Consolidate pick_*_task to task_is_pushable helper
sched: Add move_queued_task_locked helper
locking/mutex: Expose __mutex_owner()
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
sched: idle: Optimize the generic idle loop by removing needless memory barrier
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- The fair sched class currently has a bug where its balance() returns
true telling the sched core that it has tasks to run but then NULL
from pick_task(). This makes sched core call sched_ext's pick_task()
without preceding balance() which can lead to stalls in partial mode.
For now, work around by detecting the condition and forcing the CPU
to go through another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection
tool which went out of sync.
* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
sched_ext: Add a missing newline at the end of an error message
|
|
balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().
Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
|
|
Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.
This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.
In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.
The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
|
|
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs
switched_from_fair()") overlooked that __setscheduler_prio(), now
__setscheduler_class() relies on p->policy for task_should_scx(), and
moved the call before __setscheduler_params() updates it, causing it
to be using the old p->policy value.
Resolve this by changing task_should_scx() to take the policy itself
instead of a task pointer, such that __sched_setscheduler() can pass
in the updated policy.
Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
What psi needs to do on each enqueue and dequeue has gotten more
subtle, and the generic sched code trying to distill this into a bool
for the callbacks is awkward.
Pass the flags directly and let psi parse them. For that to work, the
#include "stats.h" (which has the psi callback implementations) needs
to be below the flag definitions in "sched.h". Move that section
further down, next to some of the other accounting stuff.
This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump
label, slightly reducing overhead when PSI=y but runtime disabled.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org
|
|
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().
The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.
The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.
Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net
|
|
Overlapping fixes solving the same bug slightly differently:
7266f0a6d3bb fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry
3b80552e7057 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
Use the upstream version.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Conflicts:
kernel/sched/ext.c
There's a context conflict between this upstream commit:
3fdb9ebcec10 sched_ext: Start schedulers with consistent p->scx.slice values
... and this fix in sched/urgent:
98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair()
Resolve it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Let's define the "scheduling context" as all the scheduler state
in task_struct for the task chosen to run, which we'll call the
donor task, and the "execution context" as all state required to
actually run the task.
Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the donor task selected to be scheduled, but use
the execution context of a different task to actually be run.
To this purpose, introduce rq->donor field to point to the
task_struct chosen from the runqueue by the scheduler, and will
be used for scheduler state, and preserve rq->curr to indicate
the execution context of the task that will actually be run.
This patch introduces the donor field as a union with curr, so it
doesn't cause the contexts to be split yet, but adds the logic to
handle everything separately.
[add additional comments and update more sched_class code to use
rq::proxy]
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com
|
|
This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com
|
|
Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
[jstultz: split out from larger chain migration patch]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com
|
|
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.
However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.
Introduce the following changes to improve per-mm-cid cache locality:
- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,
- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this),
- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),
- Add a per-mm max_nr_cid to keep track of the highest number of
concurrency IDs allocated for the mm. This is used for expanding the
concurrency ID allocation within the upper bound defined by:
min(mm->nr_cpus_allowed, mm->mm_users)
When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.
Spreading allocation to use all the cid values within the range
[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
improves cache locality while preserving mm_cid compactness within the
expected user limits,
- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask,
- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.
* Benchmarks
Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.
Testing configurations:
8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
Intermittent workload delays between threads: 200ms, 10ms.
Hardware:
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)
Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).
Intermittent workload delay: 200ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x
Intermittent workload delay: 10ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x
[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]
[ This patch applies on top of v6.12-rc1. ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
|
|
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
and its follow up fixes try to deal with a rather unfortunate
situation where is task is enqueued in a new class, even though it
shouldn't have been. Mostly because the existing ->switched_to/from()
hooks are in the wrong place for this case.
This all led to Paul being able to trigger failures at something like
once per 10k CPU hours of RCU torture.
For now, do the ugly thing and move the code to the right place by
ignoring the switch hooks.
Note: Clean up the whole sched_class::switch*_{to,from}() thing.
Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
|
|
was called
During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.
Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.
Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().
This will be used by sched_ext to fix ->select_task_rq() skip detection.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
the idle member is not defined:
kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
3701 | if (!tg->idle)
| ^~
Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
Fixes: e179e80c5d4f ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
!CONFIG_FAIR_GROUP_SCHED:
kernel/sched/core.c:9634:15: error: implicit declaration of function
'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
9634 | ret = sched_group_set_idle(css_tg(css), idle);
| ^~~~~~~~~~~~~~~~~~~~
| scx_group_set_idle
Fixes: e179e80c5d4f ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from
__update_blocked_others()") added update_other_load_avgs() in
kernel/sched/syscalls.c right above effective_cpu_util(). This location
didn't fit that well in the first place, and with 5d871a63997f ("sched/fair:
Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
place.
Relocate the function to kernel/sched/pelt.c where all its callees are.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
|
|
While the BPF scheduler is being unloaded, the following warning messages
trigger sometimes:
NOHZ tick-stop error: local softirq work is pending, handler #80!!!
This is caused by the CPU entering idle while there are pending softirqs.
The main culprit is the bypassing state assertion not being synchronized
with rq operations. As the BPF scheduler cannot be trusted in the disable
path, the first step is entering the bypass mode where the BPF scheduler is
ignored and scheduling becomes global FIFO.
This is implemented by turning scx_ops_bypassing() true. However, the
transition isn't synchronized against anything and it's possible for enqueue
and dispatch paths to have different ideas on whether bypass mode is on.
Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
modified while rq is locked.
This removes most of the NOHZ tick-stop messages but not completely. I
believe the stragglers are from the sched core bug where pick_task_scx() can
be called without preceding balance_scx(). Once that bug is fixed, we should
verify that all occurrences of this error message are gone too.
v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
the per-cpu states are always synchronized with the global state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Vernet <void@manifault.com>
|
|
Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. A BPF scheduler may not implement or implement only
subset of cgroup features. The implemented features can be indicated using
%SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
that are not implemented, a warning is triggered.
While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.
v7: - cgroup interface file visibility toggling is dropped in favor just
warning messages. Dynamically changing interface visiblity caused more
confusion than helping.
v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.
- Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
!CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.
v5: - Flipped the locking order between scx_cgroup_rwsem and
cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
documentation around locking.
- sched_move_task() takes an early exit if the source and destination
are identical. This triggered the warning in scx_cgroup_can_attach()
as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
migration path so that ops.cgroup_prep_move() is skipped for identity
migrations so that its invocations always match ops.cgroup_move()
one-to-one.
v4: - Example schedulers moved into their own patches.
- Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.
v3: - Make scx_example_pair switch all tasks by default.
- Convert to BPF inline iterators.
- scx_bpf_task_cgroup() is added to determine the current cgroup from
CPU controller's POV. This allows BPF schedulers to accurately track
CPU cgroup membership.
- scx_example_flatcg added. This demonstrates flattened hierarchy
implementation of CPU cgroup control and shows significant performance
improvement when cgroups which are nested multiple levels are under
competition.
v2: - Build fixes for different CONFIG combinations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
|
|
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
SCX may implement the feature, put the interface code behind the new
CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
This allows either sched class to enable the itnerface code without ading
more complex CONFIG tests.
When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
!CONFIG_FAIR_GROUP_SCHED builds.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A new BPF extensible sched_class will use css_tg() in the init and exit
paths to visit all task_groups by walking cgroups.
v4: __setscheduler_prio() is already exposed. Dropped from this patch.
v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
mechanism.
v2: Expose SCHED_CHANGE_BLOCK() too and update the description.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
With sched_ext converted to use put_prev_task() for class switch detection,
there's no user of switch_class() left. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
keep running the current task. It's not really a task property. Replace it
with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
the usage. Also, the existing clearing rule is unnecessarily strict and
makes it difficult to use with core-sched. Just clear it on entry to
balance_one().
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
- Resolve trivial context conflicts from dl_server clearing being moved
around.
- Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to
match sched/core.
- Merge sched_class->switch_class() addition from sched_ext with
tip/sched/core changes in __pick_next_task().
- Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous
behavior where sched_class->put_prev_task() was called before
sched_class->pick_next_task().
While this makes sched_ext build and function, the behavior is not in line
with other sched classes. The follow-up patches will address the
discrepancies and remove sched_class->switch_class().
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).
Notable SCX will use this to:
1) determine the next task will leave the SCX sched class and push
the current task to another CPU if possible.
2) statistics on how often and which other classes preempt it
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
|
|
When a task is selected through a dl_server, it will have p->dl_server
set, such that it can account runtime to the dl_server, see
update_curr_task().
Currently p->dl_server is set in pick*task() whenever it goes through
the dl_server, clearing it is a bit of a mess though. The trivial
solution is clearing it on the final put (now that we have this
location).
However, this gives a problem when:
p = pick_task(rq);
if (p)
put_prev_set_next_task(rq, prev, next);
picks the same task but through a different path, notably when it goes
from picking through the dl_server to a direct pick or vice-versa. In
that case we cannot readily determine wether we should clear or
preserve p->dl_server.
An additional complication is pick_*task() setting p->dl_server for a
remote pick, it might still need to update runtime before it schedules
the core_pick.
Close all these holes and remove all the random clearing of
p->dl_server by:
- having pick_*task() manage rq->dl_server
- having the final put_prev_task() clear p->dl_server
- having the first set_next_task() set p->dl_server = rq->dl_server
- complicate the core_sched code to save/restore rq->dl_server where
appropriate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
|
|
Ensure the last put_prev_task() and the first set_next_task() always
go together.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
|
|
The current rule is that:
pick_next_task() := pick_task() + set_next_task(.first = true)
And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:
pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)
The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.
As a bonus, this is a nice cleanup on its own.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
|
|
Abide by the simple rule:
pick_next_task() := pick_task() + set_next_task(.first = true)
This allows us to trivially get rid of server_pick_next() and things
collapse nicely.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
|
|
Turns out the core_sched bits forgot to use the
set_next_task(.first=true) variant. Notably:
pick_next_task() := pick_task() + set_next_task(.first = true)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
|
|
To receive 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail")
which makes sched_class.dequeue_task() return bool instead of void. This
leads to compile breakage and will be fixed by a follow-up patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Note that tasks that are kept on the runqueue to burn off negative
lag, are not in fact runnable anymore, they'll get dequeued the moment
they get picked.
As such, don't count this time towards runnable.
Thanks to Valentin for spotting I had this backwards initially.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
|
|
Since special task states must not suffer spurious wakeups, and the
proposed delayed dequeue can cause exactly these (under some boundary
conditions), propagate this knowledge into dequeue_task() such that it
can do the right thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
|
|
While most of the delayed dequeue code can be done inside the
sched_class itself, there is one location where we do not have an
appropriate hook, namely ttwu_runnable().
Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking
delayed dequeue tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
|
|
As a preparation for dequeue_task() failing, and a second code-path
needing to take care of the 'success' path, split out the DEQEUE_SLEEP
path from deactivate_task().
Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING
ordering fail.
Fixed-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
|
|
Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
|
|
Since commit e8f331bcc270 ("sched/smp: Use lag to simplify
cross-runqueue placement") the min_vruntime_copy is no longer used.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org
|
|
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are
subtle differences. It currently open codes all the tests. This is
cumbersome to understand and error-prone in case the intersecting tests need
to be updated.
Factor out the common part - testing whether the task is allowed on the CPU
at all regardless of the CPU state - into task_allowed_on_cpu() and make
both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the
code is now linked between the two and each contains only the extra tests
that differ between them, it's less error-prone when the conditions need to
be updated. Also, improve the comment to explain why they are different.
v2: Replace accidental "extern inline" with "static inline" (Peter).
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was
not available in UP, it instead called the internal balance function from
put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is
rather nasty.
Enabling sched_class->balance() on UP shouldn't cause any meaningful
overhead. Enable balance() on UP and drop the ugly workaround.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12
Pull tip/sched/core to resolve the following four conflicts. While 2-4 are
simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly.
1. 2c8d046d5d51 ("sched: Add normal_policy()")
vs.
faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy")
The former converts direct test on p->policy to use the helper
normal_policy(). The latter moves the p->policy test to a different
location. Resolve by converting the test on p->plicy in the new location to
use normal_policy().
2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
a110a81c52a9 ("sched/deadline: Deferrable dl server")
Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple
context conflict. Resolve by taking changes from both.
3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()")
The former changes for_each_class() itertion to use for_each_active_class().
The latter moves away the adjacent dl_server handling code. Simple context
conflict. Resolve by taking changes from both.
4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
vs.
31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper")
2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper")
The former adds scx_rq_deactivate() call. The latter two change code around
it. Simple context conflict. Resolve by taking changes from both.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Linux 6.11-rc1
|
|
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.
Enable fair_server with parameters equivalent to RT throttling.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
|
|
* Use simple CFS pick_task for DL pick_task
DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
because core scheduling's pick_task only calls CFS's pick_task() for
evaluation / checking of the CFS task (comparing across CPUs), not for
actually affirmatively picking the next task. This causes RB tree
corruption issues in CFS that were found by syzbot.
* Make pick_task_fair clear DL server
A DL task pick might set ->dl_server, but it is possible the task will
never run (say the other HT has a stop task). If the CFS task is picked
in the future directly (say without DL server), ->dl_server will be
set. So clear it in pick_task_fair().
This fixes the KASAN issue reported by syzbot in set_next_entity().
(DL refactoring suggestions by Vineeth Pillai).
Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
|
|
Add an interface for fair server setup on debugfs.
Each CPU has two files under /debug/sched/fair_server/cpu{ID}:
- runtime: set runtime in ns
- period: set period in ns
This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set
bounds on admission control.
The interface also add the server to the dl bandwidth accounting.
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
|
|
Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.
The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.
For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.
By setting the defer option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.
The dl_timer will be set for the defer time at (period - runtime) ns
from start time. Thus boosting the fair rq at defer time.
If the fair scheduler has the opportunity to run while waiting
for defer time, the dl server runtime will be consumed. If
the runtime is completely consumed before the defer time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new defer time
If the fair server reaches the defer time without consuming
its runtime, the server will start running, following CBS rules
(thus without breaking SCHED_DEADLINE). Then the server will
continue the running state (without deferring) until it fair
tasks are able to execute as regular fair scheduler (end of
the starvation).
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
|
|
Use deadline servers to service fair tasks.
This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
|
|
nr_spread_over tracks the number of instances where the difference
between a scheduling entity's virtual runtime and the minimum virtual
runtime in the runqueue exceeds three times the scheduler latency,
indicating significant disparity in task scheduling.
Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF
cfs_rq->exec_clock was used to account for time spent executing tasks.
Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime
accounting across classes
cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in
eevdf. Remove them from struct cfs_rq.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
|
|
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().
While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.
Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().
v2: - Add missing return after queue_balance_callback() in
schedule_deferred(). (David).
- dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
and thus no longer takes @rf. Updated accordingly.
- UP build warning fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
|
|
SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
the rq is currently enqueueing for a wakeup. This will be used to implement
direct dispatching to local DSQ of another CPU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|