Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve the default select_cpu() implementation making it topology
aware and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args
could be NULL in a rather hackish way. Use the new __nullable CFI
stub tags instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by
hammering on the same queue across sockets, could live-lock the
system to the point where the system couldn't make reasonable forward
progress.
This could lead to soft-lockup triggered resets or stalling out
bypass mode switch and thus BPF scheduler ejection for tens of
minutes if not hours. After trying a number of mitigations, the
following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while
sched_ext is trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is
imminent (a quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity
and overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch"
and the confusion around "consume". This is due to how the dispatch
queues became more generic over time. Rename the affected kfuncs for
clarity. Thanks to BPF's compatibility features, this change can be
made in a way that's both forward and backward compatible. The
compatibility code will be dropped in a few releases.
- Other misc changes
* tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits)
sched_ext: Replace scx_next_task_picked() with switch_class() in comment
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
sched_ext: Clarify sched_ext_ops table for userland scheduler
sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
sched_ext: Avoid live-locking bypass mode switching
sched_ext: Fix incorrect use of bitwise AND
sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
sched_ext: Introduce NUMA awareness to the default idle selection policy
sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
sched_ext: Rename CFI stubs to names that are recognized by BPF
sched_ext: Introduce LLC awareness to the default idle selection policy
sched_ext: Clarify ops.select_cpu() for single-CPU tasks
sched_ext: improve WAKE_SYNC behavior for default idle CPU selection
sched_ext: Use btf_ids to resolve task_struct
sched/ext: Use tg_cgroup() to elieminate duplicate code
sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core facilities:
- Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
optimizes fair-class preemption by delaying preemption requests to
the tick boundary, while working as full preemption for
RR/FIFO/DEADLINE classes. (Peter Zijlstra)
- x86: Enable Lazy preemption (Peter Zijlstra)
- riscv: Enable Lazy preemption (Jisheng Zhang)
- Initialize idle tasks only once (Thomas Gleixner)
- sched/ext: Remove sched_fork() hack (Thomas Gleixner)
Fair scheduler:
- Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
Idle loop:
- Optimize the generic idle loop by removing unnecessary memory
barrier (Zhongqiu Han)
RSEQ:
- Improve cache locality of RSEQ concurrency IDs for intermittent
workloads (Mathieu Desnoyers)
Waitqueues:
- Make wake_up_{bit,var} less fragile (Neil Brown)
PSI:
- Pass enqueue/dequeue flags to psi callbacks directly (Johannes
Weiner)
Preparatory patches for proxy execution:
- Add move_queued_task_locked helper (Connor O'Brien)
- Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
- Split out __schedule() deactivate task logic into a helper (John
Stultz)
- Split scheduler and execution contexts (Peter Zijlstra)
- Make mutex::wait_lock irq safe (Juri Lelli)
- Expose __mutex_owner() (Juri Lelli)
- Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
Misc fixes and cleanups:
- Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
Disseldorp)
- Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
Siewior)
- Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
- remove the DOUBLE_TICK feature (Huang Shijie)
- fix the comment for PREEMPT_SHORT (Huang Shijie)
- Fix unnused variable warning (Christian Loehle)
- No PREEMPT_RT=y for all{yes,mod}config"
* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
sched: No PREEMPT_RT=y for all{yes,mod}config
riscv: add PREEMPT_LAZY support
sched, x86: Enable Lazy preemption
sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
sched: Add Lazy preemption model
sched: Add TIF_NEED_RESCHED_LAZY infrastructure
sched/ext: Remove sched_fork() hack
sched: Initialize idle tasks only once
sched: psi: pass enqueue/dequeue flags to psi callbacks directly
sched/uclamp: Fix unnused variable warning
sched: Split scheduler and execution contexts
sched: Split out __schedule() deactivate task logic into a helper
sched: Consolidate pick_*_task to task_is_pushable helper
sched: Add move_queued_task_locked helper
locking/mutex: Expose __mutex_owner()
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
sched: idle: Optimize the generic idle loop by removing needless memory barrier
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
"One more fix for v6.12-rc7
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing
the operation to call kfuncs which shouldn't be allowed. Fix it by
using SCX_KF_REST instead, which is trivial and low risk"
* tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
|
|
scx_next_task_picked() has been replaced with siwtch_class(), but comment
is still referencing old one, so replace it.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Zhao Mengmeng <zhaomzhao@126.com>
Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- The fair sched class currently has a bug where its balance() returns
true telling the sched core that it has tasks to run but then NULL
from pick_task(). This makes sched core call sched_ext's pick_task()
without preceding balance() which can lead to stalls in partial mode.
For now, work around by detecting the condition and forcing the CPU
to go through another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection
tool which went out of sync.
* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
sched_ext: Add a missing newline at the end of an error message
|
|
scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the third set of renames. Compatibility is maintained
by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT
macros as they were introduced during v6.12 cycle. Wrap new API in
__COMPAT macros too and trigger build errors on both __COMPAT prefixed and
naked usages of the old names.
The compat features will be dropped after v6.15.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
|
|
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the second rename. Compatibility is maintained by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Comment and documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
|
|
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the first set of renames. Compatibility is maintained
by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
|
|
balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().
Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
|
|
4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
added four kfuncs for dispatching while iterating. They are allowed from the
dispatch and unlocked contexts but two of the kfuncs were only added in the
dispatch section. Add missing declarations in the unlocked section.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
|
|
When getting an LLC CPU mask in the default CPU selection policy,
scx_select_cpu_dfl(), a pointer to the sched_domain is dereferenced
using rcu_read_lock() without holding rcu_read_lock(). Such an unprotected
dereference often causes the following warning and can cause an invalid
memory access in the worst case.
Therefore, protect dereference of a sched_domain pointer using a pair
of rcu_read_lock() and unlock().
[ 20.996135] =============================
[ 20.996345] WARNING: suspicious RCU usage
[ 20.996563] 6.11.0-virtme #17 Tainted: G W
[ 20.996576] -----------------------------
[ 20.996576] kernel/sched/ext.c:3323 suspicious rcu_dereference_check() usage!
[ 20.996576]
[ 20.996576] other info that might help us debug this:
[ 20.996576]
[ 20.996576]
[ 20.996576] rcu_scheduler_active = 2, debug_locks = 1
[ 20.996576] 4 locks held by kworker/8:1/140:
[ 20.996576] #0: ffff8b18c00dd348 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x4a0/0x590
[ 20.996576] #1: ffffb3da01f67e58 ((work_completion)(&dev->power.work)){+.+.}-{0:0}, at: process_one_work+0x1ba/0x590
[ 20.996576] #2: ffffffffa316f9f0 (&rcu_state.gp_wq){..-.}-{2:2}, at: swake_up_one+0x15/0x60
[ 20.996576] #3: ffff8b1880398a60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x59/0x7d0
[ 20.996576]
[ 20.996576] stack backtrace:
[ 20.996576] CPU: 8 UID: 0 PID: 140 Comm: kworker/8:1 Tainted: G W 6.11.0-virtme #17
[ 20.996576] Tainted: [W]=WARN
[ 20.996576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 20.996576] Workqueue: pm pm_runtime_work
[ 20.996576] Sched_ext: simple (disabling+all), task: runnable_at=-6ms
[ 20.996576] Call Trace:
[ 20.996576] <IRQ>
[ 20.996576] dump_stack_lvl+0x6f/0xb0
[ 20.996576] lockdep_rcu_suspicious.cold+0x4e/0x96
[ 20.996576] scx_select_cpu_dfl+0x234/0x260
[ 20.996576] select_task_rq_scx+0xfb/0x190
[ 20.996576] select_task_rq+0x47/0x110
[ 20.996576] try_to_wake_up+0x110/0x7d0
[ 20.996576] swake_up_one+0x39/0x60
[ 20.996576] rcu_core+0xb08/0xe50
[ 20.996576] ? srso_alias_return_thunk+0x5/0xfbef5
[ 20.996576] ? mark_held_locks+0x40/0x70
[ 20.996576] handle_softirqs+0xd3/0x410
[ 20.996576] irq_exit_rcu+0x78/0xa0
[ 20.996576] sysvec_apic_timer_interrupt+0x73/0x80
[ 20.996576] </IRQ>
[ 20.996576] <TASK>
[ 20.996576] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 20.996576] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70
[ 20.996576] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 11 b4 36 ff 48 89 df e8 99 0d 37 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 5b 55 3c 5e 74 16 5b 5d e9 95 8e 28 00 e8 a5 ee 44 ff 9c
[ 20.996576] RSP: 0018:ffffb3da01f67d20 EFLAGS: 00000246
[ 20.996576] RAX: 0000000000000002 RBX: ffffffffa4640220 RCX: 0000000000000040
[ 20.996576] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1c7b27b
[ 20.996576] RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
[ 20.996576] R10: 0000000000000001 R11: 000000000000021c R12: 0000000000000246
[ 20.996576] R13: ffff8b1881363958 R14: 0000000000000000 R15: ffff8b1881363800
[ 20.996576] ? _raw_spin_unlock_irqrestore+0x4b/0x70
[ 20.996576] serial_port_runtime_resume+0xd4/0x1a0
[ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10
[ 20.996576] __rpm_callback+0x44/0x170
[ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10
[ 20.996576] rpm_callback+0x55/0x60
[ 20.996576] ? __pfx_serial_port_runtime_resume+0x10/0x10
[ 20.996576] rpm_resume+0x582/0x7b0
[ 20.996576] pm_runtime_work+0x7c/0xb0
[ 20.996576] process_one_work+0x1fb/0x590
[ 20.996576] worker_thread+0x18e/0x350
[ 20.996576] ? __pfx_worker_thread+0x10/0x10
[ 20.996576] kthread+0xe2/0x110
[ 20.996576] ? __pfx_kthread+0x10/0x10
[ 20.996576] ret_from_fork+0x34/0x50
[ 20.996576] ? __pfx_kthread+0x10/0x10
[ 20.996576] ret_from_fork_asm+0x1a/0x30
[ 20.996576] </TASK>
[ 21.056592] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Update the comments in sched_ext_ops to clarify this table is for
a BPF scheduler and a userland scheduler should also rely on the
sched_ext_ops table through the BPF scheduler.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
behaving BPF scheduler can live-lock the system by making multiple CPUs bang
on the same DSQ to the point where soft-lockup detection triggers before
SCX's own watchdog can take action. It also seems possible that the machine
can be live-locked enough to prevent scx_ops_helper, which is an RT task,
from running in a timely manner.
Implement scx_softlockup() which is called when three quarters of
soft-lockup threshold has passed. The function immediately enables the ops
breather and triggers an ops error to initiate ejection of the BPF
scheduler.
The previous and this patch combined enable the kernel to reliably recover
the system from live-lock conditions that can be triggered by a poorly
behaving BPF scheduler on Intel dual socket systems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
|
|
A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
banging on the same DSQ on a large NUMA system to the point where switching
to the bypass mode can take a long time. Turning on the bypass mode requires
dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they
are on are live-locked, this can take tens of seconds cascading into other
failures. This was observed on 2 x Intel Sapphire Rapids machines with 224
logical CPUs.
Inject artifical delays while the bypass mode is switching to guarantee
timely completion.
While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it
to bypass_lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Valentin Andrei <vandrei@meta.com>
Reported-by: Patrick Lu <patlu@meta.com>
|
|
Pull sched_ext/for-6.12-fixes to receive 0e7ffff1b811 ("scx: Fix raciness in
scx_ops_bypass()"). Planned updates for scx_ops_bypass() depends on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There is no reason to use a bitwise AND when checking the conditions to
enable NUMA optimization for the built-in CPU idle selection policy, so
use a logical AND instead.
Fixes: f6ce6b949304 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/lkml/20241108181753.GA2681424@thelio-3990X/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.
Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.
Therefore, detect overlapping domains and enable topology optimizations
only when necessary.
Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.
Fixes: 860a45219bce ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.
Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not
queued. Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq
is the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap
of the address space can result in a NULL pointer return from
vma_next() which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/ext: Fix scx vs sched_delayed
sched: Pass correct scheduling policy to __setscheduler_class
sched/numa: Fix the potential null pointer dereference in task_numa_work()
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
|
|
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs
switched_from_fair()") forgot about scx :/
Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20241030104934.GK14555@noisy.programming.kicks-ass.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes
* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix enq_last_no_enq_fails selftest
sched_ext: Make cast_mask() inline
scx: Fix raciness in scx_ops_bypass()
scx: Fix exit selftest to use custom DSQ
sched_ext: Fix function pointer type mismatches in BPF selftests
selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
|
|
Similarly to commit dfa4ed29b18c ("sched_ext: Introduce LLC awareness to
the default idle selection policy"), extend the built-in idle CPU
selection policy to also prioritize CPUs within the same NUMA node.
With this change applied, the built-in CPU idle selection policy follows
this logic:
- always prioritize CPUs from fully idle SMT cores,
- select the same CPU if possible,
- select a CPU within the same LLC domain,
- select a CPU within the same NUMA node.
Both NUMA and LLC awareness features are enabled only when the system
has multiple NUMA nodes or multiple LLC domains.
In the future, we may want to improve the NUMA node selection to account
the node distance from prev_cpu. Currently, the logic only tries to keep
tasks running on the same NUMA node. If all CPUs within a node are busy,
the next NUMA node is chosen randomly.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs
switched_from_fair()") overlooked that __setscheduler_prio(), now
__setscheduler_class() relies on p->policy for task_should_scx(), and
moved the call before __setscheduler_params() updates it, causing it
to be using the old p->policy value.
Resolve this by changing task_should_scx() to take the policy itself
instead of a task pointer, such that __sched_setscheduler() can pass
in the updated policy.
Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
|
|
scx_ops_bypass() can currently race on the ops enable / disable path as
follows:
1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1
2. An op on the init path exits, which schedules scx_ops_disable_workfn()
3. scx_ops_bypass(false) is called on the disable path, and bypass depth
is decremented to 0
4. kthread is scheduled to execute scx_ops_disable_workfn()
5. scx_ops_bypass(true) called, bypass depth set to 1
6. scx_ops_bypass() races when iterating over CPUs
While it's not safe to take any blocking locks on the bypass path, it is
safe to take a raw spinlock which cannot be preempted. This patch therefore
updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes
scx_ops_bypass_depth to be a regular int.
Without this change, we observe the following warnings when running the
'exit' sched_ext selftest (sometimes requires a couple of runs):
.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
...
[ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280
[ 14.935126] Modules linked in:
[ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24
[ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 14.935242] Sched_ext: exit (enabling+all)
[ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280
[ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24
[ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002
[ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a
[ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080
[ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a
[ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[ 14.935765] PKRU: 55555554
[ 14.935782] Call Trace:
[ 14.935802] <TASK>
[ 14.935823] ? __warn+0xce/0x220
[ 14.935850] ? scx_ops_bypass+0x1ca/0x280
[ 14.935881] ? report_bug+0xc1/0x160
[ 14.935909] ? handle_bug+0x61/0x90
[ 14.935934] ? exc_invalid_op+0x1a/0x50
[ 14.935959] ? asm_exc_invalid_op+0x1a/0x20
[ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30
[ 14.936019] ? scx_ops_bypass+0x1ca/0x280
[ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.936111] scx_ops_disable_workfn+0x146/0xac0
[ 14.936142] ? finish_task_switch+0xa9/0x2c0
[ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.936244] kthread_worker_fn+0x101/0x2c0
[ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10
[ 14.936299] kthread+0xec/0x110
[ 14.936327] ? __pfx_kthread+0x10/0x10
[ 14.936351] ret_from_fork+0x37/0x50
[ 14.936374] ? __pfx_kthread+0x10/0x10
[ 14.936400] ret_from_fork_asm+0x1a/0x30
[ 14.936427] </TASK>
[ 14.936443] irq event stamp: 21002
[ 14.936467] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[ 14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[ 14.936571] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.936672] ---[ end trace 0000000000000000 ]---
[ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 14.953352] ------------[ cut here ]------------
[ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280
[ 14.953428] Modules linked in:
[ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24
[ 14.953505] Tainted: [W]=WARN
[ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280
[ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0
[ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046
[ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318
[ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[ 14.954052] PKRU: 55555554
[ 14.954068] Call Trace:
[ 14.954085] <TASK>
[ 14.954102] ? __warn+0xce/0x220
[ 14.954126] ? scx_ops_bypass+0x1d8/0x280
[ 14.954150] ? report_bug+0xc1/0x160
[ 14.954178] ? handle_bug+0x61/0x90
[ 14.954203] ? exc_invalid_op+0x1a/0x50
[ 14.954226] ? asm_exc_invalid_op+0x1a/0x20
[ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30
[ 14.954285] ? scx_ops_bypass+0x1d8/0x280
[ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260
[ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0
[ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10
[ 14.954413] kthread_worker_fn+0x101/0x2c0
[ 14.954442] ? __pfx_kthread_worker_fn+0x10/0x10
[ 14.954479] kthread+0xec/0x110
[ 14.954507] ? __pfx_kthread+0x10/0x10
[ 14.954530] ret_from_fork+0x37/0x50
[ 14.954553] ? __pfx_kthread+0x10/0x10
[ 14.954576] ret_from_fork_asm+0x1a/0x30
[ 14.954603] </TASK>
[ 14.954621] irq event stamp: 21002
[ 14.954644] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[ 14.954686] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[ 14.954735] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.954782] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[ 14.954829] ---[ end trace 0000000000000000 ]---
[ 15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
===== END =====
And with it, the test passes without issue after 1000s of runs:
.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places
OUTPUT:
[ 7.412856] sched_ext: BPF scheduler "exit" enabled
[ 7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.466677] sched_ext: BPF scheduler "exit" enabled
[ 7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.512803] sched_ext: BPF scheduler "exit" enabled
[ 7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.586809] sched_ext: BPF scheduler "exit" enabled
[ 7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[ 7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
===== END =====
=============================
RESULTS:
PASSED: 1
SKIPPED: 0
FAILED: 0
Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
ops.dispatch() and ops.yield() may be fed a NULL task_struct pointer.
set_arg_maybe_null() is used to tell the verifier that they should be NULL
checked before being dereferenced. BPF now has an a lot prettier way to
express this - tagging arguments in CFI stubs with __nullable. Replace
set_arg_maybe_null() with __nullable CFI stub tags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
CFI stubs can be used to tag arguments with __nullable (and possibly other
tags in the future) but for that to work the CFI stubs must have names that
are recognized by BPF. Rename them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Rely on the scheduler topology information to implement basic LLC
awareness in the sched_ext build-in idle selection policy.
This allows schedulers using the built-in policy to make more informed
decisions when selecting an idle CPU in systems with multiple LLCs, such
as NUMA systems or chiplet-based architectures, and it helps keep tasks
within the same LLC domain, thereby improving cache locality.
For efficiency, LLC awareness is applied only to tasks that can run on
all the CPUs in the system for now. If a task's affinity is modified
from user space, it's the responsibility of user space to choose the
appropriate optimized scheduling domain.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Update ops.select_cpu() documentation to clarify that this method is not
called for tasks that are restricted to run on a single CPU, as these
tasks do not have the option to select a different CPU.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In the sched_ext built-in idle CPU selection logic, when handling a
WF_SYNC wakeup, we always attempt to migrate the task to the waker's
CPU, as the waker is expected to yield the CPU after waking the task.
However, it may be preferable to keep the task on its previous CPU if
the waker's CPU is cache-affine.
The same approach is also used by the fair class and in other scx
schedulers, like scx_rusty and scx_bpfland.
Therefore, apply the same logic to the built-in idle CPU selection
policy as well.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Save the searching time during bpf_scx_init.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Conflicts:
kernel/sched/ext.c
There's a context conflict between this upstream commit:
3fdb9ebcec10 sched_ext: Start schedulers with consistent p->scx.slice values
... and this fix in sched/urgent:
98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair()
Resolve it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
As described in commit b07996c7abac ("sched_ext: Don't hold
scx_tasks_lock for too long"), we're doing a cond_resched() every 32
calls to scx_task_iter_next() to avoid RCU and other stalls. That commit
also added a cpu_relax() to the codepath where we drop and reacquire the
lock, but as Waiman described in [0], cpu_relax() should only be
necessary in busy loops to avoid pounding on a cacheline (or to allow a
hypertwin to more fully utilize a core).
Let's remove the unnecessary cpu_relax().
[0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Commit 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
and its follow up fixes try to deal with a rather unfortunate
situation where is task is enqueued in a new class, even though it
shouldn't have been. Mostly because the existing ->switched_to/from()
hooks are in the wrong place for this case.
This all led to Paul being able to trigger failures at something like
once per 10k CPU hours of RCU torture.
For now, do the ugly thing and move the code to the right place by
ignoring the switch hooks.
Note: Clean up the whole sched_class::switch*_{to,from}() thing.
Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
|
|
While enabling and disabling a BPF scheduler, every task is iterated a
couple times by walking scx_tasks. Except for one, all iterations keep
holding scx_tasks_lock. On multi-socket systems under heavy rq lock
contention and high number of threads, this can can lead to RCU and other
stalls.
The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
running `stress-ng --workload 150 --workload-threads 10` with >400k idle
threads and RCU stall period reduced to 5s:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
rcu: 186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
rcu: (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
Sending NMI from CPU 80 to CPUs 91:
NMI backtrace for cpu 91
CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
Sched_ext: simple (disabling+all)
RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
FS: 0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
Call Trace:
<NMI>
</NMI>
<TASK>
do_raw_spin_lock+0x9c/0xb0
task_rq_lock+0x50/0x190
scx_task_iter_next_locked+0x157/0x170
scx_ops_disable_workfn+0x2c2/0xbf0
kthread_worker_fn+0x108/0x2a0
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30
</TASK>
Sending NMI from CPU 80 to CPUs 186:
NMI backtrace for cpu 186
CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
scx_task_iter can safely drop locks while iterating. Make
scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
stalls.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.
- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
scx_task_lock, respectively. Renamed to
scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
there are non-trivial side-effects.
- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
is internal.
- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
and scx_tasks_lock and the latter re-locks only scx_tasks_lock.
This doesn't cause behavior changes and will be used to implement stall
avoidance.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
This reverts commit 6f34d8d382d64e7d8e77f5a9ddfd06f4c04937b0.
Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
pick_next_task_scx() was turned into pick_task_scx() since
commit 753e2836d139 ("sched_ext: Unify regular and core-sched pick
task paths"). Update the outdated message.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.
sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
|
|
568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.
As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")
|
|
When the BPF scheduler fails, ops.exit() allows rich error reporting through
scx_exit_info. Use scx.exit() path consistently for all failures which can
be caused by the BPF scheduler:
- scx_ops_error() is called after ops.init() and ops.cgroup_init() failure
to record error information.
- ops.init_task() failure now uses scx_ops_error() instead of pr_err().
- The err_disable path updated to automatically trigger scx_ops_error() to
cover cases that the error message hasn't already been generated and
always return 0 indicating init success so that the error is reported
through ops.exit().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
|
|
Use tg_cgroup() to eliminate duplicate code patterns
in scx_bpf_task_cgroup().
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The #endif trailing comment of CONFIG_EXT_GROUP_SCHED is unmatched, so fix
it.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pure reorganization. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.
For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.
With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.
- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.
- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
|