diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-11-20 10:08:00 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-11-20 10:08:00 -0800 |
commit | 8f7c8b88bda4988f44e595a760438febf51c92c8 (patch) | |
tree | 556b56f6d57b274158f860535acb616bcb5c5985 /Documentation | |
parent | 7586d5276515a54656bc46530b32e10913c44b1f (diff) | |
parent | 6b8950ef993bcf198d4a80cde0b2da805b75ed70 (diff) |
Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve the default select_cpu() implementation making it topology
aware and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args
could be NULL in a rather hackish way. Use the new __nullable CFI
stub tags instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by
hammering on the same queue across sockets, could live-lock the
system to the point where the system couldn't make reasonable forward
progress.
This could lead to soft-lockup triggered resets or stalling out
bypass mode switch and thus BPF scheduler ejection for tens of
minutes if not hours. After trying a number of mitigations, the
following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while
sched_ext is trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is
imminent (a quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity
and overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch"
and the confusion around "consume". This is due to how the dispatch
queues became more generic over time. Rename the affected kfuncs for
clarity. Thanks to BPF's compatibility features, this change can be
made in a way that's both forward and backward compatible. The
compatibility code will be dropped in a few releases.
- Other misc changes
* tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits)
sched_ext: Replace scx_next_task_picked() with switch_class() in comment
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
sched_ext: Clarify sched_ext_ops table for userland scheduler
sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
sched_ext: Avoid live-locking bypass mode switching
sched_ext: Fix incorrect use of bitwise AND
sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
sched_ext: Introduce NUMA awareness to the default idle selection policy
sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
sched_ext: Rename CFI stubs to names that are recognized by BPF
sched_ext: Introduce LLC awareness to the default idle selection policy
sched_ext: Clarify ops.select_cpu() for single-CPU tasks
sched_ext: improve WAKE_SYNC behavior for default idle CPU selection
sched_ext: Use btf_ids to resolve task_struct
sched/ext: Use tg_cgroup() to elieminate duplicate code
sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/scheduler/sched-ext.rst | 71 |
1 files changed, 35 insertions, 36 deletions
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 7b59bbd2e564..6cb8b676ce03 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -130,7 +130,7 @@ optional. The following modified excerpt is from * Decide which CPU a task should be migrated to before being * enqueued (either at wakeup, fork time, or exec time). If an * idle core is found by the default ops.select_cpu() implementation, - * then dispatch the task directly to SCX_DSQ_LOCAL and skip the + * then insert the task directly into SCX_DSQ_LOCAL and skip the * ops.enqueue() callback. * * Note that this implementation has exactly the same behavior as the @@ -148,15 +148,15 @@ optional. The following modified excerpt is from cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); if (direct) - scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } /* - * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() - * callback will only be invoked if we failed to find a core to dispatch - * to in ops.select_cpu() above. + * Do a direct insertion of a task to the global DSQ. This ops.enqueue() + * callback will only be invoked if we failed to find a core to insert + * into in ops.select_cpu() above. * * Note that this implementation has exactly the same behavior as the * default ops.enqueue implementation, which just dispatches the task @@ -166,7 +166,7 @@ optional. The following modified excerpt is from */ void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) { - scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); } s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) @@ -202,14 +202,13 @@ and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``. -A CPU always executes a task from its local DSQ. A task is "dispatched" to a -DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's -local DSQ. +A CPU always executes a task from its local DSQ. A task is "inserted" into a +DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ. When a CPU is looking for the next task to run, if the local DSQ is not -empty, the first task is picked. Otherwise, the CPU tries to consume the -global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` -is invoked. +empty, the first task is picked. Otherwise, the CPU tries to move a task +from the global DSQ. If that doesn't yield a runnable task either, +``ops.dispatch()`` is invoked. Scheduling Cycle ---------------- @@ -229,26 +228,26 @@ The following briefly shows how a waking task is scheduled and executed. scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, using ``ops.select_cpu()`` judiciously can be simpler and more efficient. - A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by - calling ``scx_bpf_dispatch()``. If the task is dispatched to - ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` + by calling ``scx_bpf_dsq_insert()``. If the task is inserted into + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the local DSQ of whichever CPU is returned from ``ops.select_cpu()``. - Additionally, dispatching directly from ``ops.select_cpu()`` will cause the + Additionally, inserting directly from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to be skipped. Note that the scheduler core will ignore an invalid CPU selection, for example, if it's outside the allowed cpumask of the task. 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the - task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` + task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` can make one of the following decisions: - * Immediately dispatch the task to either the global or local DSQ by - calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or + * Immediately insert the task into either the global or local DSQ by + calling ``scx_bpf_dsq_insert()`` with ``SCX_DSQ_GLOBAL`` or ``SCX_DSQ_LOCAL``, respectively. - * Immediately dispatch the task to a custom DSQ by calling - ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. + * Immediately insert the task into a custom DSQ by calling + ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63. * Queue the task on the BPF side. @@ -257,23 +256,23 @@ The following briefly shows how a waking task is scheduled and executed. run, ``ops.dispatch()`` is invoked which can use the following two functions to populate the local DSQ. - * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can - be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, - ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` + * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be + used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()`` currently can't be called with BPF locks held, this is being worked on - and will be supported. ``scx_bpf_dispatch()`` schedules dispatching + and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion rather than performing them immediately. There can be up to ``ops.dispatch_max_batch`` pending tasks. - * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ - to the dispatching DSQ. This function cannot be called with any BPF - locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks - before trying to consume the specified DSQ. + * ``scx_bpf_move_to_local()`` moves a task from the specified non-local + DSQ to the dispatching DSQ. This function cannot be called with any BPF + locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions + tasks before trying to move from the specified DSQ. 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, the CPU runs the first one. If empty, the following steps are taken: - * Try to consume the global DSQ. If successful, run the task. + * Try to move from the global DSQ. If successful, run the task. * If ``ops.dispatch()`` has dispatched any tasks, retry #3. @@ -286,14 +285,14 @@ Note that the BPF scheduler can always choose to dispatch tasks immediately in ``ops.enqueue()`` as illustrated in the above simple example. If only the built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as a task is never queued on the BPF scheduler and both the local and global -DSQs are consumed automatically. +DSQs are executed automatically. -``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use -``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as +``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use +``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue -dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the -function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for -more information. +dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See +the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` +for more information. Where to Look ============= |