summaryrefslogtreecommitdiff
path: root/kernel/time/timer_migration.h
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-03-11 14:38:26 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2024-03-11 14:38:26 -0700
commitd08c407f715f651e7ea40b3a037be46dd2b11e4c (patch)
tree2b9e1a81b93f316156e663cc1d90b62985032783 /kernel/time/timer_migration.h
parent80a76c60e5f6361c497d464bb6da6ea07e908a0e (diff)
parent8ca1836769d758e4fbf5851bb81e181c52193f5d (diff)
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner: "A large set of updates and features for timers and timekeeping: - The hierarchical timer pull model When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs. This is wrong in several aspects: 1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero. 2) Due to #1 it is possible that timers are accumulated on a single target CPU 3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire. The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed. This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire. As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels. When a CPU goes idle it evaluates its own timer wheels: - If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire. - If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer. The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed. In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels. The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry. Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level. Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first. This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path. This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI. There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario. There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side. - Fixes for the timekeeping interpolation code for cross-timestamps: cross-timestamps are used for PTP to get snapshots from hardware timers and interpolated them back to clock MONOTONIC. The changes address a few corner cases in the interpolation code which got the math and logic wrong. - Simplifcation of the clocksource watchdog retry logic to automatically adjust to handle larger systems correctly instead of having more incomprehensible command line parameters. - Treewide consolidation of the VDSO data structures. - The usual small improvements and cleanups all over the place" * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) timer/migration: Fix quick check reporting late expiry tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n vdso/datapage: Quick fix - use asm/page-def.h for ARM64 timers: Assert no next dyntick timer look-up while CPU is offline tick: Assume timekeeping is correctly handed over upon last offline idle call tick: Shut down low-res tick from dying CPU tick: Split nohz and highres features from nohz_mode tick: Move individual bit features to debuggable mask accesses tick: Move got_idle_tick away from common flags tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING tick: Move tick cancellation up to CPUHP_AP_TICK_DYING tick: Start centralizing tick related CPU hotplug operations tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick: Use IS_ENABLED() whenever possible tick/sched: Remove useless oneshot ifdeffery tick/nohz: Remove duplicate between lowres and highres handlers tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() hrtimer: Select housekeeping CPU during migration ...
Diffstat (limited to 'kernel/time/timer_migration.h')
-rw-r--r--kernel/time/timer_migration.h140
1 files changed, 140 insertions, 0 deletions
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
new file mode 100644
index 000000000000..6c37d94a37d9
--- /dev/null
+++ b/kernel/time/timer_migration.h
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _KERNEL_TIME_MIGRATION_H
+#define _KERNEL_TIME_MIGRATION_H
+
+/* Per group capacity. Must be a power of 2! */
+#define TMIGR_CHILDREN_PER_GROUP 8
+
+/**
+ * struct tmigr_event - a timer event associated to a CPU
+ * @nextevt: The node to enqueue an event in the parent group queue
+ * @cpu: The CPU to which this event belongs
+ * @ignore: Hint whether the event could be ignored; it is set when
+ * CPU or group is active;
+ */
+struct tmigr_event {
+ struct timerqueue_node nextevt;
+ unsigned int cpu;
+ bool ignore;
+};
+
+/**
+ * struct tmigr_group - timer migration hierarchy group
+ * @lock: Lock protecting the event information and group hierarchy
+ * information during setup
+ * @parent: Pointer to the parent group
+ * @groupevt: Next event of the group which is only used when the
+ * group is !active. The group event is then queued into
+ * the parent timer queue.
+ * Ignore bit of @groupevt is set when the group is active.
+ * @next_expiry: Base monotonic expiry time of the next event of the
+ * group; It is used for the racy lockless check whether a
+ * remote expiry is required; it is always reliable
+ * @events: Timer queue for child events queued in the group
+ * @migr_state: State of the group (see union tmigr_state)
+ * @level: Hierarchy level of the group; Required during setup
+ * @numa_node: Required for setup only to make sure CPU and low level
+ * group information is NUMA local. It is set to NUMA node
+ * as long as the group level is per NUMA node (level <
+ * tmigr_crossnode_level); otherwise it is set to
+ * NUMA_NO_NODE
+ * @num_children: Counter of group children to make sure the group is only
+ * filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
+ * only
+ * @childmask: childmask of the group in the parent group; is set
+ * during setup and will never change; can be read
+ * lockless
+ * @list: List head that is added to the per level
+ * tmigr_level_list; is required during setup when a
+ * new group needs to be connected to the existing
+ * hierarchy groups
+ */
+struct tmigr_group {
+ raw_spinlock_t lock;
+ struct tmigr_group *parent;
+ struct tmigr_event groupevt;
+ u64 next_expiry;
+ struct timerqueue_head events;
+ atomic_t migr_state;
+ unsigned int level;
+ int numa_node;
+ unsigned int num_children;
+ u8 childmask;
+ struct list_head list;
+};
+
+/**
+ * struct tmigr_cpu - timer migration per CPU group
+ * @lock: Lock protecting the tmigr_cpu group information
+ * @online: Indicates whether the CPU is online; In deactivate path
+ * it is required to know whether the migrator in the top
+ * level group is to be set offline, while a timer is
+ * pending. Then another online CPU needs to be notified to
+ * take over the migrator role. Furthermore the information
+ * is required in CPU hotplug path as the CPU is able to go
+ * idle before the timer migration hierarchy hotplug AP is
+ * reached. During this phase, the CPU has to handle the
+ * global timers on its own and must not act as a migrator.
+ * @idle: Indicates whether the CPU is idle in the timer migration
+ * hierarchy
+ * @remote: Is set when timers of the CPU are expired remotely
+ * @tmgroup: Pointer to the parent group
+ * @childmask: childmask of tmigr_cpu in the parent group
+ * @wakeup: Stores the first timer when the timer migration
+ * hierarchy is completely idle and remote expiry was done;
+ * is returned to timer code in the idle path and is only
+ * used in idle path.
+ * @cpuevt: CPU event which could be enqueued into the parent group
+ */
+struct tmigr_cpu {
+ raw_spinlock_t lock;
+ bool online;
+ bool idle;
+ bool remote;
+ struct tmigr_group *tmgroup;
+ u8 childmask;
+ u64 wakeup;
+ struct tmigr_event cpuevt;
+};
+
+/**
+ * union tmigr_state - state of tmigr_group
+ * @state: Combined version of the state - only used for atomic
+ * read/cmpxchg function
+ * @struct: Split version of the state - only use the struct members to
+ * update information to stay independent of endianness
+ */
+union tmigr_state {
+ u32 state;
+ /**
+ * struct - split state of tmigr_group
+ * @active: Contains each childmask bit of the active children
+ * @migrator: Contains childmask of the child which is migrator
+ * @seq: Sequence counter needs to be increased when an update
+ * to the tmigr_state is done. It prevents a race when
+ * updates in the child groups are propagated in changed
+ * order. Detailed information about the scenario is
+ * given in the documentation at the begin of
+ * timer_migration.c.
+ */
+ struct {
+ u8 active;
+ u8 migrator;
+ u16 seq;
+ } __packed;
+};
+
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern void tmigr_handle_remote(void);
+extern bool tmigr_requires_handle_remote(void);
+extern void tmigr_cpu_activate(void);
+extern u64 tmigr_cpu_deactivate(u64 nextevt);
+extern u64 tmigr_cpu_new_timer(u64 nextevt);
+extern u64 tmigr_quick_check(u64 nextevt);
+#else
+static inline void tmigr_handle_remote(void) { }
+static inline bool tmigr_requires_handle_remote(void) { return false; }
+static inline void tmigr_cpu_activate(void) { }
+#endif
+
+#endif