Age | Commit message (Collapse) | Author |
|
A ~10% regression has been reported for UnixBench's execl throughput
test by Aaron Lu and Ye Xiaolong:
https://lkml.org/lkml/2018/10/30/765
That test is pretty simple, it does a "recursive" execve() syscall on the
same binary. Starting from the syscall, this sequence is possible:
do_execve()
do_execveat_common()
__do_execve_file()
sched_exec()
select_task_rq_fair() <==| Task already enqueued
find_idlest_cpu()
find_idlest_group()
capacity_spare_wake() <==| Functions not called from
cpu_util_wake() | the wakeup path
which means we can end up calling cpu_util_wake() not only from the
"wakeup path", as its name would suggest. Indeed, the task doing an
execve() syscall is already enqueued on the CPU we want to get the
cpu_util_wake() for.
The estimated utilization for a CPU computed in cpu_util_wake() was
written under the assumption that function can be called only from the
wakeup path. If instead the task is already enqueued, we end up with a
utilization which does not remove the current task's contribution from
the estimated utilization of the CPU.
This will wrongly assume a reduced spare capacity on the current CPU and
increase the chances to migrate the task on execve.
The regression is tracked down to:
commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")
because in that patch we turn on by default the UTIL_EST sched feature.
However, the real issue is introduced by:
commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths")
Let's fix this by ensuring to always discount the task estimated
utilization from the CPU's estimated utilization when the task is also
the current one. The same benchmark of the bug report, executed on a
dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
reports these "Execl Throughput" figures (higher the better):
mainline : 48136.5 lps
mainline+fix : 55376.5 lps
which correspond to a 15% speedup.
Moreover, since {cpu_util,capacity_spare}_wake() are not really only
used from the wakeup path, let's remove this ambiguity by using a better
matching name: {cpu_util,capacity_spare}_without().
Since we are at that, let's also improve the existing documentation.
Reported-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Duplicated 'case it'.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Xi Xu <xu.xi8@zte.com.cn>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: zhong.weidong@zte.com.cn
Link: http://lkml.kernel.org/r/1541379013-11352-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems,
to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen,
Valentin Schneider)
- Topology handling improvements, in particular when CPU capacity
changes and related load-balancing fixes/improvements (Morten
Rasmussen)
- ... plus misc other improvements, fixes and updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions
sched/completions/Documentation: Clean up the document some more
sched/completions/Documentation: Fix a couple of punctuation nits
cpu/SMT: State SMT is disabled even with nosmt and without "=force"
sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
sched/fair: Remove setting task's se->runnable_weight during PELT update
sched/fair: Disable LB_BIAS by default
sched/pelt: Fix warning and clean up IRQ PELT config
sched/topology: Make local variables static
sched/debug: Use symbolic names for task state constants
sched/numa: Remove unused numa_stats::nr_running field
sched/numa: Remove unused code from update_numa_stats()
sched/debug: Explicitly cast sched_feat() to bool
sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
sched/fair: Don't move tasks to lower capacity CPUs unless necessary
sched/fair: Set rq->rd->overload when misfit
sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
sched/core: Change root_domain->overload type to int
sched/fair: Change 'prefer_sibling' type to bool
sched/fair: Kick nohz balance if rq->misfit_task_load
...
|
|
The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.
From commit:
b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.
Signed-off-by: Song Muchun <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve. Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.
Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list. The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.
This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -976050
2 ffff90b56cb2cc00 -484925
3 ffff90b56cb2bc00 -658814
4 ffff90b56cb2ba00 -275365
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -994147
2 ffff90b56cb2cc00 -306051
3 ffff90b56cb2bc00 -961321
4 ffff90b56cb2ba00 -24490
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582
Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
should migrate to a local node. This filter avoids excessive ping-ponging
if a page is shared or used by threads that migrate cross-node frequently.
Threads inherit both page tables and the preferred node ID from the
parent. This means that threads can trigger hinting faults earlier than
a new task which delays scanning for a number of seconds. As it can be
load balanced very early in its lifetime there can be an unnecessary delay
before it starts migrating thread-local data. This patch migrates private
pages faster early in the lifetime of a thread using the sequence counter
as an identifier of new tasks.
With this patch applied, STREAM performance is the same as 4.17 even though
processes are not spread cross-node prematurely. Other workloads showed
a mix of minor gains and losses. This is somewhat expected most workloads
are not very sensitive to the starting conditions of a process.
4.19.0-rc5 4.19.0-rc5 4.17.0
numab-v1r1 fastmigrate-v1r1 vanilla
MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%)
MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%)
MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%)
MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.
Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:
2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
If NUMA improvement from the task migration is going to be very
minimal, then avoid task migration.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 198512 205910 3.72673
1 313559 318491 1.57291
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
8 74761.9 74935.9 0.232739
1 214874 226796 5.54837
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 180536 189780 5.12031
1 210281 205695 -2.18089
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 56511.4 60370 6.828
1 104899 108100 3.05151
1/7 cases is regressing, if we look at events migrate_pages seem
to vary the most especially in the regressing case. Also some
amount of variance is expected between different runs of
Specjbb2005.
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,818,546 13,801,554
migrations 1,149,960 1,151,541
faults 385,583 433,246
cache-misses 55,259,546,768 55,168,691,835
sched:sched_move_numa 2,257 2,551
sched:sched_stick_numa 9 24
sched:sched_swap_numa 512 904
migrate:mm_migrate_pages 2,225 1,571
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 72692 113682
numa_hint_faults_local 62270 102163
numa_hit 238762 240181
numa_huge_pte_updates 48 36
numa_interleave 75 64
numa_local 238676 240103
numa_other 86 78
numa_pages_migrated 2225 1564
numa_pte_updates 98557 134080
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,173,490 3,079,150
migrations 36,966 31,455
faults 108,776 99,081
cache-misses 12,200,075,320 11,588,126,740
sched:sched_move_numa 1,264 1
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 899 36
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 21109 430
numa_hint_faults_local 17120 77
numa_hit 72934 71277
numa_huge_pte_updates 42 0
numa_interleave 33 22
numa_local 72866 71218
numa_other 68 59
numa_pages_migrated 915 23
numa_pte_updates 42326 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,312,022 8,707,565
migrations 231,705 171,342
faults 310,242 310,820
cache-misses 402,324,573 136,115,400
sched:sched_move_numa 193 215
sched:sched_stick_numa 0 6
sched:sched_swap_numa 3 24
migrate:mm_migrate_pages 93 162
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11838 8985
numa_hint_faults_local 11216 8154
numa_hit 90689 93819
numa_huge_pte_updates 0 0
numa_interleave 1579 882
numa_local 89634 93496
numa_other 1055 323
numa_pages_migrated 92 169
numa_pte_updates 12109 9217
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,170,481 2,152,072
migrations 10,126 10,704
faults 160,962 164,376
cache-misses 10,834,845 3,818,437
sched:sched_move_numa 10 16
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 7
migrate:mm_migrate_pages 2 199
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 403 2248
numa_hint_faults_local 358 1666
numa_hit 25898 25704
numa_huge_pte_updates 0 0
numa_interleave 207 200
numa_local 25860 25679
numa_other 38 25
numa_pages_migrated 2 197
numa_pte_updates 400 2234
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 110,339,633 93,330,595
migrations 4,139,812 4,122,061
faults 863,622 865,979
cache-misses 231,838,045,660 225,395,083,479
sched:sched_move_numa 2,196 2,372
sched:sched_stick_numa 33 24
sched:sched_swap_numa 544 769
migrate:mm_migrate_pages 2,469 1,677
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 85748 91638
numa_hint_faults_local 66831 78096
numa_hit 242213 242225
numa_huge_pte_updates 0 0
numa_interleave 0 2
numa_local 242211 242219
numa_other 2 6
numa_pages_migrated 2376 1515
numa_pte_updates 86233 92274
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 59,331,057 51,487,271
migrations 552,019 537,170
faults 266,586 256,921
cache-misses 73,796,312,990 70,073,831,187
sched:sched_move_numa 981 576
sched:sched_stick_numa 54 24
sched:sched_swap_numa 286 327
migrate:mm_migrate_pages 713 726
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 14807 12000
numa_hint_faults_local 5738 5024
numa_hit 36230 36470
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36228 36465
numa_other 2 5
numa_pages_migrated 703 726
numa_pte_updates 14742 11930
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.
This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203370 205332 0.964744
1 328431 319785 -2.63252
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 206070 206585 0.249915
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188386 189162 0.41192
1 201566 213760 6.04963
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 59157.4 58736.8 -0.710985
1 105495 105419 -0.0720413
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,825,492 14,285,708
migrations 1,152,509 1,180,621
faults 371,948 339,114
cache-misses 55,654,206,041 55,205,631,894
sched:sched_move_numa 1,856 843
sched:sched_stick_numa 4 6
sched:sched_swap_numa 428 219
migrate:mm_migrate_pages 898 365
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 57146 26907
numa_hint_faults_local 51612 24279
numa_hit 238164 239771
numa_huge_pte_updates 16 0
numa_interleave 63 68
numa_local 238085 239688
numa_other 79 83
numa_pages_migrated 883 363
numa_pte_updates 67540 27415
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,288,525 3,202,779
migrations 38,652 37,186
faults 111,678 106,076
cache-misses 12,111,197,376 12,024,873,744
sched:sched_move_numa 900 931
sched:sched_stick_numa 0 0
sched:sched_swap_numa 5 1
migrate:mm_migrate_pages 714 637
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 18572 17409
numa_hint_faults_local 14850 14367
numa_hit 73197 73953
numa_huge_pte_updates 11 20
numa_interleave 25 25
numa_local 73138 73892
numa_other 59 61
numa_pages_migrated 712 668
numa_pte_updates 24021 27276
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,451,543 8,474,013
migrations 202,804 254,934
faults 310,024 320,506
cache-misses 253,522,507 110,580,458
sched:sched_move_numa 213 725
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 7
migrate:mm_migrate_pages 88 145
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11830 22797
numa_hint_faults_local 11301 21539
numa_hit 90038 89308
numa_huge_pte_updates 0 0
numa_interleave 855 865
numa_local 89796 88955
numa_other 242 353
numa_pages_migrated 88 149
numa_pte_updates 12039 22930
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,049,153 2,195,628
migrations 11,405 11,179
faults 162,309 149,656
cache-misses 7,203,343 8,117,515
sched:sched_move_numa 22 49
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 1 5
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 1693 3577
numa_hint_faults_local 1669 3476
numa_hit 25177 26142
numa_huge_pte_updates 0 0
numa_interleave 194 358
numa_local 24993 26042
numa_other 184 100
numa_pages_migrated 1 5
numa_pte_updates 1577 3587
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 94,515,937 100,602,296
migrations 4,203,554 4,135,630
faults 832,697 789,256
cache-misses 226,248,698,331 226,160,621,058
sched:sched_move_numa 1,730 1,366
sched:sched_stick_numa 14 16
sched:sched_swap_numa 432 374
migrate:mm_migrate_pages 1,398 1,350
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 80079 47857
numa_hint_faults_local 68620 39768
numa_hit 241187 240165
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 241186 240165
numa_other 1 0
numa_pages_migrated 1347 1224
numa_pte_updates 80729 48354
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 63,704,961 58,515,496
migrations 573,404 564,845
faults 230,878 245,807
cache-misses 76,568,222,781 73,603,757,976
sched:sched_move_numa 509 996
sched:sched_stick_numa 31 10
sched:sched_swap_numa 182 193
migrate:mm_migrate_pages 541 646
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 8501 13422
numa_hint_faults_local 2960 5619
numa_hit 35526 36118
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35526 36116
numa_other 0 2
numa_pages_migrated 539 616
numa_pte_updates 8433 13374
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently task scan rate is reset when NUMA balancer migrates the task
to a different node. If NUMA balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.
Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200668 203370 1.3465
1 321791 328431 2.06345
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 204848 206070 0.59654
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188098 188386 0.153112
1 200351 201566 0.606436
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 58145.9 59157.4 1.73959
1 103798 105495 1.63491
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,912,183 13,825,492
migrations 1,155,931 1,152,509
faults 367,139 371,948
cache-misses 54,240,196,814 55,654,206,041
sched:sched_move_numa 1,571 1,856
sched:sched_stick_numa 9 4
sched:sched_swap_numa 463 428
migrate:mm_migrate_pages 703 898
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 50155 57146
numa_hint_faults_local 45264 51612
numa_hit 239652 238164
numa_huge_pte_updates 36 16
numa_interleave 68 63
numa_local 239576 238085
numa_other 76 79
numa_pages_migrated 680 883
numa_pte_updates 71146 67540
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,156,720 3,288,525
migrations 30,354 38,652
faults 97,261 111,678
cache-misses 12,400,026,826 12,111,197,376
sched:sched_move_numa 4 900
sched:sched_stick_numa 0 0
sched:sched_swap_numa 1 5
migrate:mm_migrate_pages 20 714
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 272 18572
numa_hint_faults_local 186 14850
numa_hit 71362 73197
numa_huge_pte_updates 0 11
numa_interleave 23 25
numa_local 71299 73138
numa_other 63 59
numa_pages_migrated 2 712
numa_pte_updates 0 24021
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,606,824 8,451,543
migrations 155,352 202,804
faults 301,409 310,024
cache-misses 157,759,224 253,522,507
sched:sched_move_numa 168 213
sched:sched_stick_numa 0 0
sched:sched_swap_numa 3 2
migrate:mm_migrate_pages 125 88
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 4650 11830
numa_hint_faults_local 3946 11301
numa_hit 90489 90038
numa_huge_pte_updates 0 0
numa_interleave 892 855
numa_local 90034 89796
numa_other 455 242
numa_pages_migrated 124 88
numa_pte_updates 4818 12039
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,113,167 2,049,153
migrations 10,533 11,405
faults 142,727 162,309
cache-misses 5,594,192 7,203,343
sched:sched_move_numa 10 22
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 1
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 744 1693
numa_hint_faults_local 584 1669
numa_hit 25551 25177
numa_huge_pte_updates 0 0
numa_interleave 263 194
numa_local 25302 24993
numa_other 249 184
numa_pages_migrated 6 1
numa_pte_updates 744 1577
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 101,227,352 94,515,937
migrations 4,151,829 4,203,554
faults 745,233 832,697
cache-misses 224,669,561,766 226,248,698,331
sched:sched_move_numa 617 1,730
sched:sched_stick_numa 2 14
sched:sched_swap_numa 187 432
migrate:mm_migrate_pages 316 1,398
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 24195 80079
numa_hint_faults_local 21639 68620
numa_hit 238331 241187
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 238331 241186
numa_other 0 1
numa_pages_migrated 204 1347
numa_pte_updates 24561 80729
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 62,738,978 63,704,961
migrations 562,702 573,404
faults 228,465 230,878
cache-misses 75,778,067,952 76,568,222,781
sched:sched_move_numa 648 509
sched:sched_stick_numa 13 31
sched:sched_swap_numa 137 182
migrate:mm_migrate_pages 733 541
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 10281 8501
numa_hint_faults_local 3242 2960
numa_hit 36338 35526
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36338 35526
numa_other 0 0
numa_pages_migrated 706 539
numa_pte_updates 10176 8433
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.
No functional change.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203353 200668 -1.32036
1 328205 321791 -1.95427
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 214384 204848 -4.44809
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188553 188098 -0.241311
1 196273 200351 2.07772
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 57581.2 58145.9 0.980702
1 103468 103798 0.318939
Brings out the variance between different specjbb2005 runs.
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,941,377 13,912,183
migrations 1,157,323 1,155,931
faults 382,175 367,139
cache-misses 54,993,823,500 54,240,196,814
sched:sched_move_numa 2,005 1,571
sched:sched_stick_numa 14 9
sched:sched_swap_numa 529 463
migrate:mm_migrate_pages 1,573 703
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 67099 50155
numa_hint_faults_local 58456 45264
numa_hit 240416 239652
numa_huge_pte_updates 18 36
numa_interleave 65 68
numa_local 240339 239576
numa_other 77 76
numa_pages_migrated 1574 680
numa_pte_updates 77182 71146
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,176,453 3,156,720
migrations 30,238 30,354
faults 87,869 97,261
cache-misses 12,544,479,391 12,400,026,826
sched:sched_move_numa 23 4
sched:sched_stick_numa 0 0
sched:sched_swap_numa 6 1
migrate:mm_migrate_pages 10 20
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 236 272
numa_hint_faults_local 201 186
numa_hit 72293 71362
numa_huge_pte_updates 0 0
numa_interleave 26 23
numa_local 72233 71299
numa_other 60 63
numa_pages_migrated 8 2
numa_pte_updates 0 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,478,820 8,606,824
migrations 171,323 155,352
faults 307,499 301,409
cache-misses 240,353,599 157,759,224
sched:sched_move_numa 214 168
sched:sched_stick_numa 0 0
sched:sched_swap_numa 4 3
migrate:mm_migrate_pages 89 125
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 5301 4650
numa_hint_faults_local 4745 3946
numa_hit 92943 90489
numa_huge_pte_updates 0 0
numa_interleave 899 892
numa_local 92345 90034
numa_other 598 455
numa_pages_migrated 88 124
numa_pte_updates 5505 4818
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,066,172 2,113,167
migrations 11,076 10,533
faults 149,544 142,727
cache-misses 10,398,067 5,594,192
sched:sched_move_numa 43 10
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 6
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 3552 744
numa_hint_faults_local 3347 584
numa_hit 25611 25551
numa_huge_pte_updates 0 0
numa_interleave 213 263
numa_local 25583 25302
numa_other 28 249
numa_pages_migrated 6 6
numa_pte_updates 3535 744
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 99,358,136 101,227,352
migrations 4,041,607 4,151,829
faults 749,653 745,233
cache-misses 225,562,543,251 224,669,561,766
sched:sched_move_numa 771 617
sched:sched_stick_numa 14 2
sched:sched_swap_numa 204 187
migrate:mm_migrate_pages 1,180 316
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 27409 24195
numa_hint_faults_local 20677 21639
numa_hit 239988 238331
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 239983 238331
numa_other 5 0
numa_pages_migrated 1016 204
numa_pte_updates 27916 24561
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 60,899,307 62,738,978
migrations 544,668 562,702
faults 270,834 228,465
cache-misses 74,543,455,635 75,778,067,952
sched:sched_move_numa 735 648
sched:sched_stick_numa 25 13
sched:sched_swap_numa 174 137
migrate:mm_migrate_pages 816 733
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 11059 10281
numa_hint_faults_local 4733 3242
numa_hit 41384 36338
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 41383 36338
numa_other 1 0
numa_pages_migrated 815 706
numa_pte_updates 11323 10176
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Task migration under NUMA balancing can happen in parallel. More than
one task might choose to migrate to the same CPU at the same time. This
can result in:
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.
This problem is more likely if there are more cores per node or more
nodes in the system.
Use a per run-queue variable to check if NUMA-balance is active on the
run-queue.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200194 203353 1.57797
1 311331 328205 5.41995
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 197654 214384 8.46429
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 192605 188553 -2.10379
1 213402 196273 -8.02664
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 52227.1 57581.2 10.2516
1 102529 103468 0.915838
There is a regression on power 9 box. If we look at the details,
that box has a sudden jump in cache-misses with this patch.
All other parameters seem to be pointing towards NUMA
consolidation.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,345,784 13,941,377
migrations 1,127,820 1,157,323
faults 374,736 382,175
cache-misses 55,132,054,603 54,993,823,500
sched:sched_move_numa 1,923 2,005
sched:sched_stick_numa 52 14
sched:sched_swap_numa 595 529
migrate:mm_migrate_pages 1,932 1,573
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 60605 67099
numa_hint_faults_local 51804 58456
numa_hit 239945 240416
numa_huge_pte_updates 14 18
numa_interleave 60 65
numa_local 239865 240339
numa_other 80 77
numa_pages_migrated 1931 1574
numa_pte_updates 67823 77182
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,016,467 3,176,453
migrations 37,326 30,238
faults 115,342 87,869
cache-misses 11,692,155,554 12,544,479,391
sched:sched_move_numa 965 23
sched:sched_stick_numa 8 0
sched:sched_swap_numa 35 6
migrate:mm_migrate_pages 1,168 10
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 16286 236
numa_hint_faults_local 11863 201
numa_hit 112482 72293
numa_huge_pte_updates 33 0
numa_interleave 20 26
numa_local 112419 72233
numa_other 63 60
numa_pages_migrated 1144 8
numa_pte_updates 32859 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,629,724 8,478,820
migrations 221,052 171,323
faults 308,661 307,499
cache-misses 135,574,913 240,353,599
sched:sched_move_numa 147 214
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 4
migrate:mm_migrate_pages 64 89
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11481 5301
numa_hint_faults_local 10968 4745
numa_hit 89773 92943
numa_huge_pte_updates 0 0
numa_interleave 1116 899
numa_local 89220 92345
numa_other 553 598
numa_pages_migrated 62 88
numa_pte_updates 11694 5505
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,272,887 2,066,172
migrations 12,206 11,076
faults 163,704 149,544
cache-misses 4,801,186 10,398,067
sched:sched_move_numa 44 43
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 17 6
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 2261 3552
numa_hint_faults_local 1993 3347
numa_hit 25726 25611
numa_huge_pte_updates 0 0
numa_interleave 239 213
numa_local 25498 25583
numa_other 228 28
numa_pages_migrated 17 6
numa_pte_updates 2266 3535
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 117,980,962 99,358,136
migrations 3,950,220 4,041,607
faults 736,979 749,653
cache-misses 224,976,072,879 225,562,543,251
sched:sched_move_numa 504 771
sched:sched_stick_numa 50 14
sched:sched_swap_numa 239 204
migrate:mm_migrate_pages 1,260 1,180
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 18293 27409
numa_hint_faults_local 11969 20677
numa_hit 240854 239988
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 240851 239983
numa_other 3 5
numa_pages_migrated 1190 1016
numa_pte_updates 18106 27916
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 61,053,158 60,899,307
migrations 551,586 544,668
faults 244,174 270,834
cache-misses 74,326,766,973 74,543,455,635
sched:sched_move_numa 344 735
sched:sched_stick_numa 24 25
sched:sched_swap_numa 140 174
migrate:mm_migrate_pages 568 816
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 6461 11059
numa_hint_faults_local 2283 4733
numa_hit 35661 41384
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35661 41383
numa_other 0 1
numa_pages_migrated 568 815
numa_pte_updates 6518 11323
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
nr_running in struct numa_stats is not used anywhere in the code.
Remove it.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With:
commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")
the local variables 'smt', 'cpus' and 'capacity' and their results are not used
anymore in numa_has_capacity()
Remove this unused code.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
CPU with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks _ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong
can happen because of the compiler.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
- rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
- sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity CPUs, this will be evaluated to False.
- rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see:
https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
balancing.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Fix kernel-doc warning for missing 'flags' parameter description:
../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
It can happen that load_balance() finds a busiest group and then a
busiest rq but the calculated imbalance is in fact 0.
In such situation, detach_tasks() returns immediately and lets the
flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
have pinned tasks and removed from the load balance mask. then, we
redo a load balance without the busiest CPU. This creates wrong load
balance situation and generates wrong task migration.
If the calculated imbalance is 0, it's useless to try to find a
busiest rq as no task will be migrated and we can return immediately.
This situation can happen with heterogeneous system or smp system when
RT tasks are decreasing the capacity of some CPUs.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: jhugo@codeaurora.org
Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Since commit:
523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
scale_rt_capacity() returns the remaining capacity and not a scale factor
to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
scale_rt_capacity() so we must take the sched_domain argument.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.
For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.
Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.
Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().
Based on a similar patch from John Dias <joaodias@google.com>.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: John Dias <joaodias@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel de Dios <migueldedios@google.com>
Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: kernel-team@android.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
update_blocked_averages() is called to periodiccally decay the stalled load
of idle CPUs and to sync all loads before running load balance.
When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
in order to potentially pull tasks and to use this newly idle CPU. This
load balance happens whereas prev task from another class has not been put
and its utilization updated yet. This may lead to wrongly account running
time as idle time for RT or DL classes.
Test that no RT or DL task is running when updating their utilization in
update_blocked_averages().
We still update RT and DL utilization instead of simply skipping them to
make sure that all metrics are synced when used during load balance.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
numa_migrate_preferred() is called periodically or when task preferred
node changes. Preferred node evaluations happen once per scan sequence.
If the scan completion happens just after the periodic NUMA migration,
then we try to migrate to the preferred node and the preferred node might
change, needing another node migration.
Avoid this by checking for scan sequence completion only when checking
for periodic migration.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25862.6 26158.1 1.14258
1 74357 72725 -2.19482
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 117019 113992 -2.58
1 179095 174947 -2.31
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 449.46 770.77 615.22 101.70
numa01.sh Sys: 132.72 208.17 170.46 24.96
numa01.sh User: 39185.26 60290.89 50066.76 6807.84
numa02.sh Real: 60.85 61.79 61.28 0.37
numa02.sh Sys: 15.34 24.71 21.08 3.61
numa02.sh User: 5204.41 5249.85 5231.21 17.60
numa03.sh Real: 785.50 916.97 840.77 44.98
numa03.sh Sys: 108.08 133.60 119.43 8.82
numa03.sh User: 61422.86 70919.75 64720.87 3310.61
numa04.sh Real: 429.57 587.37 480.80 57.40
numa04.sh Sys: 240.61 321.97 290.84 33.58
numa04.sh User: 34597.65 40498.99 37079.48 2060.72
numa05.sh Real: 392.09 431.25 414.65 13.82
numa05.sh Sys: 229.41 372.48 297.54 53.14
numa05.sh User: 33390.86 34697.49 34222.43 556.42
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 424.63 566.18 498.12 59.26 23.50%
numa01.sh Sys: 160.19 256.53 208.98 37.02 -18.4%
numa01.sh User: 37320.00 46225.58 42001.57 3482.45 19.20%
numa02.sh Real: 60.17 62.47 60.91 0.85 0.607%
numa02.sh Sys: 15.30 22.82 17.04 2.90 23.70%
numa02.sh User: 5202.13 5255.51 5219.08 20.14 0.232%
numa03.sh Real: 823.91 844.89 833.86 8.46 0.828%
numa03.sh Sys: 130.69 148.29 140.47 6.21 -14.9%
numa03.sh User: 62519.15 64262.20 63613.38 620.05 1.740%
numa04.sh Real: 515.30 603.74 548.56 30.93 -12.3%
numa04.sh Sys: 459.73 525.48 489.18 21.63 -40.5%
numa04.sh User: 40561.96 44919.18 42047.87 1526.85 -11.8%
numa05.sh Real: 396.58 454.37 421.13 19.71 -1.53%
numa05.sh Sys: 208.72 422.02 348.90 73.60 -14.7%
numa05.sh User: 33124.08 36109.35 34846.47 1089.74 -1.79%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
consolidated to the closest group of nodes. In such a case, relying on
group_fault metric may not always help to consolidate. There can always
be a case where a node closer to the preferred node may have lesser
faults than a node further away from the preferred node. In such a case,
moving to node with more faults might avoid numa consolidation.
Using group_weight would help to consolidate task/memory around the
preferred_node.
While here, to be on the conservative side, don't override migrate thread
degrades locality logic for CPU_NEWLY_IDLE load balancing.
Note: Similar problems exist with should_numa_migrate_memory and will be
dealt separately.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25645.4 25960 1.22
1 72142 73550 1.95
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 110199 120071 8.958
1 176303 176249 -0.03
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 490.04 774.86 596.26 96.46
numa01.sh Sys: 151.52 242.88 184.82 31.71
numa01.sh User: 41418.41 60844.59 48776.09 6564.27
numa02.sh Real: 60.14 62.94 60.98 1.00
numa02.sh Sys: 16.11 30.77 21.20 5.28
numa02.sh User: 5184.33 5311.09 5228.50 44.24
numa03.sh Real: 790.95 856.35 826.41 24.11
numa03.sh Sys: 114.93 118.85 117.05 1.63
numa03.sh User: 60990.99 64959.28 63470.43 1415.44
numa04.sh Real: 434.37 597.92 504.87 59.70
numa04.sh Sys: 237.63 397.40 289.74 55.98
numa04.sh User: 34854.87 41121.83 38572.52 2615.84
numa05.sh Real: 386.77 448.90 417.22 22.79
numa05.sh Sys: 149.23 379.95 303.04 79.55
numa05.sh User: 32951.76 35959.58 34562.18 1034.05
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 493.19 672.88 597.51 59.38 -0.20%
numa01.sh Sys: 150.09 245.48 207.76 34.26 -11.0%
numa01.sh User: 41928.51 53779.17 48747.06 3901.39 0.059%
numa02.sh Real: 60.63 62.87 61.22 0.83 -0.39%
numa02.sh Sys: 16.64 27.97 20.25 4.06 4.691%
numa02.sh User: 5222.92 5309.60 5254.03 29.98 -0.48%
numa03.sh Real: 821.52 902.15 863.60 32.41 -4.30%
numa03.sh Sys: 112.04 130.66 118.35 7.08 -1.09%
numa03.sh User: 62245.16 69165.14 66443.04 2450.32 -4.47%
numa04.sh Real: 414.53 519.57 476.25 37.00 6.009%
numa04.sh Sys: 181.84 335.67 280.41 54.07 3.327%
numa04.sh User: 33924.50 39115.39 37343.78 1934.26 3.290%
numa05.sh Real: 408.30 441.45 417.90 12.05 -0.16%
numa05.sh Sys: 233.41 381.60 295.58 57.37 2.523%
numa05.sh User: 33301.31 35972.50 34335.19 938.94 0.661%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The metrics for updating scan periods are local or task specific.
Currently this update happens under the numa_group lock, which seems
unnecessary. Hence move this update outside the lock.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25355.9 25645.4 1.141
1 72812 72142 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
task_numa_find_cpu() helps to find the CPU to swap/move the task to.
It's guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps NUMA placement.
Further load_too_imbalanced(), which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.
Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25657.9 25804.1 0.569
1 74435 73413 -1.37
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.
However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node. Instead of achieving node
consolidation, it may end up spreading the load across nodes.
To avoid that pass the CPUs as additional parameters.
While here, place migrate_swap under CONFIG_NUMA_BALANCING.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25377.3 25226.6 -0.59
1 72287 73326 1.437
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The task_capacity field in 'struct numa_stats' is redundant.
Also move nr_running for better packing within the struct.
No functional changes.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25308.6 25377.3 0.271
1 72964 72287 -0.92
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When comparing two nodes at a distance of 'hoplimit', we should consider
nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
distance too. Hence two nodes at a distance of 'hoplimit' will have same
groupweight. Fix this by skipping nodes at hoplimit.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25375.3 25308.6 -0.26
1 72617 72964 0.477
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 113372 108750 -4.07684
1 177403 183115 3.21979
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 478.45 565.90 515.11 30.87
numa01.sh Sys: 207.79 271.04 232.94 21.33
numa01.sh User: 39763.93 47303.12 43210.73 2644.86
numa02.sh Real: 60.00 61.46 60.78 0.49
numa02.sh Sys: 15.71 25.31 20.69 3.42
numa02.sh User: 5175.92 5265.86 5235.97 32.82
numa03.sh Real: 776.42 834.85 806.01 23.22
numa03.sh Sys: 114.43 128.75 121.65 5.49
numa03.sh User: 60773.93 64855.25 62616.91 1576.39
numa04.sh Real: 456.93 511.95 482.91 20.88
numa04.sh Sys: 178.09 460.89 356.86 94.58
numa04.sh User: 36312.09 42553.24 39623.21 2247.96
numa05.sh Real: 393.98 493.48 436.61 35.59
numa05.sh Sys: 164.49 329.15 265.87 61.78
numa05.sh User: 33182.65 36654.53 35074.51 1187.71
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 414.64 819.20 556.08 147.70 -7.36%
numa01.sh Sys: 77.52 205.04 139.40 52.05 67.10%
numa01.sh User: 37043.24 61757.88 45517.48 9290.38 -5.06%
numa02.sh Real: 60.80 63.32 61.63 0.88 -1.37%
numa02.sh Sys: 17.35 39.37 25.71 7.33 -19.5%
numa02.sh User: 5213.79 5374.73 5268.90 55.09 -0.62%
numa03.sh Real: 780.09 948.64 831.43 63.02 -3.05%
numa03.sh Sys: 104.96 136.92 116.31 11.34 4.591%
numa03.sh User: 60465.42 73339.78 64368.03 4700.14 -2.72%
numa04.sh Real: 412.60 681.92 521.29 96.64 -7.36%
numa04.sh Sys: 210.32 314.10 251.77 37.71 41.74%
numa04.sh User: 34026.38 45581.20 38534.49 4198.53 2.825%
numa05.sh Real: 394.79 439.63 411.35 16.87 6.140%
numa05.sh Sys: 238.32 330.09 292.31 38.32 -9.04%
numa05.sh User: 33456.45 34876.07 34138.62 609.45 2.741%
While there is a regression with this change, this change is needed from a
correctness perspective. Also it helps consolidation as seen from perf bench
output.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When numa_group faults are available, task_numa_placement only uses
numa_group faults to evaluate preferred node. However it still accounts
task faults and even evaluates the preferred node just based on task
faults just to discard it in favour of preferred node chosen on the
basis of numa_group.
Instead use task faults only if numa_group is not set.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25549.6 25215.7 -1.30
1 73190 72107 -1.47
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 113437 113372 -0.05
1 196130 177403 -9.54
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 506.35 794.46 599.06 104.26
numa01.sh Sys: 150.37 223.56 195.99 24.94
numa01.sh User: 43450.69 61752.04 49281.50 6635.33
numa02.sh Real: 60.33 62.40 61.31 0.90
numa02.sh Sys: 18.12 31.66 24.28 5.89
numa02.sh User: 5203.91 5325.32 5260.29 49.98
numa03.sh Real: 696.47 853.62 745.80 57.28
numa03.sh Sys: 85.68 123.71 97.89 13.48
numa03.sh User: 55978.45 66418.63 59254.94 3737.97
numa04.sh Real: 444.05 514.83 497.06 26.85
numa04.sh Sys: 230.39 375.79 316.23 48.58
numa04.sh User: 35403.12 41004.10 39720.80 2163.08
numa05.sh Real: 423.09 460.41 439.57 13.92
numa05.sh Sys: 287.38 480.15 369.37 68.52
numa05.sh User: 34732.12 38016.80 36255.85 1070.51
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 478.45 565.90 515.11 30.87 16.29%
numa01.sh Sys: 207.79 271.04 232.94 21.33 -15.8%
numa01.sh User: 39763.93 47303.12 43210.73 2644.86 14.04%
numa02.sh Real: 60.00 61.46 60.78 0.49 0.871%
numa02.sh Sys: 15.71 25.31 20.69 3.42 17.35%
numa02.sh User: 5175.92 5265.86 5235.97 32.82 0.464%
numa03.sh Real: 776.42 834.85 806.01 23.22 -7.47%
numa03.sh Sys: 114.43 128.75 121.65 5.49 -19.5%
numa03.sh User: 60773.93 64855.25 62616.91 1576.39 -5.36%
numa04.sh Real: 456.93 511.95 482.91 20.88 2.930%
numa04.sh Sys: 178.09 460.89 356.86 94.58 -11.3%
numa04.sh User: 36312.09 42553.24 39623.21 2247.96 0.246%
numa05.sh Real: 393.98 493.48 436.61 35.59 0.677%
numa05.sh Sys: 164.49 329.15 265.87 61.78 38.92%
numa05.sh User: 33182.65 36654.53 35074.51 1187.71 3.368%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the CPU/node ideal for node migration.
Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.
Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.
Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25122.9 25549.6 1.698
1 73850 73190 -0.89
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 105930 113437 7.08676
1 178624 196130 9.80047
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 435.78 653.81 534.58 83.20
numa01.sh Sys: 121.93 187.18 145.90 23.47
numa01.sh User: 37082.81 51402.80 43647.60 5409.75
numa02.sh Real: 60.64 61.63 61.19 0.40
numa02.sh Sys: 14.72 25.68 19.06 4.03
numa02.sh User: 5210.95 5266.69 5233.30 20.82
numa03.sh Real: 746.51 808.24 780.36 23.88
numa03.sh Sys: 97.26 108.48 105.07 4.28
numa03.sh User: 58956.30 61397.05 60162.95 1050.82
numa04.sh Real: 465.97 519.27 484.81 19.62
numa04.sh Sys: 304.43 359.08 334.68 20.64
numa04.sh User: 37544.16 41186.15 39262.44 1314.91
numa05.sh Real: 411.57 457.20 433.29 16.58
numa05.sh Sys: 230.05 435.48 339.95 67.58
numa05.sh User: 33325.54 36896.31 35637.84 1222.64
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 506.35 794.46 599.06 104.26 -10.76%
numa01.sh Sys: 150.37 223.56 195.99 24.94 -25.55%
numa01.sh User: 43450.69 61752.04 49281.50 6635.33 -11.43%
numa02.sh Real: 60.33 62.40 61.31 0.90 -0.195%
numa02.sh Sys: 18.12 31.66 24.28 5.89 -21.49%
numa02.sh User: 5203.91 5325.32 5260.29 49.98 -0.513%
numa03.sh Real: 696.47 853.62 745.80 57.28 4.6339%
numa03.sh Sys: 85.68 123.71 97.89 13.48 7.3347%
numa03.sh User: 55978.45 66418.63 59254.94 3737.97 1.5323%
numa04.sh Real: 444.05 514.83 497.06 26.85 -2.464%
numa04.sh Sys: 230.39 375.79 316.23 48.58 5.8343%
numa04.sh User: 35403.12 41004.10 39720.80 2163.08 -1.153%
numa05.sh Real: 423.09 460.41 439.57 13.92 -1.428%
numa05.sh Sys: 287.38 480.15 369.37 68.52 -7.964%
numa05.sh User: 34732.12 38016.80 36255.85 1070.51 -1.704%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.
However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25058.2 25122.9 0.25
1 72950 73850 1.23
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 516.14 892.41 739.84 151.32
numa01.sh Sys: 153.16 192.99 177.70 14.58
numa01.sh User: 39821.04 69528.92 57193.87 10989.48
numa02.sh Real: 60.91 62.35 61.58 0.63
numa02.sh Sys: 16.47 26.16 21.20 3.85
numa02.sh User: 5227.58 5309.61 5265.17 31.04
numa03.sh Real: 739.07 917.73 795.75 64.45
numa03.sh Sys: 94.46 136.08 109.48 14.58
numa03.sh User: 57478.56 72014.09 61764.48 5343.69
numa04.sh Real: 442.61 715.43 530.31 96.12
numa04.sh Sys: 224.90 348.63 285.61 48.83
numa04.sh User: 35836.84 47522.47 40235.41 3985.26
numa05.sh Real: 386.13 489.17 434.94 43.59
numa05.sh Sys: 144.29 438.56 278.80 105.78
numa05.sh User: 33255.86 36890.82 34879.31 1641.98
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 435.78 653.81 534.58 83.20 38.39%
numa01.sh Sys: 121.93 187.18 145.90 23.47 21.79%
numa01.sh User: 37082.81 51402.80 43647.60 5409.75 31.03%
numa02.sh Real: 60.64 61.63 61.19 0.40 0.637%
numa02.sh Sys: 14.72 25.68 19.06 4.03 11.22%
numa02.sh User: 5210.95 5266.69 5233.30 20.82 0.608%
numa03.sh Real: 746.51 808.24 780.36 23.88 1.972%
numa03.sh Sys: 97.26 108.48 105.07 4.28 4.197%
numa03.sh User: 58956.30 61397.05 60162.95 1050.82 2.661%
numa04.sh Real: 465.97 519.27 484.81 19.62 9.385%
numa04.sh Sys: 304.43 359.08 334.68 20.64 -14.6%
numa04.sh User: 37544.16 41186.15 39262.44 1314.91 2.478%
numa05.sh Real: 411.57 457.20 433.29 16.58 0.380%
numa05.sh Sys: 230.05 435.48 339.95 67.58 -17.9%
numa05.sh User: 33325.54 36896.31 35637.84 1222.64 -2.12%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
task_numa_compare() helps choose the best CPU to move or swap the
selected task. To achieve this task_numa_compare() is called for every
CPU in the node. Currently it evaluates if the task can be moved/swapped
for each of the CPUs. However the move evaluation is mostly independent
of the CPU. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().
Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25705.2 25058.2 -2.51
1 74433 72950 -1.99
Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
8 96589.6 105930 9.670
1 181830 178624 -1.76
(numbers from v1 based on v4.17-rc5)
Testcase Time: Min Max Avg StdDev
numa01.sh Real: 440.65 941.32 758.98 189.17
numa01.sh Sys: 183.48 320.07 258.42 50.09
numa01.sh User: 37384.65 71818.14 60302.51 13798.96
numa02.sh Real: 61.24 65.35 62.49 1.49
numa02.sh Sys: 16.83 24.18 21.40 2.60
numa02.sh User: 5219.59 5356.34 5264.03 49.07
numa03.sh Real: 822.04 912.40 873.55 37.35
numa03.sh Sys: 118.80 140.94 132.90 7.60
numa03.sh User: 62485.19 70025.01 67208.33 2967.10
numa04.sh Real: 690.66 872.12 778.49 65.44
numa04.sh Sys: 459.26 563.03 494.03 42.39
numa04.sh User: 51116.44 70527.20 58849.44 8461.28
numa05.sh Real: 418.37 562.28 525.77 54.27
numa05.sh Sys: 299.45 481.00 392.49 64.27
numa05.sh User: 34115.09 41324.02 39105.30 2627.68
Testcase Time: Min Max Avg StdDev %Change
numa01.sh Real: 516.14 892.41 739.84 151.32 2.587%
numa01.sh Sys: 153.16 192.99 177.70 14.58 45.42%
numa01.sh User: 39821.04 69528.92 57193.87 10989.48 5.435%
numa02.sh Real: 60.91 62.35 61.58 0.63 1.477%
numa02.sh Sys: 16.47 26.16 21.20 3.85 0.943%
numa02.sh User: 5227.58 5309.61 5265.17 31.04 -0.02%
numa03.sh Real: 739.07 917.73 795.75 64.45 9.776%
numa03.sh Sys: 94.46 136.08 109.48 14.58 21.39%
numa03.sh User: 57478.56 72014.09 61764.48 5343.69 8.813%
numa04.sh Real: 442.61 715.43 530.31 96.12 46.79%
numa04.sh Sys: 224.90 348.63 285.61 48.83 72.97%
numa04.sh User: 35836.84 47522.47 40235.41 3985.26 46.26%
numa05.sh Real: 386.13 489.17 434.94 43.59 20.88%
numa05.sh Sys: 144.29 438.56 278.80 105.78 40.77%
numa05.sh User: 33255.86 36890.82 34879.31 1641.98 12.11%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Reuse cpu_util_irq() that has been defined for schedutil and set irq util
to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
But the compiler is not able to optimize the sequence (at least with
aarch64 GCC 7.2.1):
free *= (max - irq);
free /= max;
when irq is fixed to 0
Add a new inline function scale_irq_capacity() that will scale utilization
when irq is accounted. Reuse this funciton in schedutil which applies
similar formula.
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
rt_avg is not used anywhere anymore, so we can remove all related code.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.
scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.
The same formula as schedutil is used:
IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.
That's also important to note that because:
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is:
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
tasks and the CFS's utilization might no longer describes the real
utilization level.
Current DL bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the DL sched class. We track
DL class utilization to estimate the system utilization.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
are preempted by RT tasks and in this case util_avg reflects the remaining
capacity but not what CFS want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of RT tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.
rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, PELT code is moved in a
dedicated file.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When a new task wakes-up for the first time, its initial utilization
is set to half of the spare capacity of its CPU. The current
implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
directly as a capacity reference. As a result, on a big.LITTLE system, a
new task waking up on an idle little CPU will be given ~512 of util_avg,
even if the CPU's capacity is significantly less than that.
Fix this by computing the spare capacity with arch_scale_cpu_capacity().
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|