summaryrefslogtreecommitdiff
path: root/include/linux/sched
AgeCommit message (Collapse)Author
2023-06-27Merge tag 'sched-core-2023-06-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ...
2023-06-26Merge tag 'timers-core-2023-06-26' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time, timekeeping and related device driver updates: Core: - A set of fixes, cleanups and enhancements to the posix timer code: - Prevent another possible live lock scenario in the exit() path, which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures. - Fix a loop termination issue which was reported syzcaller/KSAN in the posix timer ID allocation code. That triggered a deeper look into the posix-timer code which unearthed more small issues. - Add missing READ/WRITE_ONCE() annotations - Fix or remove completely outdated comments - Document places which are subtle and completely undocumented. - Add missing hrtimer modes to the trace event decoder - Small cleanups and enhancements all over the place Drivers: - Rework the Hyper-V clocksource and sched clock setup code - Remove a deprecated clocksource driver - Small fixes and enhancements all over the place" * tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe dt-bindings: timers: Add Ralink SoCs timer clocksource/drivers/hyper-v: Rework clocksource and sched clock setup dt-bindings: timer: brcm,kona-timer: convert to YAML clocksource/drivers/imx-gpt: Fold <soc/imx/timer.h> into its only user clk: imx: Drop inclusion of unused header <soc/imx/timer.h> hrtimer: Add missing sparse annotations to hrtimer locking clocksource/drivers/imx-gpt: Use only a single name for functions clocksource/drivers/loongson1: Move PWM timer to clocksource framework dt-bindings: timer: Add Loongson-1 clocksource MIPS: Loongson32: Remove deprecated PWM timer clocksource clocksource/drivers/ingenic-timer: Use pm_sleep_ptr() macro tracing/timer: Add missing hrtimer modes to decode_hrtimer_mode(). posix-timers: Add sys_ni_posix_timers() prototype tick/rcu: Fix bogus ratelimit condition alarmtimer: Remove unnecessary (void *) cast alarmtimer: Remove unnecessary initialization of variable 'ret' posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS posix-timers: Polish coding style in a few places posix-timers: Remove pointless comments ...
2023-06-18posix-timers: Ensure timer ID search-loop limit is validThomas Gleixner
posix_timer_add() tries to allocate a posix timer ID by starting from the cached ID which was stored by the last successful allocation. This is done in a loop searching the ID space for a free slot one by one. The loop has to terminate when the search wrapped around to the starting point. But that's racy vs. establishing the starting point. That is read out lockless, which leads to the following problem: CPU0 CPU1 posix_timer_add() start = sig->posix_timer_id; lock(hash_lock); ... posix_timer_add() if (++sig->posix_timer_id < 0) start = sig->posix_timer_id; sig->posix_timer_id = 0; So CPU1 can observe a negative start value, i.e. -1, and the loop break never happens because the condition can never be true: if (sig->posix_timer_id == start) break; While this is unlikely to ever turn into an endless loop as the ID space is huge (INT_MAX), the racy read of the start value caught the attention of KCSAN and Dmitry unearthed that incorrectness. Rewrite it so that all id operations are under the hash lock. Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx
2023-06-16sched/topology: Mark set_sched_topology() __initMiaohe Lin
All callers of set_sched_topology() are within __init section. Mark it __init too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
2023-06-05sched/clock: Provide local_clock_noinstr()Peter Zijlstra
Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch, s390, x86) provide sched_clock_noinstr(), use this to provide local_clock_noinstr(). This local_clock_noinstr() will be safe to use from noinstr code with the assumption that any such noinstr code is non-preemptible (it had better be, entry code will have IRQs disabled while __cpuidle must have preemption disabled). Specifically, preempt_enable_notrace(), a common part of many a sched_clock() implementation calls out to schedule() -- even though, per the above, it will never trigger -- which frustrates noinstr validation. vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org
2023-06-01fork, vhost: Use CLONE_THREAD to fix freezer/ps regressionMike Christie
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-08sched/topology: Remove SHARED_CHILD from ASYM_PACKINGRicardo Neri
Only x86 and Power7 use ASYM_PACKING. They use it differently. Power7 has cores of equal priority, but the SMT siblings of a core have different priorities. Parent scheduling domains do not need (nor have) the ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would cause the topology debug code to complain. X86 has cores of different priority, but all the SMT siblings of the core have equal priority. It needs ASYM_PACKING at the MC level, but not at the SMT level (it also needs it at upper levels if they have scheduling groups of different priority). Removing ASYM_PACKING from the SMT domain causes the topology debug code to complain. Remove SHARED_CHILD for now. We still need a topology check that satisfies both architectures. Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230406203148.19182-10-ricardo.neri-calderon@linux.intel.com
2023-04-30Merge tag 'iommu-updates-v6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Convert to platform remove callback returning void - Extend changing default domain to normal group - Intel VT-d updates: - Remove VT-d virtual command interface and IOASID - Allow the VT-d driver to support non-PRI IOPF - Remove PASID supervisor request support - Various small and misc cleanups - ARM SMMU updates: - Device-tree binding updates: * Allow Qualcomm GPU SMMUs to accept relevant clock properties * Document Qualcomm 8550 SoC as implementing an MMU-500 * Favour new "qcom,smmu-500" binding for Adreno SMMUs - Fix S2CR quirk detection on non-architectural Qualcomm SMMU implementations - Acknowledge SMMUv3 PRI queue overflow when consuming events - Document (in a comment) why ATS is disabled for bypass streams - AMD IOMMU updates: - 5-level page-table support - NUMA awareness for memory allocations - Unisoc driver: Support for reattaching an existing domain - Rockchip driver: Add missing set_platform_dma_ops callback - Mediatek driver: Adjust the dma-ranges - Various other small fixes and cleanups * tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (82 commits) iommu: Remove iommu_group_get_by_id() iommu: Make iommu_release_device() static iommu/vt-d: Remove BUG_ON in dmar_insert_dev_scope() iommu/vt-d: Remove a useless BUG_ON(dev->is_virtfn) iommu/vt-d: Remove BUG_ON in map/unmap() iommu/vt-d: Remove BUG_ON when domain->pgd is NULL iommu/vt-d: Remove BUG_ON in handling iotlb cache invalidation iommu/vt-d: Remove BUG_ON on checking valid pfn range iommu/vt-d: Make size of operands same in bitwise operations iommu/vt-d: Remove PASID supervisor request support iommu/vt-d: Use non-privileged mode for all PASIDs iommu/vt-d: Remove extern from function prototypes iommu/vt-d: Do not use GFP_ATOMIC when not needed iommu/vt-d: Remove unnecessary checks in iopf disabling path iommu/vt-d: Move PRI handling to IOPF feature path iommu/vt-d: Move pfsid and ats_qdep calculation to device probe path iommu/vt-d: Move iopf code from SVA to IOPF enabling path iommu/vt-d: Allow SVA with device-specific IOPF dmaengine: idxd: Add enable/disable device IOPF feature arm64: dts: mt8186: Add dma-ranges for the parent "soc" node ...
2023-04-28Merge tag 'sched-core-2023-04-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Allow unprivileged PSI poll()ing - Fix performance regression introduced by mm_cid - Improve livepatch stalls by adding livepatch task switching to cond_resched(). This resolves livepatching busy-loop stalls with certain CPU-bound kthreads - Improve sched_move_task() performance on autogroup configs - On core-scheduling CPUs, avoid selecting throttled tasks to run - Misc cleanups, fixes and improvements * tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Fix local_clock() before sched_clock_init() sched/rt: Fix bad task migration for rt tasks sched: Fix performance regression introduced by mm_cid sched/core: Make sched_dynamic_mutex static sched/psi: Allow unprivileged polling of N*2s period sched/psi: Extract update_triggers side effect sched/psi: Rename existing poll members in preparation sched/psi: Rearrange polling code in preparation sched/fair: Fix inaccurate tally of ttwu_move_affine vhost: Fix livepatch timeouts in vhost_worker() livepatch,sched: Add livepatch task switching to cond_resched() livepatch: Skip task_call_func() for current task livepatch: Convert stack entries array to percpu sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization sched/core: Reduce cost of sched_move_task when config autogroup sched/core: Avoid selecting the task that is throttled to run when core-sched enable sched/topology: Make sched_energy_mutex,update static
2023-04-28Merge tag 'objtool-core-2023-04-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ...
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-21mm: add new api to enable ksm per processStefan Roesch
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21sched: Fix performance regression introduced by mm_cidMathieu Desnoyers
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu. Keep track of the currently allocated mm_cid for each mm/cpu rather than freeing them immediately on context switch. This eliminates most atomic operations when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The per-mm/per-cpu mm_cid values are serialized by their respective runqueue locks. Thread migration is handled by introducing invocation to sched_mm_cid_migrate_to() (with destination runqueue lock held) in activate_task() for migrating tasks. If the destination cpu's mm_cid is unset, and if the source runqueue is not actively using its mm_cid, then the source cpu's mm_cid is moved to the destination cpu on migration. Introduce a task-work executed periodically, similarly to NUMA work, which delays reclaim of cid values when they are unused for a period of time. Keep track of the allocation time for each per-cpu cid, and let the task work clear them when they are observed to be older than SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all mm_cids which are greater or equal to the Hamming weight of the mm cidmask to keep concurrency ids compact. Because we want to ensure the mm_cid converges towards the smaller values as migrations happen, the prior optimization that was done when context switching between threads belonging to the same mm is removed, because it could delay the lazy release of the destination runqueue mm_cid after it has been replaced by a migration. Removing this prior optimization is not an issue performance-wise because the introduced per-mm/per-cpu mm_cid tracking also covers this more specific case. Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Aaron Lu <aaron.lu@intel.com> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
2023-04-18sched/isolation: add cpu_is_isolated() APIFrederic Weisbecker
Patch series "memcg, cpuisol: do not interfere pcp cache charges draining with cpuisol workloads". Leonardo has reported [1] that pcp memcg charge draining can interfere with cpu isolated workloads. The said draining is done from a WQ context with a pcp worker scheduled on each CPU which holds any cached charges for a specific memcg hierarchy. Operation is not really a common operation [2]. It can be triggered from the userspace though so some care is definitely due. Leonardo has tried to address the issue by allowing remote charge draining [3]. This approach requires an additional locking to synchronize pcp caches sync from a remote cpu from local pcp consumers. Even though the proposed lock was per-cpu there is still potential for contention and less predictable behavior. This patchset addresses the issue from a different angle. Rather than dealing with a potential synchronization, cpus which are isolated are simply never scheduled to be drained. This means that a small amount of charges could be laying around and waiting for a later use or they are flushed when a different memcg is charged from the same cpu. More details are in patch 2. The first patch from Frederic is implementing an abstraction to tell whether a specific cpu has been isolated and therefore require a special treatment. This patch (of 2): Provide this new API to check if a CPU has been isolated either through isolcpus= or nohz_full= kernel parameter. It aims at avoiding kernel load deemed to be safely spared on CPUs running sensitive workload that can't bear any disturbance, such as pcp cache draining. Link: https://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20230317134448.11082-2-mhocko@kernel.org Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Leonardo Bras <leobras@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-14sched: Fix KCSAN noinstr violationJosh Poimboeuf
With KCSAN enabled, end_of_stack() can get out-of-lined. Force it inline. Fixes the following warnings: vmlinux.o: warning: objtool: check_stackleak_irqoff+0x2b: call to end_of_stack() leaves .noinstr.text section Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/cc1b4d73d3a428a00d206242a68fdf99a934ca7b.1681320026.git.jpoimboe@kernel.org
2023-03-31iommu/sva: Move PASID helpers to sva codeJacob Pan
Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230322200803.869130-3-jacob.jun.pan@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-03-28lazy tlb: allow lazy tlb mm refcounting to be configurableNicholas Piggin
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, so the patch introduces no functional change. Link: https://lkml.kernel.org/r/20230203071837.1136453-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28lazy tlb: introduce lazy tlb mm refcount helper functionsNicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23vhost_task: Allow vhost layer to use copy_processMike Christie
Qemu will create vhost devices in the kernel which perform network, SCSI, etc IO and management operations from worker threads created by the kthread API. Because the kthread API does a copy_process on the kthreadd thread, the vhost layer has to use kthread_use_mm to access the Qemu thread's memory and cgroup_attach_task_all to add itself to the Qemu thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result in VMs creating more threads than the admin expected. This patch adds a new struct vhost_task which can be used instead of kthreads. They allow the vhost layer to use copy_process and inherit the userspace process's mm and cgroups, the task is accounted for under the userspace's nproc count and can be seen in its process tree, and other features like namespaces work and are inherited by default. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-03-16iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()Kirill A. Shutemov
Kernel has few users of pasid_valid() and all but one checks if the process has PASID allocated. The helper takes ioasid_t as the input. Replace the helper with mm_valid_pasid() that takes mm_struct as the argument. The only call that checks PASID that is not tied to mm_struct is open-codded now. This is preparatory patch. It helps avoid ifdeffery: no need to dereference mm->pasid in generic code to check if the process has PASID. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-11-kirill.shutemov%40linux.intel.com
2023-03-12fork: allow kernel code to call copy_processMike Christie
The next patch adds helpers like create_io_thread, but for use by the vhost layer. There are several functions, so they are in their own file instead of cluttering up fork.c. This patch allows that new file to call copy_process. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-12fork: Add kernel_clone_args flag to ignore signalsMike Christie
Since: commit 10ab825bdef8 ("change kernel threads to ignore signals instead of blocking them") kthreads have been ignoring signals by default, and the vhost layer has never had a need to change that. This patch adds an option flag, USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand() and copy_signals() so vhost_tasks added in the next patches can continue to ignore singals. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-12fork: add kernel_clone_args flag to not dup/clone filesMike Christie
Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thread is part of the device, so when it creates a thread using a helper based on copy_process we can't dup or clone the parent's files/FDS because it would do an extra increment on ourself. Later, when we do: Qemu process exits: do_exit -> exit_files -> put_files_struct -> close_files we would leak the device's resources because of that extra refcount on the fd or file_struct. This patch adds a no_files option so these worker threads can prevent taking an extra refcount on themselves. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-12fork/vm: Move common PF_IO_WORKER behavior to new flagMike Christie
This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper instead of create_io_thread because they require different behavior for operations like signal handling. The common behavior PF_USER_WORKER covers is the vm reclaim handling. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-12kernel: Make io_thread and kthread bit fieldsMike Christie
We only set args->io_thread/kthread to 0 or 1 then test if they are set, so make them bit fields. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-12kernel: Allow a kernel thread's name to be set in copy_processMike Christie
This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after the thread is created. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-02mm: implement memory-deny-write-execute as a prctlJoey Gouly
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2. The background to this is that systemd has a configuration option called MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim is to prevent a user task from inadvertently creating an executable mapping that is (or was) writeable. Since such BPF filter is stateless, it cannot detect mappings that were previously writeable but subsequently changed to read-only. Therefore the filter simply rejects any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support (Branch Target Identification), the dynamic loader cannot change an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For libraries, it can resort to unmapping and re-mapping but for the main executable it does not have a file descriptor. The original bug report in the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries - [4]. This series adds in-kernel support for this feature as a prctl PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE | PROT_EXEC mappings. Like the systemd BPF filter it also denies adding PROT_EXEC to mappings. However unlike the BPF filter it only denies it if the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC -> PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF filter. This patch (of 2): The aim of such policy is to prevent a user task from creating an executable mapping that is also writeable. An example of mmap() returning -EACCESS if the policy is enabled: mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); Similarly, mprotect() would return -EACCESS below: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); The BPF filter that systemd MDWE uses is stateless, and disallows mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to be enabled if it was already PROT_EXEC, which allows the following case: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); where PROT_BTI enables branch tracking identification on arm64. Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com Signed-off-by: Joey Gouly <joey.gouly@arm.com> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jeremy Linton <jeremy.linton@arm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Mark Brown <broonie@kernel.org> Cc: nd <nd@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31sched/clock: Make local_clock() noinstrPeter Zijlstra
With sched_clock() noinstr, provide a noinstr implementation of local_clock(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126151323.760767043@infradead.org
2023-01-13cpuidle, sched: Remove instrumentation from TIF_{POLLING_NRFLAG,NEED_RESCHED}Peter Zijlstra
objtool pointed out that various idle-TIF management methods have instrumentation: vmlinux.o: warning: objtool: mwait_idle+0x5: call to current_set_polling_and_test() leaves .noinstr.text section vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0xc5: call to current_set_polling_and_test() leaves .noinstr.text section vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to test_ti_thread_flag() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle+0xbc: call to current_set_polling_and_test() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_irq+0xea: call to current_set_polling_and_test() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_s2idle+0xb4: call to current_set_polling_and_test() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle+0xa6: call to current_clr_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_irq+0xbf: call to current_clr_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_s2idle+0xa1: call to current_clr_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: mwait_idle+0xe: call to __current_set_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0xc5: call to __current_set_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to test_ti_thread_flag() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle+0xbc: call to __current_set_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_irq+0xea: call to __current_set_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_s2idle+0xb4: call to __current_set_polling() leaves .noinstr.text section vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to test_ti_thread_flag() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_s2idle+0x73: call to test_ti_thread_flag.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle_irq+0x91: call to test_ti_thread_flag.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: intel_idle+0x78: call to test_ti_thread_flag.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: acpi_safe_halt+0xf: call to test_ti_thread_flag.constprop.0() leaves .noinstr.text section Remove the instrumentation, because these methods are used in low-level cpuidle code moving between states, that should not be instrumented. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.988741683@infradead.org
2022-12-27cputime: remove cputime_to_nsecs fallbackNicholas Piggin
The archs that use cputime_to_nsecs() internally provide their own definition and don't need the fallback. cputime_to_usecs() unused except in this fallback, and is not defined anywhere. This removes the final remnant of the cputime_t code from the kernel. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15x86/mm: Use mm_alloc() in poking_init()Peter Zijlstra
Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org
2022-12-15mm: Move mm_cachep initialization to mm_init()Peter Zijlstra
In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org
2022-12-14Merge tag 'for-linus-iommufd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd implementation from Jason Gunthorpe: "iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA" For more background, see the extended explanations in Jason's pull request: https://lore.kernel.org/lkml/Y5dzTU8dlmXTbzoJ@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (62 commits) iommufd: Change the order of MSI setup iommufd: Improve a few unclear bits of code iommufd: Fix comment typos vfio: Move vfio group specific code into group.c vfio: Refactor dma APIs for emulated devices vfio: Wrap vfio group module init/clean code into helpers vfio: Refactor vfio_device open and close vfio: Make vfio_device_open() truly device specific vfio: Swap order of vfio_device_container_register() and open_device() vfio: Set device->group in helper function vfio: Create wrappers for group register/unregister vfio: Move the sanity check of the group to vfio_create_group() vfio: Simplify vfio_create_group() iommufd: Allow iommufd to supply /dev/vfio/vfio vfio: Make vfio_container optionally compiled vfio: Move container related MODULE_ALIAS statements into container.c vfio-iommufd: Support iommufd for emulated VFIO devices vfio-iommufd: Support iommufd for physical VFIO devices vfio-iommufd: Allow iommufd to be used in place of a container fd vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent() ...
2022-11-30kernel/user: Allow user_struct::locked_vm to be usable for iommufdJason Gunthorpe
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use user->locked_vm for accounting pinned pages. Ensure the value is included in the struct and export free_uid() as iommufd is modular. user->locked_vm is the good accounting to use for ulimit because it is per-user, and the security sandboxing of locked pages is not supposed to be per-process. Other places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this is only per-process and inconsistent with the new FOLL_LONGTERM users in the kernel. Concurrent work is underway to try to put this in a cgroup, so everything can be consistent and the kernel can provide a FOLL_LONGTERM limit that actually provides security. Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-20sched: Move numa_balancing sysctls to its own fileKefeng Wang
The sysctl_numa_balancing_promote_rate_limit and sysctl_numa_balancing are part of sched, move them to its own file. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-10-12Merge tag 'mm-nonmm-stable-2022-10-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-09Merge tag 'signal-for-v5.20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace update from Eric Biederman: "ptrace: Stop supporting SIGKILL for PTRACE_EVENT_EXIT Recently I had a conversation where it was pointed out to me that SIGKILL sent to a tracee stropped in PTRACE_EVENT_EXIT is quite difficult for a tracer to handle. Keeping SIGKILL working after the process has been killed is pain from an implementation point of view. So since the debuggers don't want this behavior let's see if we can remove this wart for the userspace API If a regression is detected it should only need to be the last change that is the reverted. The other two are just general cleanups that make the last patch simpler" * tag 'signal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Drop signals received after a fatal signal has been processed signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exit signal: Ensure SIGNAL_GROUP_EXIT gets set in do_group_exit
2022-09-26mm: delete unused MMF_OOM_VICTIM flagSuren Baghdasaryan
With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is now unused and can be removed. [akpm@linux-foundation.org: remove comment about now-removed mm_is_oom_victim()] Link: https://lkml.kernel.org/r/20220531223100.510392-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11kernel: exit: cleanup release_thread()Kefeng Wang
Only x86 has own release_thread(), introduce a new weak release_thread() function to clean empty definitions in other ARCHs. Link: https://lkml.kernel.org/r/20220819014406.32266-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Brian Cain <bcain@quicinc.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Huacai Chen <chenhuacai@kernel.org> [LoongArch] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> [csky] Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11memory tiering: rate limit NUMA migration throughputHuang Ying
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-05Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Quite a large pull request due to a selftest API overhaul and some patches that had come in too late for 5.19. ARM: - Unwinder implementations for both nVHE modes (classic and protected), complete with an overflow stack - Rework of the sysreg access from userspace, with a complete rewrite of the vgic-v3 view to allign with the rest of the infrastructure - Disagregation of the vcpu flags in separate sets to better track their use model. - A fix for the GICv2-on-v3 selftest - A small set of cosmetic fixes RISC-V: - Track ISA extensions used by Guest using bitmap - Added system instruction emulation framework - Added CSR emulation framework - Added gfp_custom flag in struct kvm_mmu_memory_cache - Added G-stage ioremap() and iounmap() functions - Added support for Svpbmt inside Guest s390: - add an interface to provide a hypervisor dump for secure guests - improve selftests to use TAP interface - enable interpretive execution of zPCI instructions (for PCI passthrough) - First part of deferred teardown - CPU Topology - PV attestation - Minor fixes x86: - Permit guests to ignore single-bit ECC errors - Intel IPI virtualization - Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS - PEBS virtualization - Simplify PMU emulation by just using PERF_TYPE_RAW events - More accurate event reinjection on SVM (avoid retrying instructions) - Allow getting/setting the state of the speaker port data bit - Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent - "Notify" VM exit (detect microarchitectural hangs) for Intel - Use try_cmpxchg64 instead of cmpxchg64 - Ignore benign host accesses to PMU MSRs when PMU is disabled - Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior - Allow NX huge page mitigation to be disabled on a per-vm basis - Port eager page splitting to shadow MMU as well - Enable CMCI capability by default and handle injected UCNA errors - Expose pid of vcpu threads in debugfs - x2AVIC support for AMD - cleanup PIO emulation - Fixes for LLDT/LTR emulation - Don't require refcounted "struct page" to create huge SPTEs - Miscellaneous cleanups: - MCE MSR emulation - Use separate namespaces for guest PTEs and shadow PTEs bitmasks - PIO emulation - Reorganize rmap API, mostly around rmap destruction - Do not workaround very old KVM bugs for L0 that runs with nesting enabled - new selftests API for CPUID Generic: - Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache - new selftests API using struct kvm_vcpu instead of a (vm, id) tuple" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits) selftests: kvm: set rax before vmcall selftests: KVM: Add exponent check for boolean stats selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test selftests: KVM: Check stat name before other fields KVM: x86/mmu: remove unused variable RISC-V: KVM: Add support for Svpbmt inside Guest/VM RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap() RISC-V: KVM: Add G-stage ioremap() and iounmap() functions KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache RISC-V: KVM: Add extensible CSR emulation framework RISC-V: KVM: Add extensible system instruction emulation framework RISC-V: KVM: Factor-out instruction emulation into separate sources RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function RISC-V: KVM: Fix variable spelling mistake RISC-V: KVM: Improve ISA extension by using a bitmap KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs() KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT KVM: x86: Do not block APIC write for non ICR registers ...
2022-08-01Merge tag 'sched-core-2022-08-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Load-balancing improvements: - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes" * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) rseq: Kill process when unknown flags are encountered in ABI structures rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags sched/core: Fix the bug that task won't enqueue into core tree when update cookie nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt() sched/core: Always flush pending blk_plug sched/fair: fix case with reduced capacity CPU sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling sched/core: add forced idle accounting for cgroups sched/fair: Remove the energy margin in feec() sched/fair: Remove task_util from effective utilization in feec() sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() sched/fair: Rename select_idle_mask to select_rq_mask sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() sched/fair: Decay task PELT values during wakeup migration sched/fair: Provide u64 read for 32-bits arch helper sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg sched: only perform capability check on privileged operation sched: Remove unused function group_first_cpu() sched/fair: Remove redundant word " *" selftests/rseq: check if libc rseq support is registered ...
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-20signal: Guarantee that SIGNAL_GROUP_EXIT is set on process exitEric W. Biederman
Track how many threads have not started exiting and when the last thread starts exiting set SIGNAL_GROUP_EXIT. This guarantees that SIGNAL_GROUP_EXIT will get set when a process exits. In practice this achieves nothing as glibc's implementation of _exit calls sys_group_exit then sys_exit. While glibc's implemenation of pthread_exit calls exit (which cleansup and calls _exit) if it is the last thread and sys_exit if it is the last thread. This means the only way the kernel might observe a process that does not set call exit_group is if the language runtime does not use glibc. With more cleanups I hope to move the decrement of quick_threads earlier. Link: https://lkml.kernel.org/r/87bkukd4tc.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-07-13sched/core: Always flush pending blk_plugJohn Keeping
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com