From 7d8faaf155454f8798ec56404faca29a82689c77 Mon Sep 17 00:00:00 2001 From: Zach O'Keefe Date: Wed, 6 Jul 2022 16:59:27 -0700 Subject: mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse This idea was introduced by David Rientjes[1]. Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory at their own expense. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, process_madvise(2) returns the number of bytes advised, and madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Most notable here is ENOMEM and EBUSY (new to madvise) which are intended to provide the caller with actionable feedback so they may take an appropriate fallback measure. Use Cases An immediate user of this new functionality are malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[2]. Only privately-mapped anon memory is supported for now, but additional support for file, shmem, and HugeTLB high-granularity mappings[2] is expected. File and tmpfs/shmem support would permit: * Backing executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. * Backing guest memory by hugapages after the memory contents have been migrated in native-page-sized chunks to a new host, in a userfaultfd-based live-migration stack. [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ [2] https://github.com/google/tcmalloc/tree/master/tcmalloc [jrdr.linux@gmail.com: avoid possible memory leak in failure path] Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com [zokeefe@google.com add missing kfree() to madvise_collapse()] Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/ Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com [zokeefe@google.com: delay computation of hpage boundaries until use]] Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com Signed-off-by: Zach O'Keefe Signed-off-by: "Souptick Joarder (HPE)" Suggested-by: David Rientjes Cc: Alex Shi Cc: Andrea Arcangeli Cc: Arnd Bergmann Cc: Axel Rasmussen Cc: Chris Kennelly Cc: Chris Zankel Cc: David Hildenbrand Cc: Helge Deller Cc: Hugh Dickins Cc: Ivan Kokshaysky Cc: James Bottomley Cc: Jens Axboe Cc: "Kirill A. Shutemov" Cc: Matthew Wilcox Cc: Matt Turner Cc: Max Filippov Cc: Miaohe Lin Cc: Michal Hocko Cc: Minchan Kim Cc: Pasha Tatashin Cc: Pavel Begunkov Cc: Peter Xu Cc: Rongwei Wang Cc: SeongJae Park Cc: Song Liu Cc: Thomas Bogendoerfer Cc: Vlastimil Babka Cc: Yang Shi Cc: Zi Yan Cc: Dan Carpenter Signed-off-by: Andrew Morton --- arch/alpha/include/uapi/asm/mman.h | 2 ++ arch/mips/include/uapi/asm/mman.h | 2 ++ arch/parisc/include/uapi/asm/mman.h | 2 ++ arch/xtensa/include/uapi/asm/mman.h | 2 ++ 4 files changed, 8 insertions(+) (limited to 'arch') diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 4aa996423b0d..763929e814e9 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -76,6 +76,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 1be428663c10..c6e1fc77c996 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -103,6 +103,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index a7ea3204a5fa..22133a6a506e 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -70,6 +70,8 @@ #define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */ #define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */ +#define MADV_COLLAPSE 73 /* Synchronous hugepage collapse */ + #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 7966a58af472..1ff0c858544f 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -111,6 +111,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 -- cgit v1.2.3-70-g09d2 From 0192445cb2f7ed1cd7a95a0fc8c7645480baba25 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Mon, 15 Aug 2022 10:39:59 -0400 Subject: arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER This Kconfig option is used by individual arch to set its desired MAX_ORDER. Rename it to reflect its actual use. Link: https://lkml.kernel.org/r/20220815143959.1511278-1-zi.yan@sent.com Acked-by: Mike Rapoport Signed-off-by: Zi Yan Acked-by: Guo Ren [csky] Acked-by: Arnd Bergmann Acked-by: Catalin Marinas [arm64] Acked-by: Huacai Chen [LoongArch] Acked-by: Michael Ellerman [powerpc] Cc: Vineet Gupta Cc: Taichi Sugaya Cc: Neil Armstrong Cc: Qin Jian Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Thomas Bogendoerfer Cc: Dinh Nguyen Cc: Christophe Leroy Cc: Yoshinori Sato Cc: "David S. Miller" Cc: Chris Zankel Cc: Ley Foon Tan Signed-off-by: Andrew Morton --- arch/arc/Kconfig | 2 +- arch/arm/Kconfig | 2 +- arch/arm/configs/imx_v6_v7_defconfig | 2 +- arch/arm/configs/milbeaut_m10v_defconfig | 2 +- arch/arm/configs/oxnas_v6_defconfig | 2 +- arch/arm/configs/pxa_defconfig | 2 +- arch/arm/configs/sama7_defconfig | 2 +- arch/arm/configs/sp7021_defconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/csky/Kconfig | 2 +- arch/ia64/Kconfig | 2 +- arch/ia64/include/asm/sparsemem.h | 6 +++--- arch/loongarch/Kconfig | 2 +- arch/m68k/Kconfig.cpu | 2 +- arch/mips/Kconfig | 2 +- arch/nios2/Kconfig | 2 +- arch/powerpc/Kconfig | 2 +- arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +- arch/powerpc/configs/fsl-emb-nonhw.config | 2 +- arch/sh/configs/ecovec24_defconfig | 2 +- arch/sh/mm/Kconfig | 2 +- arch/sparc/Kconfig | 2 +- arch/xtensa/Kconfig | 2 +- include/linux/mmzone.h | 4 ++-- 24 files changed, 27 insertions(+), 27 deletions(-) (limited to 'arch') diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 9e3653253ef2..d9a13ccf89a3 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -554,7 +554,7 @@ config ARC_BUILTIN_DTB_NAME endmenu # "ARC Architecture Configuration" -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" default "12" if ARC_HUGEPAGE_16M default "11" diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 87badeae3181..e6c8ee56ac52 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1434,7 +1434,7 @@ config ARM_MODULE_PLTS Disabling this is usually safe for small single-platform configurations. If unsure, say y. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" default "12" if SOC_AM33XX default "9" if SA1111 diff --git a/arch/arm/configs/imx_v6_v7_defconfig b/arch/arm/configs/imx_v6_v7_defconfig index 01012537a9b9..fb283059daa0 100644 --- a/arch/arm/configs/imx_v6_v7_defconfig +++ b/arch/arm/configs/imx_v6_v7_defconfig @@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y CONFIG_SMP=y CONFIG_ARM_PSCI=y CONFIG_HIGHMEM=y -CONFIG_FORCE_MAX_ZONEORDER=14 +CONFIG_ARCH_FORCE_MAX_ORDER=14 CONFIG_CMDLINE="noinitrd console=ttymxc0,115200" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/milbeaut_m10v_defconfig b/arch/arm/configs/milbeaut_m10v_defconfig index 58810e98de3d..8620061e19a8 100644 --- a/arch/arm/configs/milbeaut_m10v_defconfig +++ b/arch/arm/configs/milbeaut_m10v_defconfig @@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y # CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set # CONFIG_ARM_PATCH_IDIV is not set CONFIG_HIGHMEM=y -CONFIG_FORCE_MAX_ZONEORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=12 CONFIG_SECCOMP=y CONFIG_KEXEC=y CONFIG_EFI=y diff --git a/arch/arm/configs/oxnas_v6_defconfig b/arch/arm/configs/oxnas_v6_defconfig index 600f78b363dd..5c163a9d1429 100644 --- a/arch/arm/configs/oxnas_v6_defconfig +++ b/arch/arm/configs/oxnas_v6_defconfig @@ -12,7 +12,7 @@ CONFIG_ARCH_OXNAS=y CONFIG_MACH_OX820=y CONFIG_SMP=y CONFIG_NR_CPUS=16 -CONFIG_FORCE_MAX_ZONEORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=12 CONFIG_SECCOMP=y CONFIG_ARM_APPENDED_DTB=y CONFIG_ARM_ATAG_DTB_COMPAT=y diff --git a/arch/arm/configs/pxa_defconfig b/arch/arm/configs/pxa_defconfig index 104a45722799..ce3f4ed50498 100644 --- a/arch/arm/configs/pxa_defconfig +++ b/arch/arm/configs/pxa_defconfig @@ -21,7 +21,7 @@ CONFIG_MACH_AKITA=y CONFIG_MACH_BORZOI=y CONFIG_PXA_SYSTEMS_CPLDS=y CONFIG_AEABI=y -CONFIG_FORCE_MAX_ZONEORDER=9 +CONFIG_ARCH_FORCE_MAX_ORDER=9 CONFIG_CMDLINE="root=/dev/ram0 ro" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/sama7_defconfig b/arch/arm/configs/sama7_defconfig index 0384030d8b25..8b2cf6ddd568 100644 --- a/arch/arm/configs/sama7_defconfig +++ b/arch/arm/configs/sama7_defconfig @@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y # CONFIG_CACHE_L2X0 is not set # CONFIG_ARM_PATCH_IDIV is not set # CONFIG_CPU_SW_DOMAIN_PAN is not set -CONFIG_FORCE_MAX_ZONEORDER=15 +CONFIG_ARCH_FORCE_MAX_ORDER=15 CONFIG_UACCESS_WITH_MEMCPY=y # CONFIG_ATAGS is not set CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel" diff --git a/arch/arm/configs/sp7021_defconfig b/arch/arm/configs/sp7021_defconfig index 703b9aaa40f0..151ca8c47373 100644 --- a/arch/arm/configs/sp7021_defconfig +++ b/arch/arm/configs/sp7021_defconfig @@ -18,7 +18,7 @@ CONFIG_ARCH_SUNPLUS=y # CONFIG_VDSO is not set CONFIG_SMP=y CONFIG_THUMB2_KERNEL=y -CONFIG_FORCE_MAX_ZONEORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=12 CONFIG_VFP=y CONFIG_NEON=y CONFIG_MODULES=y diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 9fb9fff08c94..c5c7d812704c 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1418,7 +1418,7 @@ config XEN help Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int default "14" if ARM64_64K_PAGES default "12" if ARM64_16K_PAGES diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 3cbc2dc62baf..adee6ab36862 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -332,7 +332,7 @@ config HIGHMEM select KMAP_LOCAL default y -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" default "11" diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 26ac8ea15a9e..c6e06cdc738f 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -200,7 +200,7 @@ config IA64_CYCLONE Say Y here to enable support for IBM EXA Cyclone time source. If you're unsure, answer N. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE range 11 17 if !HUGETLB_PAGE default "17" if HUGETLB_PAGE diff --git a/arch/ia64/include/asm/sparsemem.h b/arch/ia64/include/asm/sparsemem.h index 42ed5248fae9..84e8ce387b69 100644 --- a/arch/ia64/include/asm/sparsemem.h +++ b/arch/ia64/include/asm/sparsemem.h @@ -11,10 +11,10 @@ #define SECTION_SIZE_BITS (30) #define MAX_PHYSMEM_BITS (50) -#ifdef CONFIG_FORCE_MAX_ZONEORDER -#if ((CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS) +#ifdef CONFIG_ARCH_FORCE_MAX_ORDER +#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS) #undef SECTION_SIZE_BITS -#define SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) +#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) #endif #endif diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 26aeb1408e56..3c7a5a54b808 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -370,7 +370,7 @@ config NODES_SHIFT default "6" depends on NUMA -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" range 14 64 if PAGE_SIZE_64KB default "14" if PAGE_SIZE_64KB diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu index e0e9e31339c1..3b2f39508524 100644 --- a/arch/m68k/Kconfig.cpu +++ b/arch/m68k/Kconfig.cpu @@ -399,7 +399,7 @@ config SINGLE_MEMORY_CHUNK order" to save memory that could be wasted for unused memory map. Say N if not sure. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" if ADVANCED depends on !SINGLE_MEMORY_CHUNK default "11" diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index ec21f8999249..70d28976a40d 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2140,7 +2140,7 @@ config PAGE_SIZE_64KB endchoice -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index 4167f1eb4cd8..a582f72104f3 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -44,7 +44,7 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" range 9 20 default "11" diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 4c466acdc70d..39d71d7701bd 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -845,7 +845,7 @@ config DATA_SHIFT in that case. If PIN_TLB is selected, it must be aligned to 8M as 8M pages will be pinned. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" range 8 9 if PPC64 && PPC_64K_PAGES default "9" if PPC64 && PPC_64K_PAGES diff --git a/arch/powerpc/configs/85xx/ge_imp3a_defconfig b/arch/powerpc/configs/85xx/ge_imp3a_defconfig index f29c166998af..e7672c186325 100644 --- a/arch/powerpc/configs/85xx/ge_imp3a_defconfig +++ b/arch/powerpc/configs/85xx/ge_imp3a_defconfig @@ -30,7 +30,7 @@ CONFIG_PREEMPT=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_BINFMT_MISC=m CONFIG_MATH_EMULATION=y -CONFIG_FORCE_MAX_ZONEORDER=17 +CONFIG_ARCH_FORCE_MAX_ORDER=17 CONFIG_PCI=y CONFIG_PCIEPORTBUS=y CONFIG_PCI_MSI=y diff --git a/arch/powerpc/configs/fsl-emb-nonhw.config b/arch/powerpc/configs/fsl-emb-nonhw.config index f14c6dbd7346..ab8a8c4530d9 100644 --- a/arch/powerpc/configs/fsl-emb-nonhw.config +++ b/arch/powerpc/configs/fsl-emb-nonhw.config @@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y CONFIG_FONT_8x16=y CONFIG_FONT_8x8=y CONFIG_FONTS=y -CONFIG_FORCE_MAX_ZONEORDER=13 +CONFIG_ARCH_FORCE_MAX_ORDER=13 CONFIG_FRAMEBUFFER_CONSOLE=y CONFIG_FRAME_WARN=1024 CONFIG_FTL=y diff --git a/arch/sh/configs/ecovec24_defconfig b/arch/sh/configs/ecovec24_defconfig index e699e2e04128..b52e14ccb450 100644 --- a/arch/sh/configs/ecovec24_defconfig +++ b/arch/sh/configs/ecovec24_defconfig @@ -8,7 +8,7 @@ CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_BLK_DEV_BSG is not set CONFIG_CPU_SUBTYPE_SH7724=y -CONFIG_FORCE_MAX_ZONEORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=12 CONFIG_MEMORY_SIZE=0x10000000 CONFIG_FLATMEM_MANUAL=y CONFIG_SH_ECOVEC=y diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig index ba569cfb4368..411fdc0901f7 100644 --- a/arch/sh/mm/Kconfig +++ b/arch/sh/mm/Kconfig @@ -18,7 +18,7 @@ config PAGE_OFFSET default "0x80000000" if MMU default "0x00000000" -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" range 9 64 if PAGE_SIZE_16KB default "9" if PAGE_SIZE_16KB diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 1c852bb530ec..4d3d1af90d52 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -269,7 +269,7 @@ config ARCH_SPARSEMEM_ENABLE config ARCH_SPARSEMEM_DEFAULT def_bool y if SPARC64 -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" default "13" help diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 12ac277282ba..bcb0c5d2abc2 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -771,7 +771,7 @@ config HIGHMEM If unsure, say Y. -config FORCE_MAX_ZONEORDER +config ARCH_FORCE_MAX_ORDER int "Maximum zone order" default "11" help diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 025754b0bc09..fd61347b4b1f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,10 +24,10 @@ #include /* Free memory management - zoned buddy allocator. */ -#ifndef CONFIG_FORCE_MAX_ZONEORDER +#ifndef CONFIG_ARCH_FORCE_MAX_ORDER #define MAX_ORDER 11 #else -#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER +#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER #endif #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1)) -- cgit v1.2.3-70-g09d2 From 1a6baaa0db733c0dca2e170ca1df2b09834f47ce Mon Sep 17 00:00:00 2001 From: Gerald Schaefer Date: Sun, 11 Sep 2022 20:26:01 -0700 Subject: s390/hugetlb: switch to generic version of follow_huge_pud() When pud-sized hugepages were introduced for s390, the generic version of follow_huge_pud() was using pte_page() instead of pud_page(). This would be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use pmd_page() in follow_huge_pmd()"). Therefore, and probably because not all archs were supporting pud_page() at that time, a private version of follow_huge_pud() was added for s390, correctly using pud_page(). Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry"), the generic version of follow_huge_pud() is now also using pud_page(), and in general behaves similar to follow_huge_pmd(). Therefore we can now switch to the generic version and get rid of the s390-specific follow_huge_pud(). Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad Signed-off-by: Gerald Schaefer Cc: Haiyue Wang Cc: Miaohe Lin Cc: "Huang, Ying" Cc: Baolin Wang Cc: David Hildenbrand Cc: Muchun Song Signed-off-by: Andrew Morton --- arch/s390/mm/hugetlbpage.c | 10 ---------- 1 file changed, 10 deletions(-) (limited to 'arch') diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c index 10e51ef9c79a..c299a18273ff 100644 --- a/arch/s390/mm/hugetlbpage.c +++ b/arch/s390/mm/hugetlbpage.c @@ -237,16 +237,6 @@ int pud_huge(pud_t pud) return pud_large(pud); } -struct page * -follow_huge_pud(struct mm_struct *mm, unsigned long address, - pud_t *pud, int flags) -{ - if (flags & FOLL_GET) - return NULL; - - return pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); -} - bool __init arch_hugetlb_valid_size(unsigned long size) { if (MACHINE_HAS_EDAT1 && size == PMD_SIZE) -- cgit v1.2.3-70-g09d2 From 9c61d5321e94a4d0678b5eb0515afc590bdb9740 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 11 Aug 2022 12:13:25 -0400 Subject: mm/x86: use SWP_TYPE_BITS in 3-level swap macros Patch series "mm: Remember a/d bits for migration entries", v4. Problem ======= When migrating a page, right now we always mark the migrated page as old & clean. However that could lead to at least two problems: (1) We lost the real hot/cold information while we could have persisted. That information shouldn't change even if the backing page is changed after the migration, (2) There can be always extra overhead on the immediate next access to any migrated page, because hardware MMU needs cycles to set the young bit again for reads, and dirty bits for write, as long as the hardware MMU supports these bits. Many of the recent upstream works showed that (2) is not something trivial and actually very measurable. In my test case, reading 1G chunk of memory - jumping in page size intervals - could take 99ms just because of the extra setting on the young bit on a generic x86_64 system, comparing to 4ms if young set. This issue is originally reported by Andrea Arcangeli. Solution ======== To solve this problem, this patchset tries to remember the young/dirty bits in the migration entries and carry them over when recovering the ptes. We have the chance to do so because in many systems the swap offset is not really fully used. Migration entries use swp offset to store PFN only, while the PFN is normally not as large as swp offset and normally smaller. It means we do have some free bits in swp offset that we can use to store things like A/D bits, and that's how this series tried to approach this problem. max_swapfile_size() is used here to detect per-arch offset length in swp entries. We'll automatically remember the A/D bits when we find that we have enough swp offset field to keep both the PFN and the extra bits. Since max_swapfile_size() can be slow, the last two patches cache the results for it and also swap_migration_ad_supported as a whole. Known Issues / TODOs ==================== We still haven't taught madvise() to recognize the new A/D bits in migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a migration entry. It's not clear yet on whether we should clear the A bit, or we should just drop the entry directly. We didn't teach idle page tracking on the new migration entries, because it'll need larger rework on the tree on rmap pgtable walk. However it should make it already better because before this patchset page will be old page after migration, so the series will fix potential false negative of idle page tracking when pages were migrated before observing. The other thing is migration A/D bits will not start to working for private device swap entries. The code is there for completeness but since private device swap entries do not yet have fields to store A/D bits, even if we'll persistent A/D across present pte switching to migration entry, we'll lose it again when the migration entry converted to private device swap entry. Tests ===== After the patchset applied, the immediate read access test [1] of above 1G chunk after migration can shrink from 99ms to 4ms. The test is done by moving 1G pages from node 0->1->0 then read it in page size jumps. The test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Similar effect can also be measured when writting the memory the 1st time after migration. After applying the patchset, both initial immediate read/write after page migrated will perform similarly like before migration happened. Patch Layout ============ Patch 1-2: Cleanups from either previous versions or on swapops.h macros. Patch 3-4: Prepare for the introduction of migration A/D bits Patch 5: The core patch to remember young/dirty bit in swap offsets. Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast. [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c This patch (of 7): Replace all the magic "5" with the macro. Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: David Hildenbrand Reviewed-by: Huang Ying Cc: Hugh Dickins Cc: "Kirill A . Shutemov" Cc: Alistair Popple Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Andi Kleen Cc: Nadav Amit Cc: Vlastimil Babka Cc: Dave Hansen Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable-3level.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index e896ebef8c24..28421a887209 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp) /* We always extract/encode the offset by shifting it all the way up, and then down again */ #define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS) -#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5) -#define __swp_type(x) (((x).val) & 0x1f) -#define __swp_offset(x) ((x).val >> 5) -#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5}) +#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS) +#define __swp_type(x) (((x).val) & ((1UL << SWP_TYPE_BITS) - 1)) +#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS) +#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS}) /* * Normally, __swp_entry() converts from arch-independent swp_entry_t to -- cgit v1.2.3-70-g09d2 From 0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 11 Aug 2022 12:13:27 -0400 Subject: mm/swap: add swp_offset_pfn() to fetch PFN from swap entry We've got a bunch of special swap entries that stores PFN inside the swap offset fields. To fetch the PFN, normally the user just calls swp_offset() assuming that'll be the PFN. Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the max possible length of a PFN on the host, meanwhile doing proper check with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry(). One reason to do so is we never tried to sanitize whether swap offset can really fit for storing PFN. At the meantime, this patch also prepares us with the future possibility to store more information inside the swp offset field, so assuming "swp_offset(entry)" to be the PFN will not stand any more very soon. Replace many of the swp_offset() callers to use swp_offset_pfn() where proper. Note that many of the existing users are not candidates for the replacement, e.g.: (1) When the swap entry is not a pfn swap entry at all, or, (2) when we wanna keep the whole swp_offset but only change the swp type. For the latter, it can happen when fork() triggered on a write-migration swap entry pte, we may want to only change the migration type from write->read but keep the rest, so it's not "fetching PFN" but "changing swap type only". They're left aside so that when there're more information within the swp offset they'll be carried over naturally in those cases. Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what the new swp_offset_pfn() is about. Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: "Huang, Ying" Cc: Alistair Popple Cc: Andi Kleen Cc: Andrea Arcangeli Cc: David Hildenbrand Cc: Hugh Dickins Cc: "Kirill A . Shutemov" Cc: Minchan Kim Cc: Nadav Amit Cc: Vlastimil Babka Cc: Dave Hansen Signed-off-by: Andrew Morton --- arch/arm64/mm/hugetlbpage.c | 2 +- fs/proc/task_mmu.c | 20 +++++++++++++++++--- include/linux/swapops.h | 35 +++++++++++++++++++++++++++++------ mm/hmm.c | 2 +- mm/memory-failure.c | 2 +- mm/page_vma_mapped.c | 6 +++--- 6 files changed, 52 insertions(+), 15 deletions(-) (limited to 'arch') diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 0795028f017c..35e9a468d13e 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); - return page_folio(pfn_to_page(swp_offset(entry))); + return page_folio(pfn_to_page(swp_offset_pfn(entry))); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 482f91577f8c..db2f3a2946a0 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1418,9 +1418,19 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, if (pte_swp_uffd_wp(pte)) flags |= PM_UFFD_WP; entry = pte_to_swp_entry(pte); - if (pm->show_pfn) + if (pm->show_pfn) { + pgoff_t offset; + /* + * For PFN swap offsets, keeping the offset field + * to be PFN only to be compatible with old smaps. + */ + if (is_pfn_swap_entry(entry)) + offset = swp_offset_pfn(entry); + else + offset = swp_offset(entry); frame = swp_type(entry) | - (swp_offset(entry) << MAX_SWAPFILES_SHIFT); + (offset << MAX_SWAPFILES_SHIFT); + } flags |= PM_SWAP; migration = is_migration_entry(entry); if (is_pfn_swap_entry(entry)) @@ -1477,7 +1487,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned long offset; if (pm->show_pfn) { - offset = swp_offset(entry) + + if (is_pfn_swap_entry(entry)) + offset = swp_offset_pfn(entry); + else + offset = swp_offset(entry); + offset = offset + ((addr & ~PMD_MASK) >> PAGE_SHIFT); frame = swp_type(entry) | (offset << MAX_SWAPFILES_SHIFT); diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 7d1b74046520..578212fbf2be 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -23,6 +23,20 @@ #define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT) #define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1) +/* + * Definitions only for PFN swap entries (see is_pfn_swap_entry()). To + * store PFN, we only need SWP_PFN_BITS bits. Each of the pfn swap entries + * can use the extra bits to store other information besides PFN. + */ +#ifdef MAX_PHYSMEM_BITS +#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) +#else /* MAX_PHYSMEM_BITS */ +#define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) +#endif /* MAX_PHYSMEM_BITS */ +#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1) + +static inline bool is_pfn_swap_entry(swp_entry_t entry); + /* Clear all flags but only keep swp_entry_t related information */ static inline pte_t pte_swp_clear_flags(pte_t pte) { @@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry) return entry.val & SWP_OFFSET_MASK; } +/* + * This should only be called upon a pfn swap entry to get the PFN stored + * in the swap entry. Please refers to is_pfn_swap_entry() for definition + * of pfn swap entry. + */ +static inline unsigned long swp_offset_pfn(swp_entry_t entry) +{ + VM_BUG_ON(!is_pfn_swap_entry(entry)); + return swp_offset(entry) & SWP_PFN_MASK; +} + /* check whether a pte points to a swap entry */ static inline int is_swap_pte(pte_t pte) { @@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte) static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) { - struct page *p = pfn_to_page(swp_offset(entry)); + struct page *p = pfn_to_page(swp_offset_pfn(entry)); /* * Any use of migration entries may only occur while the @@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) */ static inline bool is_pfn_swap_entry(swp_entry_t entry) { + /* Make sure the swp offset can always store the needed fields */ + BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS); + return is_migration_entry(entry) || is_device_private_entry(entry) || is_device_exclusive_entry(entry); } @@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry) return swp_type(entry) == SWP_HWPOISON; } -static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry) -{ - return swp_offset(entry); -} - static inline void num_poisoned_pages_inc(void) { atomic_long_inc(&num_poisoned_pages); diff --git a/mm/hmm.c b/mm/hmm.c index f2aa63b94d9b..3850fb625dda 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, cpu_flags = HMM_PFN_VALID; if (is_writable_device_private_entry(entry)) cpu_flags |= HMM_PFN_WRITE; - *hmm_pfn = swp_offset(entry) | cpu_flags; + *hmm_pfn = swp_offset_pfn(entry) | cpu_flags; return 0; } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 59dd32b75348..e554f9f583ca 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -635,7 +635,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, swp_entry_t swp = pte_to_swp_entry(pte); if (is_hwpoison_entry(swp)) - pfn = hwpoison_entry_to_pfn(swp); + pfn = swp_offset_pfn(swp); } if (!pfn || pfn != poisoned_pfn) diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 8e9e574d535a..93e13fc17d3c 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) !is_device_exclusive_entry(entry)) return false; - pfn = swp_offset(entry); + pfn = swp_offset_pfn(entry); } else if (is_swap_pte(*pvmw->pte)) { swp_entry_t entry; @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) !is_device_exclusive_entry(entry)) return false; - pfn = swp_offset(entry); + pfn = swp_offset_pfn(entry); } else { if (!pte_present(*pvmw->pte)) return false; @@ -221,7 +221,7 @@ restart: return not_found(pvmw); entry = pmd_to_swp_entry(pmde); if (!is_migration_entry(entry) || - !check_pmd(swp_offset(entry), pvmw)) + !check_pmd(swp_offset_pfn(entry), pvmw)) return not_found(pvmw); return true; } -- cgit v1.2.3-70-g09d2 From be45a4902c7caa717fee6b2f671e59b396ed395c Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 11 Aug 2022 12:13:30 -0400 Subject: mm/swap: cache maximum swapfile size when init swap We used to have swapfile_maximum_size() fetching a maximum value of swapfile size per-arch. As the caller of max_swapfile_size() grows, this patch introduce a variable "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so that we don't need to calculate the value every time. Caching the value in swapfile_init() is safe because when reaching the phase we should have initialized all the relevant information. Here the major arch to take care of is x86, which defines the max swapfile size based on L1TF mitigation. Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly when reaching swapfile_init(). As a reference, the code path looks like this for x86: - start_kernel - setup_arch - early_cpu_init - early_identify_cpu --> setup X86_BUG_L1TF - parse_early_param - l1tf_cmdline --> set l1tf_mitigation - check_bugs - l1tf_select_mitigation --> set l1tf_mitigation - arch_call_rest_init - rest_init - kernel_init - kernel_init_freeable - do_basic_setup - do_initcalls --> calls swapfile_init() (initcall level 4) The swapfile size only depends on swp pte format on non-x86 archs, so caching it is safe too. Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because arch can define its own function, so it's more straightforward to have "arch_" as its prefix. At the meantime, export swapfile_maximum_size to replace the old usages of max_swapfile_size(). [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h] Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: "Huang, Ying" Cc: Alistair Popple Cc: Andi Kleen Cc: Andrea Arcangeli Cc: David Hildenbrand Cc: Hugh Dickins Cc: "Kirill A . Shutemov" Cc: Minchan Kim Cc: Nadav Amit Cc: Vlastimil Babka Cc: Dave Hansen Signed-off-by: Andrew Morton --- arch/x86/mm/init.c | 2 +- include/linux/swapfile.h | 5 ++++- include/linux/swapops.h | 2 +- mm/swapfile.c | 7 +++++-- 4 files changed, 11 insertions(+), 5 deletions(-) (limited to 'arch') diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 82a042c03824..9121bc1b9453 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache) } #ifdef CONFIG_SWAP -unsigned long max_swapfile_size(void) +unsigned long arch_max_swapfile_size(void) { unsigned long pages; diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h index 54078542134c..e2d11ae4e73d 100644 --- a/include/linux/swapfile.h +++ b/include/linux/swapfile.h @@ -8,6 +8,9 @@ */ extern struct swap_info_struct *swap_info[]; extern unsigned long generic_max_swapfile_size(void); -extern unsigned long max_swapfile_size(void); +unsigned long arch_max_swapfile_size(void); + +/* Maximum swapfile size supported for the arch (not inclusive). */ +extern unsigned long swapfile_maximum_size; #endif /* _LINUX_SWAPFILE_H */ diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 11b874f212a2..027b4095e132 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -307,7 +307,7 @@ static inline bool migration_entry_supports_ad(void) * the offset large enough to cover all of them (PFN, A & D bits). */ #ifdef CONFIG_SWAP - return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS); + return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS); #else /* CONFIG_SWAP */ return false; #endif /* CONFIG_SWAP */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 1fdccd2f1422..3cc64399df44 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages); /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ long total_swap_pages; static int least_priority = -1; +unsigned long swapfile_maximum_size; static const char Bad_file[] = "Bad swap file entry "; static const char Unused_file[] = "Unused swap file entry "; @@ -2816,7 +2817,7 @@ unsigned long generic_max_swapfile_size(void) } /* Can be overridden by an architecture for additional checks. */ -__weak unsigned long max_swapfile_size(void) +__weak unsigned long arch_max_swapfile_size(void) { return generic_max_swapfile_size(); } @@ -2856,7 +2857,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p, p->cluster_next = 1; p->cluster_nr = 0; - maxpages = max_swapfile_size(); + maxpages = swapfile_maximum_size; last_page = swap_header->info.last_page; if (!last_page) { pr_warn("Empty swap-file\n"); @@ -3677,6 +3678,8 @@ static int __init swapfile_init(void) for_each_node(nid) plist_head_init(&swap_avail_heads[nid]); + swapfile_maximum_size = arch_max_swapfile_size(); + return 0; } subsys_initcall(swapfile_init); -- cgit v1.2.3-70-g09d2 From e1fd09e3d1dd4a1a8b3b33bc1fd647eee9f4e475 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sun, 18 Sep 2022 01:59:58 -0600 Subject: mm: x86, arm64: add arch_has_hw_pte_young() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Multi-Gen LRU Framework", v14. What's new ========== 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS, Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15. 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs) machines. The old direct reclaim backoff, which tries to enforce a minimum fairness among all eligible memcgs, over-swapped by about (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which pulls the plug on swapping once the target is met, trades some fairness for curtailed latency: https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/ 3. Fixed minior build warnings and conflicts. More comments and nits. TLDR ==== The current page reclaim is too expensive in terms of CPU usage and it often makes poor choices about what to evict. This patchset offers an alternative solution that is performant, versatile and straightforward. Patchset overview ================= The design and implementation overview is in patch 14: https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/ 01. mm: x86, arm64: add arch_has_hw_pte_young() 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Take advantage of hardware features when trying to clear the accessed bit in many PTEs. 03. mm/vmscan.c: refactor shrink_node() 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Minor refactors to improve readability for the following patches. 05. mm: multi-gen LRU: groundwork Adds the basic data structure and the functions that insert pages to and remove pages from the multi-gen LRU (MGLRU) lists. 06. mm: multi-gen LRU: minimal implementation A minimal implementation without optimizations. 07. mm: multi-gen LRU: exploit locality in rmap Exploits spatial locality to improve efficiency when using the rmap. 08. mm: multi-gen LRU: support page table walks Further exploits spatial locality by optionally scanning page tables. 09. mm: multi-gen LRU: optimize multiple memcgs Optimizes the overall performance for multiple memcgs running mixed types of workloads. 10. mm: multi-gen LRU: kill switch Adds a kill switch to enable or disable MGLRU at runtime. 11. mm: multi-gen LRU: thrashing prevention 12. mm: multi-gen LRU: debugfs interface Provide userspace with features like thrashing prevention, working set estimation and proactive reclaim. 13. mm: multi-gen LRU: admin guide 14. mm: multi-gen LRU: design doc Add an admin guide and a design doc. Benchmark results ================= Independent lab results ----------------------- Based on the popularity of searches [01] and the memory usage in Google's public cloud, the most popular open-source memory-hungry applications, in alphabetical order, are: Apache Cassandra Memcached Apache Hadoop MongoDB Apache Spark PostgreSQL MariaDB (MySQL) Redis An independent lab evaluated MGLRU with the most widely used benchmark suites for the above applications. They posted 960 data points along with kernel metrics and perf profiles collected over more than 500 hours of total benchmark time. Their final reports show that, with 95% confidence intervals (CIs), the above applications all performed significantly better for at least part of their benchmark matrices. On 5.14: 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort three billion random integers, respectively, under the medium- and the high-concurrency conditions, when overcommitting memory. There were no statistically significant changes in wall time for the rest of the benchmark matrix. 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]% more transactions per minute (TPM), respectively, under the medium- and the high-concurrency conditions, when overcommitting memory. There were no statistically significant changes in TPM for the rest of the benchmark matrix. 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]% and [21.59, 30.02]% more operations per second (OPS), respectively, for sequential access, random access and Gaussian (distribution) access, when THP=always; 95% CIs [13.85, 15.97]% and [23.94, 29.92]% more OPS, respectively, for random access and Gaussian access, when THP=never. There were no statistically significant changes in OPS for the rest of the benchmark matrix. 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and [2.16, 3.55]% more operations per second (OPS), respectively, for exponential (distribution) access, random access and Zipfian (distribution) access, when underutilizing memory; 95% CIs [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS, respectively, for exponential access, random access and Zipfian access, when overcommitting memory. On 5.15: 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% and [4.11, 7.50]% more operations per second (OPS), respectively, for exponential (distribution) access, random access and Zipfian (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%, [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for exponential access, random access and Zipfian access, when swap was on. 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]% less average wall time to finish twelve parallel TeraSort jobs, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in average wall time for the rest of the benchmark matrix. 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per minute (TPM) under the high-concurrency condition, when swap was off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM, respectively, under the medium- and the high-concurrency conditions, when swap was on. There were no statistically significant changes in TPM for the rest of the benchmark matrix. 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and [11.47, 19.36]% more total operations per second (OPS), respectively, for sequential access, random access and Gaussian (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%, [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively, for sequential access, random access and Gaussian access, when THP=never. Our lab results --------------- To supplement the above results, we ran the following benchmark suites on 5.16-rc7 and found no regressions [10]. fs_fio_bench_hdd_mq pft fs_lmbench pgsql-hammerdb fs_parallelio redis fs_postmark stream hackbench sysbenchthread kernbench tpcc_spark memcached unixbench multichase vm-scalability mutilate will-it-scale nginx [01] https://trends.google.com [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/ [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/ [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/ [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/ [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/ [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/ [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/ [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/ [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/ Read-world applications ======================= Third-party testimonials ------------------------ Konstantin reported [11]: I have Archlinux with 8G RAM + zswap + swap. While developing, I have lots of apps opened such as multiple LSP-servers for different langs, chats, two browsers, etc... Usually, my system gets quickly to a point of SWAP-storms, where I have to kill LSP-servers, restart browsers to free memory, etc, otherwise the system lags heavily and is barely usable. 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I started up by opening lots of apps to create memory pressure, and worked for a day like this. Till now I had not a single SWAP-storm, and mind you I got 3.4G in SWAP. I was never getting to the point of 3G in SWAP before without a single SWAP-storm. Vaibhav from IBM reported [12]: In a synthetic MongoDB Benchmark, seeing an average of ~19% throughput improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across three different request distributions, namely, Exponential, Uniform and Zipfan. Shuang from U of Rochester reported [13]: With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]% and [9.26, 10.36]% higher throughput, respectively, for random access, Zipfian (distribution) access and Gaussian (distribution) access, when the average number of jobs per CPU is 1; 95% CIs [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput, respectively, for random access, Zipfian access and Gaussian access, when the average number of jobs per CPU is 2. Daniel from Michigan Tech reported [14]: With Memcached allocating ~100GB of byte-addressable Optante, performance improvement in terms of throughput (measured as queries per second) was about 10% for a series of workloads. Large-scale deployments ----------------------- We've rolled out MGLRU to tens of millions of ChromeOS users and about a million Android users. Google's fleetwide profiling [15] shows an overall 40% decrease in kswapd CPU usage, in addition to improvements in other UX metrics, e.g., an 85% decrease in the number of low-memory kills at the 75th percentile and an 18% decrease in app launch time at the 50th percentile. The downstream kernels that have been using MGLRU include: 1. Android [16] 2. Arch Linux Zen [17] 3. Armbian [18] 4. ChromeOS [19] 5. Liquorix [20] 6. OpenWrt [21] 7. post-factum [22] 8. XanMod [23] [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/ [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/ [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/ [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/ [15] https://dl.acm.org/doi/10.1145/2749469.2750392 [16] https://android.com [17] https://archlinux.org [18] https://armbian.com [19] https://chromium.org [20] https://liquorix.net [21] https://openwrt.org [22] https://codeberg.org/pf-kernel [23] https://xanmod.org Summary ======= The facts are: 1. The independent lab results and the real-world applications indicate substantial improvements; there are no known regressions. 2. Thrashing prevention, working set estimation and proactive reclaim work out of the box; there are no equivalent solutions. 3. There is a lot of new code; no smaller changes have been demonstrated similar effects. Our options, accordingly, are: 1. Given the amount of evidence, the reported improvements will likely materialize for a wide range of workloads. 2. Gauging the interest from the past discussions, the new features will likely be put to use for both personal computers and data centers. 3. Based on Google's track record, the new code will likely be well maintained in the long term. It'd be more difficult if not impossible to achieve similar effects with other approaches. This patch (of 14): Some architectures automatically set the accessed bit in PTEs, e.g., x86 and arm64 v8.2. On architectures that do not have this capability, clearing the accessed bit in a PTE usually triggers a page fault following the TLB miss of this PTE (to emulate the accessed bit). Being aware of this capability can help make better decisions, e.g., whether to spread the work out over a period of time to reduce bursty page faults when trying to clear the accessed bit in many PTEs. Note that theoretically this capability can be unreliable, e.g., hotplugged CPUs might be different from builtin ones. Therefore it should not be used in architecture-independent code that involves correctness, e.g., to determine whether TLB flushes are required (in combination with the accessed bit). Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com Signed-off-by: Yu Zhao Reviewed-by: Barry Song Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Acked-by: Will Deacon Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Cc: Andi Kleen Cc: Aneesh Kumar K.V Cc: Catalin Marinas Cc: Dave Hansen Cc: Hillf Danton Cc: Jens Axboe Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Linus Torvalds Cc: linux-arm-kernel@lists.infradead.org Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michael Larabel Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Tejun Heo Cc: Vlastimil Babka Cc: Miaohe Lin Cc: Mike Rapoport Cc: Qi Zheng Signed-off-by: Andrew Morton --- arch/arm64/include/asm/pgtable.h | 15 ++------------- arch/x86/include/asm/pgtable.h | 6 +++--- include/linux/pgtable.h | 13 +++++++++++++ mm/memory.c | 14 +------------- 4 files changed, 19 insertions(+), 29 deletions(-) (limited to 'arch') diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b5df82aa99e6..71a1af42f0e8 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1082,24 +1082,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, * page after fork() + CoW for pfn mappings. We don't always have a * hardware-managed access flag on arm64. */ -static inline bool arch_faults_on_old_pte(void) -{ - /* The register read below requires a stable CPU to make any sense */ - cant_migrate(); - - return !cpu_has_hw_af(); -} -#define arch_faults_on_old_pte arch_faults_on_old_pte +#define arch_has_hw_pte_young cpu_has_hw_af /* * Experimentally, it's cheap to set the access flag in hardware and we * benefit from prefaulting mappings as 'old' to start with. */ -static inline bool arch_wants_old_prefaulted_pte(void) -{ - return !arch_faults_on_old_pte(); -} -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte +#define arch_wants_old_prefaulted_pte cpu_has_hw_af static inline bool pud_sect_supported(void) { diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 44e2d6f1dbaa..dc5f7d8ef68a 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1431,10 +1431,10 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } -#define arch_faults_on_old_pte arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) +#define arch_has_hw_pte_young arch_has_hw_pte_young +static inline bool arch_has_hw_pte_young(void) { - return false; + return true; } #ifdef CONFIG_PAGE_TABLE_CHECK diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d13b4f7cc5be..375e8e7e64f4 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -260,6 +260,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef arch_has_hw_pte_young +/* + * Return whether the accessed bit is supported on the local CPU. + * + * This stub assumes accessing through an old PTE triggers a page fault. + * Architectures that automatically set the access bit should overwrite it. + */ +static inline bool arch_has_hw_pte_young(void) +{ + return false; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index e38f9245470c..3a9b00c765c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -126,18 +126,6 @@ int randomize_va_space __read_mostly = 2; #endif -#ifndef arch_faults_on_old_pte -static inline bool arch_faults_on_old_pte(void) -{ - /* - * Those arches which don't have hw access flag feature need to - * implement their own helper. By default, "true" means pagefault - * will be hit on old pte. - */ - return true; -} -#endif - #ifndef arch_wants_old_prefaulted_pte static inline bool arch_wants_old_prefaulted_pte(void) { @@ -2871,7 +2859,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src, * On architectures with software "accessed" bits, we would * take a double page fault, so mark it accessed here. */ - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); -- cgit v1.2.3-70-g09d2 From eed9a328aa1ae6ac1edaa026957e6882f57de0dd Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sun, 18 Sep 2022 01:59:59 -0600 Subject: mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Some architectures support the accessed bit in non-leaf PMD entries, e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it as part of linear address translation [1]. Page table walkers that clear the accessed bit may use this capability to reduce their search space. Note that: 1. Although an inline function is preferable, this capability is added as a configuration option for consistency with the existing macros. 2. Due to the little interest in other varieties, this capability was only tested on Intel and AMD CPUs. Thanks to the following developers for their efforts [2][3]. Randy Dunlap Stephen Rothwell [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (June 2021), section 4.8 [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/ [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/ Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.com Signed-off-by: Yu Zhao Reviewed-by: Barry Song Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Cc: Andi Kleen Cc: Aneesh Kumar K.V Cc: Catalin Marinas Cc: Dave Hansen Cc: Hillf Danton Cc: Jens Axboe Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Linus Torvalds Cc: Matthew Wilcox Cc: Mel Gorman Cc: Miaohe Lin Cc: Michael Larabel Cc: Michal Hocko Cc: Mike Rapoport Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Qi Zheng Cc: Tejun Heo Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/Kconfig | 8 ++++++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 3 ++- arch/x86/mm/pgtable.c | 5 ++++- include/linux/pgtable.h | 4 ++-- 5 files changed, 17 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/Kconfig b/arch/Kconfig index 5dbf11a5ba4e..1c2599618eeb 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1415,6 +1415,14 @@ config DYNAMIC_SIGFRAME config HAVE_ARCH_NODE_DEV_GROUP bool +config ARCH_HAS_NONLEAF_PMD_YOUNG + bool + help + Architectures that select this option are capable of setting the + accessed bit in non-leaf PMD entries when using them as part of linear + address translations. Page table walkers that clear the accessed bit + may use this capability to reduce their search space. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f9920f1341c8..674d694a665e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -85,6 +85,7 @@ config X86 select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_SET_MEMORY diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index dc5f7d8ef68a..5059799bebe3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -815,7 +815,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } static inline unsigned long pages_to_mb(unsigned long npg) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index a932d7712d85..8525f2876fb4 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, return ret; } +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 375e8e7e64f4..a108b60a6962 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -213,7 +213,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) @@ -234,7 +234,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, BUILD_BUG(); return 0; } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */ #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH -- cgit v1.2.3-70-g09d2 From d4af56c5c7c6781ca6ca8075e2cf5bc119ed33d1 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 6 Sep 2022 19:48:45 +0000 Subject: mm: start tracking VMAs with maple tree Start tracking the VMAs with the new maple tree structure in parallel with the rb_tree. Add debug and trace events for maple tree operations and duplicate the rb_tree that is created on forks into the maple tree. The maple tree is added to the mm_struct including the mm_init struct, added support in required mm/mmap functions, added tracking in kernel/fork for process forking, and used to find the unmapped_area and checked against what the rbtree finds. This also moves the mmap_lock() in exit_mmap() since the oom reaper call does walk the VMAs. Otherwise lockdep will be unhappy if oom happens. When splitting a vma fails due to allocations of the maple tree nodes, the error path in __split_vma() calls new->vm_ops->close(new). The page accounting for hugetlb is actually in the close() operation, so it accounts for the removal of 1/2 of the VMA which was not adjusted. This results in a negative exit value. To avoid the negative charge, set vm_start = vm_end and vm_pgoff = 0. There is also a potential accounting issue in special mappings from insert_vm_struct() failing to allocate, so reverse the charge there in the failure scenario. Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Signed-off-by: Matthew Wilcox (Oracle) Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: Davidlohr Bueso Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/kernel/tboot.c | 1 + drivers/firmware/efi/efi.c | 1 + include/linux/mm.h | 5 + include/linux/mm_types.h | 3 + include/trace/events/mmap.h | 73 +++++++++ kernel/fork.c | 20 ++- mm/init-mm.c | 2 + mm/mmap.c | 353 +++++++++++++++++++++++++++++++++++++++----- mm/nommu.c | 13 ++ 9 files changed, 435 insertions(+), 36 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 3bacd935f840..e01544202651 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -96,6 +96,7 @@ void __init tboot_probe(void) static pgd_t *tboot_pg_dir; static struct mm_struct tboot_mm = { .mm_rb = RB_ROOT, + .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, tboot_mm.mmap_lock), .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index e4080ad96089..7b6a815b79d3 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -58,6 +58,7 @@ static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR; struct mm_struct efi_mm = { .mm_rb = RB_ROOT, + .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, efi_mm.mmap_lock), .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), .write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq), diff --git a/include/linux/mm.h b/include/linux/mm.h index 7cc9ffc19e7f..896d04248e66 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2567,6 +2567,8 @@ extern bool arch_has_descending_max_zone_pfns(void); /* nommu.c */ extern atomic_long_t mmap_pages_allocated; extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t); +/* mmap.c */ +void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas); /* interval_tree.c */ void vma_interval_tree_insert(struct vm_area_struct *node, @@ -2630,6 +2632,9 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **, bool *need_rmap_locks); extern void exit_mmap(struct mm_struct *); +void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas); +void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas); + static inline int check_data_rlimit(unsigned long rlim, unsigned long new, unsigned long start, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e1797813cc2c..425bc5f7d477 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -486,6 +487,7 @@ struct kioctx_table; struct mm_struct { struct { struct vm_area_struct *mmap; /* list of VMAs */ + struct maple_tree mm_mt; struct rb_root mm_rb; u64 vmacache_seqnum; /* per-thread vmacache */ #ifdef CONFIG_MMU @@ -697,6 +699,7 @@ struct mm_struct { unsigned long cpu_bitmap[]; }; +#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN) extern struct mm_struct init_mm; /* Pointer magic because the dynamic array size confuses some compilers. */ diff --git a/include/trace/events/mmap.h b/include/trace/events/mmap.h index 4661f7ba07c0..216de5f03621 100644 --- a/include/trace/events/mmap.h +++ b/include/trace/events/mmap.h @@ -42,6 +42,79 @@ TRACE_EVENT(vm_unmapped_area, __entry->low_limit, __entry->high_limit, __entry->align_mask, __entry->align_offset) ); + +TRACE_EVENT(vma_mas_szero, + TP_PROTO(struct maple_tree *mt, unsigned long start, + unsigned long end), + + TP_ARGS(mt, start, end), + + TP_STRUCT__entry( + __field(struct maple_tree *, mt) + __field(unsigned long, start) + __field(unsigned long, end) + ), + + TP_fast_assign( + __entry->mt = mt; + __entry->start = start; + __entry->end = end; + ), + + TP_printk("mt_mod %p, (NULL), SNULL, %lu, %lu,", + __entry->mt, + (unsigned long) __entry->start, + (unsigned long) __entry->end + ) +); + +TRACE_EVENT(vma_store, + TP_PROTO(struct maple_tree *mt, struct vm_area_struct *vma), + + TP_ARGS(mt, vma), + + TP_STRUCT__entry( + __field(struct maple_tree *, mt) + __field(struct vm_area_struct *, vma) + __field(unsigned long, vm_start) + __field(unsigned long, vm_end) + ), + + TP_fast_assign( + __entry->mt = mt; + __entry->vma = vma; + __entry->vm_start = vma->vm_start; + __entry->vm_end = vma->vm_end - 1; + ), + + TP_printk("mt_mod %p, (%p), STORE, %lu, %lu,", + __entry->mt, __entry->vma, + (unsigned long) __entry->vm_start, + (unsigned long) __entry->vm_end + ) +); + + +TRACE_EVENT(exit_mmap, + TP_PROTO(struct mm_struct *mm), + + TP_ARGS(mm), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(struct maple_tree *, mt) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->mt = &mm->mm_mt; + ), + + TP_printk("mt_mod %p, DESTROY\n", + __entry->mt + ) +); + #endif /* This part must be outside protection */ diff --git a/kernel/fork.c b/kernel/fork.c index d2da065442af..273364207f17 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -585,6 +585,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, int retval; unsigned long charge; LIST_HEAD(uf); + MA_STATE(mas, &mm->mm_mt, 0, 0); uprobe_start_dup_mmap(); if (mmap_write_lock_killable(oldmm)) { @@ -614,6 +615,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, goto out; khugepaged_fork(mm, oldmm); + retval = mas_expected_entries(&mas, oldmm->map_count); + if (retval) + goto out; + prev = NULL; for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { struct file *file; @@ -629,7 +634,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, */ if (fatal_signal_pending(current)) { retval = -EINTR; - goto out; + goto loop_out; } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); @@ -694,6 +699,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, rb_link = &tmp->vm_rb.rb_right; rb_parent = &tmp->vm_rb; + /* Link the vma into the MT */ + mas.index = tmp->vm_start; + mas.last = tmp->vm_end - 1; + mas_store(&mas, tmp); + mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) retval = copy_page_range(tmp, mpnt); @@ -702,10 +712,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, tmp->vm_ops->open(tmp); if (retval) - goto out; + goto loop_out; } /* a new mm has just been created */ retval = arch_dup_mmap(oldmm, mm); +loop_out: + mas_destroy(&mas); out: mmap_write_unlock(mm); flush_tlb_mm(oldmm); @@ -721,7 +733,7 @@ fail_nomem_policy: fail_nomem: retval = -ENOMEM; vm_unacct_memory(charge); - goto out; + goto loop_out; } static inline int mm_alloc_pgd(struct mm_struct *mm) @@ -1111,6 +1123,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, { mm->mmap = NULL; mm->mm_rb = RB_ROOT; + mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); + mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); mm->vmacache_seqnum = 0; atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); diff --git a/mm/init-mm.c b/mm/init-mm.c index fbe7844d0912..b912b0f2eced 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include @@ -29,6 +30,7 @@ */ struct mm_struct init_mm = { .mm_rb = RB_ROOT, + .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, init_mm.mmap_lock), .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), diff --git a/mm/mmap.c b/mm/mmap.c index dd25a2aa94f7..5115eea6a0e6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -334,7 +334,71 @@ static int browse_rb(struct mm_struct *mm) } return bug ? -1 : i; } +#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) +extern void mt_validate(struct maple_tree *mt); +extern void mt_dump(const struct maple_tree *mt); +/* Validate the maple tree */ +static void validate_mm_mt(struct mm_struct *mm) +{ + struct maple_tree *mt = &mm->mm_mt; + struct vm_area_struct *vma_mt, *vma = mm->mmap; + + MA_STATE(mas, mt, 0, 0); + + mt_validate(&mm->mm_mt); + mas_for_each(&mas, vma_mt, ULONG_MAX) { + if (xa_is_zero(vma_mt)) + continue; + + if (!vma) + break; + + if ((vma != vma_mt) || + (vma->vm_start != vma_mt->vm_start) || + (vma->vm_end != vma_mt->vm_end) || + (vma->vm_start != mas.index) || + (vma->vm_end - 1 != mas.last)) { + pr_emerg("issue in %s\n", current->comm); + dump_stack(); +#ifdef CONFIG_DEBUG_VM + dump_vma(vma_mt); + pr_emerg("and next in rb\n"); + dump_vma(vma->vm_next); +#endif + pr_emerg("mt piv: %p %lu - %lu\n", vma_mt, + mas.index, mas.last); + pr_emerg("mt vma: %p %lu - %lu\n", vma_mt, + vma_mt->vm_start, vma_mt->vm_end); + pr_emerg("rb vma: %p %lu - %lu\n", vma, + vma->vm_start, vma->vm_end); + pr_emerg("rb->next = %p %lu - %lu\n", vma->vm_next, + vma->vm_next->vm_start, vma->vm_next->vm_end); + + mt_dump(mas.tree); + if (vma_mt->vm_end != mas.last + 1) { + pr_err("vma: %p vma_mt %lu-%lu\tmt %lu-%lu\n", + mm, vma_mt->vm_start, vma_mt->vm_end, + mas.index, mas.last); + mt_dump(mas.tree); + } + VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm); + if (vma_mt->vm_start != mas.index) { + pr_err("vma: %p vma_mt %p %lu - %lu doesn't match\n", + mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end); + mt_dump(mas.tree); + } + VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm); + } + VM_BUG_ON(vma != vma_mt); + vma = vma->vm_next; + + } + VM_BUG_ON(vma); +} +#else +#define validate_mm_mt(root) do { } while (0) +#endif static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore) { struct rb_node *nd; @@ -389,6 +453,7 @@ static void validate_mm(struct mm_struct *mm) } #else #define validate_mm_rb(root, ignore) do { } while (0) +#define validate_mm_mt(root) do { } while (0) #define validate_mm(mm) do { } while (0) #endif @@ -633,6 +698,56 @@ static void __vma_link_file(struct vm_area_struct *vma) } } +/* + * vma_mas_store() - Store a VMA in the maple tree. + * @vma: The vm_area_struct + * @mas: The maple state + * + * Efficient way to store a VMA in the maple tree when the @mas has already + * walked to the correct location. + * + * Note: the end address is inclusive in the maple tree. + */ +void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas) +{ + trace_vma_store(mas->tree, vma); + mas_set_range(mas, vma->vm_start, vma->vm_end - 1); + mas_store_prealloc(mas, vma); +} + +/* + * vma_mas_remove() - Remove a VMA from the maple tree. + * @vma: The vm_area_struct + * @mas: The maple state + * + * Efficient way to remove a VMA from the maple tree when the @mas has already + * been established and points to the correct location. + * Note: the end address is inclusive in the maple tree. + */ +void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas) +{ + trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1); + mas->index = vma->vm_start; + mas->last = vma->vm_end - 1; + mas_store_prealloc(mas, NULL); +} + +/* + * vma_mas_szero() - Set a given range to zero. Used when modifying a + * vm_area_struct start or end. + * + * @mm: The struct_mm + * @start: The start address to zero + * @end: The end address to zero. + */ +static inline void vma_mas_szero(struct ma_state *mas, unsigned long start, + unsigned long end) +{ + trace_vma_mas_szero(mas->tree, start, end - 1); + mas_set_range(mas, start, end - 1); + mas_store_prealloc(mas, NULL); +} + static void __vma_link(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node **rb_link, @@ -642,17 +757,22 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma, __vma_link_rb(mm, vma, rb_link, rb_parent); } -static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, +static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node **rb_link, struct rb_node *rb_parent) { + MA_STATE(mas, &mm->mm_mt, 0, 0); struct address_space *mapping = NULL; + if (mas_preallocate(&mas, vma, GFP_KERNEL)) + return -ENOMEM; + if (vma->vm_file) { mapping = vma->vm_file->f_mapping; i_mmap_lock_write(mapping); } + vma_mas_store(vma, &mas); __vma_link(mm, vma, prev, rb_link, rb_parent); __vma_link_file(vma); @@ -661,13 +781,15 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + return 0; } /* * Helper for vma_adjust() in the split_vma insert case: insert a vma into the * mm's list and rbtree. It has already been inserted into the interval tree. */ -static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) +static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas, + struct vm_area_struct *vma) { struct vm_area_struct *prev; struct rb_node **rb_link, *rb_parent; @@ -675,7 +797,10 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) if (find_vma_links(mm, vma->vm_start, vma->vm_end, &prev, &rb_link, &rb_parent)) BUG(); - __vma_link(mm, vma, prev, rb_link, rb_parent); + + vma_mas_store(vma, mas); + __vma_link_list(mm, vma, prev); + __vma_link_rb(mm, vma, rb_link, rb_parent); mm->map_count++; } @@ -702,6 +827,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; + struct vm_area_struct *next_next; struct address_space *mapping = NULL; struct rb_root_cached *root = NULL; struct anon_vma *anon_vma = NULL; @@ -709,10 +835,13 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, bool start_changed = false, end_changed = false; long adjust_next = 0; int remove_next = 0; + MA_STATE(mas, &mm->mm_mt, 0, 0); + struct vm_area_struct *exporter = NULL, *importer = NULL; - if (next && !insert) { - struct vm_area_struct *exporter = NULL, *importer = NULL; + validate_mm(mm); + validate_mm_mt(mm); + if (next && !insert) { if (end >= next->vm_end) { /* * vma expands, overlapping all the next, and @@ -741,10 +870,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, * remove_next == 1 is case 1 or 7. */ remove_next = 1 + (end > next->vm_end); + if (remove_next == 2) + next_next = find_vma(mm, next->vm_end); + VM_WARN_ON(remove_next == 2 && end != next->vm_next->vm_end); - /* trim end to next, for case 6 first pass */ - end = next->vm_end; } exporter = next; @@ -792,9 +922,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, return error; } } -again: - vma_adjust_trans_huge(orig_vma, start, end, adjust_next); + if (mas_preallocate(&mas, vma, GFP_KERNEL)) + return -ENOMEM; + + vma_adjust_trans_huge(orig_vma, start, end, adjust_next); if (file) { mapping = file->f_mapping; root = &mapping->i_mmap; @@ -835,17 +967,28 @@ again: } if (start != vma->vm_start) { + unsigned long old_start = vma->vm_start; vma->vm_start = start; + if (old_start < start) + vma_mas_szero(&mas, old_start, start); start_changed = true; } if (end != vma->vm_end) { + unsigned long old_end = vma->vm_end; vma->vm_end = end; + if (old_end > end) + vma_mas_szero(&mas, end, old_end); end_changed = true; } + + if (end_changed || start_changed) + vma_mas_store(vma, &mas); + vma->vm_pgoff = pgoff; if (adjust_next) { next->vm_start += adjust_next; next->vm_pgoff += adjust_next >> PAGE_SHIFT; + vma_mas_store(next, &mas); } if (file) { @@ -859,10 +1002,14 @@ again: /* * vma_merge has merged next into vma, and needs * us to remove next before dropping the locks. + * Since we have expanded over this vma, the maple tree will + * have overwritten by storing the value */ - if (remove_next != 3) + if (remove_next != 3) { __vma_unlink(mm, next, next); - else + if (remove_next == 2) + __vma_unlink(mm, next_next, next_next); + } else { /* * vma is not before next if they've been * swapped. @@ -873,15 +1020,19 @@ again: * "vma"). */ __vma_unlink(mm, next, vma); - if (file) + } + if (file) { __remove_shared_vm_struct(next, file, mapping); + if (remove_next == 2) + __remove_shared_vm_struct(next_next, file, mapping); + } } else if (insert) { /* * split_vma has split insert from vma, and needs * us to insert it before dropping the locks * (it may either follow vma or precede it). */ - __insert_vm_struct(mm, insert); + __insert_vm_struct(mm, &mas, insert); } else { if (start_changed) vma_gap_update(vma); @@ -909,6 +1060,7 @@ again: } if (remove_next) { +again: if (file) { uprobe_munmap(next, next->vm_start, next->vm_end); fput(file); @@ -930,7 +1082,7 @@ again: * "next->vm_prev->vm_end" changed and the * "vma->vm_next" gap must be updated. */ - next = vma->vm_next; + next = next_next; } else { /* * For the scope of the comment "next" and @@ -946,7 +1098,6 @@ again: } if (remove_next == 2) { remove_next = 1; - end = next->vm_end; goto again; } else if (next) @@ -978,6 +1129,7 @@ again: uprobe_mmap(insert); validate_mm(mm); + validate_mm_mt(mm); return 0; } @@ -1131,6 +1283,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, struct vm_area_struct *area, *next; int err; + validate_mm_mt(mm); /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. @@ -1206,6 +1359,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, khugepaged_enter_vma(area, vm_flags); return area; } + validate_mm_mt(mm); return NULL; } @@ -1688,6 +1842,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; + validate_mm_mt(mm); /* Check against address space limit. */ if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { unsigned long nr_pages; @@ -1802,7 +1957,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr, goto free_vma; } - vma_link(mm, vma, prev, rb_link, rb_parent); + if (vma_link(mm, vma, prev, rb_link, rb_parent)) { + error = -ENOMEM; + if (file) + goto unmap_and_free_vma; + else + goto free_vma; + } /* * vma_merge() calls khugepaged_enter_vma() either, the below @@ -1842,6 +2003,7 @@ out: vma_set_page_prot(vma); + validate_mm_mt(mm); return addr; unmap_and_free_vma: @@ -1857,6 +2019,7 @@ free_vma: unacct_error: if (charged) vm_unacct_memory(charged); + validate_mm_mt(mm); return error; } @@ -1873,12 +2036,19 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info) struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long length, low_limit, high_limit, gap_start, gap_end; + unsigned long gap; + MA_STATE(mas, &mm->mm_mt, 0, 0); /* Adjust search length to account for worst case alignment overhead */ length = info->length + info->align_mask; if (length < info->length) return -ENOMEM; + mas_empty_area(&mas, info->low_limit, info->high_limit - 1, + length); + gap = mas.index; + gap += (info->align_offset - gap) & info->align_mask; + /* Adjust search limits by the desired length */ if (info->high_limit < length) return -ENOMEM; @@ -1960,20 +2130,31 @@ found: VM_BUG_ON(gap_start + info->length > info->high_limit); VM_BUG_ON(gap_start + info->length > gap_end); + + VM_BUG_ON(gap != gap_start); return gap_start; } static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma; + struct vm_area_struct *vma = NULL; unsigned long length, low_limit, high_limit, gap_start, gap_end; + unsigned long gap; + + MA_STATE(mas, &mm->mm_mt, 0, 0); + validate_mm_mt(mm); /* Adjust search length to account for worst case alignment overhead */ length = info->length + info->align_mask; if (length < info->length) return -ENOMEM; + mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1, + length); + gap = mas.last + 1 - info->length; + gap -= (gap - info->align_offset) & info->align_mask; + /* * Adjust search limits by the desired length. * See implementation comment at top of unmapped_area(). @@ -2059,6 +2240,32 @@ found_highest: VM_BUG_ON(gap_end < info->low_limit); VM_BUG_ON(gap_end < gap_start); + + if (gap != gap_end) { + pr_err("%s: %p Gap was found: mt %lu gap_end %lu\n", __func__, + mm, gap, gap_end); + pr_err("window was %lu - %lu size %lu\n", info->high_limit, + info->low_limit, length); + pr_err("mas.min %lu max %lu mas.last %lu\n", mas.min, mas.max, + mas.last); + pr_err("mas.index %lu align mask %lu offset %lu\n", mas.index, + info->align_mask, info->align_offset); + pr_err("rb_find_vma find on %lu => %p (%p)\n", mas.index, + find_vma(mm, mas.index), vma); +#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) + mt_dump(&mm->mm_mt); +#endif + { + struct vm_area_struct *dv = mm->mmap; + + while (dv) { + pr_err("vma %p %lu-%lu\n", dv, dv->vm_start, dv->vm_end); + dv = dv->vm_next; + } + } + VM_BUG_ON(gap != gap_end); + } + return gap_end; } @@ -2284,7 +2491,6 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) vmacache_update(addr, vma); return vma; } - EXPORT_SYMBOL(find_vma); /* @@ -2357,7 +2563,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) struct vm_area_struct *next; unsigned long gap_addr; int error = 0; + MA_STATE(mas, &mm->mm_mt, 0, 0); + validate_mm_mt(mm); if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; @@ -2381,9 +2589,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) /* Check that both stack segments have the same anon_vma? */ } + if (mas_preallocate(&mas, vma, GFP_KERNEL)) + return -ENOMEM; + /* We must make sure the anon_vma is allocated. */ - if (unlikely(anon_vma_prepare(vma))) + if (unlikely(anon_vma_prepare(vma))) { + mas_destroy(&mas); return -ENOMEM; + } /* * vma->vm_start/vm_end cannot change under us because the caller @@ -2420,6 +2633,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) vm_stat_account(mm, vma->vm_flags, grow); anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; + /* Overwrite old entry in mtree. */ + vma_mas_store(vma, &mas); anon_vma_interval_tree_post_update_vma(vma); if (vma->vm_next) vma_gap_update(vma->vm_next); @@ -2434,6 +2649,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) anon_vma_unlock_write(vma->anon_vma); khugepaged_enter_vma(vma, vma->vm_flags); validate_mm(mm); + validate_mm_mt(mm); + mas_destroy(&mas); return error; } #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ @@ -2447,7 +2664,9 @@ int expand_downwards(struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *prev; int error = 0; + MA_STATE(mas, &mm->mm_mt, 0, 0); + validate_mm(mm); address &= PAGE_MASK; if (address < mmap_min_addr) return -EPERM; @@ -2461,9 +2680,14 @@ int expand_downwards(struct vm_area_struct *vma, return -ENOMEM; } + if (mas_preallocate(&mas, vma, GFP_KERNEL)) + return -ENOMEM; + /* We must make sure the anon_vma is allocated. */ - if (unlikely(anon_vma_prepare(vma))) + if (unlikely(anon_vma_prepare(vma))) { + mas_destroy(&mas); return -ENOMEM; + } /* * vma->vm_start/vm_end cannot change under us because the caller @@ -2501,6 +2725,8 @@ int expand_downwards(struct vm_area_struct *vma, anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; vma->vm_pgoff -= grow; + /* Overwrite old entry in mtree. */ + vma_mas_store(vma, &mas); anon_vma_interval_tree_post_update_vma(vma); vma_gap_update(vma); spin_unlock(&mm->page_table_lock); @@ -2512,6 +2738,7 @@ int expand_downwards(struct vm_area_struct *vma, anon_vma_unlock_write(vma->anon_vma); khugepaged_enter_vma(vma, vma->vm_flags); validate_mm(mm); + mas_destroy(&mas); return error; } @@ -2633,14 +2860,17 @@ static void unmap_region(struct mm_struct *mm, * vma list as we go.. */ static bool -detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, - struct vm_area_struct *prev, unsigned long end) +detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas, + struct vm_area_struct *vma, struct vm_area_struct *prev, + unsigned long end) { struct vm_area_struct **insertion_point; struct vm_area_struct *tail_vma = NULL; insertion_point = (prev ? &prev->vm_next : &mm->mmap); vma->vm_prev = NULL; + mas_set_range(mas, vma->vm_start, end - 1); + mas_store_prealloc(mas, NULL); do { vma_rb_erase(vma, &mm->mm_rb); if (vma->vm_flags & VM_LOCKED) @@ -2681,6 +2911,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, { struct vm_area_struct *new; int err; + validate_mm_mt(mm); if (vma->vm_ops && vma->vm_ops->may_split) { err = vma->vm_ops->may_split(vma, addr); @@ -2723,6 +2954,9 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, if (!err) return 0; + /* Avoid vm accounting in close() operation */ + new->vm_start = new->vm_end; + new->vm_pgoff = 0; /* Clean everything up if vma_adjust failed. */ if (new->vm_ops && new->vm_ops->close) new->vm_ops->close(new); @@ -2733,6 +2967,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, mpol_put(vma_policy(new)); out_free_vma: vm_area_free(new); + validate_mm_mt(mm); return err; } @@ -2759,6 +2994,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, { unsigned long end; struct vm_area_struct *vma, *prev, *last; + int error = -ENOMEM; + MA_STATE(mas, &mm->mm_mt, 0, 0); if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; @@ -2779,6 +3016,9 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, vma = find_vma_intersection(mm, start, end); if (!vma) return 0; + + if (mas_preallocate(&mas, vma, GFP_KERNEL)) + return -ENOMEM; prev = vma->vm_prev; /* @@ -2789,7 +3029,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, * places tmp vma above, and higher split_vma places tmp vma below. */ if (start > vma->vm_start) { - int error; /* * Make sure that map_count on return from munmap() will @@ -2797,20 +3036,20 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, * its limit temporarily, to help free resources as expected. */ if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) - return -ENOMEM; + goto map_count_exceeded; error = __split_vma(mm, vma, start, 0); if (error) - return error; + goto split_failed; prev = vma; } /* Does it split the last one? */ last = find_vma(mm, end); if (last && end > last->vm_start) { - int error = __split_vma(mm, last, end, 1); + error = __split_vma(mm, last, end, 1); if (error) - return error; + goto split_failed; } vma = vma_next(mm, prev); @@ -2824,13 +3063,13 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, * split, despite we could. This is unlikely enough * failure that it's not worth optimizing it for. */ - int error = userfaultfd_unmap_prep(vma, start, end, uf); + error = userfaultfd_unmap_prep(vma, start, end, uf); if (error) - return error; + goto userfaultfd_error; } /* Detach vmas from rbtree */ - if (!detach_vmas_to_be_unmapped(mm, vma, prev, end)) + if (!detach_vmas_to_be_unmapped(mm, &mas, vma, prev, end)) downgrade = false; if (downgrade) @@ -2842,6 +3081,12 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, remove_vma_list(mm, vma); return downgrade ? 1 : 0; + +map_count_exceeded: +split_failed: +userfaultfd_error: + mas_destroy(&mas); + return error; } int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, @@ -2981,6 +3226,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla pgoff_t pgoff = addr >> PAGE_SHIFT; int error; unsigned long mapped_addr; + validate_mm_mt(mm); /* Until we need other flags, refuse anything except VM_EXEC. */ if ((flags & (~VM_EXEC)) != 0) @@ -3030,7 +3276,9 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla vma->vm_pgoff = pgoff; vma->vm_flags = flags; vma->vm_page_prot = vm_get_page_prot(flags); - vma_link(mm, vma, prev, rb_link, rb_parent); + if (vma_link(mm, vma, prev, rb_link, rb_parent)) + goto no_vma_link; + out: perf_event_mmap(vma); mm->total_vm += len >> PAGE_SHIFT; @@ -3038,7 +3286,12 @@ out: if (flags & VM_LOCKED) mm->locked_vm += (len >> PAGE_SHIFT); vma->vm_flags |= VM_SOFTDIRTY; + validate_mm_mt(mm); return 0; + +no_vma_link: + vm_area_free(vma); + return -ENOMEM; } int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags) @@ -3127,6 +3380,9 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); cond_resched(); } + + trace_exit_mmap(mm); + __mt_destroy(&mm->mm_mt); mm->mmap = NULL; mmap_write_unlock(mm); vm_unacct_memory(nr_accounted); @@ -3140,12 +3396,30 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { struct vm_area_struct *prev; struct rb_node **rb_link, *rb_parent; + unsigned long start = vma->vm_start; + struct vm_area_struct *overlap = NULL; + unsigned long charged = vma_pages(vma); if (find_vma_links(mm, vma->vm_start, vma->vm_end, &prev, &rb_link, &rb_parent)) + + if (find_vma_intersection(mm, vma->vm_start, vma->vm_end)) return -ENOMEM; + + overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1); + if (overlap) { + + pr_err("Found vma ending at %lu\n", start - 1); + pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap, + overlap->vm_start, overlap->vm_end - 1); +#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) + mt_dump(&mm->mm_mt); +#endif + BUG(); + } + if ((vma->vm_flags & VM_ACCOUNT) && - security_vm_enough_memory_mm(mm, vma_pages(vma))) + security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; /* @@ -3165,7 +3439,11 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; } - vma_link(mm, vma, prev, rb_link, rb_parent); + if (vma_link(mm, vma, prev, rb_link, rb_parent)) { + vm_unacct_memory(charged); + return -ENOMEM; + } + return 0; } @@ -3183,7 +3461,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, struct vm_area_struct *new_vma, *prev; struct rb_node **rb_link, *rb_parent; bool faulted_in_anon_vma = true; + unsigned long index = addr; + validate_mm_mt(mm); /* * If anonymous vma has not yet been faulted, update new pgoff * to match new location, to increase its chance of merging. @@ -3195,6 +3475,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) return NULL; /* should never get here */ + if (mt_find(&mm->mm_mt, &index, addr+len - 1)) + BUG(); new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), vma->vm_userfaultfd_ctx, anon_vma_name(vma)); @@ -3238,6 +3520,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, vma_link(mm, new_vma, prev, rb_link, rb_parent); *need_rmap_locks = false; } + validate_mm_mt(mm); return new_vma; out_free_mempol: @@ -3245,6 +3528,7 @@ out_free_mempol: out_free_vma: vm_area_free(new_vma); out: + validate_mm_mt(mm); return NULL; } @@ -3381,6 +3665,7 @@ static struct vm_area_struct *__install_special_mapping( int ret; struct vm_area_struct *vma; + validate_mm_mt(mm); vma = vm_area_alloc(mm); if (unlikely(vma == NULL)) return ERR_PTR(-ENOMEM); @@ -3403,10 +3688,12 @@ static struct vm_area_struct *__install_special_mapping( perf_event_mmap(vma); + validate_mm_mt(mm); return vma; out: vm_area_free(vma); + validate_mm_mt(mm); return ERR_PTR(ret); } diff --git a/mm/nommu.c b/mm/nommu.c index e819cbc21b39..c63793c53a82 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -545,6 +545,19 @@ static void put_nommu_region(struct vm_region *region) __put_nommu_region(region); } +void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas) +{ + mas_set_range(mas, vma->vm_start, vma->vm_end - 1); + mas_store_prealloc(mas, vma); +} + +void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas) +{ + mas->index = vma->vm_start; + mas->last = vma->vm_end - 1; + mas_store_prealloc(mas, NULL); +} + /* * add a VMA into a process's mm_struct in the appropriate place in the list * and tree and add to the address space's page tree also if not an anonymous -- cgit v1.2.3-70-g09d2 From 524e00b36e8c547f5582eef3fb645a8d9fc5e3df Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 6 Sep 2022 19:48:48 +0000 Subject: mm: remove rb tree. Remove the RB tree and start using the maple tree for vm_area_struct tracking. Drop validate_mm() calls in expand_upwards() and expand_downwards() as the lock is not held. Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: Davidlohr Bueso Cc: "Matthew Wilcox (Oracle)" Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/kernel/tboot.c | 1 - drivers/firmware/efi/efi.c | 1 - include/linux/mm.h | 2 - include/linux/mm_types.h | 14 -- kernel/fork.c | 8 - mm/init-mm.c | 2 - mm/mmap.c | 506 ++++++++++----------------------------------- mm/nommu.c | 87 +++----- mm/util.c | 10 +- 9 files changed, 144 insertions(+), 487 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index e01544202651..4c1bcb6053fc 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -95,7 +95,6 @@ void __init tboot_probe(void) static pgd_t *tboot_pg_dir; static struct mm_struct tboot_mm = { - .mm_rb = RB_ROOT, .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, tboot_mm.mmap_lock), .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 7b6a815b79d3..042a3ef4db1c 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -57,7 +57,6 @@ static unsigned long __initdata mem_reserve = EFI_INVALID_TABLE_ADDR; static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR; struct mm_struct efi_mm = { - .mm_rb = RB_ROOT, .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, efi_mm.mmap_lock), .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), diff --git a/include/linux/mm.h b/include/linux/mm.h index 646ea4d3bd74..dfce1aaa7a64 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2654,8 +2654,6 @@ extern int __split_vma(struct mm_struct *, struct vm_area_struct *, extern int split_vma(struct mm_struct *, struct vm_area_struct *, unsigned long addr, int new_below); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); -extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *, - struct rb_node **, struct rb_node *); extern void unlink_file_vma(struct vm_area_struct *); extern struct vm_area_struct *copy_vma(struct vm_area_struct **, unsigned long addr, unsigned long len, pgoff_t pgoff, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d0b51fbdf5d4..ac747273c4d6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -410,19 +410,6 @@ struct vm_area_struct { /* linked list of VM areas per task, sorted by address */ struct vm_area_struct *vm_next, *vm_prev; - - struct rb_node vm_rb; - - /* - * Largest free memory gap in bytes to the left of this VMA. - * Either between this VMA and vma->vm_prev, or between one of the - * VMAs below us in the VMA rbtree and its ->vm_prev. This helps - * get_unmapped_area find a free area of the right size. - */ - unsigned long rb_subtree_gap; - - /* Second cache line starts here. */ - struct mm_struct *vm_mm; /* The address space we belong to. */ /* @@ -488,7 +475,6 @@ struct mm_struct { struct { struct vm_area_struct *mmap; /* list of VMAs */ struct maple_tree mm_mt; - struct rb_root mm_rb; u64 vmacache_seqnum; /* per-thread vmacache */ #ifdef CONFIG_MMU unsigned long (*get_unmapped_area) (struct file *filp, diff --git a/kernel/fork.c b/kernel/fork.c index 16970c346b5b..5f81c009bb20 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -581,7 +581,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) { struct vm_area_struct *mpnt, *tmp, *prev, **pprev; - struct rb_node **rb_link, *rb_parent; int retval; unsigned long charge = 0; LIST_HEAD(uf); @@ -608,8 +607,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mm->exec_vm = oldmm->exec_vm; mm->stack_vm = oldmm->stack_vm; - rb_link = &mm->mm_rb.rb_node; - rb_parent = NULL; pprev = &mm->mmap; retval = ksm_fork(mm, oldmm); if (retval) @@ -701,10 +698,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, tmp->vm_prev = prev; prev = tmp; - __vma_link_rb(mm, tmp, rb_link, rb_parent); - rb_link = &tmp->vm_rb.rb_right; - rb_parent = &tmp->vm_rb; - /* Link the vma into the MT */ mas.index = tmp->vm_start; mas.last = tmp->vm_end - 1; @@ -1133,7 +1126,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { mm->mmap = NULL; - mm->mm_rb = RB_ROOT; mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); mm->vmacache_seqnum = 0; diff --git a/mm/init-mm.c b/mm/init-mm.c index b912b0f2eced..c9327abb771c 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -1,6 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include #include #include #include @@ -29,7 +28,6 @@ * and size this cpu_bitmask to NR_CPUS. */ struct mm_struct init_mm = { - .mm_rb = RB_ROOT, .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, init_mm.mmap_lock), .pgd = swapper_pg_dir, .mm_users = ATOMIC_INIT(2), diff --git a/mm/mmap.c b/mm/mmap.c index 68ee2958c0be..f60d83c7f233 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -39,7 +39,6 @@ #include #include #include -#include #include #include #include @@ -247,93 +246,6 @@ out: return origbrk; } -static inline unsigned long vma_compute_gap(struct vm_area_struct *vma) -{ - unsigned long gap, prev_end; - - /* - * Note: in the rare case of a VM_GROWSDOWN above a VM_GROWSUP, we - * allow two stack_guard_gaps between them here, and when choosing - * an unmapped area; whereas when expanding we only require one. - * That's a little inconsistent, but keeps the code here simpler. - */ - gap = vm_start_gap(vma); - if (vma->vm_prev) { - prev_end = vm_end_gap(vma->vm_prev); - if (gap > prev_end) - gap -= prev_end; - else - gap = 0; - } - return gap; -} - -#ifdef CONFIG_DEBUG_VM_RB -static unsigned long vma_compute_subtree_gap(struct vm_area_struct *vma) -{ - unsigned long max = vma_compute_gap(vma), subtree_gap; - if (vma->vm_rb.rb_left) { - subtree_gap = rb_entry(vma->vm_rb.rb_left, - struct vm_area_struct, vm_rb)->rb_subtree_gap; - if (subtree_gap > max) - max = subtree_gap; - } - if (vma->vm_rb.rb_right) { - subtree_gap = rb_entry(vma->vm_rb.rb_right, - struct vm_area_struct, vm_rb)->rb_subtree_gap; - if (subtree_gap > max) - max = subtree_gap; - } - return max; -} - -static int browse_rb(struct mm_struct *mm) -{ - struct rb_root *root = &mm->mm_rb; - int i = 0, j, bug = 0; - struct rb_node *nd, *pn = NULL; - unsigned long prev = 0, pend = 0; - - for (nd = rb_first(root); nd; nd = rb_next(nd)) { - struct vm_area_struct *vma; - vma = rb_entry(nd, struct vm_area_struct, vm_rb); - if (vma->vm_start < prev) { - pr_emerg("vm_start %lx < prev %lx\n", - vma->vm_start, prev); - bug = 1; - } - if (vma->vm_start < pend) { - pr_emerg("vm_start %lx < pend %lx\n", - vma->vm_start, pend); - bug = 1; - } - if (vma->vm_start > vma->vm_end) { - pr_emerg("vm_start %lx > vm_end %lx\n", - vma->vm_start, vma->vm_end); - bug = 1; - } - spin_lock(&mm->page_table_lock); - if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) { - pr_emerg("free gap %lx, correct %lx\n", - vma->rb_subtree_gap, - vma_compute_subtree_gap(vma)); - bug = 1; - } - spin_unlock(&mm->page_table_lock); - i++; - pn = nd; - prev = vma->vm_start; - pend = vma->vm_end; - } - j = 0; - for (nd = pn; nd; nd = rb_prev(nd)) - j++; - if (i != j) { - pr_emerg("backwards %d, forwards %d\n", j, i); - bug = 1; - } - return bug ? -1 : i; -} #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) extern void mt_validate(struct maple_tree *mt); extern void mt_dump(const struct maple_tree *mt); @@ -361,19 +273,25 @@ static void validate_mm_mt(struct mm_struct *mm) (vma->vm_end - 1 != mas.last)) { pr_emerg("issue in %s\n", current->comm); dump_stack(); -#ifdef CONFIG_DEBUG_VM dump_vma(vma_mt); - pr_emerg("and next in rb\n"); + pr_emerg("and vm_next\n"); dump_vma(vma->vm_next); -#endif pr_emerg("mt piv: %p %lu - %lu\n", vma_mt, mas.index, mas.last); pr_emerg("mt vma: %p %lu - %lu\n", vma_mt, vma_mt->vm_start, vma_mt->vm_end); - pr_emerg("rb vma: %p %lu - %lu\n", vma, + if (vma->vm_prev) { + pr_emerg("ll prev: %p %lu - %lu\n", + vma->vm_prev, vma->vm_prev->vm_start, + vma->vm_prev->vm_end); + } + pr_emerg("ll vma: %p %lu - %lu\n", vma, vma->vm_start, vma->vm_end); - pr_emerg("rb->next = %p %lu - %lu\n", vma->vm_next, - vma->vm_next->vm_start, vma->vm_next->vm_end); + if (vma->vm_next) { + pr_emerg("ll next: %p %lu - %lu\n", + vma->vm_next, vma->vm_next->vm_start, + vma->vm_next->vm_end); + } mt_dump(mas.tree); if (vma_mt->vm_end != mas.last + 1) { @@ -396,21 +314,6 @@ static void validate_mm_mt(struct mm_struct *mm) } VM_BUG_ON(vma); } -#else -#define validate_mm_mt(root) do { } while (0) -#endif -static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore) -{ - struct rb_node *nd; - - for (nd = rb_first(root); nd; nd = rb_next(nd)) { - struct vm_area_struct *vma; - vma = rb_entry(nd, struct vm_area_struct, vm_rb); - VM_BUG_ON_VMA(vma != ignore && - vma->rb_subtree_gap != vma_compute_subtree_gap(vma), - vma); - } -} static void validate_mm(struct mm_struct *mm) { @@ -419,7 +322,10 @@ static void validate_mm(struct mm_struct *mm) unsigned long highest_address = 0; struct vm_area_struct *vma = mm->mmap; + validate_mm_mt(mm); + while (vma) { +#ifdef CONFIG_DEBUG_VM_RB struct anon_vma *anon_vma = vma->anon_vma; struct anon_vma_chain *avc; @@ -429,6 +335,7 @@ static void validate_mm(struct mm_struct *mm) anon_vma_interval_tree_verify(avc); anon_vma_unlock_read(anon_vma); } +#endif highest_address = vm_end_gap(vma); vma = vma->vm_next; @@ -443,80 +350,13 @@ static void validate_mm(struct mm_struct *mm) mm->highest_vm_end, highest_address); bug = 1; } - i = browse_rb(mm); - if (i != mm->map_count) { - if (i != -1) - pr_emerg("map_count %d rb %d\n", mm->map_count, i); - bug = 1; - } VM_BUG_ON_MM(bug, mm); } -#else -#define validate_mm_rb(root, ignore) do { } while (0) + +#else /* !CONFIG_DEBUG_VM_MAPLE_TREE */ #define validate_mm_mt(root) do { } while (0) #define validate_mm(mm) do { } while (0) -#endif - -RB_DECLARE_CALLBACKS_MAX(static, vma_gap_callbacks, - struct vm_area_struct, vm_rb, - unsigned long, rb_subtree_gap, vma_compute_gap) - -/* - * Update augmented rbtree rb_subtree_gap values after vma->vm_start or - * vma->vm_prev->vm_end values changed, without modifying the vma's position - * in the rbtree. - */ -static void vma_gap_update(struct vm_area_struct *vma) -{ - /* - * As it turns out, RB_DECLARE_CALLBACKS_MAX() already created - * a callback function that does exactly what we want. - */ - vma_gap_callbacks_propagate(&vma->vm_rb, NULL); -} - -static inline void vma_rb_insert(struct vm_area_struct *vma, - struct rb_root *root) -{ - /* All rb_subtree_gap values must be consistent prior to insertion */ - validate_mm_rb(root, NULL); - - rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks); -} - -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root) -{ - /* - * Note rb_erase_augmented is a fairly large inline function, - * so make sure we instantiate it only once with our desired - * augmented rbtree callbacks. - */ - rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks); -} - -static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma, - struct rb_root *root, - struct vm_area_struct *ignore) -{ - /* - * All rb_subtree_gap values must be consistent prior to erase, - * with the possible exception of - * - * a. the "next" vma being erased if next->vm_start was reduced in - * __vma_adjust() -> __vma_unlink() - * b. the vma being erased in detach_vmas_to_be_unmapped() -> - * vma_rb_erase() - */ - validate_mm_rb(root, ignore); - - __vma_rb_erase(vma, root); -} - -static __always_inline void vma_rb_erase(struct vm_area_struct *vma, - struct rb_root *root) -{ - vma_rb_erase_ignore(vma, root, vma); -} +#endif /* CONFIG_DEBUG_VM_MAPLE_TREE */ /* * vma has some anon_vma assigned, and is already inserted on that @@ -550,39 +390,26 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root); } -static int find_vma_links(struct mm_struct *mm, unsigned long addr, - unsigned long end, struct vm_area_struct **pprev, - struct rb_node ***rb_link, struct rb_node **rb_parent) +/* + * range_has_overlap() - Check the @start - @end range for overlapping VMAs and + * sets up a pointer to the previous VMA + * @mm: the mm struct + * @start: the start address of the range + * @end: the end address of the range + * @pprev: the pointer to the pointer of the previous VMA + * + * Returns: True if there is an overlapping VMA, false otherwise + */ +static inline +bool range_has_overlap(struct mm_struct *mm, unsigned long start, + unsigned long end, struct vm_area_struct **pprev) { - struct rb_node **__rb_link, *__rb_parent, *rb_prev; - - mmap_assert_locked(mm); - __rb_link = &mm->mm_rb.rb_node; - rb_prev = __rb_parent = NULL; - - while (*__rb_link) { - struct vm_area_struct *vma_tmp; - - __rb_parent = *__rb_link; - vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb); + struct vm_area_struct *existing; - if (vma_tmp->vm_end > addr) { - /* Fail if an existing vma overlaps the area */ - if (vma_tmp->vm_start < end) - return -ENOMEM; - __rb_link = &__rb_parent->rb_left; - } else { - rb_prev = __rb_parent; - __rb_link = &__rb_parent->rb_right; - } - } - - *pprev = NULL; - if (rb_prev) - *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); - *rb_link = __rb_link; - *rb_parent = __rb_parent; - return 0; + MA_STATE(mas, &mm->mm_mt, start, start); + existing = mas_find(&mas, end - 1); + *pprev = mas_prev(&mas, 0); + return existing ? true : false; } /* @@ -609,8 +436,6 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm, * @start: The start of the range. * @len: The length of the range. * @pprev: pointer to the pointer that will be set to previous vm_area_struct - * @rb_link: the rb_node - * @rb_parent: the parent rb_node * * Find all the vm_area_struct that overlap from @start to * @end and munmap them. Set @pprev to the previous vm_area_struct. @@ -619,14 +444,11 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm, */ static inline int munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len, - struct vm_area_struct **pprev, struct rb_node ***link, - struct rb_node **parent, struct list_head *uf) + struct vm_area_struct **pprev, struct list_head *uf) { - - while (find_vma_links(mm, start, start + len, pprev, link, parent)) + while (range_has_overlap(mm, start, start + len, pprev)) if (do_munmap(mm, start, len, uf)) return -ENOMEM; - return 0; } @@ -647,30 +469,6 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm, return nr_pages; } -void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, - struct rb_node **rb_link, struct rb_node *rb_parent) -{ - /* Update tracking information for the gap following the new vma. */ - if (vma->vm_next) - vma_gap_update(vma->vm_next); - else - mm->highest_vm_end = vm_end_gap(vma); - - /* - * vma->vm_prev wasn't known when we followed the rbtree to find the - * correct insertion point for that vma. As a result, we could not - * update the vma vm_rb parents rb_subtree_gap values on the way down. - * So, we first insert the vma with a zero rb_subtree_gap value - * (to be consistent with what we did on the way down), and then - * immediately update the gap to the correct value. Finally we - * rebalance the rbtree after all augmented values have been set. - */ - rb_link_node(&vma->vm_rb, rb_parent, rb_link); - vma->rb_subtree_gap = 0; - vma_gap_update(vma); - vma_rb_insert(vma, &mm->mm_rb); -} - static void __vma_link_file(struct vm_area_struct *vma) { struct file *file; @@ -738,18 +536,8 @@ static inline void vma_mas_szero(struct ma_state *mas, unsigned long start, mas_store_prealloc(mas, NULL); } -static void -__vma_link(struct mm_struct *mm, struct vm_area_struct *vma, - struct vm_area_struct *prev, struct rb_node **rb_link, - struct rb_node *rb_parent) -{ - __vma_link_list(mm, vma, prev); - __vma_link_rb(mm, vma, rb_link, rb_parent); -} - static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma, - struct vm_area_struct *prev, struct rb_node **rb_link, - struct rb_node *rb_parent) + struct vm_area_struct *prev) { MA_STATE(mas, &mm->mm_mt, 0, 0); struct address_space *mapping = NULL; @@ -763,7 +551,7 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma, } vma_mas_store(vma, &mas); - __vma_link(mm, vma, prev, rb_link, rb_parent); + __vma_link_list(mm, vma, prev); __vma_link_file(vma); if (mapping) @@ -776,34 +564,20 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma, /* * Helper for vma_adjust() in the split_vma insert case: insert a vma into the - * mm's list and rbtree. It has already been inserted into the interval tree. + * mm's list and the mm tree. It has already been inserted into the interval tree. */ static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas, struct vm_area_struct *vma) { struct vm_area_struct *prev; - struct rb_node **rb_link, *rb_parent; - - if (find_vma_links(mm, vma->vm_start, vma->vm_end, - &prev, &rb_link, &rb_parent)) - BUG(); + mas_set(mas, vma->vm_start); + prev = mas_prev(mas, 0); vma_mas_store(vma, mas); __vma_link_list(mm, vma, prev); - __vma_link_rb(mm, vma, rb_link, rb_parent); mm->map_count++; } -static __always_inline void __vma_unlink(struct mm_struct *mm, - struct vm_area_struct *vma, - struct vm_area_struct *ignore) -{ - vma_rb_erase_ignore(vma, &mm->mm_rb, ignore); - __vma_unlink_list(mm, vma); - /* Kill the cache */ - vmacache_invalidate(mm); -} - /* * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that * is already present in an i_mmap tree without adjusting the tree. @@ -816,21 +590,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, struct vm_area_struct *expand) { struct mm_struct *mm = vma->vm_mm; - struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; - struct vm_area_struct *next_next; + struct vm_area_struct *next_next, *next = find_vma(mm, vma->vm_end); + struct vm_area_struct *orig_vma = vma; struct address_space *mapping = NULL; struct rb_root_cached *root = NULL; struct anon_vma *anon_vma = NULL; struct file *file = vma->vm_file; - bool start_changed = false, end_changed = false; + bool vma_changed = false; long adjust_next = 0; int remove_next = 0; MA_STATE(mas, &mm->mm_mt, 0, 0); struct vm_area_struct *exporter = NULL, *importer = NULL; - validate_mm(mm); - validate_mm_mt(mm); - if (next && !insert) { if (end >= next->vm_end) { /* @@ -957,21 +728,21 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, } if (start != vma->vm_start) { - unsigned long old_start = vma->vm_start; + if (vma->vm_start < start) + vma_mas_szero(&mas, vma->vm_start, start); + vma_changed = true; vma->vm_start = start; - if (old_start < start) - vma_mas_szero(&mas, old_start, start); - start_changed = true; } if (end != vma->vm_end) { - unsigned long old_end = vma->vm_end; + if (vma->vm_end > end) + vma_mas_szero(&mas, end, vma->vm_end); + vma_changed = true; vma->vm_end = end; - if (old_end > end) - vma_mas_szero(&mas, end, old_end); - end_changed = true; + if (!next) + mm->highest_vm_end = vm_end_gap(vma); } - if (end_changed || start_changed) + if (vma_changed) vma_mas_store(vma, &mas); vma->vm_pgoff = pgoff; @@ -995,22 +766,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, * Since we have expanded over this vma, the maple tree will * have overwritten by storing the value */ - if (remove_next != 3) { - __vma_unlink(mm, next, next); - if (remove_next == 2) - __vma_unlink(mm, next_next, next_next); - } else { - /* - * vma is not before next if they've been - * swapped. - * - * pre-swap() next->vm_start was reduced so - * tell validate_mm_rb to ignore pre-swap() - * "next" (which is stored in post-swap() - * "vma"). - */ - __vma_unlink(mm, next, vma); - } + __vma_unlink_list(mm, next); + if (remove_next == 2) + __vma_unlink_list(mm, next_next); + /* Kill the cache */ + vmacache_invalidate(mm); + if (file) { __remove_shared_vm_struct(next, file, mapping); if (remove_next == 2) @@ -1023,15 +784,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, * (it may either follow vma or precede it). */ __insert_vm_struct(mm, &mas, insert); - } else { - if (start_changed) - vma_gap_update(vma); - if (end_changed) { - if (!next) - mm->highest_vm_end = vm_end_gap(vma); - else if (!adjust_next) - vma_gap_update(next); - } } if (anon_vma) { @@ -1059,7 +811,10 @@ again: anon_vma_merge(vma, next); mm->map_count--; mpol_put(vma_policy(next)); + if (remove_next != 2) + BUG_ON(vma->vm_end < next->vm_end); vm_area_free(next); + /* * In mprotect's case 6 (see comments on vma_merge), * we must remove another next too. It would clutter @@ -1089,10 +844,7 @@ again: if (remove_next == 2) { remove_next = 1; goto again; - } - else if (next) - vma_gap_update(next); - else { + } else if (!next) { /* * If remove_next == 2 we obviously can't * reach this path. @@ -1119,8 +871,6 @@ again: uprobe_mmap(insert); validate_mm(mm); - validate_mm_mt(mm); - return 0; } @@ -1273,7 +1023,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, struct vm_area_struct *area, *next; int err; - validate_mm_mt(mm); /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. @@ -1349,7 +1098,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, khugepaged_enter_vma(area, vm_flags); return area; } - validate_mm_mt(mm); return NULL; } @@ -1519,6 +1267,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags_t vm_flags; int pkey = 0; + validate_mm(mm); *populate = 0; if (!len) @@ -1829,10 +1578,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev, *merge; int error; - struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; - validate_mm_mt(mm); /* Check against address space limit. */ if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { unsigned long nr_pages; @@ -1848,8 +1595,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return -ENOMEM; } - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */ - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf)) + /* Clear old maps, set up prev and uf */ + if (munmap_vma_range(mm, addr, len, &prev, uf)) return -ENOMEM; /* * Private writable mapping: check memory availability @@ -1947,7 +1694,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, goto free_vma; } - if (vma_link(mm, vma, prev, rb_link, rb_parent)) { + if (vma_link(mm, vma, prev)) { error = -ENOMEM; if (file) goto unmap_and_free_vma; @@ -1993,7 +1740,6 @@ out: vma_set_page_prot(vma); - validate_mm_mt(mm); return addr; unmap_and_free_vma: @@ -2009,7 +1755,6 @@ free_vma: unacct_error: if (charged) vm_unacct_memory(charged); - validate_mm_mt(mm); return error; } @@ -2367,7 +2112,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) int error = 0; MA_STATE(mas, &mm->mm_mt, 0, 0); - validate_mm_mt(mm); if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; @@ -2419,15 +2163,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) error = acct_stack_growth(vma, size, grow); if (!error) { /* - * vma_gap_update() doesn't support concurrent - * updates, but we only hold a shared mmap_lock - * lock here, so we need to protect against - * concurrent vma expansions. - * anon_vma_lock_write() doesn't help here, as - * we don't guarantee that all growable vmas - * in a mm share the same root anon vma. - * So, we reuse mm->page_table_lock to guard - * against concurrent vma expansions. + * We only hold a shared mmap_lock lock here, so + * we need to protect against concurrent vma + * expansions. anon_vma_lock_write() doesn't + * help here, as we don't guarantee that all + * growable vmas in a mm share the same root + * anon vma. So, we reuse mm->page_table_lock + * to guard against concurrent vma expansions. */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) @@ -2438,9 +2180,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) /* Overwrite old entry in mtree. */ vma_mas_store(vma, &mas); anon_vma_interval_tree_post_update_vma(vma); - if (vma->vm_next) - vma_gap_update(vma->vm_next); - else + if (!vma->vm_next) mm->highest_vm_end = vm_end_gap(vma); spin_unlock(&mm->page_table_lock); @@ -2450,8 +2190,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) } anon_vma_unlock_write(vma->anon_vma); khugepaged_enter_vma(vma, vma->vm_flags); - validate_mm(mm); - validate_mm_mt(mm); mas_destroy(&mas); return error; } @@ -2460,15 +2198,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -int expand_downwards(struct vm_area_struct *vma, - unsigned long address) +int expand_downwards(struct vm_area_struct *vma, unsigned long address) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *prev; int error = 0; MA_STATE(mas, &mm->mm_mt, 0, 0); - validate_mm(mm); address &= PAGE_MASK; if (address < mmap_min_addr) return -EPERM; @@ -2510,15 +2246,13 @@ int expand_downwards(struct vm_area_struct *vma, error = acct_stack_growth(vma, size, grow); if (!error) { /* - * vma_gap_update() doesn't support concurrent - * updates, but we only hold a shared mmap_lock - * lock here, so we need to protect against - * concurrent vma expansions. - * anon_vma_lock_write() doesn't help here, as - * we don't guarantee that all growable vmas - * in a mm share the same root anon vma. - * So, we reuse mm->page_table_lock to guard - * against concurrent vma expansions. + * We only hold a shared mmap_lock lock here, so + * we need to protect against concurrent vma + * expansions. anon_vma_lock_write() doesn't + * help here, as we don't guarantee that all + * growable vmas in a mm share the same root + * anon vma. So, we reuse mm->page_table_lock + * to guard against concurrent vma expansions. */ spin_lock(&mm->page_table_lock); if (vma->vm_flags & VM_LOCKED) @@ -2530,7 +2264,6 @@ int expand_downwards(struct vm_area_struct *vma, /* Overwrite old entry in mtree. */ vma_mas_store(vma, &mas); anon_vma_interval_tree_post_update_vma(vma); - vma_gap_update(vma); spin_unlock(&mm->page_table_lock); perf_event_mmap(vma); @@ -2539,7 +2272,6 @@ int expand_downwards(struct vm_area_struct *vma, } anon_vma_unlock_write(vma->anon_vma); khugepaged_enter_vma(vma, vma->vm_flags); - validate_mm(mm); mas_destroy(&mas); return error; } @@ -2671,10 +2403,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas, insertion_point = (prev ? &prev->vm_next : &mm->mmap); vma->vm_prev = NULL; - mas_set_range(mas, vma->vm_start, end - 1); - mas_store_prealloc(mas, NULL); + vma_mas_szero(mas, vma->vm_start, end); do { - vma_rb_erase(vma, &mm->mm_rb); if (vma->vm_flags & VM_LOCKED) mm->locked_vm -= vma_pages(vma); mm->map_count--; @@ -2682,10 +2412,9 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas, vma = vma->vm_next; } while (vma && vma->vm_start < end); *insertion_point = vma; - if (vma) { + if (vma) vma->vm_prev = prev; - vma_gap_update(vma); - } else + else mm->highest_vm_end = prev ? vm_end_gap(prev) : 0; tail_vma->vm_next = NULL; @@ -2807,11 +2536,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, if (len == 0) return -EINVAL; - /* - * arch_unmap() might do unmaps itself. It must be called - * and finish any rbtree manipulation before this code - * runs and also starts to manipulate the rbtree. - */ + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end); /* Find the first overlapping VMA where start < vma->vm_end */ @@ -2822,6 +2547,11 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, if (mas_preallocate(&mas, vma, GFP_KERNEL)) return -ENOMEM; prev = vma->vm_prev; + /* we have start < vma->vm_end */ + + /* if it doesn't overlap, we have nothing.. */ + if (vma->vm_start >= end) + return 0; /* * If we need to split any vma, do it now to save pain later. @@ -2882,6 +2612,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* Fix up all other VM information */ remove_vma_list(mm, vma); + + validate_mm(mm); return downgrade ? 1 : 0; map_count_exceeded: @@ -3020,11 +2752,11 @@ out: * anonymous maps. eventually we may be able to do some * brk-specific accounting here. */ -static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf) +static int do_brk_flags(unsigned long addr, unsigned long len, + unsigned long flags, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; - struct rb_node **rb_link, *rb_parent; pgoff_t pgoff = addr >> PAGE_SHIFT; int error; unsigned long mapped_addr; @@ -3043,8 +2775,8 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla if (error) return error; - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */ - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf)) + /* Clear old maps, set up prev and uf */ + if (munmap_vma_range(mm, addr, len, &prev, uf)) return -ENOMEM; /* Check against address space limits *after* clearing old maps... */ @@ -3078,7 +2810,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla vma->vm_pgoff = pgoff; vma->vm_flags = flags; vma->vm_page_prot = vm_get_page_prot(flags); - if (vma_link(mm, vma, prev, rb_link, rb_parent)) + if (vma_link(mm, vma, prev)) goto no_vma_link; out: @@ -3197,29 +2929,12 @@ void exit_mmap(struct mm_struct *mm) int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { struct vm_area_struct *prev; - struct rb_node **rb_link, *rb_parent; - unsigned long start = vma->vm_start; - struct vm_area_struct *overlap = NULL; unsigned long charged = vma_pages(vma); - if (find_vma_links(mm, vma->vm_start, vma->vm_end, - &prev, &rb_link, &rb_parent)) - if (find_vma_intersection(mm, vma->vm_start, vma->vm_end)) + if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev)) return -ENOMEM; - overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1); - if (overlap) { - - pr_err("Found vma ending at %lu\n", start - 1); - pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap, - overlap->vm_start, overlap->vm_end - 1); -#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) - mt_dump(&mm->mm_mt); -#endif - BUG(); - } - if ((vma->vm_flags & VM_ACCOUNT) && security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; @@ -3241,7 +2956,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; } - if (vma_link(mm, vma, prev, rb_link, rb_parent)) { + if (vma_link(mm, vma, prev)) { vm_unacct_memory(charged); return -ENOMEM; } @@ -3261,9 +2976,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, unsigned long vma_start = vma->vm_start; struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *new_vma, *prev; - struct rb_node **rb_link, *rb_parent; bool faulted_in_anon_vma = true; - unsigned long index = addr; validate_mm_mt(mm); /* @@ -3275,10 +2988,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, faulted_in_anon_vma = false; } - if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) + if (range_has_overlap(mm, addr, addr + len, &prev)) return NULL; /* should never get here */ - if (mt_find(&mm->mm_mt, &index, addr+len - 1)) - BUG(); + new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), vma->vm_userfaultfd_ctx, anon_vma_name(vma)); @@ -3319,12 +3031,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); - vma_link(mm, new_vma, prev, rb_link, rb_parent); + if (vma_link(mm, new_vma, prev)) + goto out_vma_link; *need_rmap_locks = false; } validate_mm_mt(mm); return new_vma; +out_vma_link: + if (new_vma->vm_ops && new_vma->vm_ops->close) + new_vma->vm_ops->close(new_vma); out_free_mempol: mpol_put(vma_policy(new_vma)); out_free_vma: diff --git a/mm/nommu.c b/mm/nommu.c index c63793c53a82..321c7e6718a8 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -566,9 +566,9 @@ void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas) */ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) { - struct vm_area_struct *pvma, *prev; struct address_space *mapping; - struct rb_node **p, *parent, *rb_prev; + struct vm_area_struct *prev; + MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end); BUG_ON(!vma->vm_region); @@ -586,42 +586,10 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) i_mmap_unlock_write(mapping); } + prev = mas_prev(&mas, 0); + mas_reset(&mas); /* add the VMA to the tree */ - parent = rb_prev = NULL; - p = &mm->mm_rb.rb_node; - while (*p) { - parent = *p; - pvma = rb_entry(parent, struct vm_area_struct, vm_rb); - - /* sort by: start addr, end addr, VMA struct addr in that order - * (the latter is necessary as we may get identical VMAs) */ - if (vma->vm_start < pvma->vm_start) - p = &(*p)->rb_left; - else if (vma->vm_start > pvma->vm_start) { - rb_prev = parent; - p = &(*p)->rb_right; - } else if (vma->vm_end < pvma->vm_end) - p = &(*p)->rb_left; - else if (vma->vm_end > pvma->vm_end) { - rb_prev = parent; - p = &(*p)->rb_right; - } else if (vma < pvma) - p = &(*p)->rb_left; - else if (vma > pvma) { - rb_prev = parent; - p = &(*p)->rb_right; - } else - BUG(); - } - - rb_link_node(&vma->vm_rb, parent, p); - rb_insert_color(&vma->vm_rb, &mm->mm_rb); - - /* add VMA to the VMA list also */ - prev = NULL; - if (rb_prev) - prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); - + vma_mas_store(vma, &mas); __vma_link_list(mm, vma, prev); } @@ -634,6 +602,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma) struct address_space *mapping; struct mm_struct *mm = vma->vm_mm; struct task_struct *curr = current; + MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0); mm->map_count--; for (i = 0; i < VMACACHE_SIZE; i++) { @@ -656,8 +625,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma) } /* remove from the MM's tree and list */ - rb_erase(&vma->vm_rb, &mm->mm_rb); - + vma_mas_remove(vma, &mas); __vma_unlink_list(mm, vma); } @@ -681,24 +649,19 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; + MA_STATE(mas, &mm->mm_mt, addr, addr); /* check the cache first */ vma = vmacache_find(mm, addr); if (likely(vma)) return vma; - /* trawl the list (there may be multiple mappings in which addr - * resides) */ - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (vma->vm_start > addr) - return NULL; - if (vma->vm_end > addr) { - vmacache_update(addr, vma); - return vma; - } - } + vma = mas_walk(&mas); - return NULL; + if (vma) + vmacache_update(addr, vma); + + return vma; } EXPORT_SYMBOL(find_vma); @@ -730,26 +693,23 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm, { struct vm_area_struct *vma; unsigned long end = addr + len; + MA_STATE(mas, &mm->mm_mt, addr, addr); /* check the cache first */ vma = vmacache_find_exact(mm, addr, end); if (vma) return vma; - /* trawl the list (there may be multiple mappings in which addr - * resides) */ - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (vma->vm_start < addr) - continue; - if (vma->vm_start > addr) - return NULL; - if (vma->vm_end == end) { - vmacache_update(addr, vma); - return vma; - } - } + vma = mas_walk(&mas); + if (!vma) + return NULL; + if (vma->vm_start != addr) + return NULL; + if (vma->vm_end != end) + return NULL; - return NULL; + vmacache_update(addr, vma); + return vma; } /* @@ -1546,6 +1506,7 @@ void exit_mmap(struct mm_struct *mm) delete_vma(mm, vma); cond_resched(); } + __mt_destroy(&mm->mm_mt); } int vm_brk(unsigned long addr, unsigned long len) diff --git a/mm/util.c b/mm/util.c index 8d944ce71e94..10effe256dfa 100644 --- a/mm/util.c +++ b/mm/util.c @@ -288,6 +288,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, vma->vm_next = next; if (next) next->vm_prev = vma; + else + mm->highest_vm_end = vm_end_gap(vma); } void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma) @@ -300,8 +302,14 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma) prev->vm_next = next; else mm->mmap = next; - if (next) + if (next) { next->vm_prev = prev; + } else { + if (prev) + mm->highest_vm_end = vm_end_gap(prev); + else + mm->highest_vm_end = 0; + } } /* Check if the vma is being used as a stack by this task */ -- cgit v1.2.3-70-g09d2 From de2b84d24b87172913754bca6db85d5c5998213b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:53 +0000 Subject: arm64: remove mmap linked list from vdso Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-31-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Acked-by: Vlastimil Babka Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/arm64/kernel/vdso.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index a61fc4f989b3..a8388af62b99 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -136,10 +136,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) { struct mm_struct *mm = task->mm; struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); mmap_read_lock(mm); - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm)) -- cgit v1.2.3-70-g09d2 From ef770d180ebae967b19a3964bc1cc026f3082f9a Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 6 Sep 2022 19:48:53 +0000 Subject: arm64: Change elfcore for_each_mte_vma() to use VMA iterator Rework for_each_mte_vma() to use a VMA iterator instead of an explicit linked-list. Link: https://lkml.kernel.org/r/20220906194824.2110408-32-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Acked-by: Catalin Marinas Link: https://lore.kernel.org/r/20220218023650.672072-1-Liam.Howlett@oracle.com Signed-off-by: Will Deacon Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: David Hildenbrand Cc: David Howells Cc: "Matthew Wilcox (Oracle)" Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/arm64/kernel/elfcore.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c index 98d67444a5b6..27ef7ad3ffd2 100644 --- a/arch/arm64/kernel/elfcore.c +++ b/arch/arm64/kernel/elfcore.c @@ -8,9 +8,9 @@ #include #include -#define for_each_mte_vma(tsk, vma) \ +#define for_each_mte_vma(vmi, vma) \ if (system_supports_mte()) \ - for (vma = tsk->mm->mmap; vma; vma = vma->vm_next) \ + for_each_vma(vmi, vma) \ if (vma->vm_flags & VM_MTE) static unsigned long mte_vma_tag_dump_size(struct vm_area_struct *vma) @@ -81,8 +81,9 @@ Elf_Half elf_core_extra_phdrs(void) { struct vm_area_struct *vma; int vma_count = 0; + VMA_ITERATOR(vmi, current->mm, 0); - for_each_mte_vma(current, vma) + for_each_mte_vma(vmi, vma) vma_count++; return vma_count; @@ -91,8 +92,9 @@ Elf_Half elf_core_extra_phdrs(void) int elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset) { struct vm_area_struct *vma; + VMA_ITERATOR(vmi, current->mm, 0); - for_each_mte_vma(current, vma) { + for_each_mte_vma(vmi, vma) { struct elf_phdr phdr; phdr.p_type = PT_AARCH64_MEMTAG_MTE; @@ -116,8 +118,9 @@ size_t elf_core_extra_data_size(void) { struct vm_area_struct *vma; size_t data_size = 0; + VMA_ITERATOR(vmi, current->mm, 0); - for_each_mte_vma(current, vma) + for_each_mte_vma(vmi, vma) data_size += mte_vma_tag_dump_size(vma); return data_size; @@ -126,8 +129,9 @@ size_t elf_core_extra_data_size(void) int elf_core_write_extra_data(struct coredump_params *cprm) { struct vm_area_struct *vma; + VMA_ITERATOR(vmi, current->mm, 0); - for_each_mte_vma(current, vma) { + for_each_mte_vma(vmi, vma) { if (vma->vm_flags & VM_DONTDUMP) continue; -- cgit v1.2.3-70-g09d2 From 70fa203165d96ae03abb83cf60d30c44e6b81a12 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:53 +0000 Subject: parisc: remove mmap linked list from cache handling Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-33-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Acked-by: Vlastimil Babka Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/parisc/kernel/cache.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c index 3feb7694e0ca..1d3b8bc8a623 100644 --- a/arch/parisc/kernel/cache.c +++ b/arch/parisc/kernel/cache.c @@ -657,15 +657,20 @@ static inline unsigned long mm_total_size(struct mm_struct *mm) { struct vm_area_struct *vma; unsigned long usize = 0; + VMA_ITERATOR(vmi, mm, 0); - for (vma = mm->mmap; vma && usize < parisc_cache_flush_threshold; vma = vma->vm_next) + for_each_vma(vmi, vma) { + if (usize >= parisc_cache_flush_threshold) + break; usize += vma->vm_end - vma->vm_start; + } return usize; } void flush_cache_mm(struct mm_struct *mm) { struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); /* * Flushing the whole cache on each cpu takes forever on @@ -685,7 +690,7 @@ void flush_cache_mm(struct mm_struct *mm) } /* Flush mm */ - for (vma = mm->mmap; vma; vma = vma->vm_next) + for_each_vma(vmi, vma) flush_cache_pages(vma, vma->vm_start, vma->vm_end); } -- cgit v1.2.3-70-g09d2 From 405e669172e20be9a42cecf8be0fbed089fab045 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:53 +0000 Subject: powerpc: remove mmap linked list walks Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-34-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/powerpc/kernel/vdso.c | 6 +++--- arch/powerpc/mm/book3s32/tlb.c | 11 ++++++----- arch/powerpc/mm/book3s64/subpage_prot.c | 13 ++----------- 3 files changed, 11 insertions(+), 19 deletions(-) (limited to 'arch') diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index 0da287544054..94a8fa5017c3 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -113,18 +113,18 @@ struct vdso_data *arch_get_vdso_data(void *vvar_page) int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) { struct mm_struct *mm = task->mm; + VMA_ITERATOR(vmi, mm, 0); struct vm_area_struct *vma; mmap_read_lock(mm); - - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, &vvar_spec)) zap_page_range(vma, vma->vm_start, size); } - mmap_read_unlock(mm); + return 0; } diff --git a/arch/powerpc/mm/book3s32/tlb.c b/arch/powerpc/mm/book3s32/tlb.c index 19f0ef950d77..9ad6b56bfec9 100644 --- a/arch/powerpc/mm/book3s32/tlb.c +++ b/arch/powerpc/mm/book3s32/tlb.c @@ -81,14 +81,15 @@ EXPORT_SYMBOL(hash__flush_range); void hash__flush_tlb_mm(struct mm_struct *mm) { struct vm_area_struct *mp; + VMA_ITERATOR(vmi, mm, 0); /* - * It is safe to go down the mm's list of vmas when called - * from dup_mmap, holding mmap_lock. It would also be safe from - * unmap_region or exit_mmap, but not from vmtruncate on SMP - - * but it seems dup_mmap is the only SMP case which gets here. + * It is safe to iterate the vmas when called from dup_mmap, + * holding mmap_lock. It would also be safe from unmap_region + * or exit_mmap, but not from vmtruncate on SMP - but it seems + * dup_mmap is the only SMP case which gets here. */ - for (mp = mm->mmap; mp != NULL; mp = mp->vm_next) + for_each_vma(vmi, mp) hash__flush_range(mp->vm_mm, mp->vm_start, mp->vm_end); } EXPORT_SYMBOL(hash__flush_tlb_mm); diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c index 60c6ea16a972..d73b3b4176e8 100644 --- a/arch/powerpc/mm/book3s64/subpage_prot.c +++ b/arch/powerpc/mm/book3s64/subpage_prot.c @@ -149,24 +149,15 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, unsigned long len) { struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, addr); /* * We don't try too hard, we just mark all the vma in that range * VM_NOHUGEPAGE and split them. */ - vma = find_vma(mm, addr); - /* - * If the range is in unmapped range, just return - */ - if (vma && ((addr + len) <= vma->vm_start)) - return; - - while (vma) { - if (vma->vm_start >= (addr + len)) - break; + for_each_vma_range(vmi, vma, addr + len) { vma->vm_flags |= VM_NOHUGEPAGE; walk_page_vma(vma, &subpage_walk_ops, NULL); - vma = vma->vm_next; } } #else -- cgit v1.2.3-70-g09d2 From e7b6b990e524f60994da70cf5a22159b1e88ce57 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:54 +0000 Subject: s390: remove vma linked list walks Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-35-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Acked-by: Vlastimil Babka Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/s390/kernel/vdso.c | 3 ++- arch/s390/mm/gmap.c | 6 ++++-- 2 files changed, 6 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index 5075cde77b29..535099f2736d 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -69,10 +69,11 @@ static struct page *find_timens_vvar_page(struct vm_area_struct *vma) int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) { struct mm_struct *mm = task->mm; + VMA_ITERATOR(vmi, mm, 0); struct vm_area_struct *vma; mmap_read_lock(mm); - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { unsigned long size = vma->vm_end - vma->vm_start; if (!vma_is_special_mapping(vma, &vvar_mapping)) diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 62758cb5872f..02d15c8dc92e 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2515,8 +2515,9 @@ static const struct mm_walk_ops thp_split_walk_ops = { static inline void thp_split_mm(struct mm_struct *mm) { struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); - for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) { + for_each_vma(vmi, vma) { vma->vm_flags &= ~VM_HUGEPAGE; vma->vm_flags |= VM_NOHUGEPAGE; walk_page_vma(vma, &thp_split_walk_ops, NULL); @@ -2584,8 +2585,9 @@ int gmap_mark_unmergeable(void) struct mm_struct *mm = current->mm; struct vm_area_struct *vma; int ret; + VMA_ITERATOR(vmi, mm, 0); - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_UNMERGEABLE, &vma->vm_flags); if (ret) -- cgit v1.2.3-70-g09d2 From a3884621163b7d7fab89b44461b2a48a29c5cc9a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:54 +0000 Subject: x86: remove vma linked list walks Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-36-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Acked-by: Vlastimil Babka Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/entry/vdso/vma.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 1000d457c332..6292b960037b 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -127,17 +127,17 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) { struct mm_struct *mm = task->mm; struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); mmap_read_lock(mm); - - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, &vvar_mapping)) zap_page_range(vma, vma->vm_start, size); } - mmap_read_unlock(mm); + return 0; } #else @@ -354,6 +354,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); mmap_write_lock(mm); /* @@ -363,7 +364,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr) * We could search vma near context.vdso, but it's a slowpath, * so let's explicitly check all VMAs to be completely sure. */ - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { if (vma_is_special_mapping(vma, &vdso_mapping) || vma_is_special_mapping(vma, &vvar_mapping)) { mmap_write_unlock(mm); -- cgit v1.2.3-70-g09d2 From 49c40fb4b826c90036f04abf583bb4cb5ba3d203 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:55 +0000 Subject: xtensa: remove vma linked list walks Use the VMA iterator instead. Since VMA can no longer be NULL in the loop, then deal with out-of-memory outside the loop. This means a slightly longer run time in the failure case (-ENOMEM) - it will run to the end of the VMAs before erroring instead of in the middle of the loop. Link: https://lkml.kernel.org/r/20220906194824.2110408-37-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/xtensa/kernel/syscall.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/xtensa/kernel/syscall.c b/arch/xtensa/kernel/syscall.c index 201356faa7e6..b3c2450d6f23 100644 --- a/arch/xtensa/kernel/syscall.c +++ b/arch/xtensa/kernel/syscall.c @@ -58,6 +58,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct vm_area_struct *vmm; + struct vma_iterator vmi; if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate @@ -79,15 +80,20 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, else addr = PAGE_ALIGN(addr); - for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) { - /* At this point: (!vmm || addr < vmm->vm_end). */ - if (TASK_SIZE - len < addr) - return -ENOMEM; - if (!vmm || addr + len <= vm_start_gap(vmm)) - return addr; + vma_iter_init(&vmi, current->mm, addr); + for_each_vma(vmi, vmm) { + /* At this point: (addr < vmm->vm_end). */ + if (addr + len <= vm_start_gap(vmm)) + break; + addr = vmm->vm_end; if (flags & MAP_SHARED) addr = COLOUR_ALIGN(addr, pgoff); } + + if (TASK_SIZE - len < addr) + return -ENOMEM; + + return addr; } #endif -- cgit v1.2.3-70-g09d2 From cbd43755ad15687cf8c925793a0b6c60c6181615 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 6 Sep 2022 19:48:56 +0000 Subject: um: remove vma linked list walk Use the VMA iterator instead. Link: https://lkml.kernel.org/r/20220906194824.2110408-40-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/um/kernel/tlb.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) (limited to 'arch') diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c index bc38f79ca3a3..ad449173a1a1 100644 --- a/arch/um/kernel/tlb.c +++ b/arch/um/kernel/tlb.c @@ -584,21 +584,19 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, void flush_tlb_mm(struct mm_struct *mm) { - struct vm_area_struct *vma = mm->mmap; + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); - while (vma != NULL) { + for_each_vma(vmi, vma) fix_range(mm, vma->vm_start, vma->vm_end, 0); - vma = vma->vm_next; - } } void force_flush_all(void) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = mm->mmap; + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); - while (vma != NULL) { + for_each_vma(vmi, vma) fix_range(mm, vma->vm_start, vma->vm_end, 1); - vma = vma->vm_next; - } } -- cgit v1.2.3-70-g09d2 From 9b580a1d60de0e1c8957e63859aa989196952e63 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 6 Sep 2022 19:49:05 +0000 Subject: riscv: use vma iterator for vdso Remove the linked list use in favour of the vma iterator. Link: https://lkml.kernel.org/r/20220906194824.2110408-67-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Davidlohr Bueso Tested-by: Yu Zhao Cc: Catalin Marinas Cc: David Hildenbrand Cc: David Howells Cc: "Matthew Wilcox (Oracle)" Cc: SeongJae Park Cc: Sven Schnelle Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/riscv/kernel/vdso.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c index 69b05b6c181b..692e7ae3dcb8 100644 --- a/arch/riscv/kernel/vdso.c +++ b/arch/riscv/kernel/vdso.c @@ -114,11 +114,12 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns) { struct mm_struct *mm = task->mm; struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); struct __vdso_info *vdso_info = mm->context.vdso_info; mmap_read_lock(mm); - for (vma = mm->mmap; vma; vma = vma->vm_next) { + for_each_vma(vmi, vma) { unsigned long size = vma->vm_end - vma->vm_start; if (vma_is_special_mapping(vma, vdso_info->dm)) -- cgit v1.2.3-70-g09d2 From e41e614f6a3e3d0d21874a785d3a67d353e282da Mon Sep 17 00:00:00 2001 From: Dmitry Vyukov Date: Thu, 15 Sep 2022 17:03:35 +0200 Subject: x86: add missing include to sparsemem.h Patch series "Add KernelMemorySanitizer infrastructure", v7. KernelMemorySanitizer (KMSAN) is a detector of errors related to uses of uninitialized memory. It relies on compile-time Clang instrumentation (similar to MSan in the userspace [1]) and tracks the state of every bit of kernel memory, being able to report an error if uninitialized value is used in a condition, dereferenced, or escapes to userspace, USB or DMA. KMSAN has reported more than 300 bugs in the past few years (recently fixed bugs: [2]), most of them with the help of syzkaller. Such bugs keep getting introduced into the kernel despite new compiler warnings and other analyses (the 6.0 cycle already resulted in several KMSAN-reported bugs, e.g. [3]). Mitigations like total stack and heap initialization are unfortunately very far from being deployable. The proposed patchset contains KMSAN runtime implementation together with small changes to other subsystems needed to make KMSAN work. The latter changes fall into several categories: 1. Changes and refactorings of existing code required to add KMSAN: - [01/43] x86: add missing include to sparsemem.h - [02/43] stackdepot: reserve 5 extra bits in depot_stack_handle_t - [03/43] instrumented.h: allow instrumenting both sides of copy_from_user() - [04/43] x86: asm: instrument usercopy in get_user() and __put_user_size() - [05/43] asm-generic: instrument usercopy in cacheflush.h - [10/43] libnvdimm/pfn_dev: increase MAX_STRUCT_PAGE_SIZE 2. KMSAN-related declarations in generic code, KMSAN runtime library, docs and configs: - [06/43] kmsan: add ReST documentation - [07/43] kmsan: introduce __no_sanitize_memory and __no_kmsan_checks - [09/43] x86: kmsan: pgtable: reduce vmalloc space - [11/43] kmsan: add KMSAN runtime core - [13/43] MAINTAINERS: add entry for KMSAN - [24/43] kmsan: add tests for KMSAN - [31/43] objtool: kmsan: list KMSAN API functions as uaccess-safe - [35/43] x86: kmsan: use __msan_ string functions where possible - [43/43] x86: kmsan: enable KMSAN builds for x86 3. Adding hooks from different subsystems to notify KMSAN about memory state changes: - [14/43] mm: kmsan: maintain KMSAN metadata for page - [15/43] mm: kmsan: call KMSAN hooks from SLUB code - [16/43] kmsan: handle task creation and exiting - [17/43] init: kmsan: call KMSAN initialization routines - [18/43] instrumented.h: add KMSAN support - [19/43] kmsan: add iomap support - [20/43] Input: libps2: mark data received in __ps2_command() as initialized - [21/43] dma: kmsan: unpoison DMA mappings - [34/43] x86: kmsan: handle open-coded assembly in lib/iomem.c - [36/43] x86: kmsan: sync metadata pages on page fault 4. Changes that prevent false reports by explicitly initializing memory, disabling optimized code that may trick KMSAN, selectively skipping instrumentation: - [08/43] kmsan: mark noinstr as __no_sanitize_memory - [12/43] kmsan: disable instrumentation of unsupported common kernel code - [22/43] virtio: kmsan: check/unpoison scatterlist in vring_map_one_sg() - [23/43] kmsan: handle memory sent to/from USB - [25/43] kmsan: disable strscpy() optimization under KMSAN - [26/43] crypto: kmsan: disable accelerated configs under KMSAN - [27/43] kmsan: disable physical page merging in biovec - [28/43] block: kmsan: skip bio block merging logic for KMSAN - [29/43] kcov: kmsan: unpoison area->list in kcov_remote_area_put() - [30/43] security: kmsan: fix interoperability with auto-initialization - [32/43] x86: kmsan: disable instrumentation of unsupported code - [33/43] x86: kmsan: skip shadow checks in __switch_to() - [37/43] x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN - [38/43] x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS - [39/43] x86: kmsan: don't instrument stack walking functions - [40/43] entry: kmsan: introduce kmsan_unpoison_entry_regs() 5. Fixes for bugs detected with CONFIG_KMSAN_CHECK_PARAM_RETVAL: - [41/43] bpf: kmsan: initialize BPF registers with zeroes - [42/43] mm: fs: initialize fsdata passed to write_begin/write_end interface This patchset allows one to boot and run a defconfig+KMSAN kernel on a QEMU without known false positives. It however doesn't guarantee there are no false positives in drivers of certain devices or less tested subsystems, although KMSAN is actively tested on syzbot with a large config. By default, KMSAN enforces conservative checks of most kernel function parameters passed by value (via CONFIG_KMSAN_CHECK_PARAM_RETVAL, which maps to the -fsanitize-memory-param-retval compiler flag). As discussed in [4] and [5], passing uninitialized values as function parameters is considered undefined behavior, therefore KMSAN now reports such cases as errors. Several newly added patches fix known manifestations of these errors. This patch (of 43): Including sparsemem.h from other files (e.g. transitively via asm/pgtable_64_types.h) results in compilation errors due to unknown types: sparsemem.h:34:32: error: unknown type name 'phys_addr_t' extern int phys_to_target_node(phys_addr_t start); ^ sparsemem.h:36:39: error: unknown type name 'u64' extern int memory_add_physaddr_to_nid(u64 start); ^ Fix these errors by including linux/types.h from sparsemem.h This is required for the upcoming KMSAN patches. Link: https://lkml.kernel.org/r/20220915150417.722975-1-glider@google.com Link: https://lkml.kernel.org/r/20220915150417.722975-2-glider@google.com Signed-off-by: Dmitry Vyukov Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Cc: Andrey Konovalov Cc: Eric Biggers Signed-off-by: Andrew Morton --- arch/x86/include/asm/sparsemem.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch') diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h index 6a9ccc1b2be5..64df897c0ee3 100644 --- a/arch/x86/include/asm/sparsemem.h +++ b/arch/x86/include/asm/sparsemem.h @@ -2,6 +2,8 @@ #ifndef _ASM_X86_SPARSEMEM_H #define _ASM_X86_SPARSEMEM_H +#include + #ifdef CONFIG_SPARSEMEM /* * generic non-linear memory support: -- cgit v1.2.3-70-g09d2 From 33b75c1d884e81ec97525e0a6fdcb187adf273f4 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:03:37 +0200 Subject: instrumented.h: allow instrumenting both sides of copy_from_user() Introduce instrument_copy_from_user_before() and instrument_copy_from_user_after() hooks to be invoked before and after the call to copy_from_user(). KASAN and KCSAN will be only using instrument_copy_from_user_before(), but for KMSAN we'll need to insert code after copy_from_user(). Link: https://lkml.kernel.org/r/20220915150417.722975-4-glider@google.com Signed-off-by: Alexander Potapenko Reviewed-by: Marco Elver Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/s390/lib/uaccess.c | 3 ++- include/linux/instrumented.h | 21 +++++++++++++++++++-- include/linux/uaccess.h | 19 ++++++++++++++----- lib/iov_iter.c | 9 ++++++--- lib/usercopy.c | 3 ++- 5 files changed, 43 insertions(+), 12 deletions(-) (limited to 'arch') diff --git a/arch/s390/lib/uaccess.c b/arch/s390/lib/uaccess.c index d7b3b193d108..58033dfcb6d4 100644 --- a/arch/s390/lib/uaccess.c +++ b/arch/s390/lib/uaccess.c @@ -81,8 +81,9 @@ unsigned long _copy_from_user_key(void *to, const void __user *from, might_fault(); if (!should_fail_usercopy()) { - instrument_copy_from_user(to, from, n); + instrument_copy_from_user_before(to, from, n); res = raw_copy_from_user_key(to, from, n, key); + instrument_copy_from_user_after(to, from, n, res); } if (unlikely(res)) memset(to + (n - res), 0, res); diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h index 42faebbaa202..ee8f7d17d34f 100644 --- a/include/linux/instrumented.h +++ b/include/linux/instrumented.h @@ -120,7 +120,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n) } /** - * instrument_copy_from_user - instrument writes of copy_from_user + * instrument_copy_from_user_before - add instrumentation before copy_from_user * * Instrument writes to kernel memory, that are due to copy_from_user (and * variants). The instrumentation should be inserted before the accesses. @@ -130,10 +130,27 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n) * @n number of bytes to copy */ static __always_inline void -instrument_copy_from_user(const void *to, const void __user *from, unsigned long n) +instrument_copy_from_user_before(const void *to, const void __user *from, unsigned long n) { kasan_check_write(to, n); kcsan_check_write(to, n); } +/** + * instrument_copy_from_user_after - add instrumentation after copy_from_user + * + * Instrument writes to kernel memory, that are due to copy_from_user (and + * variants). The instrumentation should be inserted after the accesses. + * + * @to destination address + * @from source address + * @n number of bytes to copy + * @left number of bytes not copied (as returned by copy_from_user) + */ +static __always_inline void +instrument_copy_from_user_after(const void *to, const void __user *from, + unsigned long n, unsigned long left) +{ +} + #endif /* _LINUX_INSTRUMENTED_H */ diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 47e5d374c7eb..afb18f198843 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -58,20 +58,28 @@ static __always_inline __must_check unsigned long __copy_from_user_inatomic(void *to, const void __user *from, unsigned long n) { - instrument_copy_from_user(to, from, n); + unsigned long res; + + instrument_copy_from_user_before(to, from, n); check_object_size(to, n, false); - return raw_copy_from_user(to, from, n); + res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); + return res; } static __always_inline __must_check unsigned long __copy_from_user(void *to, const void __user *from, unsigned long n) { + unsigned long res; + might_fault(); + instrument_copy_from_user_before(to, from, n); if (should_fail_usercopy()) return n; - instrument_copy_from_user(to, from, n); check_object_size(to, n, false); - return raw_copy_from_user(to, from, n); + res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); + return res; } /** @@ -115,8 +123,9 @@ _copy_from_user(void *to, const void __user *from, unsigned long n) unsigned long res = n; might_fault(); if (!should_fail_usercopy() && likely(access_ok(from, n))) { - instrument_copy_from_user(to, from, n); + instrument_copy_from_user_before(to, from, n); res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); } if (unlikely(res)) memset(to + (n - res), 0, res); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 4b7fce72e3e5..c3ca28ca68a6 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -174,13 +174,16 @@ static int copyout(void __user *to, const void *from, size_t n) static int copyin(void *to, const void __user *from, size_t n) { + size_t res = n; + if (should_fail_usercopy()) return n; if (access_ok(from, n)) { - instrument_copy_from_user(to, from, n); - n = raw_copy_from_user(to, from, n); + instrument_copy_from_user_before(to, from, n); + res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); } - return n; + return res; } static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe, diff --git a/lib/usercopy.c b/lib/usercopy.c index 7413dd300516..1505a52f23a0 100644 --- a/lib/usercopy.c +++ b/lib/usercopy.c @@ -12,8 +12,9 @@ unsigned long _copy_from_user(void *to, const void __user *from, unsigned long n unsigned long res = n; might_fault(); if (!should_fail_usercopy() && likely(access_ok(from, n))) { - instrument_copy_from_user(to, from, n); + instrument_copy_from_user_before(to, from, n); res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); } if (unlikely(res)) memset(to + (n - res), 0, res); -- cgit v1.2.3-70-g09d2 From 888f84a6da4d17e453058169fa7b235fff34f5bf Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:03:38 +0200 Subject: x86: asm: instrument usercopy in get_user() and put_user() Use hooks from instrumented.h to notify bug detection tools about usercopy events in variations of get_user() and put_user(). Link: https://lkml.kernel.org/r/20220915150417.722975-5-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/uaccess.h | 22 +++++++++++++++------- include/linux/instrumented.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+), 7 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 913e593a3b45..c1b8982899ec 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -5,6 +5,7 @@ * User space memory access functions */ #include +#include #include #include #include @@ -103,6 +104,7 @@ extern int __get_user_bad(void); : "=a" (__ret_gu), "=r" (__val_gu), \ ASM_CALL_CONSTRAINT \ : "0" (ptr), "i" (sizeof(*(ptr)))); \ + instrument_get_user(__val_gu); \ (x) = (__force __typeof__(*(ptr))) __val_gu; \ __builtin_expect(__ret_gu, 0); \ }) @@ -192,9 +194,11 @@ extern void __put_user_nocheck_8(void); int __ret_pu; \ void __user *__ptr_pu; \ register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX); \ - __chk_user_ptr(ptr); \ - __ptr_pu = (ptr); \ - __val_pu = (x); \ + __typeof__(*(ptr)) __x = (x); /* eval x once */ \ + __typeof__(ptr) __ptr = (ptr); /* eval ptr once */ \ + __chk_user_ptr(__ptr); \ + __ptr_pu = __ptr; \ + __val_pu = __x; \ asm volatile("call __" #fn "_%P[size]" \ : "=c" (__ret_pu), \ ASM_CALL_CONSTRAINT \ @@ -202,6 +206,7 @@ extern void __put_user_nocheck_8(void); "r" (__val_pu), \ [size] "i" (sizeof(*(ptr))) \ :"ebx"); \ + instrument_put_user(__x, __ptr, sizeof(*(ptr))); \ __builtin_expect(__ret_pu, 0); \ }) @@ -248,23 +253,25 @@ extern void __put_user_nocheck_8(void); #define __put_user_size(x, ptr, size, label) \ do { \ + __typeof__(*(ptr)) __x = (x); /* eval x once */ \ __chk_user_ptr(ptr); \ switch (size) { \ case 1: \ - __put_user_goto(x, ptr, "b", "iq", label); \ + __put_user_goto(__x, ptr, "b", "iq", label); \ break; \ case 2: \ - __put_user_goto(x, ptr, "w", "ir", label); \ + __put_user_goto(__x, ptr, "w", "ir", label); \ break; \ case 4: \ - __put_user_goto(x, ptr, "l", "ir", label); \ + __put_user_goto(__x, ptr, "l", "ir", label); \ break; \ case 8: \ - __put_user_goto_u64(x, ptr, label); \ + __put_user_goto_u64(__x, ptr, label); \ break; \ default: \ __put_user_bad(); \ } \ + instrument_put_user(__x, ptr, size); \ } while (0) #ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT @@ -305,6 +312,7 @@ do { \ default: \ (x) = __get_user_bad(); \ } \ + instrument_get_user(x); \ } while (0) #define __get_user_asm(x, addr, itype, ltype, label) \ diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h index ee8f7d17d34f..9f1dba8f717b 100644 --- a/include/linux/instrumented.h +++ b/include/linux/instrumented.h @@ -153,4 +153,32 @@ instrument_copy_from_user_after(const void *to, const void __user *from, { } +/** + * instrument_get_user() - add instrumentation to get_user()-like macros + * + * get_user() and friends are fragile, so it may depend on the implementation + * whether the instrumentation happens before or after the data is copied from + * the userspace. + * + * @to destination variable, may not be address-taken + */ +#define instrument_get_user(to) \ +({ \ +}) + +/** + * instrument_put_user() - add instrumentation to put_user()-like macros + * + * put_user() and friends are fragile, so it may depend on the implementation + * whether the instrumentation happens before or after the data is copied from + * the userspace. + * + * @from source address + * @ptr userspace pointer to copy to + * @size number of bytes to copy + */ +#define instrument_put_user(from, ptr, size) \ +({ \ +}) + #endif /* _LINUX_INSTRUMENTED_H */ -- cgit v1.2.3-70-g09d2 From 1a167ddd3c561b21a76187a81530a167e3522261 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:03:43 +0200 Subject: x86: kmsan: pgtable: reduce vmalloc space KMSAN is going to use 3/4 of existing vmalloc space to hold the metadata, therefore we lower VMALLOC_END to make sure vmalloc() doesn't allocate past the first 1/4. Link: https://lkml.kernel.org/r/20220915150417.722975-10-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable_64_types.h | 47 ++++++++++++++++++++++++++++++++- arch/x86/mm/init_64.c | 2 +- 2 files changed, 47 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 70e360a2e5fb..04f36063ad54 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -139,7 +139,52 @@ extern unsigned int ptrs_per_p4d; # define VMEMMAP_START __VMEMMAP_BASE_L4 #endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */ -#define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1) +/* + * End of the region for which vmalloc page tables are pre-allocated. + * For non-KMSAN builds, this is the same as VMALLOC_END. + * For KMSAN builds, VMALLOC_START..VMEMORY_END is 4 times bigger than + * VMALLOC_START..VMALLOC_END (see below). + */ +#define VMEMORY_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1) + +#ifndef CONFIG_KMSAN +#define VMALLOC_END VMEMORY_END +#else +/* + * In KMSAN builds vmalloc area is four times smaller, and the remaining 3/4 + * are used to keep the metadata for virtual pages. The memory formerly + * belonging to vmalloc area is now laid out as follows: + * + * 1st quarter: VMALLOC_START to VMALLOC_END - new vmalloc area + * 2nd quarter: KMSAN_VMALLOC_SHADOW_START to + * VMALLOC_END+KMSAN_VMALLOC_SHADOW_OFFSET - vmalloc area shadow + * 3rd quarter: KMSAN_VMALLOC_ORIGIN_START to + * VMALLOC_END+KMSAN_VMALLOC_ORIGIN_OFFSET - vmalloc area origins + * 4th quarter: KMSAN_MODULES_SHADOW_START to KMSAN_MODULES_ORIGIN_START + * - shadow for modules, + * KMSAN_MODULES_ORIGIN_START to + * KMSAN_MODULES_ORIGIN_START + MODULES_LEN - origins for modules. + */ +#define VMALLOC_QUARTER_SIZE ((VMALLOC_SIZE_TB << 40) >> 2) +#define VMALLOC_END (VMALLOC_START + VMALLOC_QUARTER_SIZE - 1) + +/* + * vmalloc metadata addresses are calculated by adding shadow/origin offsets + * to vmalloc address. + */ +#define KMSAN_VMALLOC_SHADOW_OFFSET VMALLOC_QUARTER_SIZE +#define KMSAN_VMALLOC_ORIGIN_OFFSET (VMALLOC_QUARTER_SIZE << 1) + +#define KMSAN_VMALLOC_SHADOW_START (VMALLOC_START + KMSAN_VMALLOC_SHADOW_OFFSET) +#define KMSAN_VMALLOC_ORIGIN_START (VMALLOC_START + KMSAN_VMALLOC_ORIGIN_OFFSET) + +/* + * The shadow/origin for modules are placed one by one in the last 1/4 of + * vmalloc space. + */ +#define KMSAN_MODULES_SHADOW_START (VMALLOC_END + KMSAN_VMALLOC_ORIGIN_OFFSET + 1) +#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN) +#endif /* CONFIG_KMSAN */ #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE) /* The module sections ends with the start of the fixmap */ diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0fe690ebc269..39b6bfcaa0ed 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1287,7 +1287,7 @@ static void __init preallocate_vmalloc_pages(void) unsigned long addr; const char *lvl; - for (addr = VMALLOC_START; addr <= VMALLOC_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) { + for (addr = VMALLOC_START; addr <= VMEMORY_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) { pgd_t *pgd = pgd_offset_k(addr); p4d_t *p4d; pud_t *pud; -- cgit v1.2.3-70-g09d2 From b073d7f8aee4ebf05d10e3380df377b73120cf16 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:03:48 +0200 Subject: mm: kmsan: maintain KMSAN metadata for page operations Insert KMSAN hooks that make the necessary bookkeeping changes: - poison page shadow and origins in alloc_pages()/free_page(); - clear page shadow and origins in clear_page(), copy_user_highpage(); - copy page metadata in copy_highpage(), wp_page_copy(); - handle vmap()/vunmap()/iounmap(); Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/page_64.h | 7 ++ arch/x86/mm/ioremap.c | 3 + include/linux/highmem.h | 3 + include/linux/kmsan.h | 145 +++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 6 ++ mm/kmsan/hooks.c | 86 ++++++++++++++++++++++++ mm/kmsan/shadow.c | 113 ++++++++++++++++++++++++++++++++ mm/memory.c | 2 + mm/page_alloc.c | 11 ++++ mm/vmalloc.c | 20 +++++- 10 files changed, 394 insertions(+), 2 deletions(-) create mode 100644 include/linux/kmsan.h (limited to 'arch') diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index baa70451b8df..198e03e59ca1 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -8,6 +8,8 @@ #include #include +#include + /* duplicated to the one in bootmem.h */ extern unsigned long max_pfn; extern unsigned long phys_base; @@ -47,6 +49,11 @@ void clear_page_erms(void *page); static inline void clear_page(void *page) { + /* + * Clean up KMSAN metadata for the page being cleared. The assembly call + * below clobbers @page, so we perform unpoisoning before it. + */ + kmsan_unpoison_memory(page, PAGE_SIZE); alternative_call_2(clear_page_orig, clear_page_rep, X86_FEATURE_REP_GOOD, clear_page_erms, X86_FEATURE_ERMS, diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 1ad0228f8ceb..78c5bc654cff 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -479,6 +480,8 @@ void iounmap(volatile void __iomem *addr) return; } + kmsan_iounmap_page_range((unsigned long)addr, + (unsigned long)addr + get_vm_area_size(p)); memtype_free(p->phys_addr, p->phys_addr + get_vm_area_size(p)); /* Finally remove it */ diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 25679035ca28..e9912da5441b 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -311,6 +312,7 @@ static inline void copy_user_highpage(struct page *to, struct page *from, vfrom = kmap_local_page(from); vto = kmap_local_page(to); copy_user_page(vto, vfrom, vaddr, to); + kmsan_unpoison_memory(page_address(to), PAGE_SIZE); kunmap_local(vto); kunmap_local(vfrom); } @@ -326,6 +328,7 @@ static inline void copy_highpage(struct page *to, struct page *from) vfrom = kmap_local_page(from); vto = kmap_local_page(to); copy_page(vto, vfrom); + kmsan_copy_page_meta(to, from); kunmap_local(vto); kunmap_local(vfrom); } diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h new file mode 100644 index 000000000000..b36bf3db835e --- /dev/null +++ b/include/linux/kmsan.h @@ -0,0 +1,145 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * KMSAN API for subsystems. + * + * Copyright (C) 2017-2022 Google LLC + * Author: Alexander Potapenko + * + */ +#ifndef _LINUX_KMSAN_H +#define _LINUX_KMSAN_H + +#include +#include +#include + +struct page; + +#ifdef CONFIG_KMSAN + +/** + * kmsan_alloc_page() - Notify KMSAN about an alloc_pages() call. + * @page: struct page pointer returned by alloc_pages(). + * @order: order of allocated struct page. + * @flags: GFP flags used by alloc_pages() + * + * KMSAN marks 1<<@order pages starting at @page as uninitialized, unless + * @flags contain __GFP_ZERO. + */ +void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags); + +/** + * kmsan_free_page() - Notify KMSAN about a free_pages() call. + * @page: struct page pointer passed to free_pages(). + * @order: order of deallocated struct page. + * + * KMSAN marks freed memory as uninitialized. + */ +void kmsan_free_page(struct page *page, unsigned int order); + +/** + * kmsan_copy_page_meta() - Copy KMSAN metadata between two pages. + * @dst: destination page. + * @src: source page. + * + * KMSAN copies the contents of metadata pages for @src into the metadata pages + * for @dst. If @dst has no associated metadata pages, nothing happens. + * If @src has no associated metadata pages, @dst metadata pages are unpoisoned. + */ +void kmsan_copy_page_meta(struct page *dst, struct page *src); + +/** + * kmsan_map_kernel_range_noflush() - Notify KMSAN about a vmap. + * @start: start of vmapped range. + * @end: end of vmapped range. + * @prot: page protection flags used for vmap. + * @pages: array of pages. + * @page_shift: page_shift passed to vmap_range_noflush(). + * + * KMSAN maps shadow and origin pages of @pages into contiguous ranges in + * vmalloc metadata address range. + */ +void kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, + pgprot_t prot, struct page **pages, + unsigned int page_shift); + +/** + * kmsan_vunmap_kernel_range_noflush() - Notify KMSAN about a vunmap. + * @start: start of vunmapped range. + * @end: end of vunmapped range. + * + * KMSAN unmaps the contiguous metadata ranges created by + * kmsan_map_kernel_range_noflush(). + */ +void kmsan_vunmap_range_noflush(unsigned long start, unsigned long end); + +/** + * kmsan_ioremap_page_range() - Notify KMSAN about a ioremap_page_range() call. + * @addr: range start. + * @end: range end. + * @phys_addr: physical range start. + * @prot: page protection flags used for ioremap_page_range(). + * @page_shift: page_shift argument passed to vmap_range_noflush(). + * + * KMSAN creates new metadata pages for the physical pages mapped into the + * virtual memory. + */ +void kmsan_ioremap_page_range(unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot, + unsigned int page_shift); + +/** + * kmsan_iounmap_page_range() - Notify KMSAN about a iounmap_page_range() call. + * @start: range start. + * @end: range end. + * + * KMSAN unmaps the metadata pages for the given range and, unlike for + * vunmap_page_range(), also deallocates them. + */ +void kmsan_iounmap_page_range(unsigned long start, unsigned long end); + +#else + +static inline int kmsan_alloc_page(struct page *page, unsigned int order, + gfp_t flags) +{ + return 0; +} + +static inline void kmsan_free_page(struct page *page, unsigned int order) +{ +} + +static inline void kmsan_copy_page_meta(struct page *dst, struct page *src) +{ +} + +static inline void kmsan_vmap_pages_range_noflush(unsigned long start, + unsigned long end, + pgprot_t prot, + struct page **pages, + unsigned int page_shift) +{ +} + +static inline void kmsan_vunmap_range_noflush(unsigned long start, + unsigned long end) +{ +} + +static inline void kmsan_ioremap_page_range(unsigned long start, + unsigned long end, + phys_addr_t phys_addr, + pgprot_t prot, + unsigned int page_shift) +{ +} + +static inline void kmsan_iounmap_page_range(unsigned long start, + unsigned long end) +{ +} + +#endif + +#endif /* _LINUX_KMSAN_H */ diff --git a/mm/internal.h b/mm/internal.h index e497ab14c984..fea3cba15484 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -818,8 +818,14 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end, } #endif +int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, + unsigned int page_shift); + void vunmap_range_noflush(unsigned long start, unsigned long end); +void __vunmap_range_noflush(unsigned long start, unsigned long end); + int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags); diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 4ac62fa67a02..040111bb9f6a 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -26,6 +27,91 @@ * skipping effects of functions like memset() inside instrumented code. */ +static unsigned long vmalloc_shadow(unsigned long addr) +{ + return (unsigned long)kmsan_get_metadata((void *)addr, + KMSAN_META_SHADOW); +} + +static unsigned long vmalloc_origin(unsigned long addr) +{ + return (unsigned long)kmsan_get_metadata((void *)addr, + KMSAN_META_ORIGIN); +} + +void kmsan_vunmap_range_noflush(unsigned long start, unsigned long end) +{ + __vunmap_range_noflush(vmalloc_shadow(start), vmalloc_shadow(end)); + __vunmap_range_noflush(vmalloc_origin(start), vmalloc_origin(end)); + flush_cache_vmap(vmalloc_shadow(start), vmalloc_shadow(end)); + flush_cache_vmap(vmalloc_origin(start), vmalloc_origin(end)); +} + +/* + * This function creates new shadow/origin pages for the physical pages mapped + * into the virtual memory. If those physical pages already had shadow/origin, + * those are ignored. + */ +void kmsan_ioremap_page_range(unsigned long start, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot, + unsigned int page_shift) +{ + gfp_t gfp_mask = GFP_KERNEL | __GFP_ZERO; + struct page *shadow, *origin; + unsigned long off = 0; + int nr; + + if (!kmsan_enabled || kmsan_in_runtime()) + return; + + nr = (end - start) / PAGE_SIZE; + kmsan_enter_runtime(); + for (int i = 0; i < nr; i++, off += PAGE_SIZE) { + shadow = alloc_pages(gfp_mask, 1); + origin = alloc_pages(gfp_mask, 1); + __vmap_pages_range_noflush( + vmalloc_shadow(start + off), + vmalloc_shadow(start + off + PAGE_SIZE), prot, &shadow, + PAGE_SHIFT); + __vmap_pages_range_noflush( + vmalloc_origin(start + off), + vmalloc_origin(start + off + PAGE_SIZE), prot, &origin, + PAGE_SHIFT); + } + flush_cache_vmap(vmalloc_shadow(start), vmalloc_shadow(end)); + flush_cache_vmap(vmalloc_origin(start), vmalloc_origin(end)); + kmsan_leave_runtime(); +} + +void kmsan_iounmap_page_range(unsigned long start, unsigned long end) +{ + unsigned long v_shadow, v_origin; + struct page *shadow, *origin; + int nr; + + if (!kmsan_enabled || kmsan_in_runtime()) + return; + + nr = (end - start) / PAGE_SIZE; + kmsan_enter_runtime(); + v_shadow = (unsigned long)vmalloc_shadow(start); + v_origin = (unsigned long)vmalloc_origin(start); + for (int i = 0; i < nr; + i++, v_shadow += PAGE_SIZE, v_origin += PAGE_SIZE) { + shadow = kmsan_vmalloc_to_page_or_null((void *)v_shadow); + origin = kmsan_vmalloc_to_page_or_null((void *)v_origin); + __vunmap_range_noflush(v_shadow, vmalloc_shadow(end)); + __vunmap_range_noflush(v_origin, vmalloc_origin(end)); + if (shadow) + __free_pages(shadow, 1); + if (origin) + __free_pages(origin, 1); + } + flush_cache_vmap(vmalloc_shadow(start), vmalloc_shadow(end)); + flush_cache_vmap(vmalloc_origin(start), vmalloc_origin(end)); + kmsan_leave_runtime(); +} + /* Functions from kmsan-checks.h follow. */ void kmsan_poison_memory(const void *address, size_t size, gfp_t flags) { diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index acc5279acc3b..8c81a059beea 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -145,3 +145,116 @@ void *kmsan_get_metadata(void *address, bool is_origin) return (is_origin ? origin_ptr_for(page) : shadow_ptr_for(page)) + off; } + +void kmsan_copy_page_meta(struct page *dst, struct page *src) +{ + if (!kmsan_enabled || kmsan_in_runtime()) + return; + if (!dst || !page_has_metadata(dst)) + return; + if (!src || !page_has_metadata(src)) { + kmsan_internal_unpoison_memory(page_address(dst), PAGE_SIZE, + /*checked*/ false); + return; + } + + kmsan_enter_runtime(); + __memcpy(shadow_ptr_for(dst), shadow_ptr_for(src), PAGE_SIZE); + __memcpy(origin_ptr_for(dst), origin_ptr_for(src), PAGE_SIZE); + kmsan_leave_runtime(); +} + +void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags) +{ + bool initialized = (flags & __GFP_ZERO) || !kmsan_enabled; + struct page *shadow, *origin; + depot_stack_handle_t handle; + int pages = 1 << order; + + if (!page) + return; + + shadow = shadow_page_for(page); + origin = origin_page_for(page); + + if (initialized) { + __memset(page_address(shadow), 0, PAGE_SIZE * pages); + __memset(page_address(origin), 0, PAGE_SIZE * pages); + return; + } + + /* Zero pages allocated by the runtime should also be initialized. */ + if (kmsan_in_runtime()) + return; + + __memset(page_address(shadow), -1, PAGE_SIZE * pages); + kmsan_enter_runtime(); + handle = kmsan_save_stack_with_flags(flags, /*extra_bits*/ 0); + kmsan_leave_runtime(); + /* + * Addresses are page-aligned, pages are contiguous, so it's ok + * to just fill the origin pages with @handle. + */ + for (int i = 0; i < PAGE_SIZE * pages / sizeof(handle); i++) + ((depot_stack_handle_t *)page_address(origin))[i] = handle; +} + +void kmsan_free_page(struct page *page, unsigned int order) +{ + if (!kmsan_enabled || kmsan_in_runtime()) + return; + kmsan_enter_runtime(); + kmsan_internal_poison_memory(page_address(page), + PAGE_SIZE << compound_order(page), + GFP_KERNEL, + KMSAN_POISON_CHECK | KMSAN_POISON_FREE); + kmsan_leave_runtime(); +} + +void kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, + pgprot_t prot, struct page **pages, + unsigned int page_shift) +{ + unsigned long shadow_start, origin_start, shadow_end, origin_end; + struct page **s_pages, **o_pages; + int nr, mapped; + + if (!kmsan_enabled) + return; + + shadow_start = vmalloc_meta((void *)start, KMSAN_META_SHADOW); + shadow_end = vmalloc_meta((void *)end, KMSAN_META_SHADOW); + if (!shadow_start) + return; + + nr = (end - start) / PAGE_SIZE; + s_pages = kcalloc(nr, sizeof(*s_pages), GFP_KERNEL); + o_pages = kcalloc(nr, sizeof(*o_pages), GFP_KERNEL); + if (!s_pages || !o_pages) + goto ret; + for (int i = 0; i < nr; i++) { + s_pages[i] = shadow_page_for(pages[i]); + o_pages[i] = origin_page_for(pages[i]); + } + prot = __pgprot(pgprot_val(prot) | _PAGE_NX); + prot = PAGE_KERNEL; + + origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN); + origin_end = vmalloc_meta((void *)end, KMSAN_META_ORIGIN); + kmsan_enter_runtime(); + mapped = __vmap_pages_range_noflush(shadow_start, shadow_end, prot, + s_pages, page_shift); + KMSAN_WARN_ON(mapped); + mapped = __vmap_pages_range_noflush(origin_start, origin_end, prot, + o_pages, page_shift); + KMSAN_WARN_ON(mapped); + kmsan_leave_runtime(); + flush_tlb_kernel_range(shadow_start, shadow_end); + flush_tlb_kernel_range(origin_start, origin_end); + flush_cache_vmap(shadow_start, shadow_end); + flush_cache_vmap(origin_start, origin_end); + +ret: + kfree(s_pages); + kfree(o_pages); +} diff --git a/mm/memory.c b/mm/memory.c index b3ed17219d77..118e5f023597 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include @@ -3136,6 +3137,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) delayacct_wpcopy_end(); return 0; } + kmsan_copy_page_meta(new_page, old_page); } if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4e8ea824e765..1db1ac74ef14 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -1400,6 +1401,7 @@ static __always_inline bool free_pages_prepare(struct page *page, VM_BUG_ON_PAGE(PageTail(page), page); trace_mm_page_free(page, order); + kmsan_free_page(page, order); if (unlikely(PageHWPoison(page)) && !order) { /* @@ -3808,6 +3810,14 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, /* * Allocate a page from the given zone. Use pcplists for order-0 allocations. */ + +/* + * Do not instrument rmqueue() with KMSAN. This function may call + * __msan_poison_alloca() through a call to set_pfnblock_flags_mask(). + * If __msan_poison_alloca() attempts to allocate pages for the stack depot, it + * may call rmqueue() again, which will result in a deadlock. + */ +__no_sanitize_memory static inline struct page *rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, @@ -5560,6 +5570,7 @@ out: } trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); + kmsan_alloc_page(page, order, alloc_gfp); return page; } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a991b909866f..ccaa461998f3 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -320,6 +320,9 @@ int ioremap_page_range(unsigned long addr, unsigned long end, err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot), ioremap_max_page_shift); flush_cache_vmap(addr, end); + if (!err) + kmsan_ioremap_page_range(addr, end, phys_addr, prot, + ioremap_max_page_shift); return err; } @@ -416,7 +419,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, * * This is an internal function only. Do not use outside mm/. */ -void vunmap_range_noflush(unsigned long start, unsigned long end) +void __vunmap_range_noflush(unsigned long start, unsigned long end) { unsigned long next; pgd_t *pgd; @@ -438,6 +441,12 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) arch_sync_kernel_mappings(start, end); } +void vunmap_range_noflush(unsigned long start, unsigned long end) +{ + kmsan_vunmap_range_noflush(start, end); + __vunmap_range_noflush(start, end); +} + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -575,7 +584,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, * * This is an internal function only. Do not use outside mm/. */ -int vmap_pages_range_noflush(unsigned long addr, unsigned long end, +int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages, unsigned int page_shift) { unsigned int i, nr = (end - addr) >> PAGE_SHIFT; @@ -601,6 +610,13 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end, return 0; } +int vmap_pages_range_noflush(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift) +{ + kmsan_vmap_pages_range_noflush(addr, end, prot, pages, page_shift); + return __vmap_pages_range_noflush(addr, end, prot, pages, page_shift); +} + /** * vmap_pages_range - map pages to a kernel virtual address * @addr: start of the VM area to map -- cgit v1.2.3-70-g09d2 From 93324e6842148cfdb44d1437fb586b957ace1f8c Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:06 +0200 Subject: x86: kmsan: disable instrumentation of unsupported code Instrumenting some files with KMSAN will result in kernel being unable to link, boot or crashing at runtime for various reasons (e.g. infinite recursion caused by instrumentation hooks calling instrumented code again). Completely omit KMSAN instrumentation in the following places: - arch/x86/boot and arch/x86/realmode/rm, as KMSAN doesn't work for i386; - arch/x86/entry/vdso, which isn't linked with KMSAN runtime; - three files in arch/x86/kernel - boot problems; - arch/x86/mm/cpu_entry_area.c - recursion. Link: https://lkml.kernel.org/r/20220915150417.722975-33-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/boot/Makefile | 1 + arch/x86/boot/compressed/Makefile | 1 + arch/x86/entry/vdso/Makefile | 3 +++ arch/x86/kernel/Makefile | 2 ++ arch/x86/kernel/cpu/Makefile | 1 + arch/x86/mm/Makefile | 2 ++ arch/x86/realmode/rm/Makefile | 1 + 7 files changed, 11 insertions(+) (limited to 'arch') diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile index ffec8bb01ba8..9860ca5979f8 100644 --- a/arch/x86/boot/Makefile +++ b/arch/x86/boot/Makefile @@ -12,6 +12,7 @@ # Sanitizer runtimes are unavailable and cannot be linked for early boot code. KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n OBJECT_FILES_NON_STANDARD := y # Kernel does not boot with kcov instrumentation here. diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 35ce1a64068b..3a261abb6d15 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -20,6 +20,7 @@ # Sanitizer runtimes are unavailable and cannot be linked for early boot code. KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n OBJECT_FILES_NON_STANDARD := y # Prevents link failures: __sanitizer_cov_trace_pc() is not linked in. diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 12f6c4d714cd..ce4eb7e44e5b 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -11,6 +11,9 @@ include $(srctree)/lib/vdso/Makefile # Sanitizer runtimes are unavailable and cannot be linked here. KASAN_SANITIZE := n +KMSAN_SANITIZE_vclock_gettime.o := n +KMSAN_SANITIZE_vgetcpu.o := n + UBSAN_SANITIZE := n KCSAN_SANITIZE := n OBJECT_FILES_NON_STANDARD := y diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index a20a5ebfacd7..ac564c5d7b1f 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -33,6 +33,8 @@ KASAN_SANITIZE_sev.o := n # With some compiler versions the generated code results in boot hangs, caused # by several compilation units. To be safe, disable all instrumentation. KCSAN_SANITIZE := n +KMSAN_SANITIZE_head$(BITS).o := n +KMSAN_SANITIZE_nmi.o := n # If instrumentation of this dir is enabled, boot hangs during first second. # Probably could be more selective here, but note that files related to irqs, diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 9661e3e802be..f10a921ee756 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -12,6 +12,7 @@ endif # If these files are instrumented, boot hangs during the first second. KCOV_INSTRUMENT_common.o := n KCOV_INSTRUMENT_perf_event.o := n +KMSAN_SANITIZE_common.o := n # As above, instrumenting secondary CPU boot code causes boot hangs. KCSAN_SANITIZE_common.o := n diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 829c1409ffbd..afb6f7187dad 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -14,6 +14,8 @@ KASAN_SANITIZE_pgprot.o := n # Disable KCSAN entirely, because otherwise we get warnings that some functions # reference __initdata sections. KCSAN_SANITIZE := n +# Avoid recursion by not calling KMSAN hooks for CEA code. +KMSAN_SANITIZE_cpu_entry_area.o := n ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_mem_encrypt.o = -pg diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile index 83f1b6a56449..f614009d3e4e 100644 --- a/arch/x86/realmode/rm/Makefile +++ b/arch/x86/realmode/rm/Makefile @@ -10,6 +10,7 @@ # Sanitizer runtimes are unavailable and cannot be linked here. KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n OBJECT_FILES_NON_STANDARD := y # Prevents link failures: __sanitizer_cov_trace_pc() is not linked in. -- cgit v1.2.3-70-g09d2 From b11671b37f8f4761ff5a3d344553d65238309954 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:07 +0200 Subject: x86: kmsan: skip shadow checks in __switch_to() When instrumenting functions, KMSAN obtains the per-task state (mostly pointers to metadata for function arguments and return values) once per function at its beginning, using the `current` pointer. Every time the instrumented function calls another function, this state (`struct kmsan_context_state`) is updated with shadow/origin data of the passed and returned values. When `current` changes in the low-level arch code, instrumented code can not notice that, and will still refer to the old state, possibly corrupting it or using stale data. This may result in false positive reports. To deal with that, we need to apply __no_kmsan_checks to the functions performing context switching - this will result in skipping all KMSAN shadow checks and marking newly created values as initialized, preventing all false positive reports in those functions. False negatives are still possible, but we expect them to be rare and impersistent. Link: https://lkml.kernel.org/r/20220915150417.722975-34-glider@google.com Suggested-by: Marco Elver Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/kernel/process_64.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch') diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 1962008fe743..6b3418bff326 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -553,6 +553,7 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp, bool x32) * Kprobes not supported here. Set the probe on schedule instead. * Function graph tracer not supported too. */ +__no_kmsan_checks __visible __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { -- cgit v1.2.3-70-g09d2 From 9245ec01ce848eb5147e2e5030cf33ffbf1befff Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:08 +0200 Subject: x86: kmsan: handle open-coded assembly in lib/iomem.c KMSAN cannot intercept memory accesses within asm() statements. That's why we add kmsan_unpoison_memory() and kmsan_check_memory() to hint it how to handle memory copied from/to I/O memory. Link: https://lkml.kernel.org/r/20220915150417.722975-35-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/lib/iomem.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'arch') diff --git a/arch/x86/lib/iomem.c b/arch/x86/lib/iomem.c index 3e2f33fc33de..e0411a3774d4 100644 --- a/arch/x86/lib/iomem.c +++ b/arch/x86/lib/iomem.c @@ -1,6 +1,7 @@ #include #include #include +#include #define movs(type,to,from) \ asm volatile("movs" type:"=&D" (to), "=&S" (from):"0" (to), "1" (from):"memory") @@ -37,6 +38,8 @@ static void string_memcpy_fromio(void *to, const volatile void __iomem *from, si n-=2; } rep_movs(to, (const void *)from, n); + /* KMSAN must treat values read from devices as initialized. */ + kmsan_unpoison_memory(to, n); } static void string_memcpy_toio(volatile void __iomem *to, const void *from, size_t n) @@ -44,6 +47,8 @@ static void string_memcpy_toio(volatile void __iomem *to, const void *from, size if (unlikely(!n)) return; + /* Make sure uninitialized memory isn't copied to devices. */ + kmsan_check_memory(from, n); /* Align any unaligned destination IO */ if (unlikely(1 & (unsigned long)to)) { movs("b", to, from); -- cgit v1.2.3-70-g09d2 From ff901d80fff6d65ada6f2a60a1f7d180ee2e0416 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:09 +0200 Subject: x86: kmsan: use __msan_ string functions where possible. Unless stated otherwise (by explicitly calling __memcpy(), __memset() or __memmove()) we want all string functions to call their __msan_ versions (e.g. __msan_memcpy() instead of memcpy()), so that shadow and origin values are updated accordingly. Bootloader must still use the default string functions to avoid crashes. Link: https://lkml.kernel.org/r/20220915150417.722975-36-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/string_64.h | 23 +++++++++++++++++++++-- include/linux/fortify-string.h | 2 ++ 2 files changed, 23 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 6e450827f677..3b87d889b6e1 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -11,11 +11,23 @@ function. */ #define __HAVE_ARCH_MEMCPY 1 +#if defined(__SANITIZE_MEMORY__) +#undef memcpy +void *__msan_memcpy(void *dst, const void *src, size_t size); +#define memcpy __msan_memcpy +#else extern void *memcpy(void *to, const void *from, size_t len); +#endif extern void *__memcpy(void *to, const void *from, size_t len); #define __HAVE_ARCH_MEMSET +#if defined(__SANITIZE_MEMORY__) +extern void *__msan_memset(void *s, int c, size_t n); +#undef memset +#define memset __msan_memset +#else void *memset(void *s, int c, size_t n); +#endif void *__memset(void *s, int c, size_t n); #define __HAVE_ARCH_MEMSET16 @@ -55,7 +67,13 @@ static inline void *memset64(uint64_t *s, uint64_t v, size_t n) } #define __HAVE_ARCH_MEMMOVE +#if defined(__SANITIZE_MEMORY__) +#undef memmove +void *__msan_memmove(void *dest, const void *src, size_t len); +#define memmove __msan_memmove +#else void *memmove(void *dest, const void *src, size_t count); +#endif void *__memmove(void *dest, const void *src, size_t count); int memcmp(const void *cs, const void *ct, size_t count); @@ -64,8 +82,7 @@ char *strcpy(char *dest, const char *src); char *strcat(char *dest, const char *src); int strcmp(const char *cs, const char *ct); -#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__) - +#if (defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)) /* * For files that not instrumented (e.g. mm/slub.c) we * should use not instrumented version of mem* functions. @@ -73,7 +90,9 @@ int strcmp(const char *cs, const char *ct); #undef memcpy #define memcpy(dst, src, len) __memcpy(dst, src, len) +#undef memmove #define memmove(dst, src, len) __memmove(dst, src, len) +#undef memset #define memset(s, c, n) __memset(s, c, n) #ifndef __NO_FORTIFY diff --git a/include/linux/fortify-string.h b/include/linux/fortify-string.h index 3b401fa0f374..6c8a1a29d0b6 100644 --- a/include/linux/fortify-string.h +++ b/include/linux/fortify-string.h @@ -285,8 +285,10 @@ __FORTIFY_INLINE void fortify_memset_chk(__kernel_size_t size, * __builtin_object_size() must be captured here to avoid evaluating argument * side-effects further into the macro layers. */ +#ifndef CONFIG_KMSAN #define memset(p, c, s) __fortify_memset_chk(p, c, s, \ __builtin_object_size(p, 0), __builtin_object_size(p, 1)) +#endif /* * To make sure the compiler can enforce protection against buffer overflows, -- cgit v1.2.3-70-g09d2 From 3f1e2c7a9099c1ed32c67f12cdf432ba782cf51f Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:10 +0200 Subject: x86: kmsan: sync metadata pages on page fault KMSAN assumes shadow and origin pages for every allocated page are accessible. For pages between [VMALLOC_START, VMALLOC_END] those metadata pages start at KMSAN_VMALLOC_SHADOW_START and KMSAN_VMALLOC_ORIGIN_START, therefore we must sync a bigger memory region. Link: https://lkml.kernel.org/r/20220915150417.722975-37-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/mm/fault.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fa71a5d12e87..d728791be8ac 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -260,7 +260,7 @@ static noinline int vmalloc_fault(unsigned long address) } NOKPROBE_SYMBOL(vmalloc_fault); -void arch_sync_kernel_mappings(unsigned long start, unsigned long end) +static void __arch_sync_kernel_mappings(unsigned long start, unsigned long end) { unsigned long addr; @@ -284,6 +284,27 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end) } } +void arch_sync_kernel_mappings(unsigned long start, unsigned long end) +{ + __arch_sync_kernel_mappings(start, end); +#ifdef CONFIG_KMSAN + /* + * KMSAN maintains two additional metadata page mappings for the + * [VMALLOC_START, VMALLOC_END) range. These mappings start at + * KMSAN_VMALLOC_SHADOW_START and KMSAN_VMALLOC_ORIGIN_START and + * have to be synced together with the vmalloc memory mapping. + */ + if (start >= VMALLOC_START && end < VMALLOC_END) { + __arch_sync_kernel_mappings( + start - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START, + end - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START); + __arch_sync_kernel_mappings( + start - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START, + end - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START); + } +#endif +} + static bool low_pfn(unsigned long pfn) { return pfn < max_low_pfn; -- cgit v1.2.3-70-g09d2 From d911c67e10b47eb1ace08dcf95ce98fe4d408c88 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:11 +0200 Subject: x86: kasan: kmsan: support CONFIG_GENERIC_CSUM on x86, enable it for KASAN/KMSAN This is needed to allow memory tools like KASAN and KMSAN see the memory accesses from the checksum code. Without CONFIG_GENERIC_CSUM the tools can't see memory accesses originating from handwritten assembly code. For KASAN it's a question of detecting more bugs, for KMSAN using the C implementation also helps avoid false positives originating from seemingly uninitialized checksum values. Link: https://lkml.kernel.org/r/20220915150417.722975-38-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 4 ++++ arch/x86/include/asm/checksum.h | 16 ++++++++++------ arch/x86/lib/Makefile | 2 ++ 3 files changed, 16 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 674d694a665e..eb158eb8a985 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -325,6 +325,10 @@ config GENERIC_ISA_DMA def_bool y depends on ISA_DMA_API +config GENERIC_CSUM + bool + default y if KMSAN || KASAN + config GENERIC_BUG def_bool y depends on BUG diff --git a/arch/x86/include/asm/checksum.h b/arch/x86/include/asm/checksum.h index bca625a60186..6df6ece8a28e 100644 --- a/arch/x86/include/asm/checksum.h +++ b/arch/x86/include/asm/checksum.h @@ -1,9 +1,13 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1 -#define HAVE_CSUM_COPY_USER -#define _HAVE_ARCH_CSUM_AND_COPY -#ifdef CONFIG_X86_32 -# include +#ifdef CONFIG_GENERIC_CSUM +# include #else -# include +# define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1 +# define HAVE_CSUM_COPY_USER +# define _HAVE_ARCH_CSUM_AND_COPY +# ifdef CONFIG_X86_32 +# include +# else +# include +# endif #endif diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index f76747862bd2..7ba5f61d7273 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -65,7 +65,9 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y) endif else obj-y += iomap_copy_64.o +ifneq ($(CONFIG_GENERIC_CSUM),y) lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o +endif lib-y += clear_page_64.o copy_page_64.o lib-y += memmove_64.o memset_64.o lib-y += copy_user_64.o -- cgit v1.2.3-70-g09d2 From 7cf8f44a5a1c0cb10a594996797e5a988cf0589d Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:12 +0200 Subject: x86: fs: kmsan: disable CONFIG_DCACHE_WORD_ACCESS dentry_string_cmp() calls read_word_at_a_time(), which might read uninitialized bytes to optimize string comparisons. Disabling CONFIG_DCACHE_WORD_ACCESS should prohibit this optimization, as well as (probably) similar ones. Link: https://lkml.kernel.org/r/20220915150417.722975-39-glider@google.com Signed-off-by: Alexander Potapenko Suggested-by: Andrey Konovalov Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eb158eb8a985..edf7f12935d7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -129,7 +129,9 @@ config X86 select CLKEVT_I8253 select CLOCKSOURCE_VALIDATE_LAST_CYCLE select CLOCKSOURCE_WATCHDOG - select DCACHE_WORD_ACCESS + # Word-size accesses may read uninitialized data past the trailing \0 + # in strings and cause false KMSAN reports. + select DCACHE_WORD_ACCESS if !KMSAN select DYNAMIC_SIGFRAME select EDAC_ATOMIC_SCRUB select EDAC_SUPPORT -- cgit v1.2.3-70-g09d2 From 37ad4ee8364255c73026a3c343403b5977fa7e79 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:13 +0200 Subject: x86: kmsan: don't instrument stack walking functions Upon function exit, KMSAN marks local variables as uninitialized. Further function calls may result in the compiler creating the stack frame where these local variables resided. This results in frame pointers being marked as uninitialized data, which is normally correct, because they are not stack-allocated. However stack unwinding functions are supposed to read and dereference the frame pointers, in which case KMSAN might be reporting uses of uninitialized values. To work around that, we mark update_stack_state(), unwind_next_frame() and show_trace_log_lvl() with __no_kmsan_checks, preventing all KMSAN reports inside those functions and making them return initialized values. Link: https://lkml.kernel.org/r/20220915150417.722975-40-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/kernel/dumpstack.c | 6 ++++++ arch/x86/kernel/unwind_frame.c | 11 +++++++++++ 2 files changed, 17 insertions(+) (limited to 'arch') diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index afae4dd77495..476eb504084e 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -177,6 +177,12 @@ static void show_regs_if_on_stack(struct stack_info *info, struct pt_regs *regs, } } +/* + * This function reads pointers from the stack and dereferences them. The + * pointers may not have their KMSAN shadow set up properly, which may result + * in false positive reports. Disable instrumentation to avoid those. + */ +__no_kmsan_checks static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs, unsigned long *stack, const char *log_lvl) { diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index 8e1c50c86e5d..d8ba93778ae3 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -183,6 +183,16 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp) } #endif +/* + * While walking the stack, KMSAN may stomp on stale locals from other + * functions that were marked as uninitialized upon function exit, and + * now hold the call frame information for the current function (e.g. the frame + * pointer). Because KMSAN does not specifically mark call frames as + * initialized, false positive reports are possible. To prevent such reports, + * we mark the functions scanning the stack (here and below) with + * __no_kmsan_checks. + */ +__no_kmsan_checks static bool update_stack_state(struct unwind_state *state, unsigned long *next_bp) { @@ -250,6 +260,7 @@ static bool update_stack_state(struct unwind_state *state, return true; } +__no_kmsan_checks bool unwind_next_frame(struct unwind_state *state) { struct pt_regs *regs; -- cgit v1.2.3-70-g09d2 From 4ca8cc8d1bbe582bfc7a4d80bd72cfd8d3d0e2e8 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 15 Sep 2022 17:04:17 +0200 Subject: x86: kmsan: enable KMSAN builds for x86 Make KMSAN usable by adding the necessary Kconfig bits. Also declare x86-specific functions checking address validity in arch/x86/include/asm/kmsan.h. Link: https://lkml.kernel.org/r/20220915150417.722975-44-glider@google.com Signed-off-by: Alexander Potapenko Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 1 + arch/x86/include/asm/kmsan.h | 55 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+) create mode 100644 arch/x86/include/asm/kmsan.h (limited to 'arch') diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index edf7f12935d7..dce94e84199d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -169,6 +169,7 @@ config X86 select HAVE_ARCH_KASAN if X86_64 select HAVE_ARCH_KASAN_VMALLOC if X86_64 select HAVE_ARCH_KFENCE + select HAVE_ARCH_KMSAN if X86_64 select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h new file mode 100644 index 000000000000..a790b865d0a6 --- /dev/null +++ b/arch/x86/include/asm/kmsan.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * x86 KMSAN support. + * + * Copyright (C) 2022, Google LLC + * Author: Alexander Potapenko + */ + +#ifndef _ASM_X86_KMSAN_H +#define _ASM_X86_KMSAN_H + +#ifndef MODULE + +#include +#include + +/* + * Taken from arch/x86/mm/physaddr.h to avoid using an instrumented version. + */ +static inline bool kmsan_phys_addr_valid(unsigned long addr) +{ + if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT)) + return !(addr >> boot_cpu_data.x86_phys_bits); + else + return true; +} + +/* + * Taken from arch/x86/mm/physaddr.c to avoid using an instrumented version. + */ +static inline bool kmsan_virt_addr_valid(void *addr) +{ + unsigned long x = (unsigned long)addr; + unsigned long y = x - __START_KERNEL_map; + + /* use the carry flag to determine if x was < __START_KERNEL_map */ + if (unlikely(x > y)) { + x = y + phys_base; + + if (y >= KERNEL_IMAGE_SIZE) + return false; + } else { + x = y + (__START_KERNEL_map - PAGE_OFFSET); + + /* carry flag will be set if starting x was >= PAGE_OFFSET */ + if ((x > y) || !kmsan_phys_addr_valid(x)) + return false; + } + + return pfn_valid(x >> PAGE_SHIFT); +} + +#endif /* !MODULE */ + +#endif /* _ASM_X86_KMSAN_H */ -- cgit v1.2.3-70-g09d2 From ce732a7520b093091c345cba1b84542d1abd83ed Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Wed, 28 Sep 2022 14:32:19 +0200 Subject: x86: kmsan: handle CPU entry area Among other data, CPU entry area holds exception stacks, so addresses from this area can be passed to kmsan_get_metadata(). This previously led to kmsan_get_metadata() returning NULL, which in turn resulted in a warning that triggered further attempts to call kmsan_get_metadata() in the exception context, which quickly exhausted the exception stack. This patch allocates shadow and origin for the CPU entry area on x86 and introduces arch_kmsan_get_meta_or_null(), which performs arch-specific metadata mapping. Link: https://lkml.kernel.org/r/20220928123219.1101883-1-glider@google.com Signed-off-by: Alexander Potapenko Fixes: 21d723a7c1409 ("kmsan: add KMSAN runtime core") Cc: Alexander Viro Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Andrey Konovalov Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Dmitry Vyukov Cc: Eric Biggers Cc: Eric Biggers Cc: Eric Dumazet Cc: Greg Kroah-Hartman Cc: Herbert Xu Cc: Ilya Leoshkevich Cc: Ingo Molnar Cc: Jens Axboe Cc: Joonsoo Kim Cc: Kees Cook Cc: Marco Elver Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Petr Mladek Cc: Stephen Rothwell Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vegard Nossum Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + arch/x86/include/asm/kmsan.h | 32 ++++++++++++++++++++++++++++++++ arch/x86/mm/Makefile | 3 +++ arch/x86/mm/kmsan_shadow.c | 20 ++++++++++++++++++++ mm/kmsan/shadow.c | 6 +++++- 5 files changed, 61 insertions(+), 1 deletion(-) create mode 100644 arch/x86/mm/kmsan_shadow.c (limited to 'arch') diff --git a/MAINTAINERS b/MAINTAINERS index 3c7dfe9bb712..456b07f02803 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11379,6 +11379,7 @@ L: kasan-dev@googlegroups.com S: Maintained F: Documentation/dev-tools/kmsan.rst F: arch/*/include/asm/kmsan.h +F: arch/*/mm/kmsan_* F: include/linux/kmsan*.h F: lib/Kconfig.kmsan F: mm/kmsan/ diff --git a/arch/x86/include/asm/kmsan.h b/arch/x86/include/asm/kmsan.h index a790b865d0a6..8fa6ac0e2d76 100644 --- a/arch/x86/include/asm/kmsan.h +++ b/arch/x86/include/asm/kmsan.h @@ -11,9 +11,41 @@ #ifndef MODULE +#include #include #include +DECLARE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_shadow); +DECLARE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_origin); + +/* + * Functions below are declared in the header to make sure they are inlined. + * They all are called from kmsan_get_metadata() for every memory access in + * the kernel, so speed is important here. + */ + +/* + * Compute metadata addresses for the CPU entry area on x86. + */ +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin) +{ + unsigned long addr64 = (unsigned long)addr; + char *metadata_array; + unsigned long off; + int cpu; + + if ((addr64 < CPU_ENTRY_AREA_BASE) || + (addr64 >= (CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE))) + return NULL; + cpu = (addr64 - CPU_ENTRY_AREA_BASE) / CPU_ENTRY_AREA_SIZE; + off = addr64 - (unsigned long)get_cpu_entry_area(cpu); + if ((off < 0) || (off >= CPU_ENTRY_AREA_SIZE)) + return NULL; + metadata_array = is_origin ? cpu_entry_area_origin : + cpu_entry_area_shadow; + return &per_cpu(metadata_array[off], cpu); +} + /* * Taken from arch/x86/mm/physaddr.h to avoid using an instrumented version. */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index afb6f7187dad..c80febc44cd2 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -46,6 +46,9 @@ obj-$(CONFIG_HIGHMEM) += highmem_32.o KASAN_SANITIZE_kasan_init_$(BITS).o := n obj-$(CONFIG_KASAN) += kasan_init_$(BITS).o +KMSAN_SANITIZE_kmsan_shadow.o := n +obj-$(CONFIG_KMSAN) += kmsan_shadow.o + obj-$(CONFIG_MMIOTRACE) += mmiotrace.o mmiotrace-y := kmmio.o pf_in.o mmio-mod.o obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o diff --git a/arch/x86/mm/kmsan_shadow.c b/arch/x86/mm/kmsan_shadow.c new file mode 100644 index 000000000000..bee2ec4a3bfa --- /dev/null +++ b/arch/x86/mm/kmsan_shadow.c @@ -0,0 +1,20 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * x86-specific bits of KMSAN shadow implementation. + * + * Copyright (C) 2022 Google LLC + * Author: Alexander Potapenko + */ + +#include +#include + +/* + * Addresses within the CPU entry area (including e.g. exception stacks) do not + * have struct page entries corresponding to them, so they need separate + * handling. + * arch_kmsan_get_meta_or_null() (declared in the header) maps the addresses in + * CPU entry area to addresses in cpu_entry_area_shadow/cpu_entry_area_origin. + */ +DEFINE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_shadow); +DEFINE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_origin); diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index 6e90a806a704..21e3e196ec3c 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -12,7 +12,6 @@ #include #include #include -#include #include #include #include @@ -126,6 +125,7 @@ void *kmsan_get_metadata(void *address, bool is_origin) { u64 addr = (u64)address, pad, off; struct page *page; + void *ret; if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) { pad = addr % KMSAN_ORIGIN_SIZE; @@ -136,6 +136,10 @@ void *kmsan_get_metadata(void *address, bool is_origin) kmsan_internal_is_module_addr(address)) return (void *)vmalloc_meta(address, is_origin); + ret = arch_kmsan_get_meta_or_null(address, is_origin); + if (ret) + return ret; + page = virt_to_page_or_null(address); if (!page) return NULL; -- cgit v1.2.3-70-g09d2 From e55b9f96860f6c6026cff97966a740576285e07b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 26 Sep 2022 09:57:04 -0400 Subject: mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol Since 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP. Update the sites accordingly and drop the symbol. [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM, which hasn't been a user-visible symbol for over half a decade. ] Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Hugh Dickins Cc: Michal Hocko Cc: Roman Gushchin Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 4 +--- arch/mips/configs/db1xxx_defconfig | 1 - arch/mips/configs/generic_defconfig | 1 - arch/powerpc/configs/powernv_defconfig | 1 - arch/powerpc/configs/pseries_defconfig | 1 - arch/sh/configs/sdk7786_defconfig | 1 - arch/sh/configs/urquell_defconfig | 1 - include/linux/swap.h | 2 +- include/linux/swap_cgroup.h | 4 ++-- init/Kconfig | 5 ----- mm/Makefile | 4 +++- mm/memcontrol.c | 6 +++--- tools/testing/selftests/cgroup/config | 1 - 13 files changed, 10 insertions(+), 22 deletions(-) (limited to 'arch') diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 2cc502a75ef6..5b86245450bd 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. -2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +2.7 Kernel Memory Extension ----------------------------------------------- With the Kernel memory extension, the Memory Controller is able to limit @@ -386,8 +386,6 @@ U != 0, K >= U: a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG -c. Enable CONFIG_MEMCG_SWAP (to use swap extension) -d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) ------------------------------------------------------------------- diff --git a/arch/mips/configs/db1xxx_defconfig b/arch/mips/configs/db1xxx_defconfig index b8bd66300996..83cbdecb27e6 100644 --- a/arch/mips/configs/db1xxx_defconfig +++ b/arch/mips/configs/db1xxx_defconfig @@ -9,7 +9,6 @@ CONFIG_HIGH_RES_TIMERS=y CONFIG_LOG_BUF_SHIFT=16 CONFIG_CGROUPS=y CONFIG_MEMCG=y -CONFIG_MEMCG_SWAP=y CONFIG_BLK_CGROUP=y CONFIG_CGROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y diff --git a/arch/mips/configs/generic_defconfig b/arch/mips/configs/generic_defconfig index 714169e411cf..48e4e251779b 100644 --- a/arch/mips/configs/generic_defconfig +++ b/arch/mips/configs/generic_defconfig @@ -3,7 +3,6 @@ CONFIG_NO_HZ_IDLE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_MEMCG=y -CONFIG_MEMCG_SWAP=y CONFIG_BLK_CGROUP=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y diff --git a/arch/powerpc/configs/powernv_defconfig b/arch/powerpc/configs/powernv_defconfig index 49f49c263935..4acca5263404 100644 --- a/arch/powerpc/configs/powernv_defconfig +++ b/arch/powerpc/configs/powernv_defconfig @@ -17,7 +17,6 @@ CONFIG_LOG_CPU_MAX_BUF_SHIFT=13 CONFIG_NUMA_BALANCING=y CONFIG_CGROUPS=y CONFIG_MEMCG=y -CONFIG_MEMCG_SWAP=y CONFIG_CGROUP_SCHED=y CONFIG_CGROUP_FREEZER=y CONFIG_CPUSETS=y diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index b571d084c148..fead14ebb1fc 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -16,7 +16,6 @@ CONFIG_LOG_CPU_MAX_BUF_SHIFT=13 CONFIG_NUMA_BALANCING=y CONFIG_CGROUPS=y CONFIG_MEMCG=y -CONFIG_MEMCG_SWAP=y CONFIG_CGROUP_SCHED=y CONFIG_CGROUP_FREEZER=y CONFIG_CPUSETS=y diff --git a/arch/sh/configs/sdk7786_defconfig b/arch/sh/configs/sdk7786_defconfig index a8662b6927ec..97b7356639ed 100644 --- a/arch/sh/configs/sdk7786_defconfig +++ b/arch/sh/configs/sdk7786_defconfig @@ -16,7 +16,6 @@ CONFIG_CPUSETS=y # CONFIG_PROC_PID_CPUSET is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_MEMCG=y -CONFIG_CGROUP_MEMCG_SWAP=y CONFIG_CGROUP_SCHED=y CONFIG_RT_GROUP_SCHED=y CONFIG_BLK_CGROUP=y diff --git a/arch/sh/configs/urquell_defconfig b/arch/sh/configs/urquell_defconfig index cb2f56468fe0..be478f3148f2 100644 --- a/arch/sh/configs/urquell_defconfig +++ b/arch/sh/configs/urquell_defconfig @@ -14,7 +14,6 @@ CONFIG_CPUSETS=y # CONFIG_PROC_PID_CPUSET is not set CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_MEMCG=y -CONFIG_CGROUP_MEMCG_SWAP=y CONFIG_CGROUP_SCHED=y CONFIG_RT_GROUP_SCHED=y CONFIG_BLK_DEV_INITRD=y diff --git a/include/linux/swap.h b/include/linux/swap.h index fc8d98660326..a18cf4b7c724 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -666,7 +666,7 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp) cgroup_throttle_swaprate(&folio->page, gfp); } -#ifdef CONFIG_MEMCG_SWAP +#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry); int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry); static inline int mem_cgroup_try_charge_swap(struct folio *folio, diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index a12dd1c3966c..ae73a87775b3 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -4,7 +4,7 @@ #include -#ifdef CONFIG_MEMCG_SWAP +#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); @@ -40,6 +40,6 @@ static inline void swap_cgroup_swapoff(int type) return; } -#endif /* CONFIG_MEMCG_SWAP */ +#endif #endif /* __LINUX_SWAP_CGROUP_H */ diff --git a/init/Kconfig b/init/Kconfig index 532362fcfe31..7d86cf6b3012 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -958,11 +958,6 @@ config MEMCG help Provides control over the memory footprint of tasks in a cgroup. -config MEMCG_SWAP - bool - depends on MEMCG && SWAP - default y - config MEMCG_KMEM bool depends on MEMCG && !SLOB diff --git a/mm/Makefile b/mm/Makefile index cc23b0052584..8e105e5b3e29 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -98,7 +98,9 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o -obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o +ifdef CONFIG_SWAP +obj-$(CONFIG_MEMCG) += swap_cgroup.o +endif obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) += gup_test.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 76bb0a18a2f3..61e05fc281fb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3423,7 +3423,7 @@ void split_page_memcg(struct page *head, unsigned int nr) css_get_many(&memcg->css, nr - 1); } -#ifdef CONFIG_MEMCG_SWAP +#ifdef CONFIG_SWAP /** * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record. * @entry: swap entry to be moved @@ -7296,7 +7296,7 @@ static int __init mem_cgroup_init(void) } subsys_initcall(mem_cgroup_init); -#ifdef CONFIG_MEMCG_SWAP +#ifdef CONFIG_SWAP static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) { while (!refcount_inc_not_zero(&memcg->id.ref)) { @@ -7788,4 +7788,4 @@ static int __init mem_cgroup_swap_init(void) } subsys_initcall(mem_cgroup_swap_init); -#endif /* CONFIG_MEMCG_SWAP */ +#endif /* CONFIG_SWAP */ diff --git a/tools/testing/selftests/cgroup/config b/tools/testing/selftests/cgroup/config index 84fe884fad86..97d549ee894f 100644 --- a/tools/testing/selftests/cgroup/config +++ b/tools/testing/selftests/cgroup/config @@ -4,5 +4,4 @@ CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_SCHED=y CONFIG_MEMCG=y CONFIG_MEMCG_KMEM=y -CONFIG_MEMCG_SWAP=y CONFIG_PAGE_COUNTER=y -- cgit v1.2.3-70-g09d2