From 32d32ef140de3cc3f6817999415a72f7b0cb52f5 Mon Sep 17 00:00:00 2001 From: "T.J. Alumbaugh" Date: Tue, 14 Feb 2023 03:54:45 +0000 Subject: mm: multi-gen LRU: improve design doc This patch improves the design doc. Specifically, 1. add a section for the per-memcg mm_struct list, and 2. add a section for the PID controller. Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 44 ++++++++++++++++++++++++++++++++++----- 1 file changed, 39 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index 5f1f6ecbb79b..52ed5092022f 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on ``folio->flags`` and therefore has a negligible cost. A feedback loop modeled after the PID controller monitors refaults over all the tiers from anon and file types and decides which tiers from which types to -evict or protect. +evict or protect. The desired effect is to balance refault percentages +between anon and file types proportional to the swappiness level. There are two conceptually independent procedures: the aging and the eviction. They form a closed-loop system, i.e., the page reclaim. @@ -156,6 +157,27 @@ This time-based approach has the following advantages: and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. +``mm_struct`` list +------------------ +An ``mm_struct`` list is maintained for each memcg, and an +``mm_struct`` follows its owner task to the new memcg when this task +is migrated. + +A page table walker iterates ``lruvec_memcg()->mm_list`` and calls +``walk_page_range()`` with each ``mm_struct`` on this list to scan +PTEs. When multiple page table walkers iterate the same list, each of +them gets a unique ``mm_struct``, and therefore they can run in +parallel. + +Page table walkers ignore any misplaced pages, e.g., if an +``mm_struct`` was migrated, pages left in the previous memcg will be +ignored when the current memcg is under reclaim. Similarly, page table +walkers will ignore pages from nodes other than the one under reclaim. + +This infrastructure also tracks the usage of ``mm_struct`` between +context switches so that page table walkers can skip processes that +have been sleeping since the last iteration. + Rmap/PT walk feedback --------------------- Searching the rmap for PTEs mapping each page on an LRU list (to test @@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it adds the PMD entry pointing to the PTE table to the Bloom filter. This forms a feedback loop between the eviction and the aging. -Bloom Filters +Bloom filters ------------- Bloom filters are a space and memory efficient data structure for set membership test, i.e., test if an element is not in the set or may be @@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit. +PID controller +-------------- +A feedback loop modeled after the Proportional-Integral-Derivative +(PID) controller monitors refaults over anon and file types and +decides which type to evict when both types are available from the +same generation. + +The PID controller uses generations rather than the wall clock as the +time domain because a CPU can scan pages at different rates under +varying memory pressure. It calculates a moving average for each new +generation to avoid being permanently locked in a suboptimal state. + Memcg LRU --------- An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, @@ -223,9 +257,9 @@ parts: * Generations * Rmap walks -* Page table walks -* Bloom filters -* PID controller +* Page table walks via ``mm_struct`` list +* Bloom filters for rmap/PT walk feedback +* PID controller for refault feedback The aging and the eviction form a producer-consumer model; specifically, the latter drives the former by the sliding window over -- cgit v1.2.3-70-g09d2 From 88e3009b5283bbd41447f0352d0b9df16cf6f183 Mon Sep 17 00:00:00 2001 From: Nicholas Piggin Date: Fri, 3 Feb 2023 17:18:35 +1000 Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, so the patch introduces no functional change. Link: https://lkml.kernel.org/r/20230203071837.1136453-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin Acked-by: Linus Torvalds Cc: Andy Lutomirski Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Michael Ellerman Cc: Nadav Amit Cc: Peter Zijlstra Cc: Rik van Riel Cc: Will Deacon Signed-off-by: Andrew Morton --- Documentation/mm/active_mm.rst | 6 ++++++ arch/Kconfig | 17 +++++++++++++++++ include/linux/sched/mm.h | 18 +++++++++++++++--- 3 files changed, 38 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst index 45d89f8fb3a8..d096fc091e23 100644 --- a/Documentation/mm/active_mm.rst +++ b/Documentation/mm/active_mm.rst @@ -2,6 +2,12 @@ Active MM ========= +Note, the mm_count refcount may no longer include the "lazy" users +(running tasks with ->active_mm == mm && ->mm == NULL) on kernels +with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy +references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() +helpers, which abstract this config option. + :: List: linux-kernel diff --git a/arch/Kconfig b/arch/Kconfig index e3511afbb7f2..643f8a04141e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM irqs disabled over activate_mm. Architectures that do IPI based TLB shootdowns should enable this. +# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references. +# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching +# to/from kernel threads when the same mm is running on a lot of CPUs (a large +# multi-threaded application), by reducing contention on the mm refcount. +# +# This can be disabled if the architecture ensures no CPUs are using an mm as a +# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm +# or its kernel page tables). This could be arranged by arch_exit_mmap(), or +# final exit(2) TLB flush, for example. +# +# To implement this, an arch *must*: +# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating +# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been +# converted already). +config MMU_LAZY_TLB_REFCOUNT + def_bool y + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 5376caf6fcf3..689dbe812563 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm) /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - mmgrab(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - mmdrop(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) { + mmdrop(mm); + } else { + /* + * mmdrop_lazy_tlb must provide a full memory barrier, see the + * membarrier comment finish_task_switch which relies on this. + */ + smp_mb(); + } } static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm) { - mmdrop_sched(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmdrop_sched(mm); + else + smp_mb(); /* see mmdrop_lazy_tlb() above */ } /** -- cgit v1.2.3-70-g09d2 From 4c85c0be3d7a9a7ffe48bfe0954eacc0ba9d3c75 Mon Sep 17 00:00:00 2001 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Mon, 30 Jan 2023 13:25:13 +0900 Subject: mm, printk: introduce new format %pGt for page_type %pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek [vsprintf part] Cc: Andy Shevchenko Cc: David Hildenbrand Cc: Joe Perches Cc: John Ogness Cc: Matthew Wilcox Cc: Sergey Senozhatsky Cc: Steven Rostedt (Google) Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- Documentation/core-api/printk-formats.rst | 16 +++++++++++----- include/linux/page-flags.h | 7 ++++++- include/trace/events/mmflags.h | 8 ++++++++ lib/test_printf.c | 26 ++++++++++++++++++++++++++ lib/vsprintf.c | 21 +++++++++++++++++++++ mm/debug.c | 5 +++++ mm/internal.h | 1 + 7 files changed, 78 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index dbe1aacc79d0..dfe7e75a71de 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -575,20 +575,26 @@ The field width is passed by value, the bitmap is passed by reference. Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease printing cpumask and nodemask. -Flags bitfields such as page flags, gfp_flags ---------------------------------------------- +Flags bitfields such as page flags, page_type, gfp_flags +-------------------------------------------------------- :: %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff) + %pGt 0xffffff7f(buddy) %pGg GFP_USER|GFP_DMA32|GFP_NOWARN %pGv read|exec|mayread|maywrite|mayexec|denywrite For printing flags bitfields as a collection of symbolic constants that would construct the value. The type of flags is given by the third -character. Currently supported are [p]age flags, [v]ma_flags (both -expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag -names and print order depends on the particular type. +character. Currently supported are: + + - p - [p]age flags, expects value of type (``unsigned long *``) + - t - page [t]ype, expects value of type (``unsigned int *``) + - v - [v]ma_flags, expects value of type (``unsigned long *``) + - g - [g]fp_flags, expects value of type (``gfp_t *``) + +The flag names and print order depends on the particular type. Note that this format should not be used directly in the :c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a7e3a3405520..57287102c5bd 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -926,9 +926,14 @@ static inline bool is_page_hwpoison(struct page *page) #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) +static inline int page_type_has_type(unsigned int page_type) +{ + return (int)page_type < PAGE_MAPCOUNT_RESERVE; +} + static inline int page_has_type(struct page *page) { - return (int)page->page_type < PAGE_MAPCOUNT_RESERVE; + return page_type_has_type(page->page_type); } #define PAGE_TYPE_OPS(uname, lname) \ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index f453babe29b3..b28218b7998e 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -141,6 +141,14 @@ IF_HAVE_PG_SKIP_KASAN_POISON(skip_kasan_poison) __def_pageflag_names \ ) : "none" +#define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) } + +#define __def_pagetype_names \ + DEF_PAGETYPE_NAME(offline), \ + DEF_PAGETYPE_NAME(guard), \ + DEF_PAGETYPE_NAME(table), \ + DEF_PAGETYPE_NAME(buddy) + #if defined(CONFIG_X86) #define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" } #elif defined(CONFIG_PPC) diff --git a/lib/test_printf.c b/lib/test_printf.c index 46b4e6c414a3..7677ebccf3c3 100644 --- a/lib/test_printf.c +++ b/lib/test_printf.c @@ -642,12 +642,26 @@ page_flags_test(int section, int node, int zone, int last_cpupid, test(cmp_buf, "%pGp", &flags); } +static void __init page_type_test(unsigned int page_type, const char *name, + char *cmp_buf) +{ + unsigned long size; + + size = scnprintf(cmp_buf, BUF_SIZE, "%#x(", page_type); + if (page_type_has_type(page_type)) + size += scnprintf(cmp_buf + size, BUF_SIZE - size, "%s", name); + + snprintf(cmp_buf + size, BUF_SIZE - size, ")"); + test(cmp_buf, "%pGt", &page_type); +} + static void __init flags(void) { unsigned long flags; char *cmp_buffer; gfp_t gfp; + unsigned int page_type; cmp_buffer = kmalloc(BUF_SIZE, GFP_KERNEL); if (!cmp_buffer) @@ -687,6 +701,18 @@ flags(void) gfp |= __GFP_HIGH; test(cmp_buffer, "%pGg", &gfp); + page_type = ~0; + page_type_test(page_type, "", cmp_buffer); + + page_type = 10; + page_type_test(page_type, "", cmp_buffer); + + page_type = ~PG_buddy; + page_type_test(page_type, "buddy", cmp_buffer); + + page_type = ~(PG_table | PG_buddy); + page_type_test(page_type, "table|buddy", cmp_buffer); + kfree(cmp_buffer); } diff --git a/lib/vsprintf.c b/lib/vsprintf.c index be71a03c936a..fbe320b5e89f 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2052,6 +2052,25 @@ char *format_page_flags(char *buf, char *end, unsigned long flags) return buf; } +static +char *format_page_type(char *buf, char *end, unsigned int page_type) +{ + buf = number(buf, end, page_type, default_flag_spec); + + if (buf < end) + *buf = '('; + buf++; + + if (page_type_has_type(page_type)) + buf = format_flags(buf, end, ~page_type, pagetype_names); + + if (buf < end) + *buf = ')'; + buf++; + + return buf; +} + static noinline_for_stack char *flags_string(char *buf, char *end, void *flags_ptr, struct printf_spec spec, const char *fmt) @@ -2065,6 +2084,8 @@ char *flags_string(char *buf, char *end, void *flags_ptr, switch (fmt[1]) { case 'p': return format_page_flags(buf, end, *(unsigned long *)flags_ptr); + case 't': + return format_page_type(buf, end, *(unsigned int *)flags_ptr); case 'v': flags = *(unsigned long *)flags_ptr; names = vmaflag_names; diff --git a/mm/debug.c b/mm/debug.c index 96d594e16292..01cf0435723b 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -36,6 +36,11 @@ const struct trace_print_flags pageflag_names[] = { {0, NULL} }; +const struct trace_print_flags pagetype_names[] = { + __def_pagetype_names, + {0, NULL} +}; + const struct trace_print_flags gfpflag_names[] = { __def_gfpflag_names, {0, NULL} diff --git a/mm/internal.h b/mm/internal.h index 7920a8b7982e..2a7ffd9962c4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -802,6 +802,7 @@ static inline void flush_tlb_batched_pending(struct mm_struct *mm) #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; +extern const struct trace_print_flags pagetype_names[]; extern const struct trace_print_flags vmaflag_names[]; extern const struct trace_print_flags gfpflag_names[]; -- cgit v1.2.3-70-g09d2 From 9dabf6e1374519f89d9fc326a129b5cc35088479 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Thu, 2 Mar 2023 17:18:45 +0530 Subject: mm/debug_vm_pgtable: replace pte_mkhuge() with arch_make_huge_pte() Since the following commit arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). Change hugetlb_basic_tests() to call arch_make_huge_pte() directly, and update its relevant documentation entry as required. 'commit 16785bd77431 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302114845.421674-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Reported-by: Christophe Leroy Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Cc: Jonathan Corbet Cc: David Hildenbrand Cc: Mike Kravetz Cc: Mike Rapoport Signed-off-by: Andrew Morton --- Documentation/mm/arch_pgtable_helpers.rst | 2 +- mm/debug_vm_pgtable.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index 30d9a09f01f4..af3891f895b0 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -214,7 +214,7 @@ HugeTLB Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_huge | Tests a HugeTLB | +---------------------------+--------------------------------------------------+ -| pte_mkhuge | Creates a HugeTLB | +| arch_make_huge_pte | Creates a HugeTLB | +---------------------------+--------------------------------------------------+ | huge_pte_dirty | Tests a dirty HugeTLB | +---------------------------+--------------------------------------------------+ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index af59cc7bd307..7887cc2b75bf 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -934,7 +934,7 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot); - WARN_ON(!pte_huge(pte_mkhuge(pte))); + WARN_ON(!pte_huge(arch_make_huge_pte(pte, PMD_SHIFT, VM_ACCESS_FLAGS))); #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ } #else /* !CONFIG_HUGETLB_PAGE */ -- cgit v1.2.3-70-g09d2 From d0f5a85442d1a0552eae681b2e8cdc86ac08aba2 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Thu, 9 Mar 2023 15:05:44 -0800 Subject: shmem: update documentation Update the docs to reflect a bit better why some folks prefer tmpfs over ramfs and clarify a bit more about the difference between brd ramdisks. While at it, add THP docs for tmpfs, both the mount options and the sysfs file. Link: https://lkml.kernel.org/r/20230309230545.2930737-6-mcgrof@kernel.org Signed-off-by: Luis Chamberlain Reviewed-by: Christian Brauner Reviewed-by: David Hildenbrand Tested-by: Xin Hao Reviewed-by: Davidlohr Bueso Cc: Adam Manzanares Cc: Davidlohr Bueso Cc: Hugh Dickins Cc: Kees Cook Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/filesystems/tmpfs.rst | 57 +++++++++++++++++++++++++++++++------ 1 file changed, 49 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 0408c245785e..1ec9a9f8196b 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -13,14 +13,25 @@ everything stored therein is lost. tmpfs puts everything into the kernel internal caches and grows and shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space. It has maximum size limits which can -be adjusted on the fly via 'mount -o remount ...' - -If you compare it to ramfs (which was the template to create tmpfs) -you gain swapping and limit checking. Another similar thing is the RAM -disk (/dev/ram*), which simulates a fixed size hard disk in physical -RAM, where you have to create an ordinary filesystem on top. Ramdisks -cannot swap and you do not have the possibility to resize them. +unneeded pages out to swap space, and supports THP. + +tmpfs extends ramfs with a few userspace configurable options listed and +explained further below, some of which can be reconfigured dynamically on the +fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs +filesystem can be resized but it cannot be resized to a size below its current +usage. tmpfs also supports POSIX ACLs, and extended attributes for the +trusted.* and security.* namespaces. ramfs does not use swap and you cannot +modify any parameter for a ramfs filesystem. The size limit of a ramfs +filesystem is how much memory you have available, and so care must be taken if +used so to not run out of memory. + +An alternative to tmpfs and ramfs is to use brd to create RAM disks +(/dev/ram*), which allows you to simulate a block device disk in physical RAM. +To write data you would just then need to create an regular filesystem on top +this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also +configured in size at initialization and you cannot dynamically resize them. +Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the +block layer at all. Since tmpfs lives completely in the page cache and on swap, all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in @@ -85,6 +96,36 @@ mount with such options, since it allows any user with write access to use up all the memory on the machine; but enhances the scalability of that instance in a system with many CPUs making intensive use of it. +tmpfs also supports Transparent Huge Pages which requires a kernel +configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for +your system (has_transparent_hugepage(), which is architecture specific). +The mount options for this are: + +====== ============================================================ +huge=0 never: disables huge pages for the mount +huge=1 always: enables huge pages for the mount +huge=2 within_size: only allocate huge pages if the page will be + fully within i_size, also respect fadvise()/madvise() hints. +huge=3 advise: only allocate huge pages if requested with + fadvise()/madvise() +====== ============================================================ + +There is a sysfs file which you can also use to control system wide THP +configuration for all tmpfs mounts, the file is: + +/sys/kernel/mm/transparent_hugepage/shmem_enabled + +This sysfs file is placed on top of THP sysfs directory and so is registered +by THP code. It is however only used to control all tmpfs mounts with one +single knob. Since it controls all tmpfs mounts it should only be used either +for emergency or testing purposes. The values you can set for shmem_enabled are: + +== ============================================================ +-1 deny: disables huge on shm_mnt and all mounts, for + emergency use +-2 force: enables huge on shm_mnt and all mounts, w/o needing + option, for testing +== ============================================================ tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be -- cgit v1.2.3-70-g09d2 From 2c6efe9cf2d7841b75fe38ed1adbd41a90f51ba0 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Thu, 9 Mar 2023 15:05:45 -0800 Subject: shmem: add support to ignore swap In doing experimentations with shmem having the option to avoid swap becomes a useful mechanism. One of the *raves* about brd over shmem is you can avoid swap, but that's not really a good reason to use brd if we can instead use shmem. Using brd has its own good reasons to exist, but just because "tmpfs" doesn't let you do that is not a great reason to avoid it if we can easily add support for it. I don't add support for reconfiguring incompatible options, but if we really wanted to we can add support for that. To avoid swap we use mapping_set_unevictable() upon inode creation, and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. Link: https://lkml.kernel.org/r/20230309230545.2930737-7-mcgrof@kernel.org Signed-off-by: Luis Chamberlain Acked-by: Christian Brauner Tested-by: Xin Hao Reviewed-by: Davidlohr Bueso Cc: Adam Manzanares Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: Hugh Dickins Cc: Kees Cook Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/filesystems/tmpfs.rst | 9 ++++++--- Documentation/mm/unevictable-lru.rst | 2 ++ include/linux/shmem_fs.h | 1 + mm/shmem.c | 28 +++++++++++++++++++++++++++- 4 files changed, 36 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 1ec9a9f8196b..f18f46be5c0c 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -13,7 +13,8 @@ everything stored therein is lost. tmpfs puts everything into the kernel internal caches and grows and shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space, and supports THP. +unneeded pages out to swap space, if swap was enabled for the tmpfs +mount. tmpfs also supports THP. tmpfs extends ramfs with a few userspace configurable options listed and explained further below, some of which can be reconfigured dynamically on the @@ -33,8 +34,8 @@ configured in size at initialization and you cannot dynamically resize them. Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the block layer at all. -Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +Since tmpfs lives completely in the page cache and optionally on swap, +all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in free(1). Notice that these counters also include shared memory (shmem, see ipcs(1)). The most reliable way to get the count is using df(1) and du(1). @@ -83,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a machine with highmem) the number of lowmem RAM pages, whichever is the lower. +noswap Disables swap. Remounts must respect the original settings. + By default swap is enabled. ========= ============================================================ These parameters accept a suffix k, m or g for kilo, mega and giga and diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 92ac5dca420c..d5ac8511eb67 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages: * Those owned by ramfs. + * Those owned by tmpfs with the noswap mount option. + * Those mapped into SHM_LOCK'd shared memory regions. * Those mapped into VM_LOCKED [mlock()ed] VMAs. diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 103d1000a5a2..50bf82b36995 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -45,6 +45,7 @@ struct shmem_sb_info { kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ bool full_inums; /* If i_ino should be uint or ino_t */ + bool noswap; /* ignores VM reclaim / swap requests */ ino_t next_ino; /* The next per-sb inode number to use */ ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ diff --git a/mm/shmem.c b/mm/shmem.c index 7751c136eadb..787e83791eb5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -116,10 +116,12 @@ struct shmem_options { bool full_inums; int huge; int seen; + bool noswap; #define SHMEM_SEEN_BLOCKS 1 #define SHMEM_SEEN_INODES 2 #define SHMEM_SEEN_HUGE 4 #define SHMEM_SEEN_INUMS 8 +#define SHMEM_SEEN_NOSWAP 16 }; #ifdef CONFIG_TMPFS @@ -1334,6 +1336,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) struct address_space *mapping = folio->mapping; struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); swp_entry_t swap; pgoff_t index; @@ -1347,7 +1350,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) if (WARN_ON_ONCE(!wbc->for_reclaim)) goto redirty; - if (WARN_ON_ONCE(info->flags & VM_LOCKED)) + if (WARN_ON_ONCE((info->flags & VM_LOCKED) || sbinfo->noswap)) goto redirty; if (!total_swap_pages) @@ -2372,6 +2375,8 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block shmem_set_inode_flags(inode, info->fsflags); INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); + if (sbinfo->noswap) + mapping_set_unevictable(inode->i_mapping); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -3459,6 +3464,7 @@ enum shmem_param { Opt_uid, Opt_inode32, Opt_inode64, + Opt_noswap, }; static const struct constant_table shmem_param_enums_huge[] = { @@ -3480,6 +3486,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = { fsparam_u32 ("uid", Opt_uid), fsparam_flag ("inode32", Opt_inode32), fsparam_flag ("inode64", Opt_inode64), + fsparam_flag ("noswap", Opt_noswap), {} }; @@ -3563,6 +3570,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) ctx->full_inums = true; ctx->seen |= SHMEM_SEEN_INUMS; break; + case Opt_noswap: + ctx->noswap = true; + ctx->seen |= SHMEM_SEEN_NOSWAP; + break; } return 0; @@ -3661,6 +3672,14 @@ static int shmem_reconfigure(struct fs_context *fc) err = "Current inum too high to switch to 32-bit inums"; goto out; } + if ((ctx->seen & SHMEM_SEEN_NOSWAP) && ctx->noswap && !sbinfo->noswap) { + err = "Cannot disable swap on remount"; + goto out; + } + if (!(ctx->seen & SHMEM_SEEN_NOSWAP) && !ctx->noswap && sbinfo->noswap) { + err = "Cannot enable swap on remount if it was disabled on first mount"; + goto out; + } if (ctx->seen & SHMEM_SEEN_HUGE) sbinfo->huge = ctx->huge; @@ -3681,6 +3700,10 @@ static int shmem_reconfigure(struct fs_context *fc) sbinfo->mpol = ctx->mpol; /* transfers initial ref */ ctx->mpol = NULL; } + + if (ctx->noswap) + sbinfo->noswap = true; + raw_spin_unlock(&sbinfo->stat_lock); mpol_put(mpol); return 0; @@ -3735,6 +3758,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); #endif shmem_show_mpol(seq, sbinfo->mpol); + if (sbinfo->noswap) + seq_printf(seq, ",noswap"); return 0; } @@ -3778,6 +3803,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) ctx->inodes = shmem_default_max_inodes(); if (!(ctx->seen & SHMEM_SEEN_INUMS)) ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); + sbinfo->noswap = ctx->noswap; } else { sb->s_flags |= SB_NOUSER; } -- cgit v1.2.3-70-g09d2 From 2bad466cc9d9b4c3b4b16eb9c03c919b59561316 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 9 Mar 2023 17:37:10 -0500 Subject: mm/uffd: UFFD_FEATURE_WP_UNPOPULATED Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4. The new feature bit makes anonymous memory acts the same as file memory on userfaultfd-wp in that it'll also wr-protect none ptes. It can be useful in two cases: (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot, so pre-fault can be replaced by enabling this flag and speed up protections (2) It helps to implement async uffd-wp mode that Muhammad is working on [1] It's debatable whether this is the most ideal solution because with the new feature bit set, wr-protect none pte needs to pre-populate the pgtables to the last level (PAGE_SIZE). But it seems fine so far to service either purpose above, so we can leave optimizations for later. The series brings pte markers to anonymous memory too. There's some change in the common mm code path in the 1st patch, great to have some eye looking at it, but hopefully they're still relatively straightforward. This patch (of 2): This is a new feature that controls how uffd-wp handles none ptes. When it's set, the kernel will handle anonymous memory the same way as file memory, by allowing the user to wr-protect unpopulated ptes. File memories handles none ptes consistently by allowing wr-protecting of none ptes because of the unawareness of page cache being exist or not. For anonymous it was not as persistent because we used to assume that we don't need protections on none ptes or known zero pages. One use case of such a feature bit was VM live snapshot, where if without wr-protecting empty ptes the snapshot can contain random rubbish in the holes of the anonymous memory, which can cause misbehave of the guest when the guest OS assumes the pages should be all zeros. QEMU worked it around by pre-populate the section with reads to fill in zero page entries before starting the whole snapshot process [1]. Recently there's another need raised on using userfaultfd wr-protect for detecting dirty pages (to replace soft-dirty in some cases) [2]. In that case if without being able to wr-protect none ptes by default, the dirty info can get lost, since we cannot treat every none pte to be dirty (the current design is identify a page dirty based on uffd-wp bit being cleared). In general, we want to be able to wr-protect empty ptes too even for anonymous. This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make uffd-wp handling on none ptes being consistent no matter what the memory type is underneath. It doesn't have any impact on file memories so far because we already have pte markers taking care of that. So it only affects anonymous. The feature bit is by default off, so the old behavior will be maintained. Sometimes it may be wanted because the wr-protect of none ptes will contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to anonymous), but also on creating the pgtables to store the pte markers. So there's potentially less chance of using thp on the first fault for a none pmd or larger than a pmd. The major implementation part is teaching the whole kernel to understand pte markers even for anonymously mapped ranges, meanwhile allowing the UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the new feature bit is set. Note that even if the patch subject starts with mm/uffd, there're a few small refactors to major mm path of handling anonymous page faults. But they should be straightforward. With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all the memory before wr-protect during taking a live snapshot. Quotting from Muhammad's test result here [3] based on a simple program [4]: (1) With huge page disabled echo madvise > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 1111453 (pre-fault 1101011) Test MADVISE: 278276 (pre-fault 266378) Test WP-UNPOPULATE: 11712 (2) With Huge page enabled echo always > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 22521 (pre-fault 22348) Test MADVISE: 4909 (pre-fault 4743) Test WP-UNPOPULATE: 14448 There'll be a great perf boost for no-thp case, while for thp enabled with extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but that's low possibility in reality, also the overhead was not reduced but postponed until a follow up write on any huge zero thp, so potentially it is faster by making the follow up writes slower. [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/ [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c [peterx@redhat.com: comment changes, oneliner fix to khugepaged] Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: David Hildenbrand Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Mike Rapoport Cc: Muhammad Usama Anjum Cc: Nadav Amit Cc: Paul Gofman Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/userfaultfd.rst | 17 +++++++++ fs/userfaultfd.c | 16 ++++++++ include/linux/mm_inline.h | 6 +++ include/linux/userfaultfd_k.h | 23 ++++++++++++ include/uapi/linux/userfaultfd.h | 10 ++++- mm/khugepaged.c | 2 +- mm/memory.c | 56 +++++++++++++++++++++------- mm/mprotect.c | 51 ++++++++++++++++++++----- 8 files changed, 155 insertions(+), 26 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 7dc823b56ca4..bd2226299583 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -219,6 +219,23 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. +Userfaultfd write-protect mode currently behave differently on none ptes +(when e.g. page is missing) over different types of memories. + +For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes +(e.g. when pages are missing and not populated). For file-backed memories +like shmem and hugetlbfs, none ptes will be write protected just like a +present pte. In other words, there will be a userfaultfd write fault +message generated when writing to a missing page on file typed memories, +as long as the page range was write-protected before. Such a message will +not be generated on anonymous memories by default. + +If the application wants to be able to write protect none ptes on anonymous +memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On +newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED +and set the feature bit in advance to make sure none ptes will also be +write protected even upon anonymous memory. + QEMU/KVM ======== diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 44d1ee429eb0..881e9c82b9d1 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) return ctx->features & UFFD_FEATURE_INITIALIZED; } +/* + * Whether WP_UNPOPULATED is enabled on the uffd context. It is only + * meaningful when userfaultfd_wp()==true on the vma and when it's + * anonymous. + */ +bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) +{ + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx; + + if (!ctx) + return false; + + return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; +} + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, vm_flags_t flags) { @@ -1971,6 +1986,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, #endif #ifndef CONFIG_PTE_MARKER_UFFD_WP uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM; + uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED; #endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index de1e622dd366..0e1d239a882c 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -557,6 +557,12 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, /* The current status of the pte should be "cleared" before calling */ WARN_ON_ONCE(!pte_none(*pte)); + /* + * NOTE: userfaultfd_wp_unpopulated() doesn't need this whole + * thing, because when zapping either it means it's dropping the + * page, or in TTU where the present pte will be quickly replaced + * with a swap pte. There's no way of leaking the bit. + */ if (vma_is_anonymous(vma) || !userfaultfd_wp(vma)) return; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 3767f18114ef..0cf8880219da 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf); extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); +extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); #else /* CONFIG_USERFAULTFD */ @@ -274,8 +275,30 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) return false; } +static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ +static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma) +{ + /* Only wr-protect mode uses pte markers */ + if (!userfaultfd_wp(vma)) + return false; + + /* File-based uffd-wp always need markers */ + if (!vma_is_anonymous(vma)) + return true; + + /* + * Anonymous uffd-wp only needs the markers if WP_UNPOPULATED + * enabled (to apply markers on zero pages). + */ + return userfaultfd_wp_unpopulated(vma); +} + static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) { #ifdef CONFIG_PTE_MARKER_UFFD_WP diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..90c958952bfc 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ - UFFD_FEATURE_WP_HUGETLBFS_SHMEM) + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ + UFFD_FEATURE_WP_UNPOPULATED) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,12 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs. + * + * UFFD_FEATURE_WP_UNPOPULATED indicates that userfaultfd + * write-protection mode will always apply to unpopulated pages + * (i.e. empty ptes). This will be the default behavior for shmem + * & hugetlbfs, so this flag only affects anonymous memory behavior + * when userfault write-protection mode is registered. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +224,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_WP_UNPOPULATED (1<<13) __u64 features; __u64 ioctls; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 074ea534f786..c7317678cb10 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1177,7 +1177,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, * enabled swap entries. Please see * comment below for pte_uffd_wp(). */ - if (pte_swp_uffd_wp(pteval)) { + if (pte_swp_uffd_wp_any(pteval)) { result = SCAN_PTE_UFFD_WP; goto out_unmap; } diff --git a/mm/memory.c b/mm/memory.c index 6285cad1f4fb..a890b2951b53 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -104,6 +104,20 @@ EXPORT_SYMBOL(mem_map); #endif static vm_fault_t do_fault(struct vm_fault *vmf); +static vm_fault_t do_anonymous_page(struct vm_fault *vmf); +static bool vmf_pte_changed(struct vm_fault *vmf); + +/* + * Return true if the original pte was a uffd-wp pte marker (so the pte was + * wr-protected). + */ +static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf) +{ + if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) + return false; + + return pte_marker_uffd_wp(vmf->orig_pte); +} /* * A number of key systems in x86 including ioremap() rely on the assumption @@ -1346,6 +1360,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, struct zap_details *details, pte_t pteval) { + /* Zap on anonymous always means dropping everything */ + if (vma_is_anonymous(vma)) + return; + if (zap_drop_file_uffd_wp(details)) return; @@ -1452,8 +1470,12 @@ again: continue; rss[mm_counter(page)]--; } else if (pte_marker_entry_uffd_wp(entry)) { - /* Only drop the uffd-wp marker if explicitly requested */ - if (!zap_drop_file_uffd_wp(details)) + /* + * For anon: always drop the marker; for file: only + * drop the marker if explicitly requested. + */ + if (!vma_is_anonymous(vma) && + !zap_drop_file_uffd_wp(details)) continue; } else if (is_hwpoison_entry(entry) || is_swapin_error_entry(entry)) { @@ -3620,6 +3642,14 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf) return 0; } +static vm_fault_t do_pte_missing(struct vm_fault *vmf) +{ + if (vma_is_anonymous(vmf->vma)) + return do_anonymous_page(vmf); + else + return do_fault(vmf); +} + /* * This is actually a page-missing access, but with uffd-wp special pte * installed. It means this pte was wr-protected before being unmapped. @@ -3630,11 +3660,10 @@ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) * Just in case there're leftover special ptes even after the region * got unregistered - we can simply clear them. */ - if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + if (unlikely(!userfaultfd_wp(vmf->vma))) return pte_marker_clear(vmf); - /* do_fault() can handle pte markers too like none pte */ - return do_fault(vmf); + return do_pte_missing(vmf); } static vm_fault_t handle_pte_marker(struct vm_fault *vmf) @@ -3999,6 +4028,7 @@ out_release: */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; vm_fault_t ret = 0; @@ -4032,7 +4062,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (!pte_none(*vmf->pte)) { + if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto unlock; } @@ -4072,7 +4102,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (!pte_none(*vmf->pte)) { + if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; } @@ -4092,6 +4122,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); setpte: + if (uffd_wp) + entry = pte_mkuffd_wp(entry); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); /* No need to invalidate - it was non-present before */ @@ -4259,7 +4291,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) { struct vm_area_struct *vma = vmf->vma; - bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte); + bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); bool write = vmf->flags & FAULT_FLAG_WRITE; bool prefault = vmf->address != addr; pte_t entry; @@ -4903,12 +4935,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) } } - if (!vmf->pte) { - if (vma_is_anonymous(vmf->vma)) - return do_anonymous_page(vmf); - else - return do_fault(vmf); - } + if (!vmf->pte) + return do_pte_missing(vmf); if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); diff --git a/mm/mprotect.c b/mm/mprotect.c index 13e84d8c0797..b9da9a5f87fe 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -276,7 +276,15 @@ static long change_pte_range(struct mmu_gather *tlb, } else { /* It must be an none page, or what else?.. */ WARN_ON_ONCE(!pte_none(oldpte)); - if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { + + /* + * Nobody plays with any none ptes besides + * userfaultfd when applying the protections. + */ + if (likely(!uffd_wp)) + continue; + + if (userfaultfd_wp_use_markers(vma)) { /* * For file-backed mem, we need to be able to * wr-protect a none pte, because even if the @@ -320,23 +328,46 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) return 0; } -/* Return true if we're uffd wr-protecting file-backed memory, or false */ +/* + * Return true if we want to split THPs into PTE mappings in change + * protection procedure, false otherwise. + */ static inline bool -uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) +pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags) { + /* + * pte markers only resides in pte level, if we need pte markers, + * we need to split. We cannot wr-protect shmem thp because file + * thp is handled differently when split by erasing the pmd so far. + */ return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); } /* - * If wr-protecting the range for file-backed, populate pgtable for the case - * when pgtable is empty but page cache exists. When {pte|pmd|...}_alloc() - * failed we treat it the same way as pgtable allocation failures during - * page faults by kicking OOM and returning error. + * Return true if we want to populate pgtables in change protection + * procedure, false otherwise + */ +static inline bool +pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags) +{ + /* If not within ioctl(UFFDIO_WRITEPROTECT), then don't bother */ + if (!(cp_flags & MM_CP_UFFD_WP)) + return false; + + /* Populate if the userfaultfd mode requires pte markers */ + return userfaultfd_wp_use_markers(vma); +} + +/* + * Populate the pgtable underneath for whatever reason if requested. + * When {pte|pmd|...}_alloc() failed we treat it the same way as pgtable + * allocation failures during page faults by kicking OOM and returning + * error. */ #define change_pmd_prepare(vma, pmd, cp_flags) \ ({ \ long err = 0; \ - if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (unlikely(pgtable_populate_needed(vma, cp_flags))) { \ if (pte_alloc(vma->vm_mm, pmd)) \ err = -ENOMEM; \ } \ @@ -351,7 +382,7 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) #define change_prepare(vma, high, low, addr, cp_flags) \ ({ \ long err = 0; \ - if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (unlikely(pgtable_populate_needed(vma, cp_flags))) { \ low##_t *p = low##_alloc(vma->vm_mm, high, addr); \ if (p == NULL) \ err = -ENOMEM; \ @@ -404,7 +435,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || - uffd_wp_protect_file(vma, cp_flags)) { + pgtable_split_needed(vma, cp_flags)) { __split_huge_pmd(vma, pmd, addr, false, NULL); /* * For file-backed, the pmd could have been -- cgit v1.2.3-70-g09d2 From 23baf831a32c04f9a968812511540b1b3e648bf5 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 15 Mar 2023 14:31:33 +0300 Subject: mm, treewide: redefine MAX_ORDER sanely MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Reviewed-by: Michael Ellerman [powerpc] Cc: "Kirill A. Shutemov" Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/kdump/vmcoreinfo.rst | 6 ++-- Documentation/admin-guide/kernel-parameters.txt | 2 +- arch/arc/Kconfig | 4 +-- arch/arm/Kconfig | 9 ++---- arch/arm/configs/imx_v6_v7_defconfig | 2 +- arch/arm/configs/milbeaut_m10v_defconfig | 2 +- arch/arm/configs/oxnas_v6_defconfig | 2 +- arch/arm/configs/pxa_defconfig | 2 +- arch/arm/configs/sama7_defconfig | 2 +- arch/arm/configs/sp7021_defconfig | 2 +- arch/arm64/Kconfig | 27 ++++++++---------- arch/arm64/include/asm/sparsemem.h | 2 +- arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +- arch/arm64/kvm/hyp/nvhe/page_alloc.c | 10 +++---- arch/csky/Kconfig | 2 +- arch/ia64/Kconfig | 8 +++--- arch/ia64/include/asm/sparsemem.h | 4 +-- arch/ia64/mm/hugetlbpage.c | 2 +- arch/loongarch/Kconfig | 15 ++++------ arch/m68k/Kconfig.cpu | 5 +--- arch/mips/Kconfig | 19 ++++++------- arch/nios2/Kconfig | 7 ++--- arch/powerpc/Kconfig | 27 ++++++++---------- arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +- arch/powerpc/configs/fsl-emb-nonhw.config | 2 +- arch/powerpc/mm/book3s64/iommu_api.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- arch/sh/configs/ecovec24_defconfig | 2 +- arch/sh/mm/Kconfig | 17 +++++------ arch/sparc/Kconfig | 5 +--- arch/sparc/kernel/pci_sun4v.c | 2 +- arch/sparc/kernel/traps_64.c | 2 +- arch/sparc/mm/tsb.c | 4 +-- arch/um/kernel/um_arch.c | 4 +-- arch/xtensa/Kconfig | 5 +--- drivers/base/regmap/regmap-debugfs.c | 8 +++--- drivers/block/floppy.c | 2 +- drivers/crypto/ccp/sev-dev.c | 2 +- drivers/crypto/hisilicon/sgl.c | 6 ++-- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/i915/gem/selftests/huge_pages.c | 2 +- drivers/gpu/drm/ttm/ttm_pool.c | 22 +++++++------- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +- drivers/iommu/dma-iommu.c | 2 +- drivers/irqchip/irq-gic-v3-its.c | 4 +-- drivers/md/dm-bufio.c | 2 +- drivers/misc/genwqe/card_dev.c | 2 +- drivers/misc/genwqe/card_utils.c | 4 +-- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 2 +- drivers/net/ethernet/ibm/ibmvnic.h | 2 +- drivers/video/fbdev/hyperv_fb.c | 4 +-- drivers/video/fbdev/vermilion/vermilion.c | 2 +- drivers/virtio/virtio_balloon.c | 2 +- drivers/virtio/virtio_mem.c | 12 ++++---- fs/ramfs/file-nommu.c | 2 +- include/drm/ttm/ttm_pool.h | 2 +- include/linux/hugetlb.h | 2 +- include/linux/mmzone.h | 10 +++---- include/linux/pageblock-flags.h | 4 +-- include/linux/slab.h | 6 ++-- kernel/crash_core.c | 2 +- kernel/dma/pool.c | 6 ++-- kernel/events/ring_buffer.c | 6 ++-- mm/Kconfig | 10 +++---- mm/compaction.c | 8 +++--- mm/debug_vm_pgtable.c | 4 +-- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 +-- mm/kmsan/init.c | 6 ++-- mm/memblock.c | 2 +- mm/memory_hotplug.c | 4 +-- mm/page_alloc.c | 38 ++++++++++++------------- mm/page_isolation.c | 12 ++++---- mm/page_owner.c | 6 ++-- mm/page_reporting.c | 6 ++-- mm/shuffle.h | 2 +- mm/slab.c | 2 +- mm/slub.c | 6 ++-- mm/vmscan.c | 2 +- mm/vmstat.c | 14 ++++----- net/smc/smc_ib.c | 2 +- security/integrity/ima/ima_crypto.c | 2 +- tools/testing/memblock/linux/mmzone.h | 6 ++-- 84 files changed, 223 insertions(+), 253 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 86fd88492870..c18d94fa6470 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -172,7 +172,7 @@ variables. Offset of the free_list's member. This value is used to compute the number of free pages. -Each zone has a free_area structure array called free_area[MAX_ORDER]. +Each zone has a free_area structure array called free_area[MAX_ORDER + 1]. The free_list represents a linked list of free page blocks. (list_head, next|prev) @@ -189,8 +189,8 @@ Offsets of the vmap_area's members. They carry vmalloc-specific information. Makedumpfile gets the start address of the vmalloc region from this. -(zone.free_area, MAX_ORDER) ---------------------------- +(zone.free_area, MAX_ORDER + 1) +------------------------------- Free areas descriptor. User-space tools use this value to iterate the free_area ranges. MAX_ORDER is used by the zone buddy allocator. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6221a1d057dd..50da4f26fad5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3969,7 +3969,7 @@ [KNL] Minimal page reporting order Format: Adjust the minimal page reporting order. The page - reporting is disabled when it exceeds (MAX_ORDER-1). + reporting is disabled when it exceeds MAX_ORDER. panic= [KNL] Kernel behaviour on panic: delay timeout > 0: seconds before rebooting diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index d9a13ccf89a3..ab6d701365bb 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -556,7 +556,7 @@ endmenu # "ARC Architecture Configuration" config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "12" if ARC_HUGEPAGE_16M - default "11" + default "11" if ARC_HUGEPAGE_16M + default "10" source "kernel/power/Kconfig" diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e24a9820e12f..929e646e84b9 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1355,9 +1355,9 @@ config ARM_MODULE_PLTS config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "12" if SOC_AM33XX - default "9" if SA1111 - default "11" + default "11" if SOC_AM33XX + default "8" if SA1111 + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -1366,9 +1366,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - config ALIGNMENT_TRAP def_bool CPU_CP15_MMU select HAVE_PROC_CPU if PROC_FS diff --git a/arch/arm/configs/imx_v6_v7_defconfig b/arch/arm/configs/imx_v6_v7_defconfig index 6dc6fed12af8..345a67e67dbd 100644 --- a/arch/arm/configs/imx_v6_v7_defconfig +++ b/arch/arm/configs/imx_v6_v7_defconfig @@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y CONFIG_SMP=y CONFIG_ARM_PSCI=y CONFIG_HIGHMEM=y -CONFIG_ARCH_FORCE_MAX_ORDER=14 +CONFIG_ARCH_FORCE_MAX_ORDER=13 CONFIG_CMDLINE="noinitrd console=ttymxc0,115200" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/milbeaut_m10v_defconfig b/arch/arm/configs/milbeaut_m10v_defconfig index bd29e5012cb0..385ad0f391a8 100644 --- a/arch/arm/configs/milbeaut_m10v_defconfig +++ b/arch/arm/configs/milbeaut_m10v_defconfig @@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y # CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set # CONFIG_ARM_PATCH_IDIV is not set CONFIG_HIGHMEM=y -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_SECCOMP=y CONFIG_KEXEC=y CONFIG_EFI=y diff --git a/arch/arm/configs/oxnas_v6_defconfig b/arch/arm/configs/oxnas_v6_defconfig index 70a67b3fc91b..90779812c6dd 100644 --- a/arch/arm/configs/oxnas_v6_defconfig +++ b/arch/arm/configs/oxnas_v6_defconfig @@ -12,7 +12,7 @@ CONFIG_ARCH_OXNAS=y CONFIG_MACH_OX820=y CONFIG_SMP=y CONFIG_NR_CPUS=16 -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_SECCOMP=y CONFIG_ARM_APPENDED_DTB=y CONFIG_ARM_ATAG_DTB_COMPAT=y diff --git a/arch/arm/configs/pxa_defconfig b/arch/arm/configs/pxa_defconfig index e656d3af2266..b46e39369dbb 100644 --- a/arch/arm/configs/pxa_defconfig +++ b/arch/arm/configs/pxa_defconfig @@ -20,7 +20,7 @@ CONFIG_PXA_SHARPSL=y CONFIG_MACH_AKITA=y CONFIG_MACH_BORZOI=y CONFIG_AEABI=y -CONFIG_ARCH_FORCE_MAX_ORDER=9 +CONFIG_ARCH_FORCE_MAX_ORDER=8 CONFIG_CMDLINE="root=/dev/ram0 ro" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/sama7_defconfig b/arch/arm/configs/sama7_defconfig index 0d964c613d71..954112041403 100644 --- a/arch/arm/configs/sama7_defconfig +++ b/arch/arm/configs/sama7_defconfig @@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y # CONFIG_CACHE_L2X0 is not set # CONFIG_ARM_PATCH_IDIV is not set # CONFIG_CPU_SW_DOMAIN_PAN is not set -CONFIG_ARCH_FORCE_MAX_ORDER=15 +CONFIG_ARCH_FORCE_MAX_ORDER=14 CONFIG_UACCESS_WITH_MEMCPY=y # CONFIG_ATAGS is not set CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel" diff --git a/arch/arm/configs/sp7021_defconfig b/arch/arm/configs/sp7021_defconfig index 5bca2eb59b86..c6448ac860b6 100644 --- a/arch/arm/configs/sp7021_defconfig +++ b/arch/arm/configs/sp7021_defconfig @@ -17,7 +17,7 @@ CONFIG_ARCH_SUNPLUS=y # CONFIG_VDSO is not set CONFIG_SMP=y CONFIG_THUMB2_KERNEL=y -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_VFP=y CONFIG_NEON=y CONFIG_MODULES=y diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1023e896d46b..cb5c6aa3254e 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1476,22 +1476,22 @@ config XEN # include/linux/mmzone.h requires the following to be true: # -# MAX_ORDER - 1 + PAGE_SHIFT <= SECTION_SIZE_BITS +# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS # -# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS + 1 - PAGE_SHIFT: +# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT: # # | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER | # ----+-------------------+--------------+-----------------+--------------------+ -# 4K | 27 | 12 | 16 | 11 | -# 16K | 27 | 14 | 14 | 12 | -# 64K | 29 | 16 | 14 | 14 | +# 4K | 27 | 12 | 15 | 10 | +# 16K | 27 | 14 | 13 | 11 | +# 64K | 29 | 16 | 13 | 13 | config ARCH_FORCE_MAX_ORDER int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES - default "14" if ARM64_64K_PAGES - range 12 14 if ARM64_16K_PAGES - default "12" if ARM64_16K_PAGES - range 11 16 if ARM64_4K_PAGES - default "11" + default "13" if ARM64_64K_PAGES + range 11 13 if ARM64_16K_PAGES + default "11" if ARM64_16K_PAGES + range 10 15 if ARM64_4K_PAGES + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -1500,14 +1500,11 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - We make sure that we can allocate up to a HugePage size for each configuration. Hence we have : - MAX_ORDER = (PMD_SHIFT - PAGE_SHIFT) + 1 => PAGE_SHIFT - 2 + MAX_ORDER = PMD_SHIFT - PAGE_SHIFT => PAGE_SHIFT - 3 - However for 4K, we choose a higher default value, 11 as opposed to 10, giving us + However for 4K, we choose a higher default value, 10 as opposed to 9, giving us 4M allocations matching the default size used by generic code. config UNMAP_KERNEL_AT_EL0 diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 4b73463423c3..5f5437621029 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -10,7 +10,7 @@ /* * Section size must be at least 512MB for 64K base * page size config. Otherwise it will be less than - * (MAX_ORDER - 1) and the build process will fail. + * MAX_ORDER and the build process will fail. */ #ifdef CONFIG_ARM64_64K_PAGES #define SECTION_SIZE_BITS 29 diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h index 0a048dc06a7d..fe5472a184a3 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h @@ -16,7 +16,7 @@ struct hyp_pool { * API at EL2. */ hyp_spinlock_t lock; - struct list_head free_area[MAX_ORDER]; + struct list_head free_area[MAX_ORDER + 1]; phys_addr_t range_start; phys_addr_t range_end; unsigned short max_order; diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c index 803ba3222e75..b1e392186a0f 100644 --- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c @@ -110,7 +110,7 @@ static void __hyp_attach_page(struct hyp_pool *pool, * after coalescing, so make sure to mark it HYP_NO_ORDER proactively. */ p->order = HYP_NO_ORDER; - for (; (order + 1) < pool->max_order; order++) { + for (; (order + 1) <= pool->max_order; order++) { buddy = __find_buddy_avail(pool, p, order); if (!buddy) break; @@ -203,9 +203,9 @@ void *hyp_alloc_pages(struct hyp_pool *pool, unsigned short order) hyp_spin_lock(&pool->lock); /* Look for a high-enough-order page */ - while (i < pool->max_order && list_empty(&pool->free_area[i])) + while (i <= pool->max_order && list_empty(&pool->free_area[i])) i++; - if (i >= pool->max_order) { + if (i > pool->max_order) { hyp_spin_unlock(&pool->lock); return NULL; } @@ -228,8 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages, int i; hyp_spin_lock_init(&pool->lock); - pool->max_order = min(MAX_ORDER, get_order((nr_pages + 1) << PAGE_SHIFT)); - for (i = 0; i < pool->max_order; i++) + pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT)); + for (i = 0; i <= pool->max_order; i++) INIT_LIST_HEAD(&pool->free_area[i]); pool->range_start = phys; pool->range_end = phys + (nr_pages << PAGE_SHIFT); diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index dba02da6fa34..c694fac43bed 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -334,7 +334,7 @@ config HIGHMEM config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "11" + default "10" config DRAM_BASE hex "DRAM start addr (the same with memory-section in dts)" diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index d7e4a24e8644..0d2f41fa56ee 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -202,10 +202,10 @@ config IA64_CYCLONE If you're unsure, answer N. config ARCH_FORCE_MAX_ORDER - int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE - range 11 17 if !HUGETLB_PAGE - default "17" if HUGETLB_PAGE - default "11" + int "MAX_ORDER (10 - 16)" if !HUGETLB_PAGE + range 10 16 if !HUGETLB_PAGE + default "16" if HUGETLB_PAGE + default "10" config SMP bool "Symmetric multi-processing support" diff --git a/arch/ia64/include/asm/sparsemem.h b/arch/ia64/include/asm/sparsemem.h index 84e8ce387b69..a58f8b466d96 100644 --- a/arch/ia64/include/asm/sparsemem.h +++ b/arch/ia64/include/asm/sparsemem.h @@ -12,9 +12,9 @@ #define SECTION_SIZE_BITS (30) #define MAX_PHYSMEM_BITS (50) #ifdef CONFIG_ARCH_FORCE_MAX_ORDER -#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS) +#if (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT > SECTION_SIZE_BITS) #undef SECTION_SIZE_BITS -#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) +#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT) #endif #endif diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c index 380d2f3966c9..e8dd4323fb86 100644 --- a/arch/ia64/mm/hugetlbpage.c +++ b/arch/ia64/mm/hugetlbpage.c @@ -170,7 +170,7 @@ static int __init hugetlb_setup_sz(char *str) size = memparse(str, &str); if (*str || !is_power_of_2(size) || !(tr_pages & size) || size <= PAGE_SIZE || - size >= (1UL << PAGE_SHIFT << MAX_ORDER)) { + size > (1UL << PAGE_SHIFT << MAX_ORDER)) { printk(KERN_WARNING "Invalid huge page size specified\n"); return 1; } diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 7fd51257e0ed..272a3a12c98d 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -420,12 +420,12 @@ config NODES_SHIFT config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - range 14 64 if PAGE_SIZE_64KB - default "14" if PAGE_SIZE_64KB - range 12 64 if PAGE_SIZE_16KB - default "12" if PAGE_SIZE_16KB - range 11 64 - default "11" + range 13 63 if PAGE_SIZE_64KB + default "13" if PAGE_SIZE_64KB + range 11 63 if PAGE_SIZE_16KB + default "11" if PAGE_SIZE_16KB + range 10 63 + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -434,9 +434,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - The page size is not necessarily 4KB. Keep this in mind when choosing a value for this option. diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu index 9380f6e3bb66..c9df6572133f 100644 --- a/arch/m68k/Kconfig.cpu +++ b/arch/m68k/Kconfig.cpu @@ -400,7 +400,7 @@ config SINGLE_MEMORY_CHUNK config ARCH_FORCE_MAX_ORDER int "Maximum zone order" if ADVANCED depends on !SINGLE_MEMORY_CHUNK - default "11" + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -413,9 +413,6 @@ config ARCH_FORCE_MAX_ORDER value also defines the minimal size of the hole that allows freeing unused memory map. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - config 060_WRITETHROUGH bool "Use write-through caching for 68060 supervisor accesses" depends on ADVANCED && M68060 diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index e2f3ca73f40d..3e8b765b8c7b 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2137,14 +2137,14 @@ endchoice config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB - default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB - range 13 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB - default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB - range 12 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB - default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB - range 0 64 - default "11" + range 13 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB + default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB + range 12 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB + default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB + range 11 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB + default "11" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB + range 0 63 + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -2153,9 +2153,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - The page size is not necessarily 4KB. Keep this in mind when choosing a value for this option. diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index a582f72104f3..89708b95978c 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -46,8 +46,8 @@ source "kernel/Kconfig.hz" config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - range 9 20 - default "11" + range 8 19 + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -56,9 +56,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - endmenu source "arch/nios2/platform/Kconfig.platform" diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 49c6d36b2b3e..24d56536b269 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -897,18 +897,18 @@ config DATA_SHIFT config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - range 8 9 if PPC64 && PPC_64K_PAGES - default "9" if PPC64 && PPC_64K_PAGES - range 13 13 if PPC64 && !PPC_64K_PAGES - default "13" if PPC64 && !PPC_64K_PAGES - range 9 64 if PPC32 && PPC_16K_PAGES - default "9" if PPC32 && PPC_16K_PAGES - range 7 64 if PPC32 && PPC_64K_PAGES - default "7" if PPC32 && PPC_64K_PAGES - range 5 64 if PPC32 && PPC_256K_PAGES - default "5" if PPC32 && PPC_256K_PAGES - range 11 64 - default "11" + range 7 8 if PPC64 && PPC_64K_PAGES + default "8" if PPC64 && PPC_64K_PAGES + range 12 12 if PPC64 && !PPC_64K_PAGES + default "12" if PPC64 && !PPC_64K_PAGES + range 8 63 if PPC32 && PPC_16K_PAGES + default "8" if PPC32 && PPC_16K_PAGES + range 6 63 if PPC32 && PPC_64K_PAGES + default "6" if PPC32 && PPC_64K_PAGES + range 4 63 if PPC32 && PPC_256K_PAGES + default "4" if PPC32 && PPC_256K_PAGES + range 10 63 + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -917,9 +917,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - The page size is not necessarily 4KB. For example, on 64-bit systems, 64KB pages can be enabled via CONFIG_PPC_64K_PAGES. Keep this in mind when choosing a value for this option. diff --git a/arch/powerpc/configs/85xx/ge_imp3a_defconfig b/arch/powerpc/configs/85xx/ge_imp3a_defconfig index ea719898b581..6cb7e90d52c1 100644 --- a/arch/powerpc/configs/85xx/ge_imp3a_defconfig +++ b/arch/powerpc/configs/85xx/ge_imp3a_defconfig @@ -30,7 +30,7 @@ CONFIG_PREEMPT=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_BINFMT_MISC=m CONFIG_MATH_EMULATION=y -CONFIG_ARCH_FORCE_MAX_ORDER=17 +CONFIG_ARCH_FORCE_MAX_ORDER=16 CONFIG_PCI=y CONFIG_PCIEPORTBUS=y CONFIG_PCI_MSI=y diff --git a/arch/powerpc/configs/fsl-emb-nonhw.config b/arch/powerpc/configs/fsl-emb-nonhw.config index ab8a8c4530d9..3009b0efaf34 100644 --- a/arch/powerpc/configs/fsl-emb-nonhw.config +++ b/arch/powerpc/configs/fsl-emb-nonhw.config @@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y CONFIG_FONT_8x16=y CONFIG_FONT_8x8=y CONFIG_FONTS=y -CONFIG_ARCH_FORCE_MAX_ORDER=13 +CONFIG_ARCH_FORCE_MAX_ORDER=12 CONFIG_FRAMEBUFFER_CONSOLE=y CONFIG_FRAME_WARN=1024 CONFIG_FTL=y diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c index 7fcfba162e0d..81d7185e2ae8 100644 --- a/arch/powerpc/mm/book3s64/iommu_api.c +++ b/arch/powerpc/mm/book3s64/iommu_api.c @@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, } mmap_read_lock(mm); - chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) / + chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) / sizeof(struct vm_area_struct *); chunk = min(chunk, entries); for (entry = 0; entry < entries; entry += chunk) { diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index f1ba8d1e8c1a..b900933507da 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void) order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT; if (order) { - VM_WARN_ON(order < MAX_ORDER); + VM_WARN_ON(order <= MAX_ORDER); hugetlb_cma_reserve(order); } } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 4f6e20a35aa1..5a81f106068e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1740,7 +1740,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) * DMA window can be larger than available memory, which will * cause errors later. */ - const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1); + const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER); /* * We create the default window as big as we can. The constraint is diff --git a/arch/sh/configs/ecovec24_defconfig b/arch/sh/configs/ecovec24_defconfig index b52e14ccb450..4d655e8d4d74 100644 --- a/arch/sh/configs/ecovec24_defconfig +++ b/arch/sh/configs/ecovec24_defconfig @@ -8,7 +8,7 @@ CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_BLK_DEV_BSG is not set CONFIG_CPU_SUBTYPE_SH7724=y -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_MEMORY_SIZE=0x10000000 CONFIG_FLATMEM_MANUAL=y CONFIG_SH_ECOVEC=y diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig index 411fdc0901f7..40271090bd7d 100644 --- a/arch/sh/mm/Kconfig +++ b/arch/sh/mm/Kconfig @@ -20,13 +20,13 @@ config PAGE_OFFSET config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - range 9 64 if PAGE_SIZE_16KB - default "9" if PAGE_SIZE_16KB - range 7 64 if PAGE_SIZE_64KB - default "7" if PAGE_SIZE_64KB - range 11 64 - default "14" if !MMU - default "11" + range 8 63 if PAGE_SIZE_16KB + default "8" if PAGE_SIZE_16KB + range 6 63 if PAGE_SIZE_64KB + default "6" if PAGE_SIZE_64KB + range 10 63 + default "13" if !MMU + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -35,9 +35,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - The page size is not necessarily 4KB. Keep this in mind when choosing a value for this option. diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index 84437a4c6545..e3242bf5a8df 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -271,7 +271,7 @@ config ARCH_SPARSEMEM_DEFAULT config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "13" + default "12" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -280,9 +280,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 13 means that the largest free memory block is 2^12 pages. - if SPARC64 || COMPILE_TEST source "kernel/power/Kconfig" endif diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c index 384480971805..7d91ca6aa675 100644 --- a/arch/sparc/kernel/pci_sun4v.c +++ b/arch/sparc/kernel/pci_sun4v.c @@ -193,7 +193,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size, size = IO_PAGE_ALIGN(size); order = get_order(size); - if (unlikely(order >= MAX_ORDER)) + if (unlikely(order > MAX_ORDER)) return NULL; npages = size >> IO_PAGE_SHIFT; diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c index 5b4de4a89dec..08ffd17d5ec3 100644 --- a/arch/sparc/kernel/traps_64.c +++ b/arch/sparc/kernel/traps_64.c @@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void) /* Now allocate error trap reporting scoreboard. */ sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info)); - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { if ((PAGE_SIZE << order) >= sz) break; } diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c index dba8dffe2113..5e2931a18409 100644 --- a/arch/sparc/mm/tsb.c +++ b/arch/sparc/mm/tsb.c @@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss) unsigned long new_rss_limit; gfp_t gfp_flags; - if (max_tsb_size > (PAGE_SIZE << (MAX_ORDER - 1))) - max_tsb_size = (PAGE_SIZE << (MAX_ORDER - 1)); + if (max_tsb_size > PAGE_SIZE << MAX_ORDER) + max_tsb_size = PAGE_SIZE << MAX_ORDER; new_cache_index = 0; for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) { diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c index 5e5a9c8e0e5d..8dcda617b8bf 100644 --- a/arch/um/kernel/um_arch.c +++ b/arch/um/kernel/um_arch.c @@ -368,10 +368,10 @@ int __init linux_main(int argc, char **argv) max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC; /* - * Zones have to begin on a 1 << MAX_ORDER-1 page boundary, + * Zones have to begin on a 1 << MAX_ORDER page boundary, * so this makes sure that's true for highmem */ - max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER - 1)) - 1); + max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER)) - 1); if (physmem_size + iomem_size > max_physmem) { highmem = physmem_size + iomem_size - max_physmem; physmem_size -= highmem; diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index bcb0c5d2abc2..3eee334ba873 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -773,7 +773,7 @@ config HIGHMEM config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "11" + default "10" help The kernel memory allocator divides physically contiguous memory blocks into "zones", where each zone is a power of two number of @@ -782,9 +782,6 @@ config ARCH_FORCE_MAX_ORDER blocks of physically contiguous memory, then you may need to increase this value. - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - endmenu menu "Power management options" diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c index 817eda2075aa..c491fabe3617 100644 --- a/drivers/base/regmap/regmap-debugfs.c +++ b/drivers/base/regmap/regmap-debugfs.c @@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from, if (*ppos < 0 || !count) return -EINVAL; - if (count > (PAGE_SIZE << (MAX_ORDER - 1))) - count = PAGE_SIZE << (MAX_ORDER - 1); + if (count > (PAGE_SIZE << MAX_ORDER)) + count = PAGE_SIZE << MAX_ORDER; buf = kmalloc(count, GFP_KERNEL); if (!buf) @@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file, if (*ppos < 0 || !count) return -EINVAL; - if (count > (PAGE_SIZE << (MAX_ORDER - 1))) - count = PAGE_SIZE << (MAX_ORDER - 1); + if (count > (PAGE_SIZE << MAX_ORDER)) + count = PAGE_SIZE << MAX_ORDER; buf = kmalloc(count, GFP_KERNEL); if (!buf) diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 90d2dfb6448e..cec2c20f5e59 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -3079,7 +3079,7 @@ static void raw_cmd_free(struct floppy_raw_cmd **ptr) } } -#define MAX_LEN (1UL << (MAX_ORDER - 1) << PAGE_SHIFT) +#define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT) static int raw_cmd_copyin(int cmd, void __user *param, struct floppy_raw_cmd **rcmd) diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c index e2f25926eb51..bf095baca244 100644 --- a/drivers/crypto/ccp/sev-dev.c +++ b/drivers/crypto/ccp/sev-dev.c @@ -886,7 +886,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp) /* * The length of the ID shouldn't be assumed by software since * it may change in the future. The allocation size is limited - * to 1 << (PAGE_SHIFT + MAX_ORDER - 1) by the page allocator. + * to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator. * If the allocation fails, simply return ENOMEM rather than * warning in the kernel log. */ diff --git a/drivers/crypto/hisilicon/sgl.c b/drivers/crypto/hisilicon/sgl.c index 09586a837b1e..3df7a256e919 100644 --- a/drivers/crypto/hisilicon/sgl.c +++ b/drivers/crypto/hisilicon/sgl.c @@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev, HISI_ACC_SGL_ALIGN_SIZE); /* - * the pool may allocate a block of memory of size PAGE_SIZE * 2^(MAX_ORDER - 1), + * the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER, * block size may exceed 2^31 on ia64, so the max of block size is 2^31 */ - block_size = 1 << (PAGE_SHIFT + MAX_ORDER <= 32 ? - PAGE_SHIFT + MAX_ORDER - 1 : 31); + block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ? + PAGE_SHIFT + MAX_ORDER : 31); sgl_num_per_block = block_size / sgl_size; block_num = count / sgl_num_per_block; remain_sgl = count % sgl_num_per_block; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c index eae9e9f6d3bf..6bc26b4b06b8 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c @@ -36,7 +36,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj) struct sg_table *st; struct scatterlist *sg; unsigned int npages; /* restricted by sg_alloc_table */ - int max_order = MAX_ORDER - 1; + int max_order = MAX_ORDER; unsigned int max_segment; gfp_t gfp; diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c index defece0bcb81..99f39a5feca1 100644 --- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c +++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c @@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj) do { struct page *page; - GEM_BUG_ON(order >= MAX_ORDER); + GEM_BUG_ON(order > MAX_ORDER); page = alloc_pages(GFP | __GFP_ZERO, order); if (!page) goto err; diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index aa116a7bbae3..6c8585abe08d 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644); static atomic_long_t allocated_pages; -static struct ttm_pool_type global_write_combined[MAX_ORDER]; -static struct ttm_pool_type global_uncached[MAX_ORDER]; +static struct ttm_pool_type global_write_combined[MAX_ORDER + 1]; +static struct ttm_pool_type global_uncached[MAX_ORDER + 1]; -static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER]; -static struct ttm_pool_type global_dma32_uncached[MAX_ORDER]; +static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1]; +static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1]; static spinlock_t shrinker_lock; static struct list_head shrinker_list; @@ -405,7 +405,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt, else gfp_flags |= GFP_HIGHUSER; - for (order = min_t(unsigned int, MAX_ORDER - 1, __fls(num_pages)); + for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages)); num_pages; order = min_t(unsigned int, order, __fls(num_pages))) { struct ttm_pool_type *pt; @@ -542,7 +542,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev, if (use_dma_alloc) { for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < MAX_ORDER; ++j) + for (j = 0; j <= MAX_ORDER; ++j) ttm_pool_type_init(&pool->caching[i].orders[j], pool, i, j); } @@ -562,7 +562,7 @@ void ttm_pool_fini(struct ttm_pool *pool) if (pool->use_dma_alloc) { for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) - for (j = 0; j < MAX_ORDER; ++j) + for (j = 0; j <= MAX_ORDER; ++j) ttm_pool_type_fini(&pool->caching[i].orders[j]); } @@ -616,7 +616,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m) unsigned int i; seq_puts(m, "\t "); - for (i = 0; i < MAX_ORDER; ++i) + for (i = 0; i <= MAX_ORDER; ++i) seq_printf(m, " ---%2u---", i); seq_puts(m, "\n"); } @@ -627,7 +627,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt, { unsigned int i; - for (i = 0; i < MAX_ORDER; ++i) + for (i = 0; i <= MAX_ORDER; ++i) seq_printf(m, " %8u", ttm_pool_type_count(&pt[i])); seq_puts(m, "\n"); } @@ -736,7 +736,7 @@ int ttm_pool_mgr_init(unsigned long num_pages) spin_lock_init(&shrinker_lock); INIT_LIST_HEAD(&shrinker_list); - for (i = 0; i < MAX_ORDER; ++i) { + for (i = 0; i <= MAX_ORDER; ++i) { ttm_pool_type_init(&global_write_combined[i], NULL, ttm_write_combined, i); ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i); @@ -769,7 +769,7 @@ void ttm_pool_mgr_fini(void) { unsigned int i; - for (i = 0; i < MAX_ORDER; ++i) { + for (i = 0; i <= MAX_ORDER; ++i) { ttm_pool_type_fini(&global_write_combined[i]); ttm_pool_type_fini(&global_uncached[i]); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 8d772ea8a583..b574c58a3487 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -182,7 +182,7 @@ #ifdef CONFIG_CMA_ALIGNMENT #define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT) #else -#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER - 1) +#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER) #endif /* diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index ac996fd6bd9c..7a9f0b0bddbd 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -736,7 +736,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev, struct page **pages; unsigned int i = 0, nid = dev_to_node(dev); - order_mask &= GENMASK(MAX_ORDER - 1, 0); + order_mask &= GENMASK(MAX_ORDER, 0); if (!order_mask) return NULL; diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 586271b8aa39..85790b870877 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -2440,8 +2440,8 @@ static bool its_parse_indirect_baser(struct its_node *its, * feature is not supported by hardware. */ new_order = max_t(u32, get_order(esz << ids), new_order); - if (new_order >= MAX_ORDER) { - new_order = MAX_ORDER - 1; + if (new_order > MAX_ORDER) { + new_order = MAX_ORDER; ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz); pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n", &its->phys_base, its_base_type_string[type], diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index cf077f9b30c3..733053c2eaa0 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -408,7 +408,7 @@ static void __cache_size_refresh(void) * If the allocation may fail we use __get_free_pages. Memory fragmentation * won't have a fatal effect here, but it just causes flushes of some other * buffers and more I/O will be performed. Don't use __get_free_pages if it - * always fails (i.e. order >= MAX_ORDER). + * always fails (i.e. order > MAX_ORDER). * * If the allocation shouldn't fail we use __vmalloc. This is only for the * initial reserve allocation, so there's no risk of wasting all vmalloc diff --git a/drivers/misc/genwqe/card_dev.c b/drivers/misc/genwqe/card_dev.c index d0e27438a73c..55fc5b80e649 100644 --- a/drivers/misc/genwqe/card_dev.c +++ b/drivers/misc/genwqe/card_dev.c @@ -443,7 +443,7 @@ static int genwqe_mmap(struct file *filp, struct vm_area_struct *vma) if (vsize == 0) return -EINVAL; - if (get_order(vsize) >= MAX_ORDER) + if (get_order(vsize) > MAX_ORDER) return -ENOMEM; dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL); diff --git a/drivers/misc/genwqe/card_utils.c b/drivers/misc/genwqe/card_utils.c index ac29698d085a..1c798d6b2dfb 100644 --- a/drivers/misc/genwqe/card_utils.c +++ b/drivers/misc/genwqe/card_utils.c @@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init) void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size, dma_addr_t *dma_handle) { - if (get_order(size) >= MAX_ORDER) + if (get_order(size) > MAX_ORDER) return NULL; return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle, @@ -308,7 +308,7 @@ int genwqe_alloc_sync_sgl(struct genwqe_dev *cd, struct genwqe_sgl *sgl, sgl->write = write; sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages); - if (get_order(sgl->sgl_size) >= MAX_ORDER) { + if (get_order(sgl->sgl_size) > MAX_ORDER) { dev_err(&pci_dev->dev, "[%s] err: too much memory requested!\n", __func__); return ret; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 25be7f8ac7cd..3973ca6adf4c 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -1041,7 +1041,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring) return; order = get_order(alloc_size); - if (order >= MAX_ORDER) { + if (order > MAX_ORDER) { if (net_ratelimit()) dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n"); return; diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h index b35c9b6f913b..4e18b4cefa97 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.h +++ b/drivers/net/ethernet/ibm/ibmvnic.h @@ -75,7 +75,7 @@ * pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160 * plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC. */ -#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << (MAX_ORDER - 1)) * PAGE_SIZE)) +#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE)) #define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX) #define IBMVNIC_LTB_SET_SIZE (38 << 20) diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index ec3f6cf05f8c..34781dec3856 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -946,7 +946,7 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev, if (request_size == 0) return -1; - if (order < MAX_ORDER) { + if (order <= MAX_ORDER) { /* Call alloc_pages if the size is less than 2^MAX_ORDER */ page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order); if (!page) @@ -977,7 +977,7 @@ static void hvfb_release_phymem(struct hv_device *hdev, { unsigned int order = get_order(size); - if (order < MAX_ORDER) + if (order <= MAX_ORDER) __free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order); else dma_free_coherent(&hdev->device, diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c index 0374ee6b6d03..32e74e02a02f 100644 --- a/drivers/video/fbdev/vermilion/vermilion.c +++ b/drivers/video/fbdev/vermilion/vermilion.c @@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo, va = &vinfo->vram[i]; order = 0; - while (requested > (PAGE_SIZE << order) && order < MAX_ORDER) + while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER) order++; err = vmlfb_alloc_vram_area(va, order, 0); diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 3f78a3a1eb75..5b15936a5214 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -33,7 +33,7 @@ #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \ __GFP_NOMEMALLOC) /* The order of free page blocks to report to host */ -#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1) +#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER /* The size of a free page block in bytes */ #define VIRTIO_BALLOON_HINT_BLOCK_BYTES \ (1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT)) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 0c2892ec6817..835f6cc2fb66 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1120,13 +1120,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn, */ static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages) { - unsigned long order = MAX_ORDER - 1; + unsigned long order = MAX_ORDER; unsigned long i; /* * We might get called for ranges that don't cover properly aligned - * MAX_ORDER - 1 pages; however, we can only online properly aligned - * pages with an order of MAX_ORDER - 1 at maximum. + * MAX_ORDER pages; however, we can only online properly aligned + * pages with an order of MAX_ORDER at maximum. */ while (!IS_ALIGNED(pfn | nr_pages, 1 << order)) order--; @@ -1237,9 +1237,9 @@ static void virtio_mem_online_page(struct virtio_mem *vm, bool do_online; /* - * We can get called with any order up to MAX_ORDER - 1. If our - * subblock size is smaller than that and we have a mixture of plugged - * and unplugged subblocks within such a page, we have to process in + * We can get called with any order up to MAX_ORDER. If our subblock + * size is smaller than that and we have a mixture of plugged and + * unplugged subblocks within such a page, we have to process in * smaller granularity. In that case we'll adjust the order exactly once * within the loop. */ diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c index 2f67516bb9bf..9fbb9b5256f7 100644 --- a/fs/ramfs/file-nommu.c +++ b/fs/ramfs/file-nommu.c @@ -70,7 +70,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize) /* make various checks */ order = get_order(newsize); - if (unlikely(order >= MAX_ORDER)) + if (unlikely(order > MAX_ORDER)) return -EFBIG; ret = inode_newsize_ok(inode, newsize); diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h index ef09b23d29e3..8ce14f9d202a 100644 --- a/include/drm/ttm/ttm_pool.h +++ b/include/drm/ttm/ttm_pool.h @@ -72,7 +72,7 @@ struct ttm_pool { bool use_dma32; struct { - struct ttm_pool_type orders[MAX_ORDER]; + struct ttm_pool_type orders[MAX_ORDER + 1]; } caching[TTM_NUM_CACHING_TYPES]; }; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7c977d234aba..8fb7d91cd0b1 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -818,7 +818,7 @@ static inline unsigned huge_page_shift(struct hstate *h) static inline bool hstate_is_gigantic(struct hstate *h) { - return huge_page_order(h) >= MAX_ORDER; + return huge_page_order(h) > MAX_ORDER; } static inline unsigned int pages_per_huge_page(const struct hstate *h) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bf8786d45b31..35b11cc210d9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -26,11 +26,11 @@ /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER -#define MAX_ORDER 11 +#define MAX_ORDER 10 #else #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER #endif -#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1)) +#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER) /* * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed @@ -93,7 +93,7 @@ static inline bool migratetype_is_mergeable(int mt) } #define for_each_migratetype_order(order, type) \ - for (order = 0; order < MAX_ORDER; order++) \ + for (order = 0; order <= MAX_ORDER; order++) \ for (type = 0; type < MIGRATE_TYPES; type++) extern int page_group_by_mobility_disabled; @@ -922,7 +922,7 @@ struct zone { CACHELINE_PADDING(_pad1_); /* free areas of different sizes */ - struct free_area free_area[MAX_ORDER]; + struct free_area free_area[MAX_ORDER + 1]; /* zone flags, see below */ unsigned long flags; @@ -1745,7 +1745,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes) #define SECTION_BLOCKFLAGS_BITS \ ((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) -#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS +#if (MAX_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS #error Allocator MAX_ORDER exceeds SECTION_SIZE #endif diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 5f1ae07d724b..e83c4c095041 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -41,14 +41,14 @@ extern unsigned int pageblock_order; * Huge pages are a constant size, but don't exceed the maximum allocation * granularity. */ -#define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1) +#define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ #else /* CONFIG_HUGETLB_PAGE */ /* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ -#define pageblock_order (MAX_ORDER-1) +#define pageblock_order MAX_ORDER #endif /* CONFIG_HUGETLB_PAGE */ diff --git a/include/linux/slab.h b/include/linux/slab.h index 45af70315a94..aa4575ef2965 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -284,7 +284,7 @@ static inline unsigned int arch_slab_minalign(void) * (PAGE_SIZE*2). Larger requests are passed to the page allocator. */ #define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1) -#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1) +#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT) #ifndef KMALLOC_SHIFT_LOW #define KMALLOC_SHIFT_LOW 5 #endif @@ -292,7 +292,7 @@ static inline unsigned int arch_slab_minalign(void) #ifdef CONFIG_SLUB #define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1) -#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1) +#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT) #ifndef KMALLOC_SHIFT_LOW #define KMALLOC_SHIFT_LOW 3 #endif @@ -305,7 +305,7 @@ static inline unsigned int arch_slab_minalign(void) * be allocated from the same page. */ #define KMALLOC_SHIFT_HIGH PAGE_SHIFT -#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1) +#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT) #ifndef KMALLOC_SHIFT_LOW #define KMALLOC_SHIFT_LOW 3 #endif diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 755f5f08ab38..90ce1dfd591c 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -474,7 +474,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(list_head, prev); VMCOREINFO_OFFSET(vmap_area, va_start); VMCOREINFO_OFFSET(vmap_area, list); - VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); + VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1); log_buf_vmcoreinfo_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); VMCOREINFO_NUMBER(NR_FREE_PAGES); diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c index 4d40dcce7604..1acec2e22827 100644 --- a/kernel/dma/pool.c +++ b/kernel/dma/pool.c @@ -84,8 +84,8 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size, void *addr; int ret = -ENOMEM; - /* Cannot allocate larger than MAX_ORDER-1 */ - order = min(get_order(pool_size), MAX_ORDER-1); + /* Cannot allocate larger than MAX_ORDER */ + order = min(get_order(pool_size), MAX_ORDER); do { pool_size = 1 << (PAGE_SHIFT + order); @@ -190,7 +190,7 @@ static int __init dma_atomic_pool_init(void) /* * If coherent_pool was not used on the command line, default the pool - * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER-1. + * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER. */ if (!atomic_pool_size) { unsigned long pages = totalram_pages() / (SZ_1G / SZ_128K); diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index d6bbdb7830b2..a0433f37b024 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) { struct page *page; - if (order >= MAX_ORDER) - order = MAX_ORDER - 1; + if (order > MAX_ORDER) + order = MAX_ORDER; do { page = alloc_pages_node(node, PERF_AUX_GFP, order); @@ -814,7 +814,7 @@ struct perf_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags) size = sizeof(struct perf_buffer); size += nr_pages * sizeof(void *); - if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER) + if (order_base_2(size) > PAGE_SHIFT+MAX_ORDER) goto fail; node = (cpu == -1) ? cpu : cpu_to_node(cpu); diff --git a/mm/Kconfig b/mm/Kconfig index ca98b2072df5..969286ab14a1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -346,9 +346,9 @@ config SHUFFLE_PAGE_ALLOCATOR the presence of a memory-side-cache. There are also incidental security benefits as it reduces the predictability of page allocations to compliment SLAB_FREELIST_RANDOM, but the - default granularity of shuffling on the "MAX_ORDER - 1" i.e, - 10th order of pages is selected based on cache utilization - benefits on x86. + default granularity of shuffling on the MAX_ORDER i.e, 10th + order of pages is selected based on cache utilization benefits + on x86. While the randomization improves cache utilization it may negatively impact workloads on platforms without a cache. For @@ -666,8 +666,8 @@ config HUGETLB_PAGE_SIZE_VARIABLE HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available on a platform. - Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be - clamped down to MAX_ORDER - 1. + Note that the pageblock_order cannot exceed MAX_ORDER and will be + clamped down to MAX_ORDER. config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA diff --git a/mm/compaction.c b/mm/compaction.c index 5a9501e0ae01..709136556b9e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -583,7 +583,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc, if (PageCompound(page)) { const unsigned int order = compound_order(page); - if (likely(order < MAX_ORDER)) { + if (likely(order <= MAX_ORDER)) { blockpfn += (1UL << order) - 1; cursor += (1UL << order) - 1; } @@ -938,7 +938,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * a valid page order. Consider only values in the * valid order range to prevent low_pfn overflow. */ - if (freepage_order > 0 && freepage_order < MAX_ORDER) + if (freepage_order > 0 && freepage_order <= MAX_ORDER) low_pfn += (1UL << freepage_order) - 1; continue; } @@ -954,7 +954,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (PageCompound(page) && !cc->alloc_contig) { const unsigned int order = compound_order(page); - if (likely(order < MAX_ORDER)) + if (likely(order <= MAX_ORDER)) low_pfn += (1UL << order) - 1; goto isolate_fail; } @@ -2124,7 +2124,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) /* Direct compactor: Is a suitable page free? */ ret = COMPACT_NO_SUITABLE_PAGE; - for (order = cc->order; order < MAX_ORDER; order++) { + for (order = cc->order; order <= MAX_ORDER; order++) { struct free_area *area = &cc->zone->free_area[order]; bool can_steal; diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 4362021b1ce7..c54177aabebd 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1086,7 +1086,7 @@ debug_vm_pgtable_alloc_huge_page(struct pgtable_debug_args *args, int order) struct page *page = NULL; #ifdef CONFIG_CONTIG_ALLOC - if (order >= MAX_ORDER) { + if (order > MAX_ORDER) { page = alloc_contig_pages((1 << order), GFP_KERNEL, first_online_node, NULL); if (page) { @@ -1096,7 +1096,7 @@ debug_vm_pgtable_alloc_huge_page(struct pgtable_debug_args *args, int order) } #endif - if (order < MAX_ORDER) + if (order <= MAX_ORDER) page = alloc_pages(GFP_KERNEL, order); return page; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2d860e70fe88..1df386d13d24 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -467,7 +467,7 @@ static int __init hugepage_init(void) /* * hugepages can't be allocated by the buddy allocator */ - MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); + MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_ORDER); /* * we use page->mapping and page->index in second tail page * as list_head: assuming THP order >= 2 diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 712e32b38295..9122e50ae02a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2090,7 +2090,7 @@ pgoff_t hugetlb_basepage_index(struct page *page) pgoff_t index = page_index(page_head); unsigned long compound_idx; - if (compound_order(page_head) >= MAX_ORDER) + if (compound_order(page_head) > MAX_ORDER) compound_idx = page_to_pfn(page) - page_to_pfn(page_head); else compound_idx = page - page_head; @@ -4497,7 +4497,7 @@ static int __init default_hugepagesz_setup(char *s) * The number of default huge pages (for this size) could have been * specified as the first hugetlb parameter: hugepages=X. If so, * then default_hstate_max_huge_pages is set. If the default huge - * page size is gigantic (>= MAX_ORDER), then the pages must be + * page size is gigantic (> MAX_ORDER), then the pages must be * allocated here from bootmem allocator. */ if (default_hstate_max_huge_pages) { diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 7fb794242fad..ffedf4dbc49d 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -96,7 +96,7 @@ void __init kmsan_init_shadow(void) struct metadata_page_pair { struct page *shadow, *origin; }; -static struct metadata_page_pair held_back[MAX_ORDER] __initdata; +static struct metadata_page_pair held_back[MAX_ORDER + 1] __initdata; /* * Eager metadata allocation. When the memblock allocator is freeing pages to @@ -211,8 +211,8 @@ static void kmsan_memblock_discard(void) * order=N-1, * - repeat. */ - collect.order = MAX_ORDER - 1; - for (int i = MAX_ORDER - 1; i >= 0; i--) { + collect.order = MAX_ORDER; + for (int i = MAX_ORDER; i >= 0; i--) { if (held_back[i].shadow) smallstack_push(&collect, held_back[i].shadow); if (held_back[i].origin) diff --git a/mm/memblock.c b/mm/memblock.c index 25fd0626a9e7..7911224b1ed3 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2043,7 +2043,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) int order; while (start < end) { - order = min(MAX_ORDER - 1UL, __ffs(start)); + order = min_t(int, MAX_ORDER, __ffs(start)); while (start + (1UL << order) > end) order--; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index db3b270254f1..c8f0a8c2d049 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -596,7 +596,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) unsigned long pfn; /* - * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might + * Online the pages in MAX_ORDER aligned chunks. The callback might * decide to not expose all pages to the buddy (e.g., expose them * later). We account all pages as being online and belonging to this * zone ("present"). @@ -605,7 +605,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) * this and the first chunk to online will be pageblock_nr_pages. */ for (pfn = start_pfn; pfn < end_pfn;) { - int order = min(MAX_ORDER - 1UL, __ffs(pfn)); + int order = min_t(int, MAX_ORDER, __ffs(pfn)); (*online_page_callback)(pfn_to_page(pfn), order); pfn += (1UL << order); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0936bde1d486..c3e49d028a7a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1063,7 +1063,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, unsigned long higher_page_pfn; struct page *higher_page; - if (order >= MAX_ORDER - 2) + if (order >= MAX_ORDER - 1) return false; higher_page_pfn = buddy_pfn & pfn; @@ -1118,7 +1118,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); - while (order < MAX_ORDER - 1) { + while (order < MAX_ORDER) { if (compaction_capture(capc, page, order, migratetype)) { __mod_zone_freepage_state(zone, -(1 << order), migratetype); @@ -2499,7 +2499,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page *page; /* Find a page of the appropriate size in the preferred list */ - for (current_order = order; current_order < MAX_ORDER; ++current_order) { + for (current_order = order; current_order <= MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); page = get_page_from_free_area(area, migratetype); if (!page) @@ -2871,7 +2871,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, continue; spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { struct free_area *area = &(zone->free_area[order]); page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); @@ -2955,7 +2955,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, * approximates finding the pageblock with the most free pages, which * would be too costly to do exactly. */ - for (current_order = MAX_ORDER - 1; current_order >= min_order; + for (current_order = MAX_ORDER; current_order >= min_order; --current_order) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2981,7 +2981,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, return false; find_smallest: - for (current_order = order; current_order < MAX_ORDER; + for (current_order = order; current_order <= MAX_ORDER; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2994,7 +2994,7 @@ find_smallest: * This should not happen - we already found a suitable fallback * when looking for the largest page. */ - VM_BUG_ON(current_order == MAX_ORDER); + VM_BUG_ON(current_order > MAX_ORDER); do_steal: page = get_page_from_free_area(area, fallback_mt); @@ -3955,7 +3955,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, return true; /* For a high-order request, check at least one suitable page is free */ - for (o = order; o < MAX_ORDER; o++) { + for (o = order; o <= MAX_ORDER; o++) { struct free_area *area = &z->free_area[o]; int mt; @@ -5475,7 +5475,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, * There are several places where we assume that the order value is sane * so bail out early if the request is out of bound. */ - if (WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp)) + if (WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)) return NULL; gfp &= gfp_allowed_mask; @@ -6205,8 +6205,8 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i for_each_populated_zone(zone) { unsigned int order; - unsigned long nr[MAX_ORDER], flags, total = 0; - unsigned char types[MAX_ORDER]; + unsigned long nr[MAX_ORDER + 1], flags, total = 0; + unsigned char types[MAX_ORDER + 1]; if (zone_idx(zone) > max_zone_idx) continue; @@ -6216,7 +6216,7 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i printk(KERN_CONT "%s: ", zone->name); spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { struct free_area *area = &zone->free_area[order]; int type; @@ -6230,7 +6230,7 @@ void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_i } } spin_unlock_irqrestore(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { printk(KERN_CONT "%lu*%lukB ", nr[order], K(1UL) << order); if (nr[order]) @@ -7581,7 +7581,7 @@ static inline void setup_usemap(struct zone *zone) {} /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ void __init set_pageblock_order(void) { - unsigned int order = MAX_ORDER - 1; + unsigned int order = MAX_ORDER; /* Check that pageblock_nr_pages has not already been setup */ if (pageblock_order) @@ -9076,7 +9076,7 @@ void *__init alloc_large_system_hash(const char *tablename, else table = memblock_alloc_raw(size, SMP_CACHE_BYTES); - } else if (get_order(size) >= MAX_ORDER || hashdist) { + } else if (get_order(size) > MAX_ORDER || hashdist) { table = vmalloc_huge(size, gfp_flags); virt = true; if (table) @@ -9290,7 +9290,7 @@ int alloc_contig_range(unsigned long start, unsigned long end, order = 0; outer_start = start; while (!PageBuddy(pfn_to_page(outer_start))) { - if (++order >= MAX_ORDER) { + if (++order > MAX_ORDER) { outer_start = start; break; } @@ -9540,7 +9540,7 @@ bool is_free_buddy_page(struct page *page) unsigned long pfn = page_to_pfn(page); unsigned int order; - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { struct page *page_head = page - (pfn & ((1 << order) - 1)); if (PageBuddy(page_head) && @@ -9548,7 +9548,7 @@ bool is_free_buddy_page(struct page *page) break; } - return order < MAX_ORDER; + return order <= MAX_ORDER; } EXPORT_SYMBOL(is_free_buddy_page); @@ -9599,7 +9599,7 @@ bool take_page_off_buddy(struct page *page) bool ret = false; spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { struct page *page_head = page - (pfn & ((1 << order) - 1)); int page_order = buddy_order(page_head); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 47fbc1696466..c6f3605e37ab 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -226,7 +226,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) */ if (PageBuddy(page)) { order = buddy_order(page); - if (order >= pageblock_order && order < MAX_ORDER - 1) { + if (order >= pageblock_order && order < MAX_ORDER) { buddy = find_buddy_page_pfn(page, page_to_pfn(page), order, NULL); if (buddy && !is_migrate_isolate_page(buddy)) { @@ -290,11 +290,11 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * isolate_single_pageblock() * @migratetype: migrate type to set in error recovery. * - * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one + * Free and in-use pages can be as big as MAX_ORDER and contain more than one * pageblock. When not all pageblocks within a page are isolated at the same * time, free page accounting can go wrong. For example, in the case of - * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks. - * [ MAX_ORDER-1 ] + * MAX_ORDER = pageblock_order + 1, a MAX_ORDER page has two pagelbocks. + * [ MAX_ORDER ] * [ pageblock0 | pageblock1 ] * When either pageblock is isolated, if it is a free page, the page is not * split into separate migratetype lists, which is supposed to; if it is an @@ -451,7 +451,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, * the free page to the right migratetype list. * * head_pfn is not used here as a hugetlb page order - * can be bigger than MAX_ORDER-1, but after it is + * can be bigger than MAX_ORDER, but after it is * freed, the free page order is not. Use pfn within * the range to find the head of the free page. */ @@ -459,7 +459,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, outer_pfn = pfn; while (!PageBuddy(pfn_to_page(outer_pfn))) { /* stop if we cannot find the free page */ - if (++order >= MAX_ORDER) + if (++order > MAX_ORDER) goto failed; outer_pfn &= ~0UL << order; } diff --git a/mm/page_owner.c b/mm/page_owner.c index 220cdeddc295..31169b3e7f06 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -315,7 +315,7 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, unsigned long freepage_order; freepage_order = buddy_order_unsafe(page); - if (freepage_order < MAX_ORDER) + if (freepage_order <= MAX_ORDER) pfn += (1UL << freepage_order) - 1; continue; } @@ -549,7 +549,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) if (PageBuddy(page)) { unsigned long freepage_order = buddy_order_unsafe(page); - if (freepage_order < MAX_ORDER) + if (freepage_order <= MAX_ORDER) pfn += (1UL << freepage_order) - 1; continue; } @@ -657,7 +657,7 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone) if (PageBuddy(page)) { unsigned long order = buddy_order_unsafe(page); - if (order > 0 && order < MAX_ORDER) + if (order > 0 && order <= MAX_ORDER) pfn += (1UL << order) - 1; continue; } diff --git a/mm/page_reporting.c b/mm/page_reporting.c index 275b466de37b..b021f482a4cb 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -20,7 +20,7 @@ static int page_order_update_notify(const char *val, const struct kernel_param * * If param is set beyond this limit, order is set to default * pageblock_order value */ - return param_set_uint_minmax(val, kp, 0, MAX_ORDER-1); + return param_set_uint_minmax(val, kp, 0, MAX_ORDER); } static const struct kernel_param_ops page_reporting_param_ops = { @@ -276,7 +276,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev, return err; /* Process each free list starting from lowest order/mt */ - for (order = page_reporting_order; order < MAX_ORDER; order++) { + for (order = page_reporting_order; order <= MAX_ORDER; order++) { for (mt = 0; mt < MIGRATE_TYPES; mt++) { /* We do not pull pages from the isolate free list */ if (is_migrate_isolate(mt)) @@ -370,7 +370,7 @@ int page_reporting_register(struct page_reporting_dev_info *prdev) */ if (page_reporting_order == -1) { - if (prdev->order > 0 && prdev->order < MAX_ORDER) + if (prdev->order > 0 && prdev->order <= MAX_ORDER) page_reporting_order = prdev->order; else page_reporting_order = pageblock_order; diff --git a/mm/shuffle.h b/mm/shuffle.h index cec62984f7d3..a6bdf54f96f1 100644 --- a/mm/shuffle.h +++ b/mm/shuffle.h @@ -4,7 +4,7 @@ #define _MM_SHUFFLE_H #include -#define SHUFFLE_ORDER (MAX_ORDER-1) +#define SHUFFLE_ORDER MAX_ORDER #ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); diff --git a/mm/slab.c b/mm/slab.c index edbe722fb906..6b7c172158e5 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -465,7 +465,7 @@ static int __init slab_max_order_setup(char *str) { get_option(&str, &slab_max_order); slab_max_order = slab_max_order < 0 ? 0 : - min(slab_max_order, MAX_ORDER - 1); + min(slab_max_order, MAX_ORDER); slab_max_order_set = true; return 1; diff --git a/mm/slub.c b/mm/slub.c index 32eb6b50fe18..f49d669ff604 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4171,8 +4171,8 @@ static inline int calculate_order(unsigned int size) /* * Doh this slab cannot be placed using slub_max_order. */ - order = calc_slab_order(size, 1, MAX_ORDER - 1, 1); - if (order < MAX_ORDER) + order = calc_slab_order(size, 1, MAX_ORDER, 1); + if (order <= MAX_ORDER) return order; return -ENOSYS; } @@ -4697,7 +4697,7 @@ __setup("slub_min_order=", setup_slub_min_order); static int __init setup_slub_max_order(char *str) { get_option(&str, (int *)&slub_max_order); - slub_max_order = min(slub_max_order, (unsigned int)MAX_ORDER - 1); + slub_max_order = min_t(unsigned int, slub_max_order, MAX_ORDER); return 1; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 8faac4310cb5..98719e72b5e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -7002,7 +7002,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, * scan_control uses s8 fields for order, priority, and reclaim_idx. * Confirm they are large enough for max values. */ - BUILD_BUG_ON(MAX_ORDER > S8_MAX); + BUILD_BUG_ON(MAX_ORDER >= S8_MAX); BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ea6a5ce1c41..b7307627772d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1055,7 +1055,7 @@ static void fill_contig_page_info(struct zone *zone, info->free_blocks_total = 0; info->free_blocks_suitable = 0; - for (order = 0; order < MAX_ORDER; order++) { + for (order = 0; order <= MAX_ORDER; order++) { unsigned long blocks; /* @@ -1088,7 +1088,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in { unsigned long requested = 1UL << order; - if (WARN_ON_ONCE(order >= MAX_ORDER)) + if (WARN_ON_ONCE(order > MAX_ORDER)) return 0; if (!info->free_blocks_total) @@ -1462,7 +1462,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, int order; seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); - for (order = 0; order < MAX_ORDER; ++order) + for (order = 0; order <= MAX_ORDER; ++order) /* * Access to nr_free is lockless as nr_free is used only for * printing purposes. Use data_race to avoid KCSAN warning. @@ -1491,7 +1491,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, pgdat->node_id, zone->name, migratetype_names[mtype]); - for (order = 0; order < MAX_ORDER; ++order) { + for (order = 0; order <= MAX_ORDER; ++order) { unsigned long freecount = 0; struct free_area *area; struct list_head *curr; @@ -1531,7 +1531,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg) /* Print header */ seq_printf(m, "%-43s ", "Free pages count per migrate type at order"); - for (order = 0; order < MAX_ORDER; ++order) + for (order = 0; order <= MAX_ORDER; ++order) seq_printf(m, "%6d ", order); seq_putc(m, '\n'); @@ -2153,7 +2153,7 @@ static void unusable_show_print(struct seq_file *m, seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); - for (order = 0; order < MAX_ORDER; ++order) { + for (order = 0; order <= MAX_ORDER; ++order) { fill_contig_page_info(zone, order, &info); index = unusable_free_index(order, &info); seq_printf(m, "%d.%03d ", index / 1000, index % 1000); @@ -2205,7 +2205,7 @@ static void extfrag_show_print(struct seq_file *m, seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); - for (order = 0; order < MAX_ORDER; ++order) { + for (order = 0; order <= MAX_ORDER; ++order) { fill_contig_page_info(zone, order, &info); index = __fragmentation_index(order, &info); seq_printf(m, "%2d.%03d ", index / 1000, index % 1000); diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index 854772dd52fd..9b66d6aeeb1a 100644 --- a/net/smc/smc_ib.c +++ b/net/smc/smc_ib.c @@ -843,7 +843,7 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev) goto out; /* the calculated number of cq entries fits to mlx5 cq allocation */ cqe_size_order = cache_line_size() == 128 ? 7 : 6; - smc_order = MAX_ORDER - cqe_size_order - 1; + smc_order = MAX_ORDER - cqe_size_order; if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE) cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2; smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev, diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c index 64499056648a..51ad29940f05 100644 --- a/security/integrity/ima/ima_crypto.c +++ b/security/integrity/ima/ima_crypto.c @@ -38,7 +38,7 @@ static int param_set_bufsize(const char *val, const struct kernel_param *kp) size = memparse(val, NULL); order = get_order(size); - if (order >= MAX_ORDER) + if (order > MAX_ORDER) return -EINVAL; ima_maxorder = order; ima_bufsize = PAGE_SIZE << order; diff --git a/tools/testing/memblock/linux/mmzone.h b/tools/testing/memblock/linux/mmzone.h index e65f89b12f1c..134f8eab0768 100644 --- a/tools/testing/memblock/linux/mmzone.h +++ b/tools/testing/memblock/linux/mmzone.h @@ -17,10 +17,10 @@ enum zone_type { }; #define MAX_NR_ZONES __MAX_NR_ZONES -#define MAX_ORDER 11 -#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1)) +#define MAX_ORDER 10 +#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER) -#define pageblock_order (MAX_ORDER - 1) +#define pageblock_order MAX_ORDER #define pageblock_nr_pages BIT(pageblock_order) #define pageblock_align(pfn) ALIGN((pfn), pageblock_nr_pages) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) -- cgit v1.2.3-70-g09d2 From 0289184476c845968ad6ac9083c96cc0f75ca505 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Tue, 14 Mar 2023 15:12:50 -0700 Subject: mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE to resolve a missing fault, one can install a write-protected one. This is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination. This was motivated by testing HugeTLB HGM [1], and in particular its interaction with userfaultfd features. Existing userfaultfd code supports using WP and MINOR modes together (i.e. you can register an area with both enabled), but without this CONTINUE flag the combination is in practice unusable. So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as UFFDIO_COPY_MODE_WP, but for *minor* faults. Update the selftest to do some very basic exercising of the new flag. Update Documentation/ to describe how these flags are used (neither the COPY nor the new CONTINUE versions of this mode flag were described there before). [1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Peter Xu Acked-by: Mike Rapoport (IBM) Cc: Al Viro Cc: Hugh Dickins Cc: Jan Kara Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Mike Kravetz Cc: Muchun Song Cc: Nadav Amit Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/userfaultfd.rst | 8 ++++++++ fs/userfaultfd.c | 8 ++++++-- include/linux/userfaultfd_k.h | 3 ++- include/uapi/linux/userfaultfd.h | 7 +++++++ mm/userfaultfd.c | 5 +++-- tools/testing/selftests/mm/userfaultfd.c | 4 ++++ 6 files changed, 30 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index bd2226299583..7c304e432205 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -236,6 +236,14 @@ newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED and set the feature bit in advance to make sure none ptes will also be write protected even upon anonymous memory. +When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either +``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when +resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE`` +respectively, it may be desirable for the new page / mapping to be +write-protected (so future writes will also result in a WP fault). These ioctls +support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` +respectively) to configure the mapping this way. + QEMU/KVM ======== diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 8971c3613cc6..8395605790f6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1893,6 +1893,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) struct uffdio_continue uffdio_continue; struct uffdio_continue __user *user_uffdio_continue; struct userfaultfd_wake_range range; + uffd_flags_t flags = 0; user_uffdio_continue = (struct uffdio_continue __user *)arg; @@ -1917,13 +1918,16 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) uffdio_continue.range.start) { goto out; } - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE | + UFFDIO_CONTINUE_MODE_WP)) goto out; + if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WP) + flags |= MFILL_ATOMIC_WP; if (mmget_not_zero(ctx->mm)) { ret = mfill_atomic_continue(ctx->mm, uffdio_continue.range.start, uffdio_continue.range.len, - &ctx->mmap_changing); + &ctx->mmap_changing, flags); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 4c477dece540..a2c53e98dfd6 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -83,7 +83,8 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long len, atomic_t *mmap_changing); extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long len, atomic_t *mmap_changing); + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 90c958952bfc..66dd4cd277bd 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -305,6 +305,13 @@ struct uffdio_writeprotect { struct uffdio_continue { struct uffdio_range range; #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0) + /* + * UFFDIO_CONTINUE_MODE_WP will map the page write protected on + * the fly. UFFDIO_CONTINUE_MODE_WP is available only if the + * write protected ioctl is implemented for the range + * according to the uffdio_register.ioctls. + */ +#define UFFDIO_CONTINUE_MODE_WP ((__u64)1<<1) __u64 mode; /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a9b19b39413d..7f1b5f8b712c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -693,10 +693,11 @@ ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start, } ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, - unsigned long len, atomic_t *mmap_changing) + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) { return mfill_atomic(dst_mm, start, 0, len, mmap_changing, - uffd_flags_set_mode(0, MFILL_ATOMIC_CONTINUE)); + uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } long uffd_wp_range(struct vm_area_struct *dst_vma, diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selftests/mm/userfaultfd.c index e030d63c031a..a96d126cb40e 100644 --- a/tools/testing/selftests/mm/userfaultfd.c +++ b/tools/testing/selftests/mm/userfaultfd.c @@ -585,6 +585,8 @@ static void continue_range(int ufd, __u64 start, __u64 len) req.range.start = start; req.range.len = len; req.mode = 0; + if (test_uffdio_wp) + req.mode |= UFFDIO_CONTINUE_MODE_WP; if (ioctl(ufd, UFFDIO_CONTINUE, &req)) err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, @@ -1332,6 +1334,8 @@ static int userfaultfd_minor_test(void) uffdio_register.range.start = (unsigned long)area_dst_alias; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); -- cgit v1.2.3-70-g09d2 From bd23024b9774e681cbe6cc3afcb24244dfcb2390 Mon Sep 17 00:00:00 2001 From: Tomas Mudrunka Date: Tue, 21 Mar 2023 11:34:30 +0100 Subject: mm/memtest: add results of early memtest to /proc/meminfo Currently the memtest results were only presented in dmesg. When running a large fleet of devices without ECC RAM it's currently not easy to do bulk monitoring for memory corruption. You have to parse dmesg, but that's a ring buffer so the error might disappear after some time. In general I do not consider dmesg to be a great API to query RAM status. In several companies I've seen such errors remain undetected and cause issues for way too long. So I think it makes sense to provide a monitoring API, so that we can safely detect and act upon them. This adds /proc/meminfo entry which can be easily used by scripts. Link: https://lkml.kernel.org/r/20230321103430.7130-1-tomas.mudrunka@gmail.com Signed-off-by: Tomas Mudrunka Cc: Jonathan Corbet Cc: Mike Rapoport (IBM) Signed-off-by: Andrew Morton --- Documentation/filesystems/proc.rst | 8 ++++++++ fs/proc/meminfo.c | 13 +++++++++++++ include/linux/memblock.h | 2 ++ mm/memtest.c | 6 ++++++ 4 files changed, 29 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 9d5fd9424e8b..8740362f31c6 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -996,6 +996,7 @@ Example output. You may not have all of these fields. VmallocUsed: 40444 kB VmallocChunk: 0 kB Percpu: 29312 kB + EarlyMemtestBad: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 4149248 kB ShmemHugePages: 0 kB @@ -1146,6 +1147,13 @@ VmallocChunk Percpu Memory allocated to the percpu allocator used to back percpu allocations. This stat excludes the cost of metadata. +EarlyMemtestBad + The amount of RAM/memory in kB, that was identified as corrupted + by early memtest. If memtest was not run, this field will not + be displayed at all. Size is never rounded down to 0 kB. + That means if 0 kB is reported, you can safely assume + there was at least one pass of memtest and none of the passes + found a single faulty byte of RAM. HardwareCorrupted The amount of RAM/memory in KB, the kernel identifies as corrupted. diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 440960110a42..b43d0bd42762 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -131,6 +132,18 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "VmallocChunk: ", 0ul); show_val_kb(m, "Percpu: ", pcpu_nr_pages()); +#ifdef CONFIG_MEMTEST + if (early_memtest_done) { + unsigned long early_memtest_bad_size_kb; + + early_memtest_bad_size_kb = early_memtest_bad_size>>10; + if (early_memtest_bad_size && !early_memtest_bad_size_kb) + early_memtest_bad_size_kb = 1; + /* When 0 is reported, it means there actually was a successful test */ + seq_printf(m, "EarlyMemtestBad: %5lu kB\n", early_memtest_bad_size_kb); + } +#endif + #ifdef CONFIG_MEMORY_FAILURE seq_printf(m, "HardwareCorrupted: %5lu kB\n", atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)); diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 50ad19662a32..f82ee3fac1cd 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -597,6 +597,8 @@ extern int hashdist; /* Distribute hashes across NUMA nodes? */ #endif #ifdef CONFIG_MEMTEST +extern phys_addr_t early_memtest_bad_size; /* Size of faulty ram found by memtest */ +extern bool early_memtest_done; /* Was early memtest done? */ extern void early_memtest(phys_addr_t start, phys_addr_t end); #else static inline void early_memtest(phys_addr_t start, phys_addr_t end) diff --git a/mm/memtest.c b/mm/memtest.c index f53ace709ccd..57149dfee438 100644 --- a/mm/memtest.c +++ b/mm/memtest.c @@ -4,6 +4,9 @@ #include #include +bool early_memtest_done; +phys_addr_t early_memtest_bad_size; + static u64 patterns[] __initdata = { /* The first entry has to be 0 to leave memtest with zeroed memory */ 0, @@ -30,6 +33,7 @@ static void __init reserve_bad_mem(u64 pattern, phys_addr_t start_bad, phys_addr pr_info(" %016llx bad mem addr %pa - %pa reserved\n", cpu_to_be64(pattern), &start_bad, &end_bad); memblock_reserve(start_bad, end_bad - start_bad); + early_memtest_bad_size += (end_bad - start_bad); } static void __init memtest(u64 pattern, phys_addr_t start_phys, phys_addr_t size) @@ -61,6 +65,8 @@ static void __init memtest(u64 pattern, phys_addr_t start_phys, phys_addr_t size } if (start_bad) reserve_bad_mem(pattern, start_bad, last_bad + incr); + + early_memtest_done = true; } static void __init do_one_pass(u64 pattern, phys_addr_t start, phys_addr_t end) -- cgit v1.2.3-70-g09d2 From 58ef47ef7db9dfc2730dc039498cc76130ea3c3d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 27 Mar 2023 18:45:15 +0100 Subject: mm: hold the RCU read lock over calls to ->map_pages Prevent filesystems from doing things which sleep in their map_pages method. This is in preparation for a pagefault path protected only by RCU. Link: https://lkml.kernel.org/r/20230327174515.1811532-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Darrick J. Wong Cc: Dave Chinner Cc: David Howells Signed-off-by: Andrew Morton --- Documentation/filesystems/locking.rst | 4 ++-- mm/memory.c | 11 ++++++++--- 2 files changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 7de7a7272a5e..aa1a233b0fa8 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page) open: yes close: yes fault: yes can return with page locked -map_pages: yes +map_pages: read page_mkwrite: yes can return with page locked pfn_mkwrite: yes access: yes @@ -661,7 +661,7 @@ locked. The VM will unlock the page. ->map_pages() is called when VM asks to map easy accessible pages. Filesystem should find and map pages associated with offsets from "start_pgoff" -till "end_pgoff". ->map_pages() is called with page table locked and must +till "end_pgoff". ->map_pages() is called with the RCU lock held and must not block. If it's not possible to reach a page without blocking, filesystem should skip it. Filesystem should use do_set_pte() to setup page table entry. Pointer to entry associated with the page is passed in diff --git a/mm/memory.c b/mm/memory.c index 7716f2cc4b68..ec7e89cc0532 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4450,6 +4450,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) /* The page offset of vmf->address within the VMA. */ pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; pgoff_t from_pte, to_pte; + vm_fault_t ret; /* The PTE offset of the start address, clamped to the VMA. */ from_pte = max(ALIGN_DOWN(pte_off, nr_pages), @@ -4465,9 +4466,13 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) return VM_FAULT_OOM; } - return vmf->vma->vm_ops->map_pages(vmf, - vmf->pgoff + from_pte - pte_off, - vmf->pgoff + to_pte - pte_off); + rcu_read_lock(); + ret = vmf->vma->vm_ops->map_pages(vmf, + vmf->pgoff + from_pte - pte_off, + vmf->pgoff + to_pte - pte_off); + rcu_read_unlock(); + + return ret; } /* Return true if we should do read fault-around, false otherwise */ -- cgit v1.2.3-70-g09d2 From d21077fbc2fc987c2e593c34dc3b4d84e546dc9f Mon Sep 17 00:00:00 2001 From: Stefan Roesch Date: Mon, 17 Apr 2023 22:13:41 -0700 Subject: mm: add new KSM process and sysfs knobs This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc//ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io Signed-off-by: Stefan Roesch Reviewed-by: Bagas Sanjaya Acked-by: David Hildenbrand Cc: David Hildenbrand Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 8 ++++++++ Documentation/admin-guide/mm/ksm.rst | 5 ++++- fs/proc/base.c | 3 +++ include/linux/ksm.h | 5 +++++ mm/ksm.c | 21 +++++++++++++++++++++ 5 files changed, 41 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm index d244674a9480..6041a025b65a 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-ksm +++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm @@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes. When it is set to 0 only pages from the same node are merged, otherwise pages from all nodes can be merged together (default). + +What: /sys/kernel/mm/ksm/general_profit +Date: April 2023 +KernelVersion: 6.4 +Contact: Linux memory management mailing list +Description: Measure how effective KSM is. + general_profit: how effective is KSM. The formula for the + calculation is in Documentation/admin-guide/mm/ksm.rst. diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index eed51a910c94..551083a396fb 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -157,6 +157,8 @@ stable_node_chains_prune_millisecs The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: +general_profit + how effective is KSM. The calculation is explained below. pages_shared how many shared pages are being used pages_sharing @@ -207,7 +209,8 @@ several times, which are unprofitable memory consumed. ksm_rmap_items * sizeof(rmap_item). where ksm_merging_pages is shown under the directory ``/proc//``, - and ksm_rmap_items is shown in ``/proc//ksm_stat``. + and ksm_rmap_items is shown in ``/proc//ksm_stat``. The process profit + is also shown in ``/proc//ksm_stat`` as ksm_process_profit. From the perspective of application, a high ratio of ``ksm_rmap_items`` to ``ksm_merging_pages`` means a bad madvise-applied policy, so developers or diff --git a/fs/proc/base.c b/fs/proc/base.c index 5e0e0ccd47aa..96a6a08c8235 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -96,6 +96,7 @@ #include #include #include +#include #include #include "internal.h" #include "fd.h" @@ -3207,6 +3208,8 @@ static int proc_pid_ksm_stat(struct seq_file *m, struct pid_namespace *ns, mm = get_task_mm(task); if (mm) { seq_printf(m, "ksm_rmap_items %lu\n", mm->ksm_rmap_items); + seq_printf(m, "ksm_merging_pages %lu\n", mm->ksm_merging_pages); + seq_printf(m, "ksm_process_profit %ld\n", ksm_process_profit(mm)); mmput(mm); } diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 4647b0c70c12..7a9b76fb6c3f 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -68,6 +68,11 @@ void folio_migrate_ksm(struct folio *newfolio, struct folio *folio); void collect_procs_ksm(struct page *page, struct list_head *to_kill, int force_early); #endif + +#ifdef CONFIG_PROC_FS +long ksm_process_profit(struct mm_struct *); +#endif /* CONFIG_PROC_FS */ + #else /* !CONFIG_KSM */ static inline void ksm_add_vma(struct vm_area_struct *vma) diff --git a/mm/ksm.c b/mm/ksm.c index 35ac6c741572..9e48258985d2 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -3007,6 +3007,14 @@ static void wait_while_offlining(void) } #endif /* CONFIG_MEMORY_HOTREMOVE */ +#ifdef CONFIG_PROC_FS +long ksm_process_profit(struct mm_struct *mm) +{ + return mm->ksm_merging_pages * PAGE_SIZE - + mm->ksm_rmap_items * sizeof(struct ksm_rmap_item); +} +#endif /* CONFIG_PROC_FS */ + #ifdef CONFIG_SYSFS /* * This all compiles without CONFIG_SYSFS, but is a waste of space. @@ -3271,6 +3279,18 @@ static ssize_t pages_volatile_show(struct kobject *kobj, } KSM_ATTR_RO(pages_volatile); +static ssize_t general_profit_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + long general_profit; + + general_profit = ksm_pages_sharing * PAGE_SIZE - + ksm_rmap_items * sizeof(struct ksm_rmap_item); + + return sysfs_emit(buf, "%ld\n", general_profit); +} +KSM_ATTR_RO(general_profit); + static ssize_t stable_node_dups_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -3335,6 +3355,7 @@ static struct attribute *ksm_attrs[] = { &stable_node_dups_attr.attr, &stable_node_chains_prune_millisecs_attr.attr, &use_zero_pages_attr.attr, + &general_profit_attr.attr, NULL, }; -- cgit v1.2.3-70-g09d2