summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-07-21 17:15:46 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2024-07-21 17:15:46 -0700
commitfbc90c042cd1dc7258ebfebe6d226017e5b5ac8c (patch)
tree45513ac12ade12a80ca6b306722f201802b0a190 /Documentation/admin-guide
parent7846b618e0a4c3e08888099d1d4512722b39ca99 (diff)
parent30d77b7eef019fa4422980806e8b7cdc8674493e (diff)
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst18
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt9
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst46
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst10
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst25
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst85
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst38
7 files changed, 174 insertions, 57 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 05862f06ed26..86311c2907cd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1306,17 +1306,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.
- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
-
Example::
echo "1G" > memory.reclaim
- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1328,6 +1321,17 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.
+The following nested keys are defined.
+
+ ========== ================================
+ swappiness Swappiness value to reclaim with
+ ========== ================================
+
+ Specifying a swappiness value instructs the kernel to perform
+ the reclaim with that swappiness value. Note that this has the
+ same semantics as vm.swappiness applied to memcg reclaim with
+ all the existing limitations and potential future extensions.
+
memory.peak
A read-only single value file which exists on non-root
cgroups.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 530ebe9b11cd..c1134ad5f06d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7239,9 +7239,12 @@
vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an
exact size of <nn>. This can be used to increase
- the minimum size (128MB on x86). It can also be
- used to decrease the size and leave more room
- for directly mapped kernel RAM.
+ the minimum size (128MB on x86, arm32 platforms).
+ It can also be used to decrease the size and leave more room
+ for directly mapped kernel RAM. Note that this parameter does
+ not exist on many other platforms (including arm64, alpha,
+ loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc,
+ parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc).
vmcp_cma=nn[MG] [KNL,S390,EARLY]
Sets the memory size reserved for contiguous memory
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
index 7aa0071ff1c3..054010a7f3d8 100644
--- a/Documentation/admin-guide/mm/damon/start.rst
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -34,18 +34,56 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
mounted.
+Snapshot Data Access Patterns
+=============================
+
+The commands below show the memory access pattern of a program at the moment of
+the execution. ::
+
+ $ git clone https://github.com/sjp38/masim; cd masim; make
+ $ sudo damo start "./masim ./configs/stairs.cfg --quiet"
+ $ sudo ./damo show
+ 0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s
+ 1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s
+ 2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s
+ 3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms
+ 4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms
+ 5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns
+ 6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s
+ 7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s
+ 8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms
+ 9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s
+ total size: 577.590 MiB
+ $ sudo ./damo stop
+
+The first command of the above example downloads and builds an artificial
+memory access generator program called ``masim``. The second command asks DAMO
+to execute the artificial generator process start via the given command and
+make DAMON monitors the generator process. The third command retrieves the
+current snapshot of the monitored access pattern of the process from DAMON and
+shows the pattern in a human readable format.
+
+Each line of the output shows which virtual address range (``addr [XX, XX)``)
+of the process is how frequently (``access XX %``) accessed for how long time
+(``age XX``). For example, the fifth region of ~9 MiB size is being most
+frequently accessed for last 300 milliseconds. Finally, the fourth command
+stops DAMON.
+
+Note that DAMON can monitor not only virtual address spaces but multiple types
+of address spaces including the physical address space.
+
+
Recording Data Access Patterns
==============================
The commands below record the memory access patterns of a program and save the
monitoring results to a file. ::
- $ git clone https://github.com/sjp38/masim
- $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ ./masim ./configs/zigzag.cfg &
$ sudo damo record -o damon.data $(pidof masim)
-The first two lines of the commands download an artificial memory access
-generator program and run it in the background. The generator will repeatedly
+The line of the commands run the artificial memory access
+generator program again. The generator will repeatedly
access two 100 MiB sized memory regions one by one. You can substitute this
with your real workload. The last line asks ``damo`` to record the access
pattern in the ``damon.data`` file.
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index e58ceb89ea2a..26df6cfa4441 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -78,7 +78,7 @@ comma (",").
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
- │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
+ │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
@@ -289,14 +289,18 @@ schemes/<N>/
------------
In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
-(``action`` and ``apply_interval``) exist.
+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
+(``action``, ``target_nid`` and ``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
from the file and their meaning are same to those of the list on
:ref:`design doc <damon_design_damos_action>`.
+The ``target_nid`` file is for setting the migration target node, which is
+only meaningful when the ``action`` is either ``migrate_hot`` or
+``migrate_cold``.
+
The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index f5f065c67615..caba0f52dd36 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -118,7 +118,7 @@ Short descriptions to the page flags
21 - KSM
Identical memory pages dynamically shared between one or more processes.
22 - THP
- Contiguous pages which construct transparent hugepages.
+ Contiguous pages which construct THP of any size and mapped by any granularity.
23 - OFFLINE
The page is logically offline.
24 - ZERO_PAGE
@@ -173,27 +173,6 @@ LRU related page flags
The page-types tool in the tools/mm directory can be used to query the
above flags.
-Using pagemap to do something useful
-====================================
-
-The general procedure for using pagemap to find out about a process' memory
-usage goes like this:
-
- 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
- mapped to what.
- 2. Select the maps you are interested in -- all of them, or a particular
- library, or the stack or the heap, etc.
- 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
- 4. Read a u64 for each page from pagemap.
- 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
- just read, seek to that entry in the file, and read the data you want.
-
-For example, to find the "unique set size" (USS), which is the amount of
-memory that a process is using that is not shared with any other process,
-you can go through every map in the process, find the PFNs, look those up
-in kpagecount, and tally up the number of pages that are only referenced
-once.
-
Exceptions for Shared Memory
============================
@@ -252,7 +231,7 @@ Following flags about pages are currently supported:
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
-- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
+- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index d414d3f5592a..058485daf186 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -202,12 +202,11 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when one or more hugepage
-sizes are enabled (either by directly setting "always" or "madvise",
-or by setting "inherit" while the top-level enabled is set to "always"
-or "madvise"), and it'll be automatically shutdown when the last
-hugepage size is disabled (either by directly setting "never", or by
-setting "inherit" while the top-level enabled is set to "never").
+khugepaged will be automatically started when PMD-sized THP is enabled
+(either of the per-size anon control or the top-level control are set
+to "always" or "madvise"), and it'll be automatically shutdown when
+PMD-sized THP is disabled (when both the per-size anon control and the
+top-level control are "never")
Khugepaged controls
-------------------
@@ -332,6 +331,31 @@ deny
force
Force the huge option on for all - very useful for testing;
+Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
+control mTHP allocation:
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
+and its value for each mTHP is essentially consistent with the global
+setting. An 'inherit' option is added to ensure compatibility with these
+global settings. Conversely, the options 'force' and 'deny' are dropped,
+which are rather testing artifacts from the old ages.
+
+always
+ Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+ Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+ have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+ Do not allocate <size> huge pages;
+
+within_size
+ Only allocate <size> huge page if it will be fully within i_size.
+ Also respect fadvise()/madvise() hints;
+
+advise
+ Only allocate <size> huge pages if requested with fadvise()/madvise();
+
Need of application restart
===========================
@@ -344,10 +368,6 @@ also applies to the regions registered in khugepaged.
Monitoring usage
================
-.. note::
- Currently the below counters only record events relating to
- PMD-sized THP. Events relating to other THP sizes are not included.
-
The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
To identify what applications are using PMD-sized anonymous transparent huge
@@ -392,20 +412,23 @@ thp_collapse_alloc_failed
the allocation.
thp_file_alloc
- is incremented every time a file huge page is successfully
- allocated.
+ is incremented every time a shmem huge page is successfully
+ allocated (Note that despite being named after "file", the counter
+ measures only shmem).
thp_file_fallback
- is incremented if a file huge page is attempted to be allocated
- but fails and instead falls back to using small pages.
+ is incremented if a shmem huge page is attempted to be allocated
+ but fails and instead falls back to using small pages. (Note that
+ despite being named after "file", the counter measures only shmem).
thp_file_fallback_charge
- is incremented if a file huge page cannot be charged and instead
+ is incremented if a shmem huge page cannot be charged and instead
falls back to using small pages even though the allocation was
- successful.
+ successful. (Note that despite being named after "file", the
+ counter measures only shmem).
thp_file_mapped
- is incremented every time a file huge page is mapped into
+ is incremented every time a file or shmem huge page is mapped into
user address space.
thp_split_page
@@ -476,6 +499,34 @@ swpout_fallback
Usually because failed to allocate some continuous swap space
for the huge page.
+shmem_alloc
+ is incremented every time a shmem huge page is successfully
+ allocated.
+
+shmem_fallback
+ is incremented if a shmem huge page is attempted to be allocated
+ but fails and instead falls back to using small pages.
+
+shmem_fallback_charge
+ is incremented if a shmem huge page cannot be charged and instead
+ falls back to using small pages even though the allocation was
+ successful.
+
+split
+ is incremented every time a huge page is successfully split into
+ smaller orders. This can happen for a variety of reasons but a
+ common reason is that a huge page is old and is being reclaimed.
+
+split_failed
+ is incremented if kernel fails to split huge
+ page. This can happen if the page was pinned by somebody.
+
+split_deferred
+ is incremented when a huge page is put onto split queue.
+ This happens when a huge page is partially unmapped and splitting
+ it would free up some memory. Pages on split queue are going to
+ be split under memory pressure, if splitting is possible.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..f48eaa98d22d 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm:
- dirtytime_expire_seconds
- dirty_writeback_centisecs
- drop_caches
+- enable_soft_offline
- extfrag_threshold
- highmem_is_dirtyable
- hugetlb_shm_group
@@ -267,6 +268,43 @@ used::
These are informational only. They do not mean that anything is wrong
with your system. To disable them, echo 4 (bit 2) into drop_caches.
+enable_soft_offline
+===================
+Correctable memory errors are very common on servers. Soft-offline is kernel's
+solution for memory pages having (excessive) corrected memory errors.
+
+For different types of page, soft-offline has different behaviors / costs.
+
+- For a raw error page, soft-offline migrates the in-use page's content to
+ a new raw page.
+
+- For a page that is part of a transparent hugepage, soft-offline splits the
+ transparent hugepage into raw pages, then migrates only the raw error page.
+ As a result, user is transparently backed by 1 less hugepage, impacting
+ memory access performance.
+
+- For a page that is part of a HugeTLB hugepage, soft-offline first migrates
+ the entire HugeTLB hugepage, during which a free hugepage will be consumed
+ as migration target. Then the original hugepage is dissolved into raw
+ pages without compensation, reducing the capacity of the HugeTLB pool by 1.
+
+It is user's call to choose between reliability (staying away from fragile
+physical memory) vs performance / capacity implications in transparent and
+HugeTLB cases.
+
+For all architectures, enable_soft_offline controls whether to soft offline
+memory pages. When set to 1, kernel attempts to soft offline the pages
+whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
+the request to soft offline the pages. Its default value is 1.
+
+It is worth mentioning that after setting enable_soft_offline to 0, the
+following requests to soft offline pages will not be performed:
+
+- Request to soft offline pages from RAS Correctable Errors Collector.
+
+- On ARM, the request to soft offline pages from GHES driver.
+
+- On PARISC, the request to soft offline pages from Page Deallocation Table.
extfrag_threshold
=================