diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/sysfs-kernel-mm-damon | 6 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 18 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 9 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/start.rst | 46 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 10 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/pagemap.rst | 25 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/transhuge.rst | 85 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 38 | ||||
-rw-r--r-- | Documentation/core-api/pin_user_pages.rst | 18 | ||||
-rw-r--r-- | Documentation/dev-tools/kmsan.rst | 11 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 9 | ||||
-rw-r--r-- | Documentation/mm/arch_pgtable_helpers.rst | 4 | ||||
-rw-r--r-- | Documentation/mm/damon/design.rst | 149 | ||||
-rw-r--r-- | Documentation/mm/damon/index.rst | 22 | ||||
-rw-r--r-- | Documentation/mm/damon/maintainer-profile.rst | 36 | ||||
-rw-r--r-- | Documentation/mm/unevictable-lru.rst | 10 |
16 files changed, 362 insertions, 134 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index cef6e1d20b18..f1b90cf1249b 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -155,6 +155,12 @@ Contact: SeongJae Park <sj@kernel.org> Description: Writing to and reading from this file sets and gets the action of the scheme. +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/target_nid +Date: Jun 2024 +Contact: SeongJae Park <sj@kernel.org> +Description: Action's target NUMA node id. Supported by only relevant + actions. + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/apply_interval_us Date: Sep 2023 Contact: SeongJae Park <sj@kernel.org> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 05862f06ed26..86311c2907cd 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1306,17 +1306,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1328,6 +1321,17 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 530ebe9b11cd..c1134ad5f06d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -7239,9 +7239,12 @@ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase - the minimum size (128MB on x86). It can also be - used to decrease the size and leave more room - for directly mapped kernel RAM. + the minimum size (128MB on x86, arm32 platforms). + It can also be used to decrease the size and leave more room + for directly mapped kernel RAM. Note that this parameter does + not exist on many other platforms (including arm64, alpha, + loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc, + parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc). vmcp_cma=nn[MG] [KNL,S390,EARLY] Sets the memory size reserved for contiguous memory diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 7aa0071ff1c3..054010a7f3d8 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -34,18 +34,56 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is mounted. +Snapshot Data Access Patterns +============================= + +The commands below show the memory access pattern of a program at the moment of +the execution. :: + + $ git clone https://github.com/sjp38/masim; cd masim; make + $ sudo damo start "./masim ./configs/stairs.cfg --quiet" + $ sudo ./damo show + 0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s + 1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s + 2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s + 3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms + 4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms + 5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns + 6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s + 7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s + 8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms + 9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s + total size: 577.590 MiB + $ sudo ./damo stop + +The first command of the above example downloads and builds an artificial +memory access generator program called ``masim``. The second command asks DAMO +to execute the artificial generator process start via the given command and +make DAMON monitors the generator process. The third command retrieves the +current snapshot of the monitored access pattern of the process from DAMON and +shows the pattern in a human readable format. + +Each line of the output shows which virtual address range (``addr [XX, XX)``) +of the process is how frequently (``access XX %``) accessed for how long time +(``age XX``). For example, the fifth region of ~9 MiB size is being most +frequently accessed for last 300 milliseconds. Finally, the fourth command +stops DAMON. + +Note that DAMON can monitor not only virtual address spaces but multiple types +of address spaces including the physical address space. + + Recording Data Access Patterns ============================== The commands below record the memory access patterns of a program and save the monitoring results to a file. :: - $ git clone https://github.com/sjp38/masim - $ cd masim; make; ./masim ./configs/zigzag.cfg & + $ ./masim ./configs/zigzag.cfg & $ sudo damo record -o damon.data $(pidof masim) -The first two lines of the commands download an artificial memory access -generator program and run it in the background. The generator will repeatedly +The line of the commands run the artificial memory access +generator program again. The generator will repeatedly access two 100 MiB sized memory regions one by one. You can substitute this with your real workload. The last line asks ``damo`` to record the access pattern in the ``damon.data`` file. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index e58ceb89ea2a..26df6cfa4441 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -78,7 +78,7 @@ comma (","). │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes - │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us + │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max @@ -289,14 +289,18 @@ schemes/<N>/ ------------ In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files -(``action`` and ``apply_interval``) exist. +``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files +(``action``, ``target_nid`` and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read from the file and their meaning are same to those of the list on :ref:`design doc <damon_design_damos_action>`. +The ``target_nid`` file is for setting the migration target node, which is +only meaningful when the ``action`` is either ``migrate_hot`` or +``migrate_cold``. + The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval <damon_design_damos>` in microseconds. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index f5f065c67615..caba0f52dd36 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -118,7 +118,7 @@ Short descriptions to the page flags 21 - KSM Identical memory pages dynamically shared between one or more processes. 22 - THP - Contiguous pages which construct transparent hugepages. + Contiguous pages which construct THP of any size and mapped by any granularity. 23 - OFFLINE The page is logically offline. 24 - ZERO_PAGE @@ -173,27 +173,6 @@ LRU related page flags The page-types tool in the tools/mm directory can be used to query the above flags. -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - Exceptions for Shared Memory ============================ @@ -252,7 +231,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PRESENT`` - Page is present in the memory - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN -- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index d414d3f5592a..058485daf186 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,12 +202,11 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when one or more hugepage -sizes are enabled (either by directly setting "always" or "madvise", -or by setting "inherit" while the top-level enabled is set to "always" -or "madvise"), and it'll be automatically shutdown when the last -hugepage size is disabled (either by directly setting "never", or by -setting "inherit" while the top-level enabled is set to "never"). +khugepaged will be automatically started when PMD-sized THP is enabled +(either of the per-size anon control or the top-level control are set +to "always" or "madvise"), and it'll be automatically shutdown when +PMD-sized THP is disabled (when both the per-size anon control and the +top-level control are "never") Khugepaged controls ------------------- @@ -332,6 +331,31 @@ deny force Force the huge option on for all - very useful for testing; +Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to +control mTHP allocation: +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled', +and its value for each mTHP is essentially consistent with the global +setting. An 'inherit' option is added to ensure compatibility with these +global settings. Conversely, the options 'force' and 'deny' are dropped, +which are rather testing artifacts from the old ages. + +always + Attempt to allocate <size> huge pages every time we need a new page; + +inherit + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages + have enabled="inherit" and all other hugepage sizes have enabled="never"; + +never + Do not allocate <size> huge pages; + +within_size + Only allocate <size> huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate <size> huge pages if requested with fadvise()/madvise(); + Need of application restart =========================== @@ -344,10 +368,6 @@ also applies to the regions registered in khugepaged. Monitoring usage ================ -.. note:: - Currently the below counters only record events relating to - PMD-sized THP. Events relating to other THP sizes are not included. - The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using PMD-sized anonymous transparent huge @@ -392,20 +412,23 @@ thp_collapse_alloc_failed the allocation. thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem). thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem). thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem). thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space. thp_split_page @@ -476,6 +499,34 @@ swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +shmem_alloc + is incremented every time a shmem huge page is successfully + allocated. + +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + +split + is incremented every time a huge page is successfully split into + smaller orders. This can happen for a variety of reasons but a + common reason is that a huge page is old and is being reclaimed. + +split_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +split_deferred + is incremented when a huge page is put onto split queue. + This happens when a huge page is partially unmapped and splitting + it would free up some memory. Pages on split queue are going to + be split under memory pressure, if splitting is possible. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e86c968a7a0e..f48eaa98d22d 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm: - dirtytime_expire_seconds - dirty_writeback_centisecs - drop_caches +- enable_soft_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -267,6 +268,43 @@ used:: These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. +enable_soft_offline +=================== +Correctable memory errors are very common on servers. Soft-offline is kernel's +solution for memory pages having (excessive) corrected memory errors. + +For different types of page, soft-offline has different behaviors / costs. + +- For a raw error page, soft-offline migrates the in-use page's content to + a new raw page. + +- For a page that is part of a transparent hugepage, soft-offline splits the + transparent hugepage into raw pages, then migrates only the raw error page. + As a result, user is transparently backed by 1 less hugepage, impacting + memory access performance. + +- For a page that is part of a HugeTLB hugepage, soft-offline first migrates + the entire HugeTLB hugepage, during which a free hugepage will be consumed + as migration target. Then the original hugepage is dissolved into raw + pages without compensation, reducing the capacity of the HugeTLB pool by 1. + +It is user's call to choose between reliability (staying away from fragile +physical memory) vs performance / capacity implications in transparent and +HugeTLB cases. + +For all architectures, enable_soft_offline controls whether to soft offline +memory pages. When set to 1, kernel attempts to soft offline the pages +whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to +the request to soft offline the pages. Its default value is 1. + +It is worth mentioning that after setting enable_soft_offline to 0, the +following requests to soft offline pages will not be performed: + +- Request to soft offline pages from RAS Correctable Errors Collector. + +- On ARM, the request to soft offline pages from GHES driver. + +- On PARISC, the request to soft offline pages from Page Deallocation Table. extfrag_threshold ================= diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 6b5f7e6e7155..c16ca163b55e 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -132,7 +132,7 @@ CASE 1: Direct IO (DIO) ----------------------- There are GUP references to pages that are serving as DIO buffers. These buffers are needed for a relatively short time (so they -are not "long term"). No special synchronization with page_mkclean() or +are not "long term"). No special synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN @@ -144,7 +144,7 @@ CASE 2: RDMA ------------ There are GUP references to pages that are serving as DMA buffers. These buffers are needed for a long time ("long term"). No special -synchronization with page_mkclean() or munmap() is provided. Therefore, flags +synchronization with folio_mkclean() or munmap() is provided. Therefore, flags to set at the call site are: :: FOLL_PIN | FOLL_LONGTERM @@ -170,7 +170,7 @@ callback, simply remove the range from the device's page tables. Either way, as long as the driver unpins the pages upon mmu notifier callback, then there is proper synchronization with both filesystem and mm -(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. +(folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. CASE 4: Pinning for struct page manipulation only ------------------------------------------------- @@ -196,20 +196,20 @@ INCORRECT (uses FOLL_GET calls): write to the data within the pages put_page() -page_maybe_dma_pinned(): the whole point of pinning -=================================================== +folio_maybe_dma_pinned(): the whole point of pinning +==================================================== -The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able -to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean() (and file system writeback code in general) to make informed decisions about -what to do when a page cannot be unmapped due to such pins. +what to do when a folio cannot be unmapped due to such pins. What to do in those cases is the subject of a years-long series of discussions and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: :: - static inline bool page_maybe_dma_pinned(struct page *page) + static inline bool folio_maybe_dma_pinned(struct folio *folio) ...is a prerequisite to solving the long-running gup+DMA problem. diff --git a/Documentation/dev-tools/kmsan.rst b/Documentation/dev-tools/kmsan.rst index 323eedad53cd..6a48d96c5c85 100644 --- a/Documentation/dev-tools/kmsan.rst +++ b/Documentation/dev-tools/kmsan.rst @@ -110,6 +110,13 @@ in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every function in the file or directory. Most users won't need KMSAN_SANITIZE, unless their code gets broken by KMSAN (e.g. runs at early boot time). +KMSAN checks can also be temporarily disabled for the current task using +``kmsan_disable_current()`` and ``kmsan_enable_current()`` calls. Each +``kmsan_enable_current()`` call must be preceded by a +``kmsan_disable_current()`` call; these call pairs may be nested. One needs to +be careful with these calls, keeping the regions short and preferring other +ways to disable instrumentation, where possible. + Support ======= @@ -338,11 +345,11 @@ Per-task KMSAN state ~~~~~~~~~~~~~~~~~~~~ Every task_struct has an associated KMSAN task state that holds the KMSAN -context (see above) and a per-task flag disallowing KMSAN reports:: +context (see above) and a per-task counter disallowing KMSAN reports:: struct kmsan_context { ... - bool allow_reporting; + unsigned int depth; struct kmsan_context_state cstate; ... } diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 82d142de3461..e834779d9611 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -443,6 +443,15 @@ is not associated with a file: or if empty, the mapping is anonymous. +Starting with 6.11 kernel, /proc/PID/maps provides an alternative +ioctl()-based API that gives ability to flexibly and efficiently query and +filter individual VMAs. This interface is binary and is meant for more +efficient and easy programmatic use. `struct procmap_query`, defined in +linux/fs.h UAPI header, serves as an input/output argument to the +`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for +details on query semantics, supported flags, data returned, and general API +usage information. + The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each mapping (aka Virtual Memory Area, or VMA) there is a series of lines such as the following:: diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index ad50ca6f495e..af245161d8e7 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -90,8 +90,6 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_leaf | Tests a leaf mapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_huge | Tests a HugeTLB mapped PMD | -+---------------------------+--------------------------------------------------+ | pmd_trans_huge | Tests a Transparent Huge Page (THP) at PMD | +---------------------------+--------------------------------------------------+ | pmd_present | Tests whether pmd_page() points to valid memory | @@ -169,8 +167,6 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_leaf | Tests a leaf mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_huge | Tests a HugeTLB mapped PUD | -+---------------------------+--------------------------------------------------+ | pud_trans_huge | Tests a Transparent Huge Page (THP) at PUD | +---------------------------+--------------------------------------------------+ | pud_present | Tests a valid mapped PUD | diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 3df387249937..8730c246ceaa 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -16,53 +16,24 @@ called DAMON ``context``. DAMON executes each context with a kernel thread called ``kdamond``. Multiple kdamonds could run in parallel, for different types of monitoring. +To know how user-space can do the configurations and start/stop DAMON, refer to +:ref:`DAMON sysfs interface <sysfs_interface>` documentation. + Overall Architecture ==================== DAMON subsystem is configured with three layers including -- Operations Set: Implements fundamental operations for DAMON that depends on - the given monitoring target address-space and available set of - software/hardware primitives, -- Core: Implements core logics including monitoring overhead/accurach control - and access-aware system operations on top of the operations set layer, and -- Modules: Implements kernel modules for various purposes that provides - interfaces for the user space, on top of the core layer. - - -.. _damon_design_configurable_operations_set: - -Configurable Operations Set ---------------------------- - -For data access monitoring and additional low level work, DAMON needs a set of -implementations for specific operations that are dependent on and optimized for -the given target address space. On the other hand, the accuracy and overhead -tradeoff mechanism, which is the core logic of DAMON, is in the pure logic -space. DAMON separates the two parts in different layers, namely DAMON -Operations Set and DAMON Core Logics Layers, respectively. It further defines -the interface between the layers to allow various operations sets to be -configured with the core logic. - -Due to this design, users can extend DAMON for any address space by configuring -the core logic to use the appropriate operations set. If any appropriate set -is unavailable, users can implement one on their own. - -For example, physical memory, virtual memory, swap space, those for specific -processes, NUMA nodes, files, and backing memory devices would be supportable. -Also, if some architectures or devices supporting special optimized access -check primitives, those will be easily configurable. - - -Programmable Modules --------------------- - -Core layer of DAMON is implemented as a framework, and exposes its application -programming interface to all kernel space components such as subsystems and -modules. For common use cases of DAMON, DAMON subsystem provides kernel -modules that built on top of the core layer using the API, which can be easily -used by the user space end users. +- :ref:`Operations Set <damon_operations_set>`: Implements fundamental + operations for DAMON that depends on the given monitoring target + address-space and available set of software/hardware primitives, +- :ref:`Core <damon_core_logic>`: Implements core logics including monitoring + overhead/accuracy control and access-aware system operations on top of the + operations set layer, and +- :ref:`Modules <damon_modules>`: Implements kernel modules for various + purposes that provides interfaces for the user space, on top of the core + layer. .. _damon_operations_set: @@ -70,11 +41,32 @@ used by the user space end users. Operations Set Layer ==================== -The monitoring operations are defined in two parts: +.. _damon_design_configurable_operations_set: + +For data access monitoring and additional low level work, DAMON needs a set of +implementations for specific operations that are dependent on and optimized for +the given target address space. For example, below two operations for access +monitoring are address-space dependent. 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. +DAMON consolidates these implementations in a layer called DAMON Operations +Set, and defines the interface between it and the upper layer. The upper layer +is dedicated for DAMON's core logics including the mechanism for control of the +monitoring accruracy and the overhead. + +Hence, DAMON can easily be extended for any address space and/or available +hardware features by configuring the core logic to use the appropriate +operations set. If there is no available operations set for a given purpose, a +new operations set can be implemented following the interface between the +layers. + +For example, physical memory, virtual memory, swap space, those for specific +processes, NUMA nodes, files, and backing memory devices would be supportable. +Also, if some architectures or devices support special optimized access check +features, those will be easily configurable. + DAMON currently provides below three operation sets. Below two subsections describe how those work. @@ -82,6 +74,10 @@ describe how those work. - fvaddr: Monitor fixed virtual address ranges - paddr: Monitor the physical address space of the system +To know how user-space can do the configuration via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`operations <sysfs_context>` file part of the +documentation. + .. _damon_design_vaddr_target_regions_construction: @@ -140,9 +136,12 @@ conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags, as Idle page tracking does. +.. _damon_core_logic: + Core Logics =========== +.. _damon_design_monitoring: Monitoring ---------- @@ -152,6 +151,10 @@ monitoring attributes, ``sampling interval``, ``aggregation interval``, ``update interval``, ``minimum number of regions``, and ``maximum number of regions``. +To know how user-space can set the attributes via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`monitoring_attrs <sysfs_monitoring_attrs>` +part of the documentation. + Access Frequency Monitoring ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -192,7 +195,7 @@ one page in the region is required to be checked. Thus, for each ``sampling interval``, DAMON randomly picks one page in each region, waits for one ``sampling interval``, checks whether the page is accessed meanwhile, and increases the access frequency counter of the region if so. The counter is -called ``nr_regions`` of the region. Therefore, the monitoring overhead is +called ``nr_accesses`` of the region. Therefore, the monitoring overhead is controllable by setting the number of regions. DAMON allows users to set the minimum and the maximum number of regions for the trade-off. @@ -209,11 +212,18 @@ the data access pattern can be dynamically changed. This will result in low monitoring quality. To keep the assumption as much as possible, DAMON adaptively merges and splits each region based on their access frequency. -For each ``aggregation interval``, it compares the access frequencies of -adjacent regions and merges those if the frequency difference is small. Then, -after it reports and clears the aggregated access frequency of each region, it -splits each region into two or three regions if the total number of regions -will not exceed the user-specified maximum number of regions after the split. +For each ``aggregation interval``, it compares the access frequencies +(``nr_accesses``) of adjacent regions. If the difference is small, and if the +sum of the two regions' sizes is smaller than the size of total regions divided +by the ``minimum number of regions``, DAMON merges the two regions. If the +resulting number of total regions is still higher than ``maximum number of +regions``, it repeats the merging with increasing access frequenceis difference +threshold until the upper-limit of the number of regions is met, or the +threshold becomes higher than possible maximum value (``aggregation interval`` +divided by ``sampling interval``). Then, after it reports and clears the +aggregated access frequency of each region, it splits each region into two or +three regions if the total number of regions will not exceed the user-specified +maximum number of regions after the split. In this way, DAMON provides its best-effort quality and minimal overhead while keeping the bounds users set for their trade-off. @@ -248,6 +258,11 @@ and applies it to monitoring operations-related data structures such as the abstracted monitoring target memory area only for each of a user-specified time interval (``update interval``). +User-space can get the monitoring results via DAMON sysfs interface and/or +tracepoints. For more details, please refer to the documentations for +:ref:`DAMOS tried regions <sysfs_schemes_tried_regions>` and :ref:`tracepoint`, +respectively. + .. _damon_design_damos: @@ -288,6 +303,10 @@ the access pattern of interest, and applies the user-desired operation actions to the regions, for every user-specified time interval called ``apply_interval``. +To know how user-space can set ``apply_interval`` via :ref:`DAMON sysfs +interface <sysfs_interface>`, refer to :ref:`apply_interval_us <sysfs_scheme>` +part of the documentation. + .. _damon_design_damos_action: @@ -325,6 +344,10 @@ that supports each action are as below. Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. Supported by ``paddr`` operations set. + - ``migrate_hot``: Migrate the regions prioritizing warmer regions. + Supported by ``paddr`` operations set. + - ``migrate_cold``: Migrate the regions prioritizing colder regions. + Supported by ``paddr`` operations set. - ``stat``: Do nothing but count the statistics. Supported by all operations sets. @@ -332,6 +355,10 @@ Applying the actions except ``stat`` to a region is considered as changing the region's characteristics. Hence, DAMOS resets the age of regions when any such actions are applied to those. +To know how user-space can set the action via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`action <sysfs_scheme>` part of the +documentation. + .. _damon_design_damos_access_pattern: @@ -345,6 +372,10 @@ interest by setting minimum and maximum values of the three properties. If a region's three properties are in the ranges, DAMOS classifies it as one of the regions that the scheme is having an interest in. +To know how user-space can set the access pattern via :ref:`DAMON sysfs +interface <sysfs_interface>`, refer to :ref:`access_pattern +<sysfs_access_pattern>` part of the documentation. + .. _damon_design_damos_quotas: @@ -364,6 +395,10 @@ feature called quotas. It lets users specify an upper limit of time that DAMOS can use for applying the action, and/or a maximum bytes of memory regions that the action can be applied within a user-specified time duration. +To know how user-space can set the basic quotas via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`quotas <sysfs_quotas>` part of the +documentation. + .. _damon_design_damos_quotas_prioritization: @@ -391,6 +426,10 @@ information to the underlying mechanism. Nevertheless, how and even whether the weight will be respected are up to the underlying prioritization mechanism implementation. +To know how user-space can set the prioritization weights via :ref:`DAMON sysfs +interface <sysfs_interface>`, refer to :ref:`weights <sysfs_quotas>` part of +the documentation. + .. _damon_design_damos_quotas_auto_tuning: @@ -420,6 +459,10 @@ Currently, two ``target_metric`` are provided. DAMOS does the measurement on its own, so only ``target_value`` need to be set by users at the initial time. In other words, DAMOS does self-feedback. +To know how user-space can set the tuning goal metric, the target value, and/or +the current value via :ref:`DAMON sysfs interface <sysfs_interface>`, refer to +:ref:`quota goals <sysfs_schemes_quota_goals>` part of the documentation. + .. _damon_design_damos_watermarks: @@ -442,6 +485,10 @@ is activated. If all schemes are deactivated by the watermarks, the monitoring is also deactivated. In this case, the DAMON worker thread only periodically checks the watermarks and therefore incurs nearly zero overhead. +To know how user-space can set the watermarks via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`watermarks <sysfs_watermarks>` part of the +documentation. + .. _damon_design_damos_filters: @@ -488,6 +535,10 @@ Below types of filters are currently supported. - Applied to pages that belonging to a given DAMON monitoring target. - Handled by the core logic. +To know how user-space can set the watermarks via :ref:`DAMON sysfs interface +<sysfs_interface>`, refer to :ref:`filters <sysfs_filters>` part of the +documentation. + Application Programming Interface --------------------------------- @@ -501,6 +552,8 @@ interface, namely ``include/linux/damon.h``. Please refer to the API :doc:`document </mm/damon/api>` for details of the interface. +.. _damon_modules: + Modules ======= diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 5e0a50583500..dafd6d028924 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -6,7 +6,7 @@ DAMON: Data Access MONitor DAMON is a Linux kernel subsystem that provides a framework for data access monitoring and the monitoring results based system operations. The core -monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it +monitoring :ref:`mechanisms <damon_design_monitoring>` of DAMON make it - *accurate* (the monitoring output is useful enough for DRAM level memory management; It might not appropriate for CPU Cache levels, though), @@ -16,15 +16,16 @@ monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it of the size of target workloads). Using this framework, therefore, the kernel can operate system in an -access-aware fashion. Because the features are also exposed to the user space, -users who have special information about their workloads can write personalized -applications for better understanding and optimizations of their workloads and -systems. +access-aware fashion. Because the features are also exposed to the :doc:`user +space </admin-guide/mm/damon/index>`, users who have special information about +their workloads can write personalized applications for better understanding +and optimizations of their workloads and systems. -For easier development of such systems, DAMON provides a feature called DAMOS -(DAMon-based Operation Schemes) in addition to the monitoring. Using the -feature, DAMON users in both kernel and user spaces can do access-aware system -operations with no code but simple configurations. +For easier development of such systems, DAMON provides a feature called +:ref:`DAMOS <damon_design_damos>` (DAMon-based Operation Schemes) in addition +to the monitoring. Using the feature, DAMON users in both kernel and :doc:`user +spaces </admin-guide/mm/damon/index>` can do access-aware system operations +with no code but simple configurations. .. toctree:: :maxdepth: 2 @@ -33,3 +34,6 @@ operations with no code but simple configurations. design api maintainer-profile + +To utilize and control DAMON from the user-space, please refer to the +administration :doc:`guide </admin-guide/mm/damon/index>`. diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index 8213cf61d38a..feccf6a0f6c3 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -53,6 +53,40 @@ Mon-Fri) in PT (Pacific Time). The response to patches will occasionally be slow. Do not hesitate to send a ping if you have not heard back within a week of sending a patch. +Mailing tool +------------ + +Like many other Linux kernel subsystems, DAMON uses the mailing lists +(damon@lists.linux.dev and linux-mm@kvack.org) as the major communication +channel. There is a simple tool called HacKerMaiL (``hkml``) [8]_ , which is +for people who are not very familiar with the mailing lists based +communication. The tool could be particularly helpful for DAMON community +members since it is developed and maintained by DAMON maintainer. The tool is +also officially announced to support DAMON and general Linux kernel development +workflow. + +In other words, ``hkml`` [8]_ is a mailing tool for DAMON community, which +DAMON maintainer is committed to support. Please feel free to try and report +issues or feature requests for the tool to the maintainer. + +Community meetup +---------------- + +DAMON community is maintaining two bi-weekly meetup series for community +members who prefer synchronous conversations over mails. + +The first one is for any discussion between every community member. No +reservation is needed. + +The seconds one is for discussions on specific topics between restricted +members including the maintainer. The maintainer shares the available time +slots, and attendees should reserve one of those at least 24 hours before the +time slot, by reaching out to the maintainer. + +Schedules and available reservation time slots are available at the Google doc +[9]_ . DAMON maintainer will also provide periodic reminder to the mailing +list (damon@lists.linux.dev). + .. [1] https://git.kernel.org/akpm/mm/h/mm-unstable .. [2] https://git.kernel.org/sj/h/damon/next @@ -61,3 +95,5 @@ of sending a patch. .. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh .. [6] https://github.com/awslabs/damon-tests/tree/master/corr .. [7] https://github.com/awslabs/damon-tests/tree/master/perf +.. [8] https://github.com/damonitor/hackermail +.. [9] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index b6a07a26b10d..2feb2ed51ae2 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -191,13 +191,13 @@ have become evictable again (via munlock() for example) and have been "rescued" from the unevictable list. However, there may be situations where we decide, for the sake of expediency, to leave an unevictable folio on one of the regular active/inactive LRU lists for vmscan to deal with. vmscan checks for such -folios in all of the shrink_{active|inactive|page}_list() functions and will +folios in all of the shrink_{active|inactive|folio}_list() functions and will "cull" such folios that it encounters: that is, it diverts those folios to the unevictable list for the memory cgroup and node being scanned. There may be situations where a folio is mapped into a VM_LOCKED VMA, but the folio does not have the mlocked flag set. Such folios will make -it all the way to shrink_active_list() or shrink_page_list() where they +it all the way to shrink_active_list() or shrink_folio_list() where they will be detected when vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The folio is culled to the unevictable list when it is released by the shrinker. @@ -269,7 +269,7 @@ the LRU. Such pages can be "noticed" by memory management in several places: (4) in the fault path and when a VM_LOCKED stack segment is expanded; or - (5) as mentioned above, in vmscan:shrink_page_list() when attempting to + (5) as mentioned above, in vmscan:shrink_folio_list() when attempting to reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap(). mlocked pages become unlocked and rescued from the unevictable list when: @@ -548,12 +548,12 @@ Some examples of these unevictable pages on the LRU lists are: (3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked, but events left mlock_count too low, so they were munlocked too early. -vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously +vmscan's shrink_inactive_list() and shrink_folio_list() also divert obviously unevictable pages found on the inactive lists to the appropriate memory cgroup and node unevictable list. rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or -shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), +shrink_folio_list(), and rmap's try_to_unmap_one() called via shrink_folio_list(), check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() to correct them. Such pages are culled to the unevictable list when released by the shrinker. |