diff options
-rw-r--r-- | Documentation/admin-guide/cgroup-v1/cgroups.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v1/memory.rst | 284 | ||||
-rw-r--r-- | kernel/cgroup/cpuset.c | 15 |
3 files changed, 162 insertions, 139 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index b0688011ed06..9343148ee993 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -80,6 +80,8 @@ access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rs you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. +.. _cgroups-why-needed: + 1.2 Why are cgroups needed ? ---------------------------- diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 60370f2c67b9..27d89495ac88 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -2,18 +2,18 @@ Memory Resource Controller ========================== -NOTE: +.. caution:: This document is hopelessly outdated and it asks for a complete rewrite. It still contains a useful information so we are keeping it here but make sure to check the current code if you need a deeper understanding. -NOTE: +.. note:: The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. -(For editors) In this document: +.. hint:: When we mention a cgroup (cgroupfs's directory) with memory controller, we call it "memory cgroup". When you see git-log and source code, you'll see patch's title and function names tend to use "memcg". @@ -23,7 +23,7 @@ Benefits and Purpose of the memory controller ============================================= The memory controller isolates the memory behaviour of a group of tasks -from the rest of the system. The article on LWN [12] mentions some probable +from the rest of the system. The article on LWN [12]_ mentions some probable uses of the memory controller. The memory controller can be used to a. Isolate an application or a group of applications @@ -55,7 +55,8 @@ Features: - Root cgroup has no limit controls. Kernel memory support is a work in progress, and the current version provides - basically functionality. (See Section 2.7) + basically functionality. (See :ref:`section 2.7 + <cgroup-v1-memory-kernel-extension>`) Brief summary of control files. @@ -107,16 +108,16 @@ Brief summary of control files. ========== The memory controller has a long history. A request for comments for the memory -controller was posted by Balbir Singh [1]. At the time the RFC was posted +controller was posted by Balbir Singh [1]_. At the time the RFC was posted there were several implementations for memory control. The goal of the RFC was to build consensus and agreement for the minimal features required -for memory control. The first RSS controller was posted by Balbir Singh[2] -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the -RSS controller. At OLS, at the resource management BoF, everyone suggested -that we handle both page cache and RSS together. Another request was raised -to allow user space handling of OOM. The current memory controller is +for memory control. The first RSS controller was posted by Balbir Singh [2]_ +in Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions +of the RSS controller. At OLS, at the resource management BoF, everyone +suggested that we handle both page cache and RSS together. Another request was +raised to allow user space handling of OOM. The current memory controller is at version 6; it combines both mapped (RSS) and unmapped Page -Cache Control [11]. +Cache Control [11]_. 2. Memory Control ================= @@ -147,7 +148,8 @@ specific data structure (mem_cgroup) associated with it. 2.2. Accounting --------------- -:: +.. code-block:: + :caption: Figure 1: Hierarchy of Accounting +--------------------+ | mem_cgroup | @@ -167,7 +169,6 @@ specific data structure (mem_cgroup) associated with it. | | | | +---------------+ +---------------+ - (Figure 1: Hierarchy of Accounting) Figure 1 shows the important aspects of the controller @@ -221,8 +222,9 @@ behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in -- this will happen on memory pressure). -But see section 8.2: when moving a task to another cgroup, its pages may -be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. +But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a +task to another cgroup, its pages may be recharged to the new cgroup, if +move_charge_at_immigrate has been chosen. 2.4 Swap Extension -------------------------------------- @@ -244,7 +246,8 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. By using the memsw limit, you can avoid system OOM which can be caused by swap shortage. -**why 'memory+swap' rather than swap** +2.4.1 why 'memory+swap' rather than swap +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of @@ -252,7 +255,8 @@ memory+swap. In other words, when we want to limit the usage of swap without affecting global LRU, memory+swap limit is better than just limiting swap from an OS point of view. -**What happens when a cgroup hits memory.memsw.limit_in_bytes** +2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out in this cgroup. Then, swap-out will not be done by cgroup routine and file @@ -268,26 +272,26 @@ global VM. When a cgroup goes over its limit, we first try to reclaim memory from the cgroup so as to make space for the new pages that the cgroup has touched. If the reclaim is unsuccessful, an OOM routine is invoked to select and kill the bulkiest task in the -cgroup. (See 10. OOM Control below.) +cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.) The reclaim algorithm has not been modified for cgroups, except that pages that are selected for reclaiming come from the per-cgroup LRU list. -NOTE: - Reclaim does not work for the root cgroup, since we cannot set any - limits on the root cgroup. +.. note:: + Reclaim does not work for the root cgroup, since we cannot set any + limits on the root cgroup. -Note2: - When panic_on_oom is set to "2", the whole system will panic. +.. note:: + When panic_on_oom is set to "2", the whole system will panic. When oom event notifier is registered, event will be delivered. -(See oom_control section) +(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section) 2.6 Locking ----------- -Lock order is as follows: +Lock order is as follows:: Page lock (PG_locked bit of page->flags) mm->page_table_lock or split pte_lock @@ -299,6 +303,8 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. +.. _cgroup-v1-memory-kernel-extension: + 2.7 Kernel Memory Extension ----------------------------------------------- @@ -367,10 +373,10 @@ U != 0, K < U: never greater than the total memory, and freely set U at the cost of his QoS. -WARNING: - In the current implementation, memory reclaim will NOT be - triggered for a cgroup when it hits K while staying below U, which makes - this setup impractical. + .. warning:: + In the current implementation, memory reclaim will NOT be triggered for + a cgroup when it hits K while staying below U, which makes this setup + impractical. U != 0, K >= U: Since kmem charges will also be fed to the user counter and reclaim will be @@ -381,45 +387,41 @@ U != 0, K >= U: 3. User Interface ================= -3.0. Configuration ------------------- - -a. Enable CONFIG_CGROUPS -b. Enable CONFIG_MEMCG - -3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) -------------------------------------------------------------------- +To use the user interface: -:: +1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options +2. Prepare the cgroups (see :ref:`Why are cgroups needed? + <cgroups-why-needed>` for the background information):: # mount -t tmpfs none /sys/fs/cgroup # mkdir /sys/fs/cgroup/memory # mount -t cgroup none /sys/fs/cgroup/memory -o memory -3.2. Make the new group and move bash into it:: +3. Make the new group and move bash into it:: # mkdir /sys/fs/cgroup/memory/0 # echo $$ > /sys/fs/cgroup/memory/0/tasks -Since now we're in the 0 cgroup, we can alter the memory limit:: +4. Since now we're in the 0 cgroup, we can alter the memory limit:: # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes -NOTE: - We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, - mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, - Gibibytes.) + The limit can now be queried:: -NOTE: - We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. + # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes + 4194304 -NOTE: - We cannot set limits on the root cgroup any more. +.. note:: + We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, + mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, + Gibibytes.) -:: +.. note:: + We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. + +.. note:: + We cannot set limits on the root cgroup any more. - # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes - 4194304 We can check the usage:: @@ -458,6 +460,8 @@ test because it has noise of shared objects/status. But the above two are testing extreme situations. Trying usual test under memory controller is always helpful. +.. _cgroup-v1-memory-test-troubleshoot: + 4.1 Troubleshooting ------------------- @@ -470,8 +474,11 @@ terminated by the OOM killer. There are several causes for this: A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of some of the pages cached in the cgroup (page cache pages). -To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and -seeing what happens will be helpful. +To know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control" +<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be +helpful. + +.. _cgroup-v1-memory-test-task-migration: 4.2 Task migration ------------------ @@ -482,15 +489,16 @@ remain charged to it, the charge is dropped when the page is freed or reclaimed. You can move charges of a task along with task migration. -See 8. "Move charges at task migration" +See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>` 4.3 Removing a cgroup --------------------- -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a -cgroup might have some charge associated with it, even though all -tasks have migrated away from it. (because we charge against pages, not -against tasks.) +A cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1 +<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2 +<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge +associated with it, even though all tasks have migrated away from it. (because +we charge against pages, not against tasks.) We move the stats to parent, and no change on the charge except uncharging from the child. @@ -519,67 +527,66 @@ will be charged as a new owner of it. 5.2 stat file ------------- -memory.stat file includes following statistics - -per-memory cgroup local status -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -=============== =============================================================== -cache # of bytes of page cache memory. -rss # of bytes of anonymous and swap cache memory (includes - transparent hugepages). -rss_huge # of bytes of anonymous transparent hugepages. -mapped_file # of bytes of mapped file (includes tmpfs/shmem) -pgpgin # of charging events to the memory cgroup. The charging - event happens each time a page is accounted as either mapped - anon page(RSS) or cache page(Page Cache) to the cgroup. -pgpgout # of uncharging events to the memory cgroup. The uncharging - event happens each time a page is unaccounted from the cgroup. -swap # of bytes of swap usage -dirty # of bytes that are waiting to get written back to the disk. -writeback # of bytes of file/anon cache that are queued for syncing to - disk. -inactive_anon # of bytes of anonymous and swap cache memory on inactive - LRU list. -active_anon # of bytes of anonymous and swap cache memory on active - LRU list. -inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory( - LazyFree pages) on inactive LRU list. -active_file # of bytes of file-backed memory on active LRU list. -unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). -=============== =============================================================== - -status considering hierarchy (see memory.use_hierarchy settings) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= =================================================== -hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy - under which the memory cgroup is -hierarchical_memsw_limit # of bytes of memory+swap limit with regard to - hierarchy under which memory cgroup is. - -total_<counter> # hierarchical version of <counter>, which in - addition to the cgroup's own value includes the - sum of all hierarchical children's values of - <counter>, i.e. total_cache -========================= =================================================== - -The following additional stats are dependent on CONFIG_DEBUG_VM -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= ======================================== -recent_rotated_anon VM internal parameter. (see mm/vmscan.c) -recent_rotated_file VM internal parameter. (see mm/vmscan.c) -recent_scanned_anon VM internal parameter. (see mm/vmscan.c) -recent_scanned_file VM internal parameter. (see mm/vmscan.c) -========================= ======================================== - -Memo: +memory.stat file includes following statistics: + + * per-memory cgroup local status + + =============== =============================================================== + cache # of bytes of page cache memory. + rss # of bytes of anonymous and swap cache memory (includes + transparent hugepages). + rss_huge # of bytes of anonymous transparent hugepages. + mapped_file # of bytes of mapped file (includes tmpfs/shmem) + pgpgin # of charging events to the memory cgroup. The charging + event happens each time a page is accounted as either mapped + anon page(RSS) or cache page(Page Cache) to the cgroup. + pgpgout # of uncharging events to the memory cgroup. The uncharging + event happens each time a page is unaccounted from the + cgroup. + swap # of bytes of swap usage + dirty # of bytes that are waiting to get written back to the disk. + writeback # of bytes of file/anon cache that are queued for syncing to + disk. + inactive_anon # of bytes of anonymous and swap cache memory on inactive + LRU list. + active_anon # of bytes of anonymous and swap cache memory on active + LRU list. + inactive_file # of bytes of file-backed memory and MADV_FREE anonymous + memory (LazyFree pages) on inactive LRU list. + active_file # of bytes of file-backed memory on active LRU list. + unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). + =============== =============================================================== + + * status considering hierarchy (see memory.use_hierarchy settings): + + ========================= =================================================== + hierarchical_memory_limit # of bytes of memory limit with regard to + hierarchy + under which the memory cgroup is + hierarchical_memsw_limit # of bytes of memory+swap limit with regard to + hierarchy under which memory cgroup is. + + total_<counter> # hierarchical version of <counter>, which in + addition to the cgroup's own value includes the + sum of all hierarchical children's values of + <counter>, i.e. total_cache + ========================= =================================================== + + * additional vm parameters (depends on CONFIG_DEBUG_VM): + + ========================= ======================================== + recent_rotated_anon VM internal parameter. (see mm/vmscan.c) + recent_rotated_file VM internal parameter. (see mm/vmscan.c) + recent_scanned_anon VM internal parameter. (see mm/vmscan.c) + recent_scanned_file VM internal parameter. (see mm/vmscan.c) + ========================= ======================================== + +.. hint:: recent_rotated means recent frequency of LRU rotation. recent_scanned means recent # of scans to LRU. showing for better debug please see the code for meanings. -Note: +.. note:: Only anonymous and swap cache memory is listed as part of 'rss' stat. This should not be confused with the true 'resident set size' or the amount of physical memory used by the cgroup. @@ -710,13 +717,16 @@ If we want to change this to 1G, we can at any time use:: # echo 1G > memory.soft_limit_in_bytes -NOTE1: +.. note:: Soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups -NOTE2: + +.. note:: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. +.. _cgroup-v1-memory-move-charges: + 8. Move charges at task migration ================================= @@ -735,23 +745,29 @@ If you want to enable it:: # echo (some positive value) > memory.move_charge_at_immigrate -Note: +.. note:: Each bits of move_charge_at_immigrate has its own meaning about what type - of charges should be moved. See 8.2 for details. -Note: + of charges should be moved. See :ref:`section 8.2 + <cgroup-v1-memory-movable-charges>` for details. + +.. note:: Charges are moved only when you move mm->owner, in other words, a leader of a thread group. -Note: + +.. note:: If we cannot find enough space for the task in the destination cgroup, we try to make space by reclaiming memory. Task migration may fail if we cannot make enough space. -Note: + +.. note:: It can take several seconds if you move charges much. And if you want disable it again:: # echo 0 > memory.move_charge_at_immigrate +.. _cgroup-v1-memory-movable-charges: + 8.2 Type of charges which can be moved -------------------------------------- @@ -801,6 +817,8 @@ threshold in any direction. It's applicable for root and non-root cgroup. +.. _cgroup-v1-memory-oom-control: + 10. OOM Control =============== @@ -956,15 +974,16 @@ commented and discussed quite extensively in the community. References ========== -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ -2. Singh, Balbir. Memory Controller (RSS Control), +.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ +.. [2] Singh, Balbir. Memory Controller (RSS Control), http://lwn.net/Articles/222762/ -3. Emelianov, Pavel. Resource controllers based on process cgroups +.. [3] Emelianov, Pavel. Resource controllers based on process cgroups https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru -4. Emelianov, Pavel. RSS controller based on process cgroups (v2) +.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2) https://lore.kernel.org/r/461A3010.90403@sw.ru -5. Emelianov, Pavel. RSS controller based on process cgroups (v3) +.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3) https://lore.kernel.org/r/465D9739.8070209@openvz.org + 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control subsystem (v3), http://lwn.net/Articles/235534/ @@ -974,7 +993,8 @@ References https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com 10. Singh, Balbir. Memory controller v6 test results, https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop -11. Singh, Balbir. Memory controller introduction (v6), - https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop -12. Corbet, Jonathan, Controlling memory use in cgroups, - http://lwn.net/Articles/243795/ + +.. [11] Singh, Balbir. Memory controller introduction (v6), + https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop +.. [12] Corbet, Jonathan, Controlling memory use in cgroups, + http://lwn.net/Articles/243795/ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index ca826bd1eba3..636f1c682ac0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1271,7 +1271,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on); /** * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset - * @cpuset: The cpuset that requests change in partition root state + * @cs: The cpuset that requests change in partition root state * @cmd: Partition root state change command * @newmask: Optional new cpumask for partcmd_update * @tmp: Temporary addmask and delmask @@ -3286,8 +3286,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = { int __init cpuset_init(void) { - BUG_ON(percpu_init_rwsem(&cpuset_rwsem)); - BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL)); @@ -3907,8 +3905,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask) } /** - * cpuset_mem_spread_node() - On which node to begin search for a file page - * cpuset_slab_spread_node() - On which node to begin search for a slab page + * cpuset_spread_node() - On which node to begin search for a page * * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for * tasks in a cpuset with is_spread_page or is_spread_slab set), @@ -3932,12 +3929,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask) * is passed an offline node, it will fall back to the local node. * See kmem_cache_alloc_node(). */ - static int cpuset_spread_node(int *rotor) { return *rotor = next_node_in(*rotor, current->mems_allowed); } +/** + * cpuset_mem_spread_node() - On which node to begin search for a file page + */ int cpuset_mem_spread_node(void) { if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE) @@ -3947,6 +3946,9 @@ int cpuset_mem_spread_node(void) return cpuset_spread_node(¤t->cpuset_mem_spread_rotor); } +/** + * cpuset_slab_spread_node() - On which node to begin search for a slab page + */ int cpuset_slab_spread_node(void) { if (current->cpuset_slab_spread_rotor == NUMA_NO_NODE) @@ -3955,7 +3957,6 @@ int cpuset_slab_spread_node(void) return cpuset_spread_node(¤t->cpuset_slab_spread_rotor); } - EXPORT_SYMBOL_GPL(cpuset_mem_spread_node); /** |