From 8bad1cd0e1edd124c0f05f925762ef84e6047586 Mon Sep 17 00:00:00 2001 From: "Du, Changbin" Date: Thu, 19 May 2016 17:09:38 -0700 Subject: Documentation: update debugobjects doc Update documentation creangponding to change(debugobjects: make fixup functions return bool instead of int). Signed-off-by: Du, Changbin Cc: Jonathan Corbet Cc: Josh Triplett Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Tejun Heo Cc: Christian Borntraeger Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DocBook/debugobjects.tmpl | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/debugobjects.tmpl b/Documentation/DocBook/debugobjects.tmpl index 24979f691e3e..7e4f34fde697 100644 --- a/Documentation/DocBook/debugobjects.tmpl +++ b/Documentation/DocBook/debugobjects.tmpl @@ -316,8 +316,8 @@ - The function returns 1 when the fixup was successful, - otherwise 0. The return value is used to update the + The function returns true when the fixup was successful, + otherwise false. The return value is used to update the statistics. @@ -341,8 +341,8 @@ - The function returns 1 when the fixup was successful, - otherwise 0. The return value is used to update the + The function returns true when the fixup was successful, + otherwise false. The return value is used to update the statistics. @@ -359,7 +359,8 @@ statically initialized object or not. In case it is it calls debug_object_init() and debug_object_activate() to make the object known to the tracker and marked active. In this case - the function should return 0 because this is not a real fixup. + the function should return false because this is not a real + fixup. @@ -376,8 +377,8 @@ - The function returns 1 when the fixup was successful, - otherwise 0. The return value is used to update the + The function returns true when the fixup was successful, + otherwise false. The return value is used to update the statistics. @@ -397,8 +398,8 @@ - The function returns 1 when the fixup was successful, - otherwise 0. The return value is used to update the + The function returns true when the fixup was successful, + otherwise false. The return value is used to update the statistics. @@ -414,8 +415,8 @@ debug bucket. - The function returns 1 when the fixup was successful, - otherwise 0. The return value is used to update the + The function returns true when the fixup was successful, + otherwise false. The return value is used to update the statistics. @@ -427,7 +428,8 @@ case. The fixup function should check if this is a legitimate case of a statically initialized object or not. In this case only debug_object_init() should be called to make the object known to - the tracker. Then the function should return 0 because this is not + the tracker. Then the function should return false because this + is not a real fixup. -- cgit v1.2.3-70-g09d2 From 0139aa7b7fa12ceef095d99dc36606a5b10ab83a Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Thu, 19 May 2016 17:10:49 -0700 Subject: mm: rename _count, field of the struct page, to _refcount Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim Signed-off-by: Stephen Rothwell Cc: Vlastimil Babka Cc: Hugh Dickins Cc: Johannes Berg Cc: "David S. Miller" Cc: Sunil Goutham Cc: Chris Metcalf Cc: Manish Chopra Cc: Yuval Mintz Cc: Tariq Toukan Cc: Saeed Mahameed Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/transhuge.txt | 10 +++++----- arch/tile/mm/init.c | 2 +- drivers/block/aoe/aoecmd.c | 2 +- drivers/hwtracing/intel_th/msu.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 20 +++++++++---------- drivers/net/ethernet/qlogic/qede/qede_main.c | 4 ++-- fs/proc/page.c | 2 +- include/linux/mm.h | 2 +- include/linux/mm_types.h | 14 ++++++++----- include/linux/page_ref.h | 26 ++++++++++++------------- include/linux/pagemap.h | 8 ++++---- kernel/kexec_core.c | 2 +- mm/huge_memory.c | 4 ++-- mm/internal.h | 2 +- mm/page_alloc.c | 4 ++-- mm/slub.c | 4 ++-- mm/vmscan.c | 4 ++-- 17 files changed, 58 insertions(+), 54 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index d9cb65cf5cfd..fb0e1f2a19cc 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -394,9 +394,9 @@ hugepage natively. Once finished you can drop the page table lock. Refcounting on THP is mostly consistent with refcounting on other compound pages: - - get_page()/put_page() and GUP operate in head page's ->_count. + - get_page()/put_page() and GUP operate in head page's ->_refcount. - - ->_count in tail pages is always zero: get_page_unless_zero() never + - ->_refcount in tail pages is always zero: get_page_unless_zero() never succeed on tail pages. - map/unmap of the pages with PTE entry increment/decrement ->_mapcount @@ -426,15 +426,15 @@ requests to split pinned huge page: it expects page count to be equal to sum of mapcount of all sub-pages plus one (split_huge_page caller must have reference for head page). -split_huge_page uses migration entries to stabilize page->_count and +split_huge_page uses migration entries to stabilize page->_refcount and page->_mapcount. We safe against physical memory scanners too: the only legitimate way scanner can get reference to a page is get_page_unless_zero(). -All tail pages has zero ->_count until atomic_add(). It prevent scanner +All tail pages has zero ->_refcount until atomic_add(). It prevent scanner from geting reference to tail page up to the point. After the atomic_add() -we don't care about ->_count value. We already known how many references +we don't care about ->_refcount value. We already known how many references with should uncharge from head page. For head page get_page_unless_zero() will succeed and we don't mind. It's diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index a0582b7f41d3..adce25462b0d 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -679,7 +679,7 @@ static void __init init_free_pfn_range(unsigned long start, unsigned long end) * Hacky direct set to avoid unnecessary * lock take/release for EVERY page here. */ - p->_count.counter = 0; + p->_refcount.counter = 0; p->_mapcount.counter = -1; } init_page_count(page); diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 437b3a822f44..d597e432e195 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -861,7 +861,7 @@ rqbiocnt(struct request *r) * discussion. * * We cannot use get_page in the workaround, because it insists on a - * positive page count as a precondition. So we use _count directly. + * positive page count as a precondition. So we use _refcount directly. */ static void bio_pageinc(struct bio *bio) diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c index d9d6022c5aca..d2209147dc89 100644 --- a/drivers/hwtracing/intel_th/msu.c +++ b/drivers/hwtracing/intel_th/msu.c @@ -1164,7 +1164,7 @@ static void msc_mmap_close(struct vm_area_struct *vma) if (!atomic_dec_and_mutex_lock(&msc->mmap_count, &msc->buf_mutex)) return; - /* drop page _counts */ + /* drop page _refcounts */ for (pg = 0; pg < msc->nr_pages; pg++) { struct page *page = msc_buffer_get_page(msc, pg); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index f3456798c596..bd947704b59c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -433,8 +433,8 @@ static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq, for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) { if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i))) goto err_unmap; - atomic_add(mlx5e_mpwqe_strides_per_page(rq), - &wi->umr.dma_info[i].page->_count); + page_ref_add(wi->umr.dma_info[i].page, + mlx5e_mpwqe_strides_per_page(rq)); wi->skbs_frags[i] = 0; } @@ -452,8 +452,8 @@ err_unmap: while (--i >= 0) { dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE, PCI_DMA_FROMDEVICE); - atomic_sub(mlx5e_mpwqe_strides_per_page(rq), - &wi->umr.dma_info[i].page->_count); + page_ref_sub(wi->umr.dma_info[i].page, + mlx5e_mpwqe_strides_per_page(rq)); put_page(wi->umr.dma_info[i].page); } dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE); @@ -477,8 +477,8 @@ void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq, for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) { dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE, PCI_DMA_FROMDEVICE); - atomic_sub(mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i], - &wi->umr.dma_info[i].page->_count); + page_ref_sub(wi->umr.dma_info[i].page, + mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]); put_page(wi->umr.dma_info[i].page); } dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE); @@ -527,8 +527,8 @@ static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq, */ split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER); for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) { - atomic_add(mlx5e_mpwqe_strides_per_page(rq), - &wi->dma_info.page[i]._count); + page_ref_add(&wi->dma_info.page[i], + mlx5e_mpwqe_strides_per_page(rq)); wi->skbs_frags[i] = 0; } @@ -551,8 +551,8 @@ void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq, dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz, PCI_DMA_FROMDEVICE); for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) { - atomic_sub(mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i], - &wi->dma_info.page[i]._count); + page_ref_sub(&wi->dma_info.page[i], + mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]); put_page(&wi->dma_info.page[i]); } } diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 3aabfc0adefe..73dd525fbf08 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -1036,7 +1036,7 @@ static int qede_fill_frag_skb(struct qede_dev *edev, /* Incr page ref count to reuse on allocation failure * so that it doesn't get freed while freeing SKB. */ - atomic_inc(¤t_bd->data->_count); + page_ref_inc(current_bd->data); goto out; } @@ -1487,7 +1487,7 @@ static int qede_rx_int(struct qede_fastpath *fp, int budget) * freeing SKB. */ - atomic_inc(&sw_rx_data->data->_count); + page_ref_inc(sw_rx_data->data); rxq->rx_alloc_errors++; qede_recycle_rx_bd_ring(rxq, edev, fp_cqe->bd_num); diff --git a/fs/proc/page.c b/fs/proc/page.c index 712f1b9992cc..3ecd445e830d 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -142,7 +142,7 @@ u64 stable_page_flags(struct page *page) /* - * Caveats on high order pages: page->_count will only be set + * Caveats on high order pages: page->_refcount will only be set * -1 on the head page; SLUB/SLQB do the same for PG_slab; * SLOB won't set PG_slab at all on compound pages. */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 727f799757ab..1193a54ea2b3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -734,7 +734,7 @@ static inline void get_page(struct page *page) page = compound_head(page); /* * Getting a normal page or the head of a compound page - * requires to already have an elevated page->_count. + * requires to already have an elevated page->_refcount. */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c2d75b4fa86c..1fda9c99ef95 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -73,9 +73,9 @@ struct page { unsigned long counters; #else /* - * Keep _count separate from slub cmpxchg_double data. - * As the rest of the double word is protected by - * slab_lock but _count is not. + * Keep _refcount separate from slub cmpxchg_double + * data. As the rest of the double word is protected by + * slab_lock but _refcount is not. */ unsigned counters; #endif @@ -97,7 +97,11 @@ struct page { }; int units; /* SLOB */ }; - atomic_t _count; /* Usage count, see below. */ + /* + * Usage count, *USE WRAPPER FUNCTION* + * when manual accounting. See page_ref.h + */ + atomic_t _refcount; }; unsigned int active; /* SLAB */ }; @@ -248,7 +252,7 @@ struct page_frag_cache { __u32 offset; #endif /* we maintain a pagecount bias, so that we dont dirty cache line - * containing page->_count every time we allocate a fragment. + * containing page->_refcount every time we allocate a fragment. */ unsigned int pagecnt_bias; bool pfmemalloc; diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index e596d5d9540e..8b5e0a9f2431 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -63,17 +63,17 @@ static inline void __page_ref_unfreeze(struct page *page, int v) static inline int page_ref_count(struct page *page) { - return atomic_read(&page->_count); + return atomic_read(&page->_refcount); } static inline int page_count(struct page *page) { - return atomic_read(&compound_head(page)->_count); + return atomic_read(&compound_head(page)->_refcount); } static inline void set_page_count(struct page *page, int v) { - atomic_set(&page->_count, v); + atomic_set(&page->_refcount, v); if (page_ref_tracepoint_active(__tracepoint_page_ref_set)) __page_ref_set(page, v); } @@ -89,35 +89,35 @@ static inline void init_page_count(struct page *page) static inline void page_ref_add(struct page *page, int nr) { - atomic_add(nr, &page->_count); + atomic_add(nr, &page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, nr); } static inline void page_ref_sub(struct page *page, int nr) { - atomic_sub(nr, &page->_count); + atomic_sub(nr, &page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, -nr); } static inline void page_ref_inc(struct page *page) { - atomic_inc(&page->_count); + atomic_inc(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, 1); } static inline void page_ref_dec(struct page *page) { - atomic_dec(&page->_count); + atomic_dec(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, -1); } static inline int page_ref_sub_and_test(struct page *page, int nr) { - int ret = atomic_sub_and_test(nr, &page->_count); + int ret = atomic_sub_and_test(nr, &page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test)) __page_ref_mod_and_test(page, -nr, ret); @@ -126,7 +126,7 @@ static inline int page_ref_sub_and_test(struct page *page, int nr) static inline int page_ref_dec_and_test(struct page *page) { - int ret = atomic_dec_and_test(&page->_count); + int ret = atomic_dec_and_test(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test)) __page_ref_mod_and_test(page, -1, ret); @@ -135,7 +135,7 @@ static inline int page_ref_dec_and_test(struct page *page) static inline int page_ref_dec_return(struct page *page) { - int ret = atomic_dec_return(&page->_count); + int ret = atomic_dec_return(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return)) __page_ref_mod_and_return(page, -1, ret); @@ -144,7 +144,7 @@ static inline int page_ref_dec_return(struct page *page) static inline int page_ref_add_unless(struct page *page, int nr, int u) { - int ret = atomic_add_unless(&page->_count, nr, u); + int ret = atomic_add_unless(&page->_refcount, nr, u); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); @@ -153,7 +153,7 @@ static inline int page_ref_add_unless(struct page *page, int nr, int u) static inline int page_ref_freeze(struct page *page, int count) { - int ret = likely(atomic_cmpxchg(&page->_count, count, 0) == count); + int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count); if (page_ref_tracepoint_active(__tracepoint_page_ref_freeze)) __page_ref_freeze(page, count, ret); @@ -165,7 +165,7 @@ static inline void page_ref_unfreeze(struct page *page, int count) VM_BUG_ON_PAGE(page_count(page) != 0, page); VM_BUG_ON(count == 0); - atomic_set(&page->_count, count); + atomic_set(&page->_refcount, count); if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze)) __page_ref_unfreeze(page, count); } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 7e1ab155c67c..fe1513ffb7bf 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -90,12 +90,12 @@ void release_pages(struct page **pages, int nr, bool cold); /* * speculatively take a reference to a page. - * If the page is free (_count == 0), then _count is untouched, and 0 - * is returned. Otherwise, _count is incremented by 1 and 1 is returned. + * If the page is free (_refcount == 0), then _refcount is untouched, and 0 + * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned. * * This function must be called inside the same rcu_read_lock() section as has * been used to lookup the page in the pagecache radix-tree (or page table): - * this allows allocators to use a synchronize_rcu() to stabilize _count. + * this allows allocators to use a synchronize_rcu() to stabilize _refcount. * * Unless an RCU grace period has passed, the count of all pages coming out * of the allocator must be considered unstable. page_count may return higher @@ -111,7 +111,7 @@ void release_pages(struct page **pages, int nr, bool cold); * 2. conditionally increment refcount * 3. check the page is still in pagecache (if no, goto 1) * - * Remove-side that cares about stability of _count (eg. reclaim) has the + * Remove-side that cares about stability of _refcount (eg. reclaim) has the * following (with tree_lock held for write): * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg) * B. remove page from pagecache diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 1391d3ee3b86..1c03dfb4abfd 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1410,7 +1410,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_STRUCT_SIZE(list_head); VMCOREINFO_SIZE(nodemask_t); VMCOREINFO_OFFSET(page, flags); - VMCOREINFO_OFFSET(page, _count); + VMCOREINFO_OFFSET(page, _refcount); VMCOREINFO_OFFSET(page, mapping); VMCOREINFO_OFFSET(page, lru); VMCOREINFO_OFFSET(page, _mapcount); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b49ee126d4d1..f8ac8f582fd8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3113,7 +3113,7 @@ static void __split_huge_page_tail(struct page *head, int tail, VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail); /* - * tail_page->_count is zero and not changing from under us. But + * tail_page->_refcount is zero and not changing from under us. But * get_page_unless_zero() may be running from under us on the * tail_page. If we used atomic_set() below instead of atomic_inc(), we * would then run atomic_set() concurrently with @@ -3340,7 +3340,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (mlocked) lru_add_drain(); - /* Prevent deferred_split_scan() touching ->_count */ + /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock_irqsave(&pgdata->split_queue_lock, flags); count = page_count(head); mapcount = total_mapcount(head); diff --git a/mm/internal.h b/mm/internal.h index b79abb6721cf..098a89e3b97c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -58,7 +58,7 @@ static inline unsigned long ra_submit(struct file_ra_state *ra, } /* - * Turn a non-refcounted page (->_count == 0) into refcounted with + * Turn a non-refcounted page (->_refcount == 0) into refcounted with * a count of one. */ static inline void set_page_refcounted(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c1069efcc4d7..4ce57f938b7f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -794,7 +794,7 @@ static inline int free_pages_check(struct page *page) if (unlikely(page->mapping != NULL)) bad_reason = "non-NULL mapping"; if (unlikely(page_ref_count(page) != 0)) - bad_reason = "nonzero _count"; + bad_reason = "nonzero _refcount"; if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_FREE)) { bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; bad_flags = PAGE_FLAGS_CHECK_AT_FREE; @@ -6864,7 +6864,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, * We can't use page_count without pin a page * because another CPU can free compound page. * This check already skips compound tails of THP - * because their page->_count is zero at all time. + * because their page->_refcount is zero at all time. */ if (!page_ref_count(page)) { if (PageBuddy(page)) diff --git a/mm/slub.c b/mm/slub.c index 8671de2e5b12..cf1faa4d3992 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -329,8 +329,8 @@ static inline void set_page_slub_counters(struct page *page, unsigned long count tmp.counters = counters_new; /* * page->counters can cover frozen/inuse/objects as well - * as page->_count. If we assign to ->counters directly - * we run the risk of losing updates to page->_count, so + * as page->_refcount. If we assign to ->counters directly + * we run the risk of losing updates to page->_refcount, so * be careful and only assign to the fields we need. */ page->frozen = tmp.frozen; diff --git a/mm/vmscan.c b/mm/vmscan.c index 142cb61f4822..d3a02ac3eed7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -633,7 +633,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * * Reversing the order of the tests ensures such a situation cannot * escape unnoticed. The smp_rmb is needed to ensure the page->flags - * load is not satisfied before that of page->_count. + * load is not satisfied before that of page->_refcount. * * Note that if SetPageDirty is always performed via set_page_dirty, * and thus under tree_lock, then this ordering is not required. @@ -1720,7 +1720,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * It is safe to rely on PG_active against the non-LRU pages in here because * nobody will play with that bit on a non-LRU page. * - * The downside is that we have to touch page->_count against each page. + * The downside is that we have to touch page->_refcount against each page. * But we had to alter page->flags anyway. */ -- cgit v1.2.3-70-g09d2 From 52b6f46bc163eef17ecba4cd552beeafe2b24453 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Thu, 19 May 2016 17:12:50 -0700 Subject: mm: /proc/sys/vm/stat_refresh to force vmstat update Provide /proc/sys/vm/stat_refresh to force an immediate update of per-cpu into global vmstats: useful to avoid a sleep(2) or whatever before checking counts when testing. Originally added to work around a bug which left counts stranded indefinitely on a cpu going idle (an inaccuracy magnified when small below-batch numbers represent "huge" amounts of memory), but I believe that bug is now fixed: nonetheless, this is still a useful knob. Its schedule_on_each_cpu() is probably too expensive just to fold into reading /proc/meminfo itself: give this mode 0600 to prevent abuse. Allow a write or a read to do the same: nothing to read, but "grep -h Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and since global_page_state() itself is careful to disguise any underflow as 0, hack in an "Invalid argument" and pr_warn() if a counter is negative after the refresh - this helped to fix a misaccounting of NR_ISOLATED_FILE in my migration code. But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED often go negative some of the time. I have not yet worked out why, but have no evidence that it's actually harmful. Punt for the moment by just ignoring the anomaly on those. Signed-off-by: Hugh Dickins Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Andres Lagar-Cavilla Cc: Yang Shi Cc: Ning Qu Cc: Mel Gorman Cc: Andres Lagar-Cavilla Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 14 +++++++++++ include/linux/vmstat.h | 4 +++ kernel/sysctl.c | 7 ++++++ mm/vmstat.c | 60 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 85 insertions(+) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 34a5fece3121..720355cbdf45 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -57,6 +57,7 @@ Currently, these files are in /proc/sys/vm: - panic_on_oom - percpu_pagelist_fraction - stat_interval +- stat_refresh - swappiness - user_reserve_kbytes - vfs_cache_pressure @@ -755,6 +756,19 @@ is 1 second. ============================================================== +stat_refresh + +Any read or write (by root only) flushes all the per-cpu vm statistics +into their global totals, for more accurate reports when testing +e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo + +As a side-effect, it also checks for negative totals (elsewhere reported +as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. +(At time of writing, a few stats are known sometimes to be found negative, +with no ill effects: errors and warnings on these stats are suppressed.) + +============================================================== + swappiness This control is used to define how aggressive the kernel will swap diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 73fae8c4a5fb..02fce415b3d9 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -193,6 +193,10 @@ void quiet_vmstat(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); +struct ctl_table; +int vmstat_refresh(struct ctl_table *, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); + void drain_zonestat(struct zone *zone, struct per_cpu_pageset *); int calculate_pressure_threshold(struct zone *zone); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c8b318663525..2effd84d83e3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1521,6 +1521,13 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, + { + .procname = "stat_refresh", + .data = NULL, + .maxlen = 0, + .mode = 0600, + .proc_handler = vmstat_refresh, + }, #endif #ifdef CONFIG_MMU { diff --git a/mm/vmstat.c b/mm/vmstat.c index a7de9adacbd9..c831be32a1a3 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1379,6 +1379,66 @@ static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static cpumask_var_t cpu_stat_off; +#ifdef CONFIG_PROC_FS +static void refresh_vm_stats(struct work_struct *work) +{ + refresh_cpu_vm_stats(true); +} + +int vmstat_refresh(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + long val; + int err; + int i; + + /* + * The regular update, every sysctl_stat_interval, may come later + * than expected: leaving a significant amount in per_cpu buckets. + * This is particularly misleading when checking a quantity of HUGE + * pages, immediately after running a test. /proc/sys/vm/stat_refresh, + * which can equally be echo'ed to or cat'ted from (by root), + * can be used to update the stats just before reading them. + * + * Oh, and since global_page_state() etc. are so careful to hide + * transiently negative values, report an error here if any of + * the stats is negative, so we know to go looking for imbalance. + */ + err = schedule_on_each_cpu(refresh_vm_stats); + if (err) + return err; + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { + val = atomic_long_read(&vm_stat[i]); + if (val < 0) { + switch (i) { + case NR_ALLOC_BATCH: + case NR_PAGES_SCANNED: + /* + * These are often seen to go negative in + * recent kernels, but not to go permanently + * negative. Whilst it would be nicer not to + * have exceptions, rooting them out would be + * another task, of rather low priority. + */ + break; + default: + pr_warn("%s: %s %ld\n", + __func__, vmstat_text[i], val); + err = -EINVAL; + break; + } + } + } + if (err) + return err; + if (write) + *ppos += *lenp; + else + *lenp = 0; + return 0; +} +#endif /* CONFIG_PROC_FS */ + static void vmstat_update(struct work_struct *w) { if (refresh_cpu_vm_stats(true)) { -- cgit v1.2.3-70-g09d2 From 8604d9e534a3e662600e288bcfd1a5acd2763d28 Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Thu, 19 May 2016 17:13:03 -0700 Subject: memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE This patchset continues the work I started with commit 31bc3858ea3e ("memory-hotplug: add automatic onlining policy for the newly added memory"). Initially I was going to stop there and bring the policy setting logic to userspace. I met two issues on this way: 1) It is possible to have memory hotplugged at boot (e.g. with QEMU). These blocks stay offlined if we turn the onlining policy on by userspace. 2) My attempt to bring this policy setting to systemd failed, systemd maintainers suggest to change the default in kernel or ... to use tmpfiles.d to alter the policy (which looks like a hack to me): https://github.com/systemd/systemd/pull/2938 Here I suggest to add a config option to set the default value for the policy and a kernel command line parameter to make the override. This patch (of 2): Introduce config option to set the default value for memory hotplug onlining policy (/sys/devices/system/memory/auto_online_blocks). The reason one would want to turn this option on are to have early onlining for hotpluggable memory available at boot and to not require any userspace actions to make memory hotplug work. [akpm@linux-foundation.org: tweak Kconfig text] Signed-off-by: Vitaly Kuznetsov Cc: Jonathan Corbet Cc: Dan Williams Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: David Vrabel Cc: David Rientjes Cc: Igor Mammedov Cc: Lennart Poettering Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/memory-hotplug.txt | 9 +++++---- mm/Kconfig | 16 ++++++++++++++++ mm/memory_hotplug.c | 4 ++++ 3 files changed, 25 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 443f4b44ad97..0d7cb955aa01 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -261,10 +261,11 @@ it according to the policy which can be read from "auto_online_blocks" file: % cat /sys/devices/system/memory/auto_online_blocks -The default is "offline" which means the newly added memory is not in a -ready-to-use state and you have to "online" the newly added memory blocks -manually. Automatic onlining can be requested by writing "online" to -"auto_online_blocks" file: +The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config +option. If it is disabled the default is "offline" which means the newly added +memory is not in a ready-to-use state and you have to "online" the newly added +memory blocks manually. Automatic onlining can be requested by writing "online" +to "auto_online_blocks" file: % echo online > /sys/devices/system/memory/auto_online_blocks diff --git a/mm/Kconfig b/mm/Kconfig index d6e9042b99e0..b0432b71137d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -192,6 +192,22 @@ config MEMORY_HOTPLUG_SPARSE def_bool y depends on SPARSEMEM && MEMORY_HOTPLUG +config MEMORY_HOTPLUG_DEFAULT_ONLINE + bool "Online the newly added memory blocks by default" + default n + depends on MEMORY_HOTPLUG + help + This option sets the default policy setting for memory hotplug + onlining policy (/sys/devices/system/memory/auto_online_blocks) which + determines what happens to newly added memory regions. Policy setting + can always be changed at runtime. + See Documentation/memory-hotplug.txt for more information. + + Say Y here if you want all hot-plugged memory blocks to appear in + 'online' state by default. + Say N here if you want the default policy to keep all hot-plugged + memory blocks in 'offline' state. + config MEMORY_HOTREMOVE bool "Allow for memory hot remove" select MEMORY_ISOLATION diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b21d8895ea41..fcafbfcff044 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -78,7 +78,11 @@ static struct { #define memhp_lock_acquire() lock_map_acquire(&mem_hotplug.dep_map) #define memhp_lock_release() lock_map_release(&mem_hotplug.dep_map) +#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE bool memhp_auto_online; +#else +bool memhp_auto_online = true; +#endif EXPORT_SYMBOL_GPL(memhp_auto_online); void get_online_mems(void) -- cgit v1.2.3-70-g09d2 From 86dd995d63241039e0ad9123f9b424013c611510 Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Thu, 19 May 2016 17:13:06 -0700 Subject: memory_hotplug: introduce memhp_default_state= command line parameter CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the memory hotplug onlining policy. Add a command line parameter to make it possible to override the default. It may come handy for debug and testing purposes. Signed-off-by: Vitaly Kuznetsov Cc: Jonathan Corbet Cc: Dan Williams Cc: "Kirill A. Shutemov" Cc: Mel Gorman Cc: David Vrabel Cc: David Rientjes Cc: Igor Mammedov Cc: Lennart Poettering Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 8 ++++++++ mm/memory_hotplug.c | 11 +++++++++++ 2 files changed, 19 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 794403166a8f..2edb27bdc680 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2161,6 +2161,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. [KNL,SH] Allow user to override the default size for per-device physically contiguous DMA buffers. + memhp_default_state=online/offline + [KNL] Set the initial state for the memory hotplug + onlining policy. If not specified, the default value is + set according to the + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config + option. + See Documentation/memory-hotplug.txt. + memmap=exactmap [KNL,X86] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index fcafbfcff044..caf2a14c37ad 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -85,6 +85,17 @@ bool memhp_auto_online = true; #endif EXPORT_SYMBOL_GPL(memhp_auto_online); +static int __init setup_memhp_default_state(char *str) +{ + if (!strcmp(str, "online")) + memhp_auto_online = true; + else if (!strcmp(str, "offline")) + memhp_auto_online = false; + + return 1; +} +__setup("memhp_default_state=", setup_memhp_default_state); + void get_online_mems(void) { might_sleep(); -- cgit v1.2.3-70-g09d2