From 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 8 Jan 2019 15:23:39 -0800 Subject: mm, page_alloc: do not wake kswapd with zone lock held syzbot reported the following regression in the latest merge window and it was confirmed by Qian Cai that a similar bug was visible from a different context. ====================================================== WARNING: possible circular locking dependency detected 4.20.0+ #297 Not tainted ------------------------------------------------------ syz-executor0/8529 is trying to acquire lock: 000000005e7fb829 (&pgdat->kswapd_wait){....}, at: __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120 but task is already holding lock: 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock include/linux/spinlock.h:329 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk mm/page_alloc.c:2548 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist mm/page_alloc.c:3021 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist mm/page_alloc.c:3050 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue mm/page_alloc.c:3072 [inline] 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491 It appears to be a false positive in that the only way the lock ordering should be inverted is if kswapd is waking itself and the wakeup allocates debugging objects which should already be allocated if it's kswapd doing the waking. Nevertheless, the possibility exists and so it's best to avoid the problem. This patch flags a zone as needing a kswapd using the, surprisingly, unused zone flag field. The flag is read without the lock held to do the wakeup. It's possible that the flag setting context is not the same as the flag clearing context or for small races to occur. However, each race possibility is harmless and there is no visible degredation in fragmentation treatment. While zone->flag could have continued to be unused, there is potential for moving some existing fields into the flags field instead. Particularly read-mostly ones like zone->initialized and zone->contiguous. Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Tested-by: Qian Cai Cc: Dmitry Vyukov Cc: Vlastimil Babka Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cde5dac6229a..d295c9bc01a8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2214,7 +2214,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, */ boost_watermark(zone); if (alloc_flags & ALLOC_KSWAPD) - wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); /* We are not allowed to try stealing from the whole block */ if (!whole_block) @@ -3102,6 +3102,12 @@ struct page *rmqueue(struct zone *preferred_zone, local_irq_restore(flags); out: + /* Separate test+clear to avoid unnecessary atomics */ + if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) { + clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); + wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + } + VM_BUG_ON_PAGE(page && bad_range(zone, page), page); return page; -- cgit v1.2.3-70-g09d2 From 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 25 Jan 2019 19:08:58 +0100 Subject: Revert "mm, memory_hotplug: initialize struct pages for the full memory section" This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684. The underlying assumption that one sparse section belongs into a single numa node doesn't hold really. Robert Shteynfeld has reported a boot failure. The boot log was not captured but his memory layout is as follows: Early memory node ranges node 1: [mem 0x0000000000001000-0x0000000000090fff] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] node 1: [mem 0x0000000100000000-0x0000001423ffffff] node 0: [mem 0x0000001424000000-0x0000002023ffffff] This means that node0 starts in the middle of a memory section which is also in node1. memmap_init_zone tries to initialize padding of a section even when it is outside of the given pfn range because there are code paths (e.g. memory hotplug) which assume that the full worth of memory section is always initialized. In this particular case, though, such a range is already intialized and most likely already managed by the page allocator. Scribbling over those pages corrupts the internal state and likely blows up when any of those pages gets used. Reported-by: Robert Shteynfeld Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") Cc: stable@kernel.org Signed-off-by: Michal Hocko Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 12 ------------ 1 file changed, 12 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d295c9bc01a8..35fdde041f5c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5701,18 +5701,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, cond_resched(); } } -#ifdef CONFIG_SPARSEMEM - /* - * If the zone does not span the rest of the section then - * we should at least initialize those pages. Otherwise we - * could blow up on a poisoned page in some paths which depend - * on full sections being initialized (e.g. memory hotplug). - */ - while (end_pfn % PAGES_PER_SECTION) { - __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid); - end_pfn++; - } -#endif } #ifdef CONFIG_ZONE_DEVICE -- cgit v1.2.3-70-g09d2 From 2c2ade81741c66082f8211f0b96cf509cc4c0218 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Wed, 13 Feb 2019 22:45:59 +0100 Subject: mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum number of references that we might need to create in the fastpath later, the bump-allocation fastpath only has to modify the non-atomic bias value that tracks the number of extra references we hold instead of the atomic refcount. The maximum number of allocations we can serve (under the assumption that no allocation is made with size 0) is nc->size, so that's the bias used. However, even when all memory in the allocation has been given away, a reference to the page is still held; and in the `offset < 0` slowpath, the page may be reused if everyone else has dropped their references. This means that the necessary number of references is actually `nc->size+1`. Luckily, from a quick grep, it looks like the only path that can call page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which requires CAP_NET_ADMIN in the init namespace and is only intended to be used for kernel testing and fuzzing. To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the `offset < 0` path, below the virt_to_page() call, and then repeatedly call writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, with a vector consisting of 15 elements containing 1 byte each. Signed-off-by: Jann Horn Signed-off-by: David S. Miller --- mm/page_alloc.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'mm/page_alloc.c') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 35fdde041f5c..46285d28e43b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4675,11 +4675,11 @@ refill: /* Even if we own the page, we do not use atomic_set(). * This would break get_page_unless_zero() users. */ - page_ref_add(page, size - 1); + page_ref_add(page, size); /* reset page count bias and offset to start of new frag */ nc->pfmemalloc = page_is_pfmemalloc(page); - nc->pagecnt_bias = size; + nc->pagecnt_bias = size + 1; nc->offset = size; } @@ -4695,10 +4695,10 @@ refill: size = nc->size; #endif /* OK, page count is 0, we can safely set it */ - set_page_count(page, size); + set_page_count(page, size + 1); /* reset page count bias and offset to start of new frag */ - nc->pagecnt_bias = size; + nc->pagecnt_bias = size + 1; offset = size - fragsz; } -- cgit v1.2.3-70-g09d2