From 63341ab03706e11a31e3dd8ccc0fbc9beaf723f0 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 11 Dec 2019 12:11:52 +0100 Subject: virtio-balloon: fix managed page counts when migrating pages between zones In case we have to migrate a ballon page to a newpage of another zone, the managed page count of both zones is wrong. Paired with memory offlining (which will adjust the managed page count), we can trigger kernel crashes and all kinds of different symptoms. One way to reproduce: 1. Start a QEMU guest with 4GB, no NUMA 2. Hotplug a 1GB DIMM and online the memory to ZONE_NORMAL 3. Inflate the balloon to 1GB 4. Unplug the DIMM (be quick, otherwise unmovable data ends up on it) 5. Observe /proc/zoneinfo Node 0, zone Normal pages free 16810 min 24848885473806 low 18471592959183339 high 36918337032892872 spanned 262144 present 262144 managed 18446744073709533486 6. Do anything that requires some memory (e.g., inflate the balloon some more). The OOM goes crazy and the system crashes [ 238.324946] Out of memory: Killed process 537 (login) total-vm:27584kB, anon-rss:860kB, file-rss:0kB, shmem-rss:00 [ 238.338585] systemd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ 238.339420] CPU: 0 PID: 1 Comm: systemd Tainted: G D W 5.4.0-next-20191204+ #75 [ 238.340139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 238.341121] Call Trace: [ 238.341337] dump_stack+0x8f/0xd0 [ 238.341630] dump_header+0x61/0x5ea [ 238.341942] oom_kill_process.cold+0xb/0x10 [ 238.342299] out_of_memory+0x24d/0x5a0 [ 238.342625] __alloc_pages_slowpath+0xd12/0x1020 [ 238.343024] __alloc_pages_nodemask+0x391/0x410 [ 238.343407] pagecache_get_page+0xc3/0x3a0 [ 238.343757] filemap_fault+0x804/0xc30 [ 238.344083] ? ext4_filemap_fault+0x28/0x42 [ 238.344444] ext4_filemap_fault+0x30/0x42 [ 238.344789] __do_fault+0x37/0x1a0 [ 238.345087] __handle_mm_fault+0x104d/0x1ab0 [ 238.345450] handle_mm_fault+0x169/0x360 [ 238.345790] do_user_addr_fault+0x20d/0x490 [ 238.346154] do_page_fault+0x31/0x210 [ 238.346468] async_page_fault+0x43/0x50 [ 238.346797] RIP: 0033:0x7f47eba4197e [ 238.347110] Code: Bad RIP value. [ 238.347387] RSP: 002b:00007ffd7c0c1890 EFLAGS: 00010293 [ 238.347834] RAX: 0000000000000002 RBX: 000055d196a20a20 RCX: 00007f47eba4197e [ 238.348437] RDX: 0000000000000033 RSI: 00007ffd7c0c18c0 RDI: 0000000000000004 [ 238.349047] RBP: 00007ffd7c0c1c20 R08: 0000000000000000 R09: 0000000000000033 [ 238.349660] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001 [ 238.350261] R13: ffffffffffffffff R14: 0000000000000000 R15: 00007ffd7c0c18c0 [ 238.350878] Mem-Info: [ 238.351085] active_anon:3121 inactive_anon:51 isolated_anon:0 [ 238.351085] active_file:12 inactive_file:7 isolated_file:0 [ 238.351085] unevictable:0 dirty:0 writeback:0 unstable:0 [ 238.351085] slab_reclaimable:5565 slab_unreclaimable:10170 [ 238.351085] mapped:3 shmem:111 pagetables:155 bounce:0 [ 238.351085] free:720717 free_pcp:2 free_cma:0 [ 238.353757] Node 0 active_anon:12484kB inactive_anon:204kB active_file:48kB inactive_file:28kB unevictable:0kB iss [ 238.355979] Node 0 DMA free:11556kB min:36kB low:48kB high:60kB reserved_highatomic:0KB active_anon:152kB inactivB [ 238.358345] lowmem_reserve[]: 0 2955 2884 2884 2884 [ 238.358761] Node 0 DMA32 free:2677864kB min:7004kB low:10028kB high:13052kB reserved_highatomic:0KB active_anon:0B [ 238.361202] lowmem_reserve[]: 0 0 72057594037927865 72057594037927865 72057594037927865 [ 238.361888] Node 0 Normal free:193448kB min:99395541895224kB low:73886371836733356kB high:147673348131571488kB reB [ 238.364765] lowmem_reserve[]: 0 0 0 0 0 [ 238.365101] Node 0 DMA: 7*4kB (U) 5*8kB (UE) 6*16kB (UME) 2*32kB (UM) 1*64kB (U) 2*128kB (UE) 3*256kB (UME) 2*512B [ 238.366379] Node 0 DMA32: 0*4kB 1*8kB (U) 2*16kB (UM) 2*32kB (UM) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 1*512kB (U)B [ 238.367654] Node 0 Normal: 1985*4kB (UME) 1321*8kB (UME) 844*16kB (UME) 524*32kB (UME) 300*64kB (UME) 138*128kB (B [ 238.369184] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 238.369915] 130 total pagecache pages [ 238.370241] 0 pages in swap cache [ 238.370533] Swap cache stats: add 0, delete 0, find 0/0 [ 238.370981] Free swap = 0kB [ 238.371239] Total swap = 0kB [ 238.371488] 1048445 pages RAM [ 238.371756] 0 pages HighMem/MovableOnly [ 238.372090] 306992 pages reserved [ 238.372376] 0 pages cma reserved [ 238.372661] 0 pages hwpoisoned In another instance (older kernel), I was able to observe this (negative page count :/): [ 180.896971] Offlined Pages 32768 [ 182.667462] Offlined Pages 32768 [ 184.408117] Offlined Pages 32768 [ 186.026321] Offlined Pages 32768 [ 187.684861] Offlined Pages 32768 [ 189.227013] Offlined Pages 32768 [ 190.830303] Offlined Pages 32768 [ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009 In another instance (older kernel), I was no longer able to start any process: [root@vm ~]# [ 214.348068] Offlined Pages 32768 [ 215.973009] Offlined Pages 32768 cat /proc/meminfo -bash: fork: Cannot allocate memory [root@vm ~]# cat /proc/meminfo -bash: fork: Cannot allocate memory Fix it by properly adjusting the managed page count when migrating if the zone changed. The managed page count of the zones now looks after unplug of the DIMM (and after deflating the balloon) just like before inflating the balloon (and plugging+onlining the DIMM). We'll temporarily modify the totalram page count. If this ever becomes a problem, we can fine tune by providing helpers that don't touch the totalram pages (e.g., adjust_zone_managed_page_count()). Please note that fixing up the managed page count is only necessary when we adjusted the managed page count when inflating - only if we don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the managed page count is not touched when inflating/deflating. Reported-by: Yumei Huang Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages") Cc: # v3.11+ Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Jiang Liu Cc: Andrew Morton Cc: Igor Mammedov Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index e05679c478e2..9f4117766bb1 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -721,6 +721,17 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, get_page(newpage); /* balloon reference */ + /* + * When we migrate a page to a different zone and adjusted the + * managed page count when inflating, we have to fixup the count of + * both involved zones. + */ + if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM) && + page_zone(page) != page_zone(newpage)) { + adjust_managed_page_count(page, 1); + adjust_managed_page_count(newpage, -1); + } + /* balloon's page migration 1st step -- inflate "newpage" */ spin_lock_irqsave(&vb_dev_info->pages_lock, flags); balloon_page_insert(vb_dev_info, newpage); -- cgit v1.2.3-70-g09d2 From 2a946fa1c8bc26df03bacebb6811b656d5e0cf1c Mon Sep 17 00:00:00 2001 From: "Michael S. Tsirkin" Date: Tue, 19 Nov 2019 05:21:47 -0500 Subject: virtio_balloon: name cleanups free_page_order is a confusing name. It's not a page order actually, it's the order of the block of memory we are hinting. Rename to hint_block_order. Also, rename SIZE to BYTES to make it clear it's the block size in bytes. Signed-off-by: Michael S. Tsirkin Reviewed-by: Wei Wang Reviewed-by: David Hildenbrand --- drivers/virtio/virtio_balloon.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9f4117766bb1..23119b3a71bd 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -32,10 +32,10 @@ #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \ __GFP_NOMEMALLOC) /* The order of free page blocks to report to host */ -#define VIRTIO_BALLOON_FREE_PAGE_ORDER (MAX_ORDER - 1) +#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1) /* The size of a free page block in bytes */ -#define VIRTIO_BALLOON_FREE_PAGE_SIZE \ - (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT)) +#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \ + (1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT)) #ifdef CONFIG_BALLOON_COMPACTION static struct vfsmount *balloon_mnt; @@ -380,7 +380,7 @@ static unsigned long return_free_pages_to_mm(struct virtio_balloon *vb, if (!page) break; free_pages((unsigned long)page_address(page), - VIRTIO_BALLOON_FREE_PAGE_ORDER); + VIRTIO_BALLOON_HINT_BLOCK_ORDER); } vb->num_free_page_blocks -= num_returned; spin_unlock_irq(&vb->free_page_list_lock); @@ -582,7 +582,7 @@ static int get_free_page_and_send(struct virtio_balloon *vb) ; page = alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG, - VIRTIO_BALLOON_FREE_PAGE_ORDER); + VIRTIO_BALLOON_HINT_BLOCK_ORDER); /* * When the allocation returns NULL, it indicates that we have got all * the possible free pages, so return -EINTR to stop. @@ -591,13 +591,13 @@ static int get_free_page_and_send(struct virtio_balloon *vb) return -EINTR; p = page_address(page); - sg_init_one(&sg, p, VIRTIO_BALLOON_FREE_PAGE_SIZE); + sg_init_one(&sg, p, VIRTIO_BALLOON_HINT_BLOCK_BYTES); /* There is always 1 entry reserved for the cmd id to use. */ if (vq->num_free > 1) { err = virtqueue_add_inbuf(vq, &sg, 1, p, GFP_KERNEL); if (unlikely(err)) { free_pages((unsigned long)p, - VIRTIO_BALLOON_FREE_PAGE_ORDER); + VIRTIO_BALLOON_HINT_BLOCK_ORDER); return err; } virtqueue_kick(vq); @@ -610,7 +610,7 @@ static int get_free_page_and_send(struct virtio_balloon *vb) * The vq has no available entry to add this page block, so * just free it. */ - free_pages((unsigned long)p, VIRTIO_BALLOON_FREE_PAGE_ORDER); + free_pages((unsigned long)p, VIRTIO_BALLOON_HINT_BLOCK_ORDER); } return 0; @@ -776,11 +776,11 @@ static unsigned long shrink_free_pages(struct virtio_balloon *vb, unsigned long blocks_to_free, blocks_freed; pages_to_free = round_up(pages_to_free, - 1 << VIRTIO_BALLOON_FREE_PAGE_ORDER); - blocks_to_free = pages_to_free >> VIRTIO_BALLOON_FREE_PAGE_ORDER; + 1 << VIRTIO_BALLOON_HINT_BLOCK_ORDER); + blocks_to_free = pages_to_free >> VIRTIO_BALLOON_HINT_BLOCK_ORDER; blocks_freed = return_free_pages_to_mm(vb, blocks_to_free); - return blocks_freed << VIRTIO_BALLOON_FREE_PAGE_ORDER; + return blocks_freed << VIRTIO_BALLOON_HINT_BLOCK_ORDER; } static unsigned long leak_balloon_pages(struct virtio_balloon *vb, @@ -837,7 +837,7 @@ static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker, unsigned long count; count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; - count += vb->num_free_page_blocks << VIRTIO_BALLOON_FREE_PAGE_ORDER; + count += vb->num_free_page_blocks << VIRTIO_BALLOON_HINT_BLOCK_ORDER; return count; } -- cgit v1.2.3-70-g09d2 From 63b9b80e9f5b2c463d98d6e550e0d0e3ace66033 Mon Sep 17 00:00:00 2001 From: "Michael S. Tsirkin" Date: Tue, 19 Nov 2019 05:25:24 -0500 Subject: virtio_balloon: divide/multiply instead of shifts We managed to get confused about the shift direction at least once. Let's switch to division/multiplcation instead. Add a number of pages macro for this purpose. We still keep the order macro around too since this is what alloc/free pages want. Signed-off-by: Michael S. Tsirkin Reviewed-by: Wei Wang Reviewed-by: David Hildenbrand --- drivers/virtio/virtio_balloon.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 23119b3a71bd..93f995f6cf36 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -36,6 +36,7 @@ /* The size of a free page block in bytes */ #define VIRTIO_BALLOON_HINT_BLOCK_BYTES \ (1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT)) +#define VIRTIO_BALLOON_HINT_BLOCK_PAGES (1 << VIRTIO_BALLOON_HINT_BLOCK_ORDER) #ifdef CONFIG_BALLOON_COMPACTION static struct vfsmount *balloon_mnt; @@ -776,11 +777,11 @@ static unsigned long shrink_free_pages(struct virtio_balloon *vb, unsigned long blocks_to_free, blocks_freed; pages_to_free = round_up(pages_to_free, - 1 << VIRTIO_BALLOON_HINT_BLOCK_ORDER); - blocks_to_free = pages_to_free >> VIRTIO_BALLOON_HINT_BLOCK_ORDER; + VIRTIO_BALLOON_HINT_BLOCK_PAGES); + blocks_to_free = pages_to_free / VIRTIO_BALLOON_HINT_BLOCK_PAGES; blocks_freed = return_free_pages_to_mm(vb, blocks_to_free); - return blocks_freed << VIRTIO_BALLOON_HINT_BLOCK_ORDER; + return blocks_freed * VIRTIO_BALLOON_HINT_BLOCK_PAGES; } static unsigned long leak_balloon_pages(struct virtio_balloon *vb, @@ -837,7 +838,7 @@ static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker, unsigned long count; count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; - count += vb->num_free_page_blocks << VIRTIO_BALLOON_HINT_BLOCK_ORDER; + count += vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES; return count; } -- cgit v1.2.3-70-g09d2