summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-10-16fs: do not update nr_thps for mappings which support THPsMatthew Wilcox (Oracle)
The nr_thps counter is to support THPs in the page cache when the filesystem doesn't understand THPs. Eventually it will be removed, but we should still support filesystems which do not understand THPs yet. Move the nr_thp manipulation functions to filemap.h since they're page-cache specific. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20200916032717.22917-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16fs: add a filesystem flag for THPsMatthew Wilcox (Oracle)
The page cache needs to know whether the filesystem supports THPs so that it doesn't send THPs to filesystems which can't handle them. Dave Chinner points out that getting from the page mapping to the filesystem type is too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that information in the address space flags. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/vmscan: allow arbitrary sized pages to be paged outMatthew Wilcox (Oracle)
Remove the assumption that a compound page has HPAGE_PMD_NR pins from the page cache. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-12-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/page-writeback: support tail pages in wait_for_stable_pageMatthew Wilcox (Oracle)
page->mapping is undefined for tail pages, so operate exclusively on the head page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-11-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/truncate: fix truncation for pages of arbitrary sizeMatthew Wilcox (Oracle)
Remove the assumption that a compound page is HPAGE_PMD_SIZE, and the assumption that any page is PAGE_SIZE. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-10-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/rmap: fix assumptions of THP sizeMatthew Wilcox (Oracle)
Ask the page what size it is instead of assuming it's PMD size. Do this for anon pages as well as file pages for when someone decides to support that. Leave the assumption alone for pages which are PMD mapped; we don't currently grow THPs beyond PMD size, so we don't need to change this code yet. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/huge_memory: fix can_split_huge_page assumption of THP sizeMatthew Wilcox (Oracle)
Ask the page how many subpages it has instead of assuming it's PMD size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-8-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/huge_memory: fix page_trans_huge_mapcount assumption of THP sizeMatthew Wilcox (Oracle)
Ask the page what size it is instead of assuming it's PMD size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/huge_memory: fix split assumption of page sizeKirill A. Shutemov
File THPs may now be of arbitrary size, and we can't rely on that size after doing the split so remember the number of pages before we start the split. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/huge_memory: fix total_mapcount assumption of page sizeKirill A. Shutemov
File THPs may now be of arbitrary order. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-5-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/page_owner: change split_page_owner to take a countMatthew Wilcox (Oracle)
The implementation of split_page_owner() prefers a count rather than the old order of the page. When we support a variable size THP, we won't have the order at this point, but we will have the number of pages. So change the interface to what the caller and callee would prefer. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/memory: remove page fault assumption of compound page sizeMatthew Wilcox (Oracle)
A compound page in the page cache will not necessarily be of PMD size, so check explicitly. [willy@infradead.org: fix remove page fault assumption of compound page size] Link: https://lkml.kernel.org/r/20201001152259.14932-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/filemap: fix page cache removal for arbitrary sized THPsMatthew Wilcox (Oracle)
Patch series "Remove assumptions of THP size". There are a number of places in the VM which assume that a THP is a PMD in size. That's true today, and remains true after this patch series, but this is a prerequisite for switching to arbitrary-sized THPs. thp_nr_pages() still returns either HPAGE_PMD_NR or 1, but will be changed later. This patch (of 11): page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200908195539.25896-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/filemap: fix storing to a THP shadow entryMatthew Wilcox (Oracle)
When a THP is removed from the page cache by reclaim, we replace it with a shadow entry that occupies all slots of the XArray previously occupied by the THP. If the user then accesses that page again, we only allocate a single page, but storing it into the shadow entry replaces all entries with that one page. That leads to bugs like page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:2529! https://bugzilla.kernel.org/show_bug.cgi?id=206569 This is hard to reproduce with mainline, but happens regularly with the THP patchset (as so many more THPs are created). This solution is take from the THP patchset. It splits the shadow entry into order-0 pieces at the time that we bring a new page into cache. Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Song Liu <songliubraving@fb.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Qian Cai <cai@lca.pw> Link: https://lkml.kernel.org/r/20200903183029.14930-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16XArray: add xas_splitMatthew Wilcox (Oracle)
In order to use multi-index entries for huge pages in the page cache, we need to be able to split a multi-index entry (eg if a file is truncated in the middle of a huge page entry). This version does not support splitting more than one level of the tree at a time. This is an acceptable limitation for the page cache as we do not expect to support order-12 pages in the near future. [akpm@linux-foundation.org: export xas_split_alloc() to modules] [willy@infradead.org: fix xarray split] Link: https://lkml.kernel.org/r/20200910175450.GV6583@casper.infradead.org [willy@infradead.org: fix xarray] Link: https://lkml.kernel.org/r/20201001233943.GW20115@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Qian Cai <cai@lca.pw> Cc: Song Liu <songliubraving@fb.com> Link: https://lkml.kernel.org/r/20200903183029.14930-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16XArray: add xa_get_orderMatthew Wilcox (Oracle)
Patch series "Fix read-only THP for non-tmpfs filesystems". As described more verbosely in the [3/3] changelog, we can inadvertently put an order-0 page in the page cache which occupies 512 consecutive entries. Users are running into this if they enable the READ_ONLY_THP_FOR_FS config option; see https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also reported it here: https://lore.kernel.org/lkml/20200616013309.GB815@lca.pw/ This is a rather intrusive way of fixing the problem, but has the advantage that I've actually been testing it with the THP patches, which means that it sees far more use than it does upstream -- indeed, Song has been entirely unable to reproduce it. It also has the advantage that it removes a few patches from my gargantuan backlog of THP patches. This patch (of 3): This function returns the order of the entry at the index. We need this because there isn't space in the shadow entry to encode its order. [akpm@linux-foundation.org: export xa_get_order to modules] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Qian Cai <cai@lca.pw> Cc: Song Liu <songliubraving@fb.com> Link: https://lkml.kernel.org/r/20200903183029.14930-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200903183029.14930-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable: avoid doing memory allocation with pgtable_t mapped.Aneesh Kumar K.V
With highmem, pte_alloc_map() keep the level4 page table mapped using kmap_atomic(). Avoid doing new memory allocation with page table mapped like above. [ 9.409233] BUG: sleeping function called from invalid context at mm/page_alloc.c:4822 [ 9.410557] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper [ 9.411932] no locks held by swapper/1. [ 9.412595] CPU: 0 PID: 1 Comm: swapper Not tainted 5.9.0-rc3-00323-gc50eb1ed654b5 #2 [ 9.413824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 9.415207] Call Trace: [ 9.415651] ? ___might_sleep.cold+0xa7/0xcc [ 9.416367] ? __alloc_pages_nodemask+0x14c/0x5b0 [ 9.417055] ? swap_migration_tests+0x50/0x293 [ 9.417704] ? debug_vm_pgtable+0x4bc/0x708 [ 9.418287] ? swap_migration_tests+0x293/0x293 [ 9.418911] ? do_one_initcall+0x82/0x3cb [ 9.419465] ? parse_args+0x1bd/0x280 [ 9.419983] ? rcu_read_lock_sched_held+0x36/0x60 [ 9.420673] ? trace_initcall_level+0x1f/0xf3 [ 9.421279] ? trace_initcall_level+0xbd/0xf3 [ 9.421881] ? do_basic_setup+0x9d/0xdd [ 9.422410] ? do_basic_setup+0xc3/0xdd [ 9.422938] ? kernel_init_freeable+0x72/0xa3 [ 9.423539] ? rest_init+0x134/0x134 [ 9.424055] ? kernel_init+0x5/0x12c [ 9.424574] ? ret_from_fork+0x19/0x30 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200913110327.645310-1-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable: avoid none pte in pte_clear_testAneesh Kumar K.V
pte_clear_tests operate on an existing pte entry. Make sure that is not a none pte entry. [aneesh.kumar@linux.ibm.com: avoid kernel crash with riscv] Link: https://lkml.kernel.org/r/20201015033206.140550-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: https://lkml.kernel.org/r/20200902114222.181353-14-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/hugetlb: disable hugetlb test on ppc64Aneesh Kumar K.V
The seems to be missing quite a lot of details w.r.t allocating the correct pgtable_t page (huge_pte_alloc()), holding the right lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. Hence disable the test on ppc64. [anshuman.khandual@arm.com: drop hugetlb_advanced_tests()] Link: https://lore.kernel.org/lkml/289c3fdb-1394-c1af-bdc4-5542907089dc@linux.ibm.com/#t Link: https://lkml.kernel.org/r/1600914446-21890-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-13-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/pmd_clear: don't use pmd/pud_clear on pte entriesAneesh Kumar K.V
pmd_clear() should not be used to clear pmd level pte entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-12-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/thp: use page table depost/withdraw with THPAneesh Kumar K.V
Architectures like ppc64 use deposited page table while updating the huge pte entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-11-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/locks: take correct page table lockAneesh Kumar K.V
Make sure we call pte accessors with correct lock held. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-10-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/locks: move non page table modifying test togetherAneesh Kumar K.V
This will help in adding proper locks in a later patch Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-9-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/set_pte/pmd/pud: don't use set_*_at to update an ↵Aneesh Kumar K.V
existing pte entry set_pte_at() should not be used to set a pte entry at locations that already holds a valid pte entry. Architectures like ppc64 don't do TLB invalidate in set_pte_at() and hence expect it to be used to set locations that are not a valid PTE. Link: https://lkml.kernel.org/r/20200902114222.181353-8-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/savedwrite: enable savedwrite test with ↵Aneesh Kumar K.V
CONFIG_NUMA_BALANCING Saved write support was added to track the write bit of a pte after marking the pte protnone. This was done so that AUTONUMA can convert a write pte to protnone and still track the old write bit. When converting it back we set the pte write bit correctly thereby avoiding a write fault again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled and use protnone protflags. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-6-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtables/hugevmap: use the arch helper to identify huge vmap ↵Aneesh Kumar K.V
support. ppc64 supports huge vmap only with radix translation. Hence use arch helper to determine the huge vmap support. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-5-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16mm/debug_vm_pgtable/ppc64: avoid setting top bits in radom valueAneesh Kumar K.V
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in random value. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-4-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16powerpc/mm: move setting pte specific flags to pfn_pteAneesh Kumar K.V
powerpc used to set the pte specific flags in set_pte_at(). This is different from other architectures. To be consistent with other architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte. We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without setting _PAGE_PTE bit. We will remove that after a few releases. With respect to huge pmd entries, pmd_mkhuge() takes care of adding the _PAGE_PTE bit. [akpm@linux-foundation.org: whitespace fix, per Christophe] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Link: https://lkml.kernel.org/r/20200902114222.181353-3-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16powerpc/mm: add DEBUG_VM WARN for pmd_clearAneesh Kumar K.V
Patch series "mm/debug_vm_pgtable fixes", v4. This patch series includes fixes for debug_vm_pgtable test code so that they follow page table updates rules correctly. The first two patches introduce changes w.r.t ppc64. Hugetlb test is disabled on ppc64 because that needs larger change to satisfy page table update rules. These tests are broken w.r.t page table update rules and results in kernel crash as below. [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 lr: c0000000005eeeec: pte_update+0x11c/0x190 sp: c000000c6d1e7950 msr: 8000000002029033 current = 0xc000000c6d172c80 paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at arch/powerpc/mm/pgtable.c:304! [link register ] c0000000005eeeec pte_update+0x11c/0x190 [c000000c6d1e7950] 0000000000000001 (unreliable) [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c With DEBUG_VM disabled [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 20.530183] Faulting instruction address: 0xc0000000000df330 cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] pc: c0000000000df330: memset+0x68/0x104 lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 sp: c000000c6d19f990 msr: 8000000002009033 dar: 0 current = 0xc000000c6d177480 paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c This patch (of 13): With the hash page table, the kernel should not use pmd_clear for clearing huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lkml.kernel.org/r/20200902114222.181353-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20200902114222.181353-2-aneesh.kumar@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16device-dax/kmem: fix resource releaseDan Williams
The conversion to request_mem_region() is broken because it assumes that the range is marked busy prior to release. However, due to the way that the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to let {add,remove}_memory() handle busy) it requires a manual release_resource() to perform cleanup. Given that the actual 'struct resource *' needs to be recalled, not just the range, add that tracking to the kmem driver-data. Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with release_mem_region()") Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lkml.kernel.org/r/160272252925.3136502.17220638073995895400.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16RDMA/ucma: Fix use after free in destroy id flowMaor Gottlieb
ucma_free_ctx() should call to __destroy_id() on all the connection requests that have not been delivered to user space. Currently it calls on the context itself and cause to use after free. Fixes the trace: BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108 Faulting instruction address: 0xc0080000002428f4 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: [c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable) [c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm] [c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm] [c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330 [c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110 [c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50 [c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0 [c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30 [c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470 [c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240 [c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0 Rename listen_ctx to conn_req_ctx as the poor name was the cause of this bug. Fixes: a1d33b70dbbc ("RDMA/ucma: Rework how new connections are passed through event delivery") Link: https://lore.kernel.org/r/20201012045600.418271-4-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16RDMA/rxe: Handle skb_clone() failure in rxe_recv.cBob Pearson
If skb_clone() is unable to allocate memory for a new sk_buff this is not detected by the current code. Check for a NULL return and continue. This is similar to other errors in this loop over QPs attached to the multicast address and consistent with the unreliable UD transport. Fixes: e7ec96fc7932f ("RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()") Addresses-Coverity-ID: 1497804: Null pointer dereferences (NULL_RETURNS) Link: https://lore.kernel.org/r/20201013184236.5231-1-rpearson@hpe.com Signed-off-by: Bob Pearson <rpearson@hpe.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16RDMA/rxe: Move the definitions for rxe_av.network_type to uAPIJason Gunthorpe
RXE was wrongly using an internal kernel enum as part of its uAPI, split this out into a dedicated uAPI enum just for RXE. It only uses the IPv4 and IPv6 values. This was exposed by changing the internal kernel enum definition which broke RXE. Fixes: 1c15b4f2a42f ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16RDMA: Explicitly pass in the dma_device to ib_register_deviceJason Gunthorpe
The code in setup_dma_device has become rather convoluted, move all of this to the drivers. Drives now pass in a DMA capable struct device which will be used to setup DMA, or drivers must fully configure the ibdev for DMA and pass in NULL. Other than setting the masks in rvt all drivers were doing this already anyhow. mthca, mlx4 and mlx5 were already setting up maximum DMA segment size for DMA based on their hardweare limits in: __mthca_init_one() dma_set_max_seg_size (1G) __mlx4_init_one() dma_set_max_seg_size (1G) mlx5_pci_init() set_dma_caps() dma_set_max_seg_size (2G) Other non software drivers (except usnic) were extended to UINT_MAX [1, 2] instead of 2G as was before. [1] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/ [2] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/ Link: https://lore.kernel.org/r/20201008082752.275846-1-leon@kernel.org Link: https://lore.kernel.org/r/6b2ed339933d066622d5715903870676d8cc523a.1602590106.git.mchehab+huawei@kernel.org Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16PCI/ASPM: Remove struct pcie_link_state.l1ssSaheed O. Bolarinwa
Previously we computed L1.2 parameters in the enumeration path, saved them in struct pcie_link_state.l1ss, and programmed them into the devices whenever we enabled or disabled L1.2 on the link. But these parameters are constant and don't need to be updated when enabling/disabling L1.2. Compute and program the L1.2 parameters once during enumeration and remove the struct pcie_link_state.l1ss member. No functional change intended. [bhelgaas: rework to program L1.2 parameters during enumeration] Link: https://lore.kernel.org/r/20201015193039.12585-13-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_capSaheed O. Bolarinwa
Previously we stored the L1SS Capabilities value in the struct aspm_register_info. We only need this information in one place, so read it there and remove struct aspm_register_info completely, since it's now empty. No functional change intended. [bhelgaas: split up, don't cache l1ss_cap in pci_dev] Link: https://lore.kernel.org/r/20201015193039.12585-12-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Pass L1SS Capabilities value, not struct aspm_register_infoBjorn Helgaas
aspm_calc_l1ss_info() needs only the L1SS Capabilities. It doesn't need anything else from struct aspm_register_info, so pass only the Capabilities value. No functional change intended. Link: https://lore.kernel.org/r/20201015193039.12585-11-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_ctl1Saheed O. Bolarinwa
Previously we stored the L1SS Control 1 register in the struct aspm_register_info. We only need this information in one place, so read it there and remove it from struct aspm_register_info. No functional change intended. [bhelgaas: split ctl1/ctl2] Link: https://lore.kernel.org/r/20201015193039.12585-10-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_ctl2 (unused)Bjorn Helgaas
We never use the aspm_register_info.l1ss_ctl2 value, so remove it. No functional change intended. Link: https://lore.kernel.org/r/20201015193039.12585-9-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.l1ss_cap_ptrSaheed O. Bolarinwa
Save the L1 Substates Capability pointer in struct pci_dev. Then we don't have to keep track of it in the struct aspm_register_info and struct pcie_link_state, which makes the code easier to read. No functional change intended. [bhelgaas: split to a separate patch] Link: https://lore.kernel.org/r/20201015193039.12585-8-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.latency_encodingSaheed O. Bolarinwa
Previously we stored L0s and L1 Exit Latency information from the Link Capabilities register in the struct aspm_register_info. We only need these latencies when we already have the Link Capabilities values, so use those directly and remove the latencies from struct aspm_register_info. No functional change intended. Link: https://lore.kernel.org/r/20201015193039.12585-7-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.enabledSaheed O. Bolarinwa
Previously we stored the "ASPM Control" bits from the Link Control register in the struct aspm_register_info. Read PCI_EXP_LNKCTL directly when needed. This means we can use the PCI_EXP_LNKCTL_ASPM_* bits directly instead of the similar but different PCIE_LINK_STATE_* bits. No functional change intended. [bhelgaas: drop get_aspm_enable() and read LNKCTL once directly] Link: https://lore.kernel.org/r/20201015193039.12585-6-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Remove struct aspm_register_info.supportSaheed O. Bolarinwa
Previously we stored the "ASPM Support" field from the Link Capabilities register in the struct aspm_register_info. Read the Link Capabilities directly when needed and remove it from the struct aspm_register_info. No functional change intended. [bhelgaas: remove pci_dev cached copy since LNKCAP isn't truly read-only, add PCI_EXP_LNKCAP_ASPM_L0S & PCI_EXP_LNKCAP_ASPM_L1, check them directly instead of adding aspm_support()] Link: https://lore.kernel.org/r/20201015193039.12585-5-helgaas@kernel.org Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Use 'parent' and 'child' for readabilityBjorn Helgaas
Other users of link->pdev and link->downstream, e.g., pcie_aspm_cap_init(), pcie_config_aspm_l1ss(), and pcie_config_aspm_link(), use "parent" and "child" as local names. Do the same in aspm_calc_l1ss_info() for readability. No functional change intended. Link: https://lore.kernel.org/r/20201015193039.12585-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Move LTR path check to where it's usedBjorn Helgaas
pcie_get_aspm_reg() mostly reads ASPM-related registers, but in some cases it also updates the value read from PCI_L1SS_CAP based on LTR properties. Move this update to the point where the value is used to make the code more readable. No functional change intended, although previously we could clear PCI_L1SS_CAP_ASPM_L1_2 for both ends of the link, and now we'll only do it for the downstream end of a link. This shouldn't matter because we always test that bit by ANDing l1ss_cap for the upstream and downstream ends. Link: https://lore.kernel.org/r/20201015193039.12585-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16PCI/ASPM: Move pci_clear_and_set_dword() earlierBjorn Helgaas
Move pci_clear_and_set_dword() earlier in file to prepare for future patch. No functional change intended. Link: https://lore.kernel.org/r/20201015193039.12585-2-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-10-16lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED valuesJason Gunthorpe
The main intention of the max_segment argument to __sg_alloc_table_from_pages() is to match the DMA layer segment size set by dma_set_max_seg_size(). Restricting the input to be page aligned makes it impossible to just connect the DMA layer to this API. The only reason for a page alignment here is because the algorithm will overshoot the max_segment if it is not a multiple of PAGE_SIZE. Simply fix the alignment before starting and don't expose this implementation detail to the callers. A future patch will completely remove SCATTERLIST_MAX_SEGMENT. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16Merge branch 'dynamic_sg' into rdma.git for-nextJason Gunthorpe
From Maor Gottlieb says: ==================== This series extends __sg_alloc_table_from_pages to allow chaining of new pages to an already initialized SG table. This allows for drivers to utilize the optimization of merging contiguous pages without a need to pre allocate all the pages and hold them in a very large temporary buffer prior to the call to SG table initialization. The last patch changes the Infiniband core to use the new API. It removes duplicate functionality from the code and benefits from the optimization of allocating dynamic SG table from pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. ==================== * branch 'dynamic_sg': RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test
2020-10-16tracing: Remove __init from __trace_early_add_new_event()Masami Hiramatsu
The commit 720dee53ad8d ("tracing/boot: Initialize per-instance event list in early boot") removes __init from __trace_early_add_events() but __trace_early_add_new_event() still has __init and will cause a section mismatch. Remove __init from __trace_early_add_new_event() as same as __trace_early_add_events(). Link: https://lore.kernel.org/lkml/CAHk-=wjU86UhovK4XuwvCqTOfc+nvtpAuaN2PJBz15z=w=u0Xg@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-16afs: Don't assert on unpurgeable server recordsDavid Howells
Don't give an assertion failure on unpurgeable afs_server records - which kills the thread - but rather emit a trace line when we are purging a record (which only happens during network namespace removal or rmmod) and print a notice of the problem. Signed-off-by: David Howells <dhowells@redhat.com>