From 2d337b7158f8c38dacf8394028ab87b8a25ed707 Mon Sep 17 00:00:00 2001 From: ZhangPeng Date: Wed, 1 Mar 2023 10:06:27 +0000 Subject: userfaultfd: move unprivileged_userfaultfd sysctl to its own file The sysctl_unprivileged_userfaultfd is part of userfaultfd, move it to its own file. Signed-off-by: ZhangPeng Signed-off-by: Luis Chamberlain --- include/linux/userfaultfd_k.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 3767f18114ef..fff49fec0258 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -36,8 +36,6 @@ #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) -extern int sysctl_unprivileged_userfaultfd; - extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); /* -- cgit v1.2.3-70-g09d2 From 2bad466cc9d9b4c3b4b16eb9c03c919b59561316 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Thu, 9 Mar 2023 17:37:10 -0500 Subject: mm/uffd: UFFD_FEATURE_WP_UNPOPULATED Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4. The new feature bit makes anonymous memory acts the same as file memory on userfaultfd-wp in that it'll also wr-protect none ptes. It can be useful in two cases: (1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot, so pre-fault can be replaced by enabling this flag and speed up protections (2) It helps to implement async uffd-wp mode that Muhammad is working on [1] It's debatable whether this is the most ideal solution because with the new feature bit set, wr-protect none pte needs to pre-populate the pgtables to the last level (PAGE_SIZE). But it seems fine so far to service either purpose above, so we can leave optimizations for later. The series brings pte markers to anonymous memory too. There's some change in the common mm code path in the 1st patch, great to have some eye looking at it, but hopefully they're still relatively straightforward. This patch (of 2): This is a new feature that controls how uffd-wp handles none ptes. When it's set, the kernel will handle anonymous memory the same way as file memory, by allowing the user to wr-protect unpopulated ptes. File memories handles none ptes consistently by allowing wr-protecting of none ptes because of the unawareness of page cache being exist or not. For anonymous it was not as persistent because we used to assume that we don't need protections on none ptes or known zero pages. One use case of such a feature bit was VM live snapshot, where if without wr-protecting empty ptes the snapshot can contain random rubbish in the holes of the anonymous memory, which can cause misbehave of the guest when the guest OS assumes the pages should be all zeros. QEMU worked it around by pre-populate the section with reads to fill in zero page entries before starting the whole snapshot process [1]. Recently there's another need raised on using userfaultfd wr-protect for detecting dirty pages (to replace soft-dirty in some cases) [2]. In that case if without being able to wr-protect none ptes by default, the dirty info can get lost, since we cannot treat every none pte to be dirty (the current design is identify a page dirty based on uffd-wp bit being cleared). In general, we want to be able to wr-protect empty ptes too even for anonymous. This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make uffd-wp handling on none ptes being consistent no matter what the memory type is underneath. It doesn't have any impact on file memories so far because we already have pte markers taking care of that. So it only affects anonymous. The feature bit is by default off, so the old behavior will be maintained. Sometimes it may be wanted because the wr-protect of none ptes will contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to anonymous), but also on creating the pgtables to store the pte markers. So there's potentially less chance of using thp on the first fault for a none pmd or larger than a pmd. The major implementation part is teaching the whole kernel to understand pte markers even for anonymously mapped ranges, meanwhile allowing the UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the new feature bit is set. Note that even if the patch subject starts with mm/uffd, there're a few small refactors to major mm path of handling anonymous page faults. But they should be straightforward. With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all the memory before wr-protect during taking a live snapshot. Quotting from Muhammad's test result here [3] based on a simple program [4]: (1) With huge page disabled echo madvise > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 1111453 (pre-fault 1101011) Test MADVISE: 278276 (pre-fault 266378) Test WP-UNPOPULATE: 11712 (2) With Huge page enabled echo always > /sys/kernel/mm/transparent_hugepage/enabled ./uffd_wp_perf Test DEFAULT: 4 Test PRE-READ: 22521 (pre-fault 22348) Test MADVISE: 4909 (pre-fault 4743) Test WP-UNPOPULATE: 14448 There'll be a great perf boost for no-thp case, while for thp enabled with extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but that's low possibility in reality, also the overhead was not reduced but postponed until a follow up write on any huge zero thp, so potentially it is faster by making the follow up writes slower. [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/ [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c [peterx@redhat.com: comment changes, oneliner fix to khugepaged] Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: David Hildenbrand Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Mike Rapoport Cc: Muhammad Usama Anjum Cc: Nadav Amit Cc: Paul Gofman Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/userfaultfd.rst | 17 +++++++++ fs/userfaultfd.c | 16 ++++++++ include/linux/mm_inline.h | 6 +++ include/linux/userfaultfd_k.h | 23 ++++++++++++ include/uapi/linux/userfaultfd.h | 10 ++++- mm/khugepaged.c | 2 +- mm/memory.c | 56 +++++++++++++++++++++------- mm/mprotect.c | 51 ++++++++++++++++++++----- 8 files changed, 155 insertions(+), 26 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 7dc823b56ca4..bd2226299583 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -219,6 +219,23 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. +Userfaultfd write-protect mode currently behave differently on none ptes +(when e.g. page is missing) over different types of memories. + +For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes +(e.g. when pages are missing and not populated). For file-backed memories +like shmem and hugetlbfs, none ptes will be write protected just like a +present pte. In other words, there will be a userfaultfd write fault +message generated when writing to a missing page on file typed memories, +as long as the page range was write-protected before. Such a message will +not be generated on anonymous memories by default. + +If the application wants to be able to write protect none ptes on anonymous +memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On +newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED +and set the feature bit in advance to make sure none ptes will also be +write protected even upon anonymous memory. + QEMU/KVM ======== diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 44d1ee429eb0..881e9c82b9d1 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -108,6 +108,21 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) return ctx->features & UFFD_FEATURE_INITIALIZED; } +/* + * Whether WP_UNPOPULATED is enabled on the uffd context. It is only + * meaningful when userfaultfd_wp()==true on the vma and when it's + * anonymous. + */ +bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) +{ + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx; + + if (!ctx) + return false; + + return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; +} + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, vm_flags_t flags) { @@ -1971,6 +1986,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, #endif #ifndef CONFIG_PTE_MARKER_UFFD_WP uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM; + uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED; #endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index de1e622dd366..0e1d239a882c 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -557,6 +557,12 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, /* The current status of the pte should be "cleared" before calling */ WARN_ON_ONCE(!pte_none(*pte)); + /* + * NOTE: userfaultfd_wp_unpopulated() doesn't need this whole + * thing, because when zapping either it means it's dropping the + * page, or in TTU where the present pte will be quickly replaced + * with a swap pte. There's no way of leaking the bit. + */ if (vma_is_anonymous(vma) || !userfaultfd_wp(vma)) return; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 3767f18114ef..0cf8880219da 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf); extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); +extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); #else /* CONFIG_USERFAULTFD */ @@ -274,8 +275,30 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) return false; } +static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ +static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma) +{ + /* Only wr-protect mode uses pte markers */ + if (!userfaultfd_wp(vma)) + return false; + + /* File-based uffd-wp always need markers */ + if (!vma_is_anonymous(vma)) + return true; + + /* + * Anonymous uffd-wp only needs the markers if WP_UNPOPULATED + * enabled (to apply markers on zero pages). + */ + return userfaultfd_wp_unpopulated(vma); +} + static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) { #ifdef CONFIG_PTE_MARKER_UFFD_WP diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..90c958952bfc 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ - UFFD_FEATURE_WP_HUGETLBFS_SHMEM) + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ + UFFD_FEATURE_WP_UNPOPULATED) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,12 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs. + * + * UFFD_FEATURE_WP_UNPOPULATED indicates that userfaultfd + * write-protection mode will always apply to unpopulated pages + * (i.e. empty ptes). This will be the default behavior for shmem + * & hugetlbfs, so this flag only affects anonymous memory behavior + * when userfault write-protection mode is registered. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +224,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_WP_UNPOPULATED (1<<13) __u64 features; __u64 ioctls; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 074ea534f786..c7317678cb10 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1177,7 +1177,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, * enabled swap entries. Please see * comment below for pte_uffd_wp(). */ - if (pte_swp_uffd_wp(pteval)) { + if (pte_swp_uffd_wp_any(pteval)) { result = SCAN_PTE_UFFD_WP; goto out_unmap; } diff --git a/mm/memory.c b/mm/memory.c index 6285cad1f4fb..a890b2951b53 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -104,6 +104,20 @@ EXPORT_SYMBOL(mem_map); #endif static vm_fault_t do_fault(struct vm_fault *vmf); +static vm_fault_t do_anonymous_page(struct vm_fault *vmf); +static bool vmf_pte_changed(struct vm_fault *vmf); + +/* + * Return true if the original pte was a uffd-wp pte marker (so the pte was + * wr-protected). + */ +static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf) +{ + if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) + return false; + + return pte_marker_uffd_wp(vmf->orig_pte); +} /* * A number of key systems in x86 including ioremap() rely on the assumption @@ -1346,6 +1360,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, struct zap_details *details, pte_t pteval) { + /* Zap on anonymous always means dropping everything */ + if (vma_is_anonymous(vma)) + return; + if (zap_drop_file_uffd_wp(details)) return; @@ -1452,8 +1470,12 @@ again: continue; rss[mm_counter(page)]--; } else if (pte_marker_entry_uffd_wp(entry)) { - /* Only drop the uffd-wp marker if explicitly requested */ - if (!zap_drop_file_uffd_wp(details)) + /* + * For anon: always drop the marker; for file: only + * drop the marker if explicitly requested. + */ + if (!vma_is_anonymous(vma) && + !zap_drop_file_uffd_wp(details)) continue; } else if (is_hwpoison_entry(entry) || is_swapin_error_entry(entry)) { @@ -3620,6 +3642,14 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf) return 0; } +static vm_fault_t do_pte_missing(struct vm_fault *vmf) +{ + if (vma_is_anonymous(vmf->vma)) + return do_anonymous_page(vmf); + else + return do_fault(vmf); +} + /* * This is actually a page-missing access, but with uffd-wp special pte * installed. It means this pte was wr-protected before being unmapped. @@ -3630,11 +3660,10 @@ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) * Just in case there're leftover special ptes even after the region * got unregistered - we can simply clear them. */ - if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + if (unlikely(!userfaultfd_wp(vmf->vma))) return pte_marker_clear(vmf); - /* do_fault() can handle pte markers too like none pte */ - return do_fault(vmf); + return do_pte_missing(vmf); } static vm_fault_t handle_pte_marker(struct vm_fault *vmf) @@ -3999,6 +4028,7 @@ out_release: */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; vm_fault_t ret = 0; @@ -4032,7 +4062,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (!pte_none(*vmf->pte)) { + if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto unlock; } @@ -4072,7 +4102,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (!pte_none(*vmf->pte)) { + if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; } @@ -4092,6 +4122,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); setpte: + if (uffd_wp) + entry = pte_mkuffd_wp(entry); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); /* No need to invalidate - it was non-present before */ @@ -4259,7 +4291,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) { struct vm_area_struct *vma = vmf->vma; - bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte); + bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); bool write = vmf->flags & FAULT_FLAG_WRITE; bool prefault = vmf->address != addr; pte_t entry; @@ -4903,12 +4935,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) } } - if (!vmf->pte) { - if (vma_is_anonymous(vmf->vma)) - return do_anonymous_page(vmf); - else - return do_fault(vmf); - } + if (!vmf->pte) + return do_pte_missing(vmf); if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); diff --git a/mm/mprotect.c b/mm/mprotect.c index 13e84d8c0797..b9da9a5f87fe 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -276,7 +276,15 @@ static long change_pte_range(struct mmu_gather *tlb, } else { /* It must be an none page, or what else?.. */ WARN_ON_ONCE(!pte_none(oldpte)); - if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { + + /* + * Nobody plays with any none ptes besides + * userfaultfd when applying the protections. + */ + if (likely(!uffd_wp)) + continue; + + if (userfaultfd_wp_use_markers(vma)) { /* * For file-backed mem, we need to be able to * wr-protect a none pte, because even if the @@ -320,23 +328,46 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) return 0; } -/* Return true if we're uffd wr-protecting file-backed memory, or false */ +/* + * Return true if we want to split THPs into PTE mappings in change + * protection procedure, false otherwise. + */ static inline bool -uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) +pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags) { + /* + * pte markers only resides in pte level, if we need pte markers, + * we need to split. We cannot wr-protect shmem thp because file + * thp is handled differently when split by erasing the pmd so far. + */ return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); } /* - * If wr-protecting the range for file-backed, populate pgtable for the case - * when pgtable is empty but page cache exists. When {pte|pmd|...}_alloc() - * failed we treat it the same way as pgtable allocation failures during - * page faults by kicking OOM and returning error. + * Return true if we want to populate pgtables in change protection + * procedure, false otherwise + */ +static inline bool +pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags) +{ + /* If not within ioctl(UFFDIO_WRITEPROTECT), then don't bother */ + if (!(cp_flags & MM_CP_UFFD_WP)) + return false; + + /* Populate if the userfaultfd mode requires pte markers */ + return userfaultfd_wp_use_markers(vma); +} + +/* + * Populate the pgtable underneath for whatever reason if requested. + * When {pte|pmd|...}_alloc() failed we treat it the same way as pgtable + * allocation failures during page faults by kicking OOM and returning + * error. */ #define change_pmd_prepare(vma, pmd, cp_flags) \ ({ \ long err = 0; \ - if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (unlikely(pgtable_populate_needed(vma, cp_flags))) { \ if (pte_alloc(vma->vm_mm, pmd)) \ err = -ENOMEM; \ } \ @@ -351,7 +382,7 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) #define change_prepare(vma, high, low, addr, cp_flags) \ ({ \ long err = 0; \ - if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (unlikely(pgtable_populate_needed(vma, cp_flags))) { \ low##_t *p = low##_alloc(vma->vm_mm, high, addr); \ if (p == NULL) \ err = -ENOMEM; \ @@ -404,7 +435,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || - uffd_wp_protect_file(vma, cp_flags)) { + pgtable_split_needed(vma, cp_flags)) { __split_huge_pmd(vma, pmd, addr, false, NULL); /* * For file-backed, the pmd could have been -- cgit v1.2.3-70-g09d2 From a734991ccaec1985fff42fb26bb6d789d35defb4 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Tue, 14 Mar 2023 15:12:47 -0700 Subject: mm: userfaultfd: rename functions for clarity + consistency Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP", v5. - Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the main goal of improving consistency and reducing the number of function args. - Commit 4 adds UFFDIO_CONTINUE_MODE_WP. This patch (of 4): The basic problem is, over time we've added new userfaultfd ioctls, and we've refactored the code so functions which used to handle only one case are now re-used to deal with several cases. While this happened, we didn't bother to rename the functions. Similarly, as we added new functions, we cargo-culted pieces of the now-inconsistent naming scheme, so those functions too ended up with names that don't make a lot of sense. A key point here is, "copy" in most userfaultfd code refers specifically to UFFDIO_COPY, where we allocate a new page and copy its contents from userspace. There are many functions with "copy" in the name that don't actually do this (at least in some cases). So, rename things into a consistent scheme. The high level idea is that the call stack for userfaultfd ioctls becomes: userfaultfd_ioctl -> userfaultfd_(particular ioctl) -> mfill_atomic_(particular kind of fill operation) -> mfill_atomic /* loops over pages in range */ -> mfill_atomic_pte /* deals with single pages */ -> mfill_atomic_pte_(particular kind of fill operation) -> mfill_atomic_install_pte There are of course some special cases (shmem, hugetlb), but this is the general structure which all function names now adhere to. Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Peter Xu Acked-by: Mike Rapoport (IBM) Cc: Al Viro Cc: Hugh Dickins Cc: James Houghton Cc: Jan Kara Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Mike Kravetz Cc: Muchun Song Cc: Nadav Amit Cc: Shuah Khan Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 18 ++++---- include/linux/hugetlb.h | 30 ++++++------- include/linux/userfaultfd_k.h | 18 ++++---- mm/hugetlb.c | 20 ++++----- mm/userfaultfd.c | 98 +++++++++++++++++++++---------------------- 5 files changed, 92 insertions(+), 92 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 881e9c82b9d1..4aedfd98e3f5 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1756,9 +1756,9 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP)) goto out; if (mmget_not_zero(ctx->mm)) { - ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, - uffdio_copy.len, &ctx->mmap_changing, - uffdio_copy.mode); + ret = mfill_atomic_copy(ctx->mm, uffdio_copy.dst, uffdio_copy.src, + uffdio_copy.len, &ctx->mmap_changing, + uffdio_copy.mode); mmput(ctx->mm); } else { return -ESRCH; @@ -1808,9 +1808,9 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, goto out; if (mmget_not_zero(ctx->mm)) { - ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start, - uffdio_zeropage.range.len, - &ctx->mmap_changing); + ret = mfill_atomic_zeropage(ctx->mm, uffdio_zeropage.range.start, + uffdio_zeropage.range.len, + &ctx->mmap_changing); mmput(ctx->mm); } else { return -ESRCH; @@ -1918,9 +1918,9 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) goto out; if (mmget_not_zero(ctx->mm)) { - ret = mcopy_continue(ctx->mm, uffdio_continue.range.start, - uffdio_continue.range.len, - &ctx->mmap_changing); + ret = mfill_atomic_continue(ctx->mm, uffdio_continue.range.start, + uffdio_continue.range.len, + &ctx->mmap_changing); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8fb7d91cd0b1..152434396c48 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -158,13 +158,13 @@ unsigned long hugetlb_total_pages(void); vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); #ifdef CONFIG_USERFAULTFD -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy); +int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + enum mcopy_atomic_mode mode, + struct page **pagep, + bool wp_copy); #endif /* CONFIG_USERFAULTFD */ bool hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, @@ -393,14 +393,14 @@ static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, } #ifdef CONFIG_USERFAULTFD -static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy) +static inline int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, + pte_t *dst_pte, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + enum mcopy_atomic_mode mode, + struct page **pagep, + bool wp_copy) { BUG(); return 0; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 0cf8880219da..ac178d810dc7 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -61,15 +61,15 @@ extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, unsigned long dst_addr, struct page *page, bool newly_allocated, bool wp_copy); -extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long src_start, unsigned long len, - atomic_t *mmap_changing, __u64 mode); -extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, - unsigned long dst_start, - unsigned long len, - atomic_t *mmap_changing); -extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long len, atomic_t *mmap_changing); +extern ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start, + unsigned long src_start, unsigned long len, + atomic_t *mmap_changing, __u64 mode); +extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, + unsigned long dst_start, + unsigned long len, + atomic_t *mmap_changing); +extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, + unsigned long len, atomic_t *mmap_changing); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9122e50ae02a..b1e474aa2fc5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6156,17 +6156,17 @@ out_mutex: #ifdef CONFIG_USERFAULTFD /* - * Used by userfaultfd UFFDIO_COPY. Based on mcopy_atomic_pte with - * modifications for huge pages. + * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte + * with modifications for hugetlb pages. */ -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy) +int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, + pte_t *dst_pte, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + enum mcopy_atomic_mode mode, + struct page **pagep, + bool wp_copy) { bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); struct hstate *h = hstate_vma(dst_vma); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 23cabd02ac52..874379ce271f 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -127,13 +127,13 @@ out_unlock: return ret; } -static int mcopy_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - struct page **pagep, - bool wp_copy) +static int mfill_atomic_pte_copy(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + struct page **pagep, + bool wp_copy) { void *page_kaddr; int ret; @@ -204,10 +204,10 @@ out_release: goto out; } -static int mfill_zeropage_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) +static int mfill_atomic_pte_zeropage(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr) { pte_t _dst_pte, *dst_pte; spinlock_t *ptl; @@ -240,11 +240,11 @@ out_unlock: } /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ -static int mcontinue_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - bool wp_copy) +static int mfill_atomic_pte_continue(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + bool wp_copy) { struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); @@ -307,10 +307,10 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) #ifdef CONFIG_HUGETLB_PAGE /* - * __mcopy_atomic processing for HUGETLB vmas. Note that this routine is + * mfill_atomic processing for HUGETLB vmas. Note that this routine is * called with mmap_lock held, it will release mmap_lock before returning. */ -static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, +static __always_inline ssize_t mfill_atomic_hugetlb(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long src_start, @@ -411,7 +411,7 @@ retry: goto out_unlock; } - err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, + err = hugetlb_mfill_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, src_addr, mode, &page, wp_copy); @@ -463,7 +463,7 @@ out: } #else /* !CONFIG_HUGETLB_PAGE */ /* fail at build time if gcc attempts to use this */ -extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, +extern ssize_t mfill_atomic_hugetlb(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long src_start, @@ -484,8 +484,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, ssize_t err; if (mode == MCOPY_ATOMIC_CONTINUE) { - return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, - wp_copy); + return mfill_atomic_pte_continue(dst_mm, dst_pmd, dst_vma, + dst_addr, wp_copy); } /* @@ -500,11 +500,11 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, */ if (!(dst_vma->vm_flags & VM_SHARED)) { if (mode == MCOPY_ATOMIC_NORMAL) - err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, page, - wp_copy); + err = mfill_atomic_pte_copy(dst_mm, dst_pmd, dst_vma, + dst_addr, src_addr, page, + wp_copy); else - err = mfill_zeropage_pte(dst_mm, dst_pmd, + err = mfill_atomic_pte_zeropage(dst_mm, dst_pmd, dst_vma, dst_addr); } else { err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, @@ -516,13 +516,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, return err; } -static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, - unsigned long dst_start, - unsigned long src_start, - unsigned long len, - enum mcopy_atomic_mode mcopy_mode, - atomic_t *mmap_changing, - __u64 mode) +static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm, + unsigned long dst_start, + unsigned long src_start, + unsigned long len, + enum mcopy_atomic_mode mcopy_mode, + atomic_t *mmap_changing, + __u64 mode) { struct vm_area_struct *dst_vma; ssize_t err; @@ -588,9 +588,9 @@ retry: * If this is a HUGETLB vma, pass off to appropriate routine */ if (is_vm_hugetlb_page(dst_vma)) - return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start, - src_start, len, mcopy_mode, - wp_copy); + return mfill_atomic_hugetlb(dst_mm, dst_vma, dst_start, + src_start, len, mcopy_mode, + wp_copy); if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock; @@ -688,26 +688,26 @@ out: return copied ? copied : err; } -ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long src_start, unsigned long len, - atomic_t *mmap_changing, __u64 mode) +ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start, + unsigned long src_start, unsigned long len, + atomic_t *mmap_changing, __u64 mode) { - return __mcopy_atomic(dst_mm, dst_start, src_start, len, - MCOPY_ATOMIC_NORMAL, mmap_changing, mode); + return mfill_atomic(dst_mm, dst_start, src_start, len, + MCOPY_ATOMIC_NORMAL, mmap_changing, mode); } -ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start, - unsigned long len, atomic_t *mmap_changing) +ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing) { - return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE, - mmap_changing, 0); + return mfill_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE, + mmap_changing, 0); } -ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, - unsigned long len, atomic_t *mmap_changing) +ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing) { - return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE, - mmap_changing, 0); + return mfill_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE, + mmap_changing, 0); } long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, -- cgit v1.2.3-70-g09d2 From 61c5004022f56c443b86800e8985d8803f3a22aa Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Tue, 14 Mar 2023 15:12:48 -0700 Subject: mm: userfaultfd: don't pass around both mm and vma Quite a few userfaultfd functions took both mm and vma pointers as arguments. Since the mm is trivially accessible via vma->vm_mm, there's no reason to pass both; it just needlessly extends the already long argument list. Get rid of the mm pointer, where possible, to shorten the argument list. Link: https://lkml.kernel.org/r/20230314221250.682452-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Peter Xu Acked-by: Mike Rapoport (IBM) Cc: Al Viro Cc: Hugh Dickins Cc: James Houghton Cc: Jan Kara Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Mike Kravetz Cc: Muchun Song Cc: Nadav Amit Cc: Shuah Khan Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 2 +- include/linux/hugetlb.h | 5 ++-- include/linux/shmem_fs.h | 4 +-- include/linux/userfaultfd_k.h | 4 +-- mm/hugetlb.c | 4 +-- mm/shmem.c | 7 +++-- mm/userfaultfd.c | 61 ++++++++++++++++++++----------------------- 7 files changed, 41 insertions(+), 46 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 4aedfd98e3f5..d8d432ca81e6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1644,7 +1644,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, /* Reset ptes for the whole vma range if wr-protected */ if (userfaultfd_wp(vma)) - uffd_wp_range(mm, vma, start, vma_end - start, false); + uffd_wp_range(vma, start, vma_end - start, false); new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 152434396c48..3cb7cd853fa8 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -158,7 +158,7 @@ unsigned long hugetlb_total_pages(void); vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); #ifdef CONFIG_USERFAULTFD -int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, +int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -393,8 +393,7 @@ static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, } #ifdef CONFIG_USERFAULTFD -static inline int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, +static inline int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 50bf82b36995..922a2b45fe6f 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -152,14 +152,14 @@ extern void shmem_uncharge(struct inode *inode, long pages); #ifdef CONFIG_USERFAULTFD #ifdef CONFIG_SHMEM -extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, +extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, bool zeropage, bool wp_copy, struct page **pagep); #else /* !CONFIG_SHMEM */ -#define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \ +#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; }) #endif /* CONFIG_SHMEM */ #endif /* CONFIG_USERFAULTFD */ diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ac178d810dc7..9458cd94a508 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -56,7 +56,7 @@ enum mcopy_atomic_mode { MCOPY_ATOMIC_CONTINUE, }; -extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, +extern int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, bool newly_allocated, bool wp_copy); @@ -73,7 +73,7 @@ extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); -extern long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma, +extern long uffd_wp_range(struct vm_area_struct *vma, unsigned long start, unsigned long len, bool enable_wp); /* mm helpers */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b1e474aa2fc5..6dc32cccbd9b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6159,8 +6159,7 @@ out_mutex: * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte * with modifications for hugetlb pages. */ -int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, - pte_t *dst_pte, +int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -6168,6 +6167,7 @@ int hugetlb_mfill_atomic_pte(struct mm_struct *dst_mm, struct page **pagep, bool wp_copy) { + struct mm_struct *dst_mm = dst_vma->vm_mm; bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); struct hstate *h = hstate_vma(dst_vma); struct address_space *mapping = dst_vma->vm_file->f_mapping; diff --git a/mm/shmem.c b/mm/shmem.c index fa6e38f2f55f..9d13b9a64144 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2429,8 +2429,7 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block } #ifdef CONFIG_USERFAULTFD -int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, +int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -2520,11 +2519,11 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, goto out_release; ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, - gfp & GFP_RECLAIM_MASK, dst_mm); + gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm); if (ret) goto out_release; - ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, &folio->page, true, wp_copy); if (ret) goto out_delete_from_cache; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 874379ce271f..c3cc6cb04548 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -55,12 +55,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem * and anon, and for both shared and private VMAs. */ -int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, +int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, bool newly_allocated, bool wp_copy) { int ret; + struct mm_struct *dst_mm = dst_vma->vm_mm; pte_t _dst_pte, *dst_pte; bool writable = dst_vma->vm_flags & VM_WRITE; bool vm_shared = dst_vma->vm_flags & VM_SHARED; @@ -127,8 +128,7 @@ out_unlock: return ret; } -static int mfill_atomic_pte_copy(struct mm_struct *dst_mm, - pmd_t *dst_pmd, +static int mfill_atomic_pte_copy(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -190,10 +190,10 @@ static int mfill_atomic_pte_copy(struct mm_struct *dst_mm, __SetPageUptodate(page); ret = -ENOMEM; - if (mem_cgroup_charge(page_folio(page), dst_mm, GFP_KERNEL)) + if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL)) goto out_release; - ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, page, true, wp_copy); if (ret) goto out_release; @@ -204,8 +204,7 @@ out_release: goto out; } -static int mfill_atomic_pte_zeropage(struct mm_struct *dst_mm, - pmd_t *dst_pmd, +static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr) { @@ -217,7 +216,7 @@ static int mfill_atomic_pte_zeropage(struct mm_struct *dst_mm, _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl); if (dst_vma->vm_file) { /* the shmem MAP_PRIVATE case requires checking the i_size */ inode = dst_vma->vm_file->f_inode; @@ -230,7 +229,7 @@ static int mfill_atomic_pte_zeropage(struct mm_struct *dst_mm, ret = -EEXIST; if (!pte_none(*dst_pte)) goto out_unlock; - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + set_pte_at(dst_vma->vm_mm, dst_addr, dst_pte, _dst_pte); /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); ret = 0; @@ -240,8 +239,7 @@ out_unlock: } /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ -static int mfill_atomic_pte_continue(struct mm_struct *dst_mm, - pmd_t *dst_pmd, +static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, bool wp_copy) @@ -269,7 +267,7 @@ static int mfill_atomic_pte_continue(struct mm_struct *dst_mm, goto out_release; } - ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, page, false, wp_copy); if (ret) goto out_release; @@ -310,7 +308,7 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) * mfill_atomic processing for HUGETLB vmas. Note that this routine is * called with mmap_lock held, it will release mmap_lock before returning. */ -static __always_inline ssize_t mfill_atomic_hugetlb(struct mm_struct *dst_mm, +static __always_inline ssize_t mfill_atomic_hugetlb( struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long src_start, @@ -318,6 +316,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(struct mm_struct *dst_mm, enum mcopy_atomic_mode mode, bool wp_copy) { + struct mm_struct *dst_mm = dst_vma->vm_mm; int vm_shared = dst_vma->vm_flags & VM_SHARED; ssize_t err; pte_t *dst_pte; @@ -411,7 +410,7 @@ retry: goto out_unlock; } - err = hugetlb_mfill_atomic_pte(dst_mm, dst_pte, dst_vma, + err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr, src_addr, mode, &page, wp_copy); @@ -463,17 +462,15 @@ out: } #else /* !CONFIG_HUGETLB_PAGE */ /* fail at build time if gcc attempts to use this */ -extern ssize_t mfill_atomic_hugetlb(struct mm_struct *dst_mm, - struct vm_area_struct *dst_vma, - unsigned long dst_start, - unsigned long src_start, - unsigned long len, - enum mcopy_atomic_mode mode, - bool wp_copy); +extern ssize_t mfill_atomic_hugetlb(struct vm_area_struct *dst_vma, + unsigned long dst_start, + unsigned long src_start, + unsigned long len, + enum mcopy_atomic_mode mode, + bool wp_copy); #endif /* CONFIG_HUGETLB_PAGE */ -static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, +static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, @@ -484,7 +481,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, ssize_t err; if (mode == MCOPY_ATOMIC_CONTINUE) { - return mfill_atomic_pte_continue(dst_mm, dst_pmd, dst_vma, + return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, wp_copy); } @@ -500,14 +497,14 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, */ if (!(dst_vma->vm_flags & VM_SHARED)) { if (mode == MCOPY_ATOMIC_NORMAL) - err = mfill_atomic_pte_copy(dst_mm, dst_pmd, dst_vma, + err = mfill_atomic_pte_copy(dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy); else - err = mfill_atomic_pte_zeropage(dst_mm, dst_pmd, + err = mfill_atomic_pte_zeropage(dst_pmd, dst_vma, dst_addr); } else { - err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, + err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, src_addr, mode != MCOPY_ATOMIC_NORMAL, wp_copy, page); @@ -588,7 +585,7 @@ retry: * If this is a HUGETLB vma, pass off to appropriate routine */ if (is_vm_hugetlb_page(dst_vma)) - return mfill_atomic_hugetlb(dst_mm, dst_vma, dst_start, + return mfill_atomic_hugetlb(dst_vma, dst_start, src_start, len, mcopy_mode, wp_copy); @@ -641,7 +638,7 @@ retry: BUG_ON(pmd_none(*dst_pmd)); BUG_ON(pmd_trans_huge(*dst_pmd)); - err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, src_addr, &page, mcopy_mode, wp_copy); cond_resched(); @@ -710,7 +707,7 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, mmap_changing, 0); } -long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, +long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { unsigned int mm_cp_flags; @@ -732,7 +729,7 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, */ if (!enable_wp && vma_wants_manual_pte_write_upgrade(dst_vma)) mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE; - tlb_gather_mmu(&tlb, dst_mm); + tlb_gather_mmu(&tlb, dst_vma->vm_mm); ret = change_protection(&tlb, dst_vma, start, start + len, mm_cp_flags); tlb_finish_mmu(&tlb); @@ -788,7 +785,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, _start = max(dst_vma->vm_start, start); _end = min(dst_vma->vm_end, end); - err = uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp); + err = uffd_wp_range(dst_vma, _start, _end - _start, enable_wp); /* Return 0 on success, <0 on failures */ if (err < 0) -- cgit v1.2.3-70-g09d2 From d9712937037e0ce887920f321429826e9dbfd960 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Tue, 14 Mar 2023 15:12:49 -0700 Subject: mm: userfaultfd: combine 'mode' and 'wp_copy' arguments Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy' argument. In future commits we plan to plumb the flags through to more places, so we'd be proliferating the very long argument list even further. Let's take the time to simplify the argument list. Combine the two arguments into one - and generalize, so when we add more flags in the future, it doesn't imply more function arguments. Since the modes (copy, zeropage, continue) are mutually exclusive, store them as an integer value (0, 1, 2) in the low bits. Place combine-able flag bits in the high bits. This is quite similar to an earlier patch proposed by Nadav Amit ("userfaultfd: introduce uffd_flags" [1]). The main difference is that patch only handled flags, whereas this patch *also* combines the "mode" argument into the same type to shorten the argument list. [1]: https://lore.kernel.org/all/20220619233449.181323-2-namit@vmware.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: James Houghton Acked-by: Peter Xu Acked-by: Mike Rapoport (IBM) Cc: Al Viro Cc: Hugh Dickins Cc: Jan Kara Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Mike Kravetz Cc: Muchun Song Cc: Shuah Khan Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 5 ++- include/linux/hugetlb.h | 10 +++--- include/linux/shmem_fs.h | 5 +-- include/linux/userfaultfd_k.h | 46 +++++++++++++++++--------- mm/hugetlb.c | 12 +++---- mm/shmem.c | 7 ++-- mm/userfaultfd.c | 76 +++++++++++++++++++------------------------ 7 files changed, 84 insertions(+), 77 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index d8d432ca81e6..8971c3613cc6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1729,6 +1729,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, struct uffdio_copy uffdio_copy; struct uffdio_copy __user *user_uffdio_copy; struct userfaultfd_wake_range range; + uffd_flags_t flags = 0; user_uffdio_copy = (struct uffdio_copy __user *) arg; @@ -1755,10 +1756,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, goto out; if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP)) goto out; + if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP) + flags |= MFILL_ATOMIC_WP; if (mmget_not_zero(ctx->mm)) { ret = mfill_atomic_copy(ctx->mm, uffdio_copy.dst, uffdio_copy.src, uffdio_copy.len, &ctx->mmap_changing, - uffdio_copy.mode); + flags); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3cb7cd853fa8..2a758bcd6719 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -162,9 +162,8 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy); + uffd_flags_t flags, + struct page **pagep); #endif /* CONFIG_USERFAULTFD */ bool hugetlb_reserve_pages(struct inode *inode, long from, long to, struct vm_area_struct *vma, @@ -397,9 +396,8 @@ static inline int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy) + uffd_flags_t flags, + struct page **pagep) { BUG(); return 0; diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 922a2b45fe6f..3bb8d21edbb3 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -156,11 +157,11 @@ extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - bool zeropage, bool wp_copy, + uffd_flags_t flags, struct page **pagep); #else /* !CONFIG_SHMEM */ #define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ - src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; }) + src_addr, flags, pagep) ({ BUG(); 0; }) #endif /* CONFIG_SHMEM */ #endif /* CONFIG_USERFAULTFD */ diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9458cd94a508..4c477dece540 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -40,30 +40,44 @@ extern int sysctl_unprivileged_userfaultfd; extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); -/* - * The mode of operation for __mcopy_atomic and its helpers. - * - * This is almost an implementation detail (mcopy_atomic below doesn't take this - * as a parameter), but it's exposed here because memory-kind-specific - * implementations (e.g. hugetlbfs) need to know the mode of operation. - */ -enum mcopy_atomic_mode { - /* A normal copy_from_user into the destination range. */ - MCOPY_ATOMIC_NORMAL, - /* Don't copy; map the destination range to the zero page. */ - MCOPY_ATOMIC_ZEROPAGE, - /* Just install pte(s) with the existing page(s) in the page cache. */ - MCOPY_ATOMIC_CONTINUE, +/* A combined operation mode + behavior flags. */ +typedef unsigned int __bitwise uffd_flags_t; + +/* Mutually exclusive modes of operation. */ +enum mfill_atomic_mode { + MFILL_ATOMIC_COPY, + MFILL_ATOMIC_ZEROPAGE, + MFILL_ATOMIC_CONTINUE, + NR_MFILL_ATOMIC_MODES, }; +#define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1) +#define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr)) +#define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr)) +#define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1)) + +static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected) +{ + return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected); +} + +static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode) +{ + flags &= ~MFILL_ATOMIC_MODE_MASK; + return flags | ((__force uffd_flags_t) mode); +} + +/* Flags controlling behavior. These behavior changes are mode-independent. */ +#define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) + extern int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, - bool newly_allocated, bool wp_copy); + bool newly_allocated, uffd_flags_t flags); extern ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, - atomic_t *mmap_changing, __u64 mode); + atomic_t *mmap_changing, uffd_flags_t flags); extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6dc32cccbd9b..8bfd07f4c143 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6163,12 +6163,12 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - enum mcopy_atomic_mode mode, - struct page **pagep, - bool wp_copy) + uffd_flags_t flags, + struct page **pagep) { struct mm_struct *dst_mm = dst_vma->vm_mm; - bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); + bool is_continue = uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE); + bool wp_enabled = (flags & MFILL_ATOMIC_WP); struct hstate *h = hstate_vma(dst_vma); struct address_space *mapping = dst_vma->vm_file->f_mapping; pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); @@ -6303,7 +6303,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY * with wp flag set, don't set pte write bit. */ - if (wp_copy || (is_continue && !vm_shared)) + if (wp_enabled || (is_continue && !vm_shared)) writable = 0; else writable = dst_vma->vm_flags & VM_WRITE; @@ -6318,7 +6318,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, _dst_pte = huge_pte_mkdirty(_dst_pte); _dst_pte = pte_mkyoung(_dst_pte); - if (wp_copy) + if (wp_enabled) _dst_pte = huge_pte_mkuffd_wp(_dst_pte); set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); diff --git a/mm/shmem.c b/mm/shmem.c index 9d13b9a64144..b185c1db3009 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -76,7 +76,6 @@ static struct vfsmount *shm_mnt; #include #include #include -#include #include #include @@ -2433,7 +2432,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - bool zeropage, bool wp_copy, + uffd_flags_t flags, struct page **pagep) { struct inode *inode = file_inode(dst_vma->vm_file); @@ -2465,7 +2464,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, if (!folio) goto out_unacct_blocks; - if (!zeropage) { /* COPY */ + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { page_kaddr = kmap_local_folio(folio, 0); /* * The read mmap_lock is held here. Despite the @@ -2524,7 +2523,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, goto out_release; ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - &folio->page, true, wp_copy); + &folio->page, true, flags); if (ret) goto out_delete_from_cache; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index c3cc6cb04548..a9b19b39413d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -58,7 +58,7 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, - bool newly_allocated, bool wp_copy) + bool newly_allocated, uffd_flags_t flags) { int ret; struct mm_struct *dst_mm = dst_vma->vm_mm; @@ -77,7 +77,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, writable = false; if (writable) _dst_pte = pte_mkwrite(_dst_pte); - if (wp_copy) + if (flags & MFILL_ATOMIC_WP) _dst_pte = pte_mkuffd_wp(_dst_pte); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); @@ -132,8 +132,8 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - struct page **pagep, - bool wp_copy) + uffd_flags_t flags, + struct page **pagep) { void *page_kaddr; int ret; @@ -194,7 +194,7 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, goto out_release; ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - page, true, wp_copy); + page, true, flags); if (ret) goto out_release; out: @@ -242,7 +242,7 @@ out_unlock: static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, - bool wp_copy) + uffd_flags_t flags) { struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); @@ -268,7 +268,7 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, } ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - page, false, wp_copy); + page, false, flags); if (ret) goto out_release; @@ -313,8 +313,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb( unsigned long dst_start, unsigned long src_start, unsigned long len, - enum mcopy_atomic_mode mode, - bool wp_copy) + uffd_flags_t flags) { struct mm_struct *dst_mm = dst_vma->vm_mm; int vm_shared = dst_vma->vm_flags & VM_SHARED; @@ -334,7 +333,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * by THP. Since we can not reliably insert a zero page, this * feature is not supported. */ - if (mode == MCOPY_ATOMIC_ZEROPAGE) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { mmap_read_unlock(dst_mm); return -EINVAL; } @@ -402,7 +401,7 @@ retry: goto out_unlock; } - if (mode != MCOPY_ATOMIC_CONTINUE && + if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && !huge_pte_none_mostly(huge_ptep_get(dst_pte))) { err = -EEXIST; hugetlb_vma_unlock_read(dst_vma); @@ -410,9 +409,8 @@ retry: goto out_unlock; } - err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, - dst_addr, src_addr, mode, &page, - wp_copy); + err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr, + src_addr, flags, &page); hugetlb_vma_unlock_read(dst_vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -466,23 +464,21 @@ extern ssize_t mfill_atomic_hugetlb(struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long src_start, unsigned long len, - enum mcopy_atomic_mode mode, - bool wp_copy); + uffd_flags_t flags); #endif /* CONFIG_HUGETLB_PAGE */ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - struct page **page, - enum mcopy_atomic_mode mode, - bool wp_copy) + uffd_flags_t flags, + struct page **pagep) { ssize_t err; - if (mode == MCOPY_ATOMIC_CONTINUE) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, - dst_addr, wp_copy); + dst_addr, flags); } /* @@ -496,18 +492,17 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) { - if (mode == MCOPY_ATOMIC_NORMAL) + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) err = mfill_atomic_pte_copy(dst_pmd, dst_vma, - dst_addr, src_addr, page, - wp_copy); + dst_addr, src_addr, + flags, pagep); else err = mfill_atomic_pte_zeropage(dst_pmd, dst_vma, dst_addr); } else { err = shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, src_addr, - mode != MCOPY_ATOMIC_NORMAL, - wp_copy, page); + flags, pagep); } return err; @@ -517,9 +512,8 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, - enum mcopy_atomic_mode mcopy_mode, atomic_t *mmap_changing, - __u64 mode) + uffd_flags_t flags) { struct vm_area_struct *dst_vma; ssize_t err; @@ -527,7 +521,6 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm, unsigned long src_addr, dst_addr; long copied; struct page *page; - bool wp_copy; /* * Sanitize the command parameters: @@ -577,8 +570,7 @@ retry: * validate 'mode' now that we know the dst_vma: don't allow * a wrprotect copy if the userfaultfd didn't register as WP. */ - wp_copy = mode & UFFDIO_COPY_MODE_WP; - if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP)) + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) goto out_unlock; /* @@ -586,12 +578,12 @@ retry: */ if (is_vm_hugetlb_page(dst_vma)) return mfill_atomic_hugetlb(dst_vma, dst_start, - src_start, len, mcopy_mode, - wp_copy); + src_start, len, flags); if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock; - if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) + if (!vma_is_shmem(dst_vma) && + uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock; /* @@ -639,7 +631,7 @@ retry: BUG_ON(pmd_trans_huge(*dst_pmd)); err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, - src_addr, &page, mcopy_mode, wp_copy); + src_addr, flags, &page); cond_resched(); if (unlikely(err == -ENOENT)) { @@ -687,24 +679,24 @@ out: ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, - atomic_t *mmap_changing, __u64 mode) + atomic_t *mmap_changing, uffd_flags_t flags) { - return mfill_atomic(dst_mm, dst_start, src_start, len, - MCOPY_ATOMIC_NORMAL, mmap_changing, mode); + return mfill_atomic(dst_mm, dst_start, src_start, len, mmap_changing, + uffd_flags_set_mode(flags, MFILL_ATOMIC_COPY)); } ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start, unsigned long len, atomic_t *mmap_changing) { - return mfill_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE, - mmap_changing, 0); + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(0, MFILL_ATOMIC_ZEROPAGE)); } ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, unsigned long len, atomic_t *mmap_changing) { - return mfill_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE, - mmap_changing, 0); + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(0, MFILL_ATOMIC_CONTINUE)); } long uffd_wp_range(struct vm_area_struct *dst_vma, -- cgit v1.2.3-70-g09d2 From 0289184476c845968ad6ac9083c96cc0f75ca505 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Tue, 14 Mar 2023 15:12:50 -0700 Subject: mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE to resolve a missing fault, one can install a write-protected one. This is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination. This was motivated by testing HugeTLB HGM [1], and in particular its interaction with userfaultfd features. Existing userfaultfd code supports using WP and MINOR modes together (i.e. you can register an area with both enabled), but without this CONTINUE flag the combination is in practice unusable. So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as UFFDIO_COPY_MODE_WP, but for *minor* faults. Update the selftest to do some very basic exercising of the new flag. Update Documentation/ to describe how these flags are used (neither the COPY nor the new CONTINUE versions of this mode flag were described there before). [1]: https://patchwork.kernel.org/project/linux-mm/cover/20230218002819.1486479-1-jthoughton@google.com/ Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Peter Xu Acked-by: Mike Rapoport (IBM) Cc: Al Viro Cc: Hugh Dickins Cc: Jan Kara Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Mike Kravetz Cc: Muchun Song Cc: Nadav Amit Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/userfaultfd.rst | 8 ++++++++ fs/userfaultfd.c | 8 ++++++-- include/linux/userfaultfd_k.h | 3 ++- include/uapi/linux/userfaultfd.h | 7 +++++++ mm/userfaultfd.c | 5 +++-- tools/testing/selftests/mm/userfaultfd.c | 4 ++++ 6 files changed, 30 insertions(+), 5 deletions(-) (limited to 'include/linux/userfaultfd_k.h') diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index bd2226299583..7c304e432205 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -236,6 +236,14 @@ newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED and set the feature bit in advance to make sure none ptes will also be write protected even upon anonymous memory. +When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either +``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when +resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE`` +respectively, it may be desirable for the new page / mapping to be +write-protected (so future writes will also result in a WP fault). These ioctls +support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` +respectively) to configure the mapping this way. + QEMU/KVM ======== diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 8971c3613cc6..8395605790f6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1893,6 +1893,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) struct uffdio_continue uffdio_continue; struct uffdio_continue __user *user_uffdio_continue; struct userfaultfd_wake_range range; + uffd_flags_t flags = 0; user_uffdio_continue = (struct uffdio_continue __user *)arg; @@ -1917,13 +1918,16 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) uffdio_continue.range.start) { goto out; } - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE | + UFFDIO_CONTINUE_MODE_WP)) goto out; + if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WP) + flags |= MFILL_ATOMIC_WP; if (mmget_not_zero(ctx->mm)) { ret = mfill_atomic_continue(ctx->mm, uffdio_continue.range.start, uffdio_continue.range.len, - &ctx->mmap_changing); + &ctx->mmap_changing, flags); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 4c477dece540..a2c53e98dfd6 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -83,7 +83,8 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long len, atomic_t *mmap_changing); extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, - unsigned long len, atomic_t *mmap_changing); + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 90c958952bfc..66dd4cd277bd 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -305,6 +305,13 @@ struct uffdio_writeprotect { struct uffdio_continue { struct uffdio_range range; #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0) + /* + * UFFDIO_CONTINUE_MODE_WP will map the page write protected on + * the fly. UFFDIO_CONTINUE_MODE_WP is available only if the + * write protected ioctl is implemented for the range + * according to the uffdio_register.ioctls. + */ +#define UFFDIO_CONTINUE_MODE_WP ((__u64)1<<1) __u64 mode; /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a9b19b39413d..7f1b5f8b712c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -693,10 +693,11 @@ ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start, } ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, - unsigned long len, atomic_t *mmap_changing) + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) { return mfill_atomic(dst_mm, start, 0, len, mmap_changing, - uffd_flags_set_mode(0, MFILL_ATOMIC_CONTINUE)); + uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } long uffd_wp_range(struct vm_area_struct *dst_vma, diff --git a/tools/testing/selftests/mm/userfaultfd.c b/tools/testing/selftests/mm/userfaultfd.c index e030d63c031a..a96d126cb40e 100644 --- a/tools/testing/selftests/mm/userfaultfd.c +++ b/tools/testing/selftests/mm/userfaultfd.c @@ -585,6 +585,8 @@ static void continue_range(int ufd, __u64 start, __u64 len) req.range.start = start; req.range.len = len; req.mode = 0; + if (test_uffdio_wp) + req.mode |= UFFDIO_CONTINUE_MODE_WP; if (ioctl(ufd, UFFDIO_CONTINUE, &req)) err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, @@ -1332,6 +1334,8 @@ static int userfaultfd_minor_test(void) uffdio_register.range.start = (unsigned long)area_dst_alias; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) err("register failure"); -- cgit v1.2.3-70-g09d2