diff options
author | Vivek Goyal <vgoyal@redhat.com> | 2020-08-19 18:19:54 -0400 |
---|---|---|
committer | Miklos Szeredi <mszeredi@redhat.com> | 2020-09-10 11:39:23 +0200 |
commit | 6ae330cad6ef22ab8347ea9e0707dc56a7c7363f (patch) | |
tree | 200aaccefa5b50b7155fea5a0e97d1805c9c634a /fs/fuse/dax.c | |
parent | 9483e7d5809ab41890298a6a1f5d23c4e10a2cfd (diff) |
virtiofs: serialize truncate/punch_hole and dax fault path
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Diffstat (limited to 'fs/fuse/dax.c')
-rw-r--r-- | fs/fuse/dax.c | 49 |
1 files changed, 49 insertions, 0 deletions
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index c21305409191..efc1a8ef5639 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -495,6 +495,47 @@ static const struct iomap_ops fuse_iomap_ops = { .iomap_end = fuse_iomap_end, }; +static void fuse_wait_dax_page(struct inode *inode) +{ + struct fuse_inode *fi = get_fuse_inode(inode); + + up_write(&fi->i_mmap_sem); + schedule(); + down_write(&fi->i_mmap_sem); +} + +/* Should be called with fi->i_mmap_sem lock held exclusively */ +static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, + loff_t start, loff_t end) +{ + struct page *page; + + page = dax_layout_busy_page_range(inode->i_mapping, start, end); + if (!page) + return 0; + + *retry = true; + return ___wait_var_event(&page->_refcount, + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, + 0, 0, fuse_wait_dax_page(inode)); +} + +/* dmap_end == 0 leads to unmapping of whole file */ +int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start, + u64 dmap_end) +{ + bool retry; + int ret; + + do { + retry = false; + ret = __fuse_dax_break_layouts(inode, &retry, dmap_start, + dmap_end); + } while (ret == 0 && retry); + + return ret; +} + ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct inode *inode = file_inode(iocb->ki_filp); @@ -596,10 +637,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, if (write) sb_start_pagefault(sb); + /* + * We need to serialize against not only truncate but also against + * fuse dax memory range reclaim. While a range is being reclaimed, + * we do not want any read/write/mmap to make progress and try + * to populate page cache or access memory we are trying to free. + */ + down_read(&get_fuse_inode(inode)->i_mmap_sem); ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); + up_read(&get_fuse_inode(inode)->i_mmap_sem); if (write) sb_end_pagefault(sb); |