From 95ec8daba310b44302d2977dd54b16886527b681 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 16 Feb 2015 15:59:09 -0800 Subject: dax: replace XIP documentation with DAX documentation Based on the original XIP documentation, this documents the current state of affairs, and includes instructions on how users can enable DAX if their devices and kernel support it. Signed-off-by: Matthew Wilcox Reviewed-by: Randy Dunlap Cc: Andreas Dilger Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Mathieu Desnoyers Cc: Ross Zwisler Cc: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/00-INDEX | 5 ++- Documentation/filesystems/dax.txt | 89 ++++++++++++++++++++++++++++++++++++++ Documentation/filesystems/xip.txt | 71 ------------------------------ 3 files changed, 92 insertions(+), 73 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index ac28149aede4..9922939e7d99 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -34,6 +34,9 @@ configfs/ - directory containing configfs documentation and example code. cramfs.txt - info on the cram filesystem for small storage (ROMs etc). +dax.txt + - info on avoiding the page cache for files stored on CPU-addressable + storage devices. debugfs.txt - info on the debugfs filesystem. devpts.txt @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt - info on XFS Self Describing Metadata. xfs.txt - info and mount options for the XFS filesystem. -xip.txt - - info on execute-in-place for file mappings. diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt new file mode 100644 index 000000000000..635adaa1e425 --- /dev/null +++ b/Documentation/filesystems/dax.txt @@ -0,0 +1,89 @@ +Direct Access for files +----------------------- + +Motivation +---------- + +The page cache is usually used to buffer reads and writes to files. +It is also used to provide the pages which are mapped into userspace +by a call to mmap. + +For block devices that are memory-like, the page cache pages would be +unnecessary copies of the original storage. The DAX code removes the +extra copy by performing reads and writes directly to the storage device. +For file mappings, the storage device is mapped directly into userspace. + + +Usage +----- + +If you have a block device which supports DAX, you can make a filesystem +on it as usual. When mounting it, use the -o dax option manually +or add 'dax' to the options in /etc/fstab. + + +Implementation Tips for Block Driver Writers +-------------------------------------------- + +To support DAX in your block driver, implement the 'direct_access' +block device operation. It is used to translate the sector number +(expressed in units of 512-byte sectors) to a page frame number (pfn) +that identifies the physical page for the memory. It also returns a +kernel virtual address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that can be contiguously accessed at that offset. It may also +return a negative errno if an error occurs. + +In order to support this method, the storage must be byte-accessible by +the CPU at all times. If your device uses paging techniques to expose +a large amount of memory through a smaller window, then you cannot +implement direct_access. Equally, if your device can occasionally +stall the CPU for an extended period, you should also not attempt to +implement direct_access. + +These block devices may be used for inspiration: +- axonram: Axon DDR2 device driver +- brd: RAM backed block device driver +- dcssblk: s390 dcss block device driver + + +Implementation Tips for Filesystem Writers +------------------------------------------ + +Filesystem support consists of +- adding support to mark inodes as being DAX by setting the S_DAX flag in + i_flags +- implementing the direct_IO address space operation, and calling + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set +- implementing an mmap file operation for DAX files which sets the + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers + for fault and page_mkwrite (which should probably call dax_fault() and + dax_mkwrite(), passing the appropriate get_block() callback) +- calling dax_truncate_page() instead of block_truncate_page() for DAX files +- ensuring that there is sufficient locking between reads, writes, + truncates and page faults + +The get_block() callback passed to the DAX functions may return +uninitialised extents. If it does, it must ensure that simultaneous +calls to get_block() (for example by a page-fault racing with a read() +or a write()) work correctly. + +These filesystems may be used for inspiration: +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt + + +Shortcomings +------------ + +Even if the kernel or its modules are stored on a filesystem that supports +DAX on a block device that supports DAX, they will still be copied into RAM. + +Calling get_user_pages() on a range of user memory that has been mmaped +from a DAX file will fail as there are no 'struct page' to describe +those pages. This problem is being worked on. That means that O_DIRECT +reads/writes to those memory ranges from a non-DAX file will fail (note +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory +that is being accessed that is key here). Other things that will not +work include RDMA, sendfile() and splice(). diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt deleted file mode 100644 index b77472949ede..000000000000 --- a/Documentation/filesystems/xip.txt +++ /dev/null @@ -1,71 +0,0 @@ -Execute-in-place for file mappings ----------------------------------- - -Motivation ----------- -File mappings are performed by mapping page cache pages to userspace. In -addition, read&write type file operations also transfer data from/to the page -cache. - -For memory backed storage devices that use the block device interface, the page -cache pages are in fact copies of the original storage. Various approaches -exist to work around the need for an extra copy. The ramdisk driver for example -does read the data into the page cache, keeps a reference, and discards the -original data behind later on. - -Execute-in-place solves this issue the other way around: instead of keeping -data in the page cache, the need to have a page cache copy is eliminated -completely. With execute-in-place, read&write type operations are performed -directly from/to the memory backed storage device. For file mappings, the -storage device itself is mapped directly into userspace. - -This implementation was initially written for shared memory segments between -different virtual machines on s390 hardware to allow multiple machines to -share the same binaries and libraries. - -Implementation --------------- -Execute-in-place is implemented in three steps: block device operation, -address space operation, and file operations. - -A block device operation named direct_access is used to translate the -block device sector number to a page frame number (pfn) that identifies -the physical page for the memory. It also returns a kernel virtual -address that can be used to access the memory. - -The direct_access method takes a 'size' parameter that indicates the -number of bytes being requested. The function should return the number -of bytes that can be contiguously accessed at that offset. It may also -return a negative errno if an error occurs. - -The block device operation is optional, these block devices support it as of -today: -- dcssblk: s390 dcss block device driver - -An address space operation named get_xip_mem is used to retrieve references -to a page frame number and a kernel address. To obtain these values a reference -to an address_space is provided. This function assigns values to the kmem and -pfn parameters. The third argument indicates whether the function should allocate -blocks if needed. - -This address space operation is mutually exclusive with readpage&writepage that -do page cache read/write operations. -The following filesystems support it as of today: -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt - -A set of file operations that do utilize get_xip_page can be found in -mm/filemap_xip.c . The following file operation implementations are provided: -- aio_read/aio_write -- readv/writev -- sendfile - -The generic file operations do_sync_read/do_sync_write can be used to implement -classic synchronous IO calls. - -Shortcomings ------------- -This implementation is limited to storage devices that are cpu addressable at -all times (no highmem or such). It works well on rom/ram, but enhancements are -needed to make it work with flash in read+write mode. -Putting the Linux kernel and/or its modules on a xip filesystem does not mean -they are not copied. -- cgit v1.2.3-70-g09d2 From e748dcd095ddee50e7a7deda2e26247715318a2e Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 16 Feb 2015 15:59:12 -0800 Subject: vfs: remove get_xip_mem All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Also remove mm/filemap_xip.c as it is now empty. Also remove documentation of the long-gone get_xip_page(). Signed-off-by: Matthew Wilcox Cc: Andreas Dilger Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Mathieu Desnoyers Cc: Randy Dunlap Cc: Ross Zwisler Cc: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/Locking | 3 --- Documentation/filesystems/vfs.txt | 7 ------ fs/exofs/inode.c | 1 - fs/ext2/inode.c | 1 - fs/ext2/xip.c | 45 --------------------------------------- fs/ext2/xip.h | 3 --- fs/open.c | 5 +---- include/linux/fs.h | 2 -- include/linux/rmap.h | 2 +- mm/Makefile | 1 - mm/fadvise.c | 6 ++++-- mm/filemap_xip.c | 24 --------------------- mm/madvise.c | 2 +- 13 files changed, 7 insertions(+), 95 deletions(-) delete mode 100644 mm/filemap_xip.c (limited to 'Documentation') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index b30753cbf431..2ca3d17eee56 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -199,8 +199,6 @@ prototypes: int (*releasepage) (struct page *, int); void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, - unsigned long *); int (*migratepage)(struct address_space *, struct page *, struct page *); int (*launder_page)(struct page *); int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long); @@ -225,7 +223,6 @@ invalidatepage: yes releasepage: yes freepage: yes direct_IO: -get_xip_mem: maybe migratepage: yes (both) launder_page: yes is_partially_uptodate: yes diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 43ce0507ee25..966b22829f3b 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -591,8 +591,6 @@ struct address_space_operations { int (*releasepage) (struct page *, int); void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); - struct page* (*get_xip_page)(struct address_space *, sector_t, - int); /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); @@ -748,11 +746,6 @@ struct address_space_operations { and transfer data directly between the storage and the application's address space. - get_xip_page: called by the VM to translate a block number to a page. - The page is valid until the corresponding filesystem is unmounted. - Filesystems that want to use execute-in-place (XIP) need to implement - it. An example implementation can be found in fs/ext2/xip.c. - migrate_page: This is used to compact the physical memory usage. If the VM wants to relocate a page (maybe off a memory card that is signalling imminent failure) it will pass a new page diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index 6fc91df99ff8..a198e94813fe 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { .direct_IO = exofs_direct_IO, /* With these NULL has special meaning or default is not exported */ - .get_xip_mem = NULL, .migratepage = NULL, .launder_page = NULL, .is_partially_uptodate = NULL, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 5ac0a3461b3c..59d6c7d43740 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, - .get_xip_mem = ext2_get_xip_mem, .direct_IO = ext2_direct_IO, }; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index 8cfca3a4cd58..132d4daf98c4 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,35 +13,6 @@ #include "ext2.h" #include "xip.h" -static inline long __inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn, long size) -{ - struct block_device *bdev = inode->i_sb->s_bdev; - sector_t sector = block * (PAGE_SIZE / 512); - return bdev_direct_access(bdev, sector, kaddr, pfn, size); -} - -static inline int -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, - sector_t *result) -{ - struct buffer_head tmp; - int rc; - - memset(&tmp, 0, sizeof(struct buffer_head)); - tmp.b_size = 1 << inode->i_blkbits; - rc = ext2_get_block(inode, pgoff, &tmp, create); - *result = tmp.b_blocknr; - - /* did we get a sparse block (hole in the file)? */ - if (!tmp.b_blocknr && !rc) { - BUG_ON(create); - rc = -ENODATA; - } - - return rc; -} - void ext2_xip_verify_sb(struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); @@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb) "not supported by bdev"); } } - -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, - void **kmem, unsigned long *pfn) -{ - long rc; - sector_t block; - - /* first, retrieve the sector number */ - rc = __ext2_get_block(mapping->host, pgoff, create, &block); - if (rc) - return rc; - - /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); - return (rc < 0) ? rc : 0; -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index b2592f2f3c9d..e7b9f0a2cc54 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb) struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, - void **, unsigned long *); #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 -#define ext2_get_xip_mem NULL #endif diff --git a/fs/open.c b/fs/open.c index 813be037b412..a293c2020676 100644 --- a/fs/open.c +++ b/fs/open.c @@ -667,11 +667,8 @@ int open_check_o_direct(struct file *f) { /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { - if (!f->f_mapping->a_ops || - ((!f->f_mapping->a_ops->direct_IO) && - (!f->f_mapping->a_ops->get_xip_mem))) { + if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) return -EINVAL; - } } return 0; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2c8f9055af38..9772d655f444 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -362,8 +362,6 @@ struct address_space_operations { int (*releasepage) (struct page *, gfp_t); void (*freepage)(struct page *); ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, - void **, unsigned long *); /* * migrate the contents of a page to the specified target. If * migrate_mode is MIGRATE_ASYNC, it must not block. diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b38f559130d5..c4c559a45dc8 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -198,7 +198,7 @@ int page_referenced(struct page *, int is_locked, int try_to_unmap(struct page *, enum ttu_flags flags); /* - * Called from mm/filemap_xip.c to unmap empty zero page + * Used by uprobes to replace a userspace page safely */ pte_t *__page_check_address(struct page *, struct mm_struct *, unsigned long, spinlock_t **, int); diff --git a/mm/Makefile b/mm/Makefile index 088c68e9ec35..3c1caa2693bd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -55,7 +55,6 @@ obj-$(CONFIG_KMEMCHECK) += kmemcheck.o obj-$(CONFIG_KASAN) += kasan/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o -obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o diff --git a/mm/fadvise.c b/mm/fadvise.c index fac23ecf8d72..4a3907cf79f8 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -28,6 +28,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) { struct fd f = fdget(fd); + struct inode *inode; struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ @@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) if (!f.file) return -EBADF; - if (S_ISFIFO(file_inode(f.file)->i_mode)) { + inode = file_inode(f.file); + if (S_ISFIFO(inode->i_mode)) { ret = -ESPIPE; goto out; } @@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) goto out; } - if (mapping->a_ops->get_xip_mem) { + if (IS_DAX(inode)) { switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c deleted file mode 100644 index 8e3f99b61959..000000000000 --- a/mm/filemap_xip.c +++ /dev/null @@ -1,24 +0,0 @@ -/* - * linux/mm/filemap_xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte - * - * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - diff --git a/mm/madvise.c b/mm/madvise.c index 1077cbdc8b52..d551475517bf 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -239,7 +239,7 @@ static long madvise_willneed(struct vm_area_struct *vma, return -EBADF; #endif - if (file->f_mapping->a_ops->get_xip_mem) { + if (IS_DAX(file_inode(file))) { /* no bad return value, but ignore advice */ return 0; } -- cgit v1.2.3-70-g09d2 From 9c3ce9ec58716733232b97771b10f31901caf62e Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 16 Feb 2015 15:59:31 -0800 Subject: ext2: get rid of most mentions of XIP in ext2 To help people transition, accept the 'xip' mount option (and report it in /proc/mounts), but print a message encouraging people to switch over to the 'dax' option. Signed-off-by: Matthew Wilcox Reviewed-by: Mathieu Desnoyers Cc: Andreas Dilger Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Randy Dunlap Cc: Ross Zwisler Cc: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/ext2.txt | 5 +++-- fs/ext2/ext2.h | 13 +++++++------ fs/ext2/file.c | 2 +- fs/ext2/inode.c | 6 +++--- fs/ext2/namei.c | 8 ++++---- fs/ext2/super.c | 25 ++++++++++++++++--------- 6 files changed, 34 insertions(+), 25 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f905f10..b9714569e472 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -20,6 +20,9 @@ minixdf Makes `df' act like Minix. check=none, nocheck (*) Don't do extra checking of bitmaps on mount (check=normal and check=strict options removed) +dax Use direct access (no page cache). See + Documentation/filesystems/dax.txt. + debug Extra debugging information is sent to the kernel syslog. Useful for developers. @@ -56,8 +59,6 @@ noacl Don't support POSIX ACLs. nobh Do not attach buffer_heads to file pagecache. -xip Use execute in place (no caching) if possible - grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index ab9b3ec3bac9..678f9ab08c48 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,14 +380,15 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ -#ifdef CONFIG_FS_DAX -#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */ -#else -#define EXT2_MOUNT_XIP 0 -#endif +#define EXT2_MOUNT_XIP 0x010000 /* Obsolete, use DAX */ #define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */ #define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */ #define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */ +#ifdef CONFIG_FS_DAX +#define EXT2_MOUNT_DAX 0x100000 /* Direct Access */ +#else +#define EXT2_MOUNT_DAX 0 +#endif #define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt @@ -792,7 +793,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_xip_file_operations; +extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index de8174d1e973..e31701713516 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { }; #ifdef CONFIG_FS_DAX -const struct file_operations ext2_xip_file_operations = { +const struct file_operations ext2_dax_file_operations = { .llseek = generic_file_llseek, .read = new_sync_read, .write = new_sync_write, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 034fd42eade0..6434bc000125 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; - if (test_opt(inode->i_sb, XIP)) + if (test_opt(inode->i_sb, DAX)) inode->i_flags |= S_DAX; } @@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 0db888c91bec..148f6e3789ea 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = &ext2_file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_xip_file_operations; + inode->i_fop = &ext2_dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; inode->i_fop = &ext2_file_operations; diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 5f029d8c3a02..d0e746e96511 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -294,6 +294,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) #ifdef CONFIG_FS_DAX if (sbi->s_mount_opt & EXT2_MOUNT_XIP) seq_puts(seq, ",xip"); + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) + seq_puts(seq, ",dax"); #endif if (!test_opt(sb, RESERVATION)) @@ -402,7 +404,7 @@ enum { Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr, - Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota, + Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota, Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation }; @@ -431,6 +433,7 @@ static const match_table_t tokens = { {Opt_acl, "acl"}, {Opt_noacl, "noacl"}, {Opt_xip, "xip"}, + {Opt_dax, "dax"}, {Opt_grpquota, "grpquota"}, {Opt_ignore, "noquota"}, {Opt_quota, "quota"}, @@ -558,10 +561,14 @@ static int parse_options(char *options, struct super_block *sb) break; #endif case Opt_xip: + ext2_msg(sb, KERN_INFO, "use dax instead of xip"); + set_opt(sbi->s_mount_opt, XIP); + /* Fall through */ + case Opt_dax: #ifdef CONFIG_FS_DAX - set_opt (sbi->s_mount_opt, XIP); + set_opt(sbi->s_mount_opt, DAX); #else - ext2_msg(sb, KERN_INFO, "xip option not supported"); + ext2_msg(sb, KERN_INFO, "dax option not supported"); #endif break; @@ -905,15 +912,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (sbi->s_mount_opt & EXT2_MOUNT_DAX) { if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, - "error: unsupported blocksize for xip"); + "error: unsupported blocksize for dax"); goto failed_mount; } if (!sb->s_bdev->bd_disk->fops->direct_access) { ext2_msg(sb, KERN_ERR, - "error: device does not support xip"); + "error: device does not support dax"); goto failed_mount; } } @@ -1286,10 +1293,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " - "xip flag with busy inodes while remounting"); - sbi->s_mount_opt ^= EXT2_MOUNT_XIP; + "dax flag with busy inodes while remounting"); + sbi->s_mount_opt ^= EXT2_MOUNT_DAX; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(&sbi->s_lock); -- cgit v1.2.3-70-g09d2 From 25726bc15731d42112b579cf73f30edbc43d3973 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 16 Feb 2015 15:59:35 -0800 Subject: dax: add dax_zero_page_range This new function allows us to support hole-punch for DAX files by zeroing a partial page, as opposed to the dax_truncate_page() function which can only truncate to the end of the page. Reimplement dax_truncate_page() to call dax_zero_page_range(). [ross.zwisler@linux.intel.com: ported to 3.13-rc2] [akpm@linux-foundation.org: fix typos in comments] Signed-off-by: Matthew Wilcox Signed-off-by: Ross Zwisler Cc: Andreas Dilger Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Mathieu Desnoyers Cc: Randy Dunlap Cc: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/dax.txt | 1 + fs/dax.c | 38 ++++++++++++++++++++++++++++++++------ include/linux/fs.h | 1 + 3 files changed, 34 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 635adaa1e425..ebcd97f03654 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -62,6 +62,7 @@ Filesystem support consists of for fault and page_mkwrite (which should probably call dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback) - calling dax_truncate_page() instead of block_truncate_page() for DAX files +- calling dax_zero_page_range() instead of zero_user() for DAX files - ensuring that there is sufficient locking between reads, writes, truncates and page faults diff --git a/fs/dax.c b/fs/dax.c index ebf9b2231b0f..ed1619ec6537 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -464,31 +464,35 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** - * dax_truncate_page - handle a partial page being truncated in a DAX file + * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to + * @length: The number of bytes to zero * @get_block: The filesystem method used to translate file offsets to blocks * - * Similar to block_truncate_page(), this function can be called by a - * filesystem when it is truncating an DAX file to handle the partial page. + * This function can be called by a filesystem when it is zeroing part of a + * page in a DAX file. This is intended for hole-punch operations. If + * you are truncating a file, the helper function dax_truncate_page() may be + * more convenient. * * We work in terms of PAGE_CACHE_SIZE here for commonality with * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem * took care of disposing of the unnecessary blocks. Even if the filesystem * block size is smaller than PAGE_SIZE, we have to zero the rest of the page - * since the file might be mmaped. + * since the file might be mmapped. */ -int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, + get_block_t get_block) { struct buffer_head bh; pgoff_t index = from >> PAGE_CACHE_SHIFT; unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length = PAGE_CACHE_ALIGN(from) - from; int err; /* Block boundary? Nothing to do */ if (!length) return 0; + BUG_ON((offset + length) > PAGE_CACHE_SIZE); memset(&bh, 0, sizeof(bh)); bh.b_size = PAGE_CACHE_SIZE; @@ -505,4 +509,26 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) return 0; } +EXPORT_SYMBOL_GPL(dax_zero_page_range); + +/** + * dax_truncate_page - handle a partial page being truncated in a DAX file + * @inode: The file being truncated + * @from: The file offset that is being truncated to + * @get_block: The filesystem method used to translate file offsets to blocks + * + * Similar to block_truncate_page(), this function can be called by a + * filesystem when it is truncating a DAX file to handle the partial page. + * + * We work in terms of PAGE_CACHE_SIZE here for commonality with + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem + * took care of disposing of the unnecessary blocks. Even if the filesystem + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page + * since the file might be mmapped. + */ +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +{ + unsigned length = PAGE_CACHE_ALIGN(from) - from; + return dax_zero_page_range(inode, from, length, get_block); +} EXPORT_SYMBOL_GPL(dax_truncate_page); diff --git a/include/linux/fs.h b/include/linux/fs.h index d46f8fe6a0ea..ed5a0900b94d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2589,6 +2589,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) -- cgit v1.2.3-70-g09d2 From 923ae0ff9250430133b3310fe62c47538cf1cbc1 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Mon, 16 Feb 2015 15:59:38 -0800 Subject: ext4: add DAX functionality This is a port of the DAX functionality found in the current version of ext2. [matthew.r.wilcox@intel.com: heavily tweaked] [akpm@linux-foundation.org: remap_pages went away] Signed-off-by: Ross Zwisler Reviewed-by: Andreas Dilger Signed-off-by: Matthew Wilcox Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Mathieu Desnoyers Cc: Randy Dunlap Cc: Theodore Ts'o Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/dax.txt | 1 + Documentation/filesystems/ext4.txt | 4 ++ fs/ext4/ext4.h | 6 +++ fs/ext4/file.c | 49 ++++++++++++++++++++- fs/ext4/indirect.c | 18 +++++--- fs/ext4/inode.c | 89 ++++++++++++++++++++++++++------------ fs/ext4/namei.c | 10 ++++- fs/ext4/super.c | 39 ++++++++++++++++- 8 files changed, 179 insertions(+), 37 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index ebcd97f03654..be376d91d058 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -73,6 +73,7 @@ or a write()) work correctly. These filesystems may be used for inspiration: - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt +- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt Shortcomings diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 919a3293aaa4..6c0108eb0137 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -386,6 +386,10 @@ max_dir_size_kb=n This limits the size of directories so that any i_version Enable 64-bit inode version support. This option is off by default. +dax Use direct access (no page cache). See + Documentation/filesystems/dax.txt. Note that + this option is incompatible with data=journal. + Data Mode ========= There are 3 different data modes: diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index a75fba67bb1f..982d934fd9ac 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -965,6 +965,11 @@ struct ext4_inode_info { #define EXT4_MOUNT_ERRORS_MASK 0x00070 #define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */ #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/ +#ifdef CONFIG_FS_DAX +#define EXT4_MOUNT_DAX 0x00200 /* Direct Access */ +#else +#define EXT4_MOUNT_DAX 0 +#endif #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */ #define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */ #define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */ @@ -2578,6 +2583,7 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; +extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); /* inline.c */ diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 7cb592386121..33a09da16c9c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file_inode(iocb->ki_filp); struct mutex *aio_mutex = NULL; struct blk_plug plug; - int o_direct = file->f_flags & O_DIRECT; + int o_direct = io_is_direct(file); int overwrite = 0; size_t length = iov_iter_count(from); ssize_t ret; @@ -191,6 +191,26 @@ errout: return ret; } +#ifdef CONFIG_FS_DAX +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext4_get_block); + /* Is this the right get_block? */ +} + +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext4_get_block); +} + +static const struct vm_operations_struct ext4_dax_vm_ops = { + .fault = ext4_dax_fault, + .page_mkwrite = ext4_dax_mkwrite, +}; +#else +#define ext4_dax_vm_ops ext4_file_vm_ops +#endif + static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, @@ -200,7 +220,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = { static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { file_accessed(file); - vma->vm_ops = &ext4_file_vm_ops; + if (IS_DAX(file_inode(file))) { + vma->vm_ops = &ext4_dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + } else { + vma->vm_ops = &ext4_file_vm_ops; + } return 0; } @@ -599,6 +624,26 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; +#ifdef CONFIG_FS_DAX +const struct file_operations ext4_dax_file_operations = { + .llseek = ext4_llseek, + .read = new_sync_read, + .write = new_sync_write, + .read_iter = generic_file_read_iter, + .write_iter = ext4_file_write_iter, + .unlocked_ioctl = ext4_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ext4_compat_ioctl, +#endif + .mmap = ext4_file_mmap, + .open = ext4_file_open, + .release = ext4_release_file, + .fsync = ext4_sync_file, + /* Splice not yet supported with DAX */ + .fallocate = ext4_fallocate, +}; +#endif + const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_getattr, diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 36b369697a13..6b9878a24182 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -689,14 +689,22 @@ retry: inode_dio_done(inode); goto locked; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iter, offset, - ext4_get_block, NULL, NULL, 0); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iter, offset, + ext4_get_block, NULL, 0); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iter, offset, + ext4_get_block, NULL, NULL, 0); inode_dio_done(inode); } else { locked: - ret = blockdev_direct_IO(rw, iocb, inode, iter, - offset, ext4_get_block); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iter, offset, + ext4_get_block, NULL, DIO_LOCKING); + else + ret = blockdev_direct_IO(rw, iocb, inode, iter, + offset, ext4_get_block); if (unlikely((rw & WRITE) && ret < 0)) { loff_t isize = i_size_read(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5653fa42930b..28555f191b62 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -657,6 +657,18 @@ has_zeroout: return retval; } +static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate) +{ + struct inode *inode = bh->b_assoc_map->host; + /* XXX: breaks on 32-bit > 16GB. Is that even supported? */ + loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits; + int err; + if (!uptodate) + return; + WARN_ON(!buffer_unwritten(bh)); + err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size); +} + /* Maximum number of blocks we map for direct IO at once. */ #define DIO_MAX_BLOCKS 4096 @@ -694,6 +706,11 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock, map_bh(bh, inode->i_sb, map.m_pblk); bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; + if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) { + bh->b_assoc_map = inode->i_mapping; + bh->b_private = (void *)(unsigned long)iblock; + bh->b_end_io = ext4_end_io_unwritten; + } if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN) set_buffer_defer_completion(bh); bh->b_size = inode->i_sb->s_blocksize * map.m_len; @@ -3010,13 +3027,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb, get_block_func = ext4_get_block_write; dio_flags = DIO_LOCKING; } - ret = __blockdev_direct_IO(rw, iocb, inode, - inode->i_sb->s_bdev, iter, - offset, - get_block_func, - ext4_end_io_dio, - NULL, - dio_flags); + if (IS_DAX(inode)) + ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func, + ext4_end_io_dio, dio_flags); + else + ret = __blockdev_direct_IO(rw, iocb, inode, + inode->i_sb->s_bdev, iter, offset, + get_block_func, + ext4_end_io_dio, NULL, dio_flags); /* * Put our reference to io_end. This can free the io_end structure e.g. @@ -3180,19 +3198,12 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_aops; } -/* - * ext4_block_zero_page_range() zeros out a mapping of length 'length' - * starting from file offset 'from'. The range to be zero'd must - * be contained with in one block. If the specified range exceeds - * the end of the block it will be shortened to end of the block - * that cooresponds to 'from' - */ -static int ext4_block_zero_page_range(handle_t *handle, +static int __ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned blocksize, max, pos; + unsigned blocksize, pos; ext4_lblk_t iblock; struct inode *inode = mapping->host; struct buffer_head *bh; @@ -3205,14 +3216,6 @@ static int ext4_block_zero_page_range(handle_t *handle, return -ENOMEM; blocksize = inode->i_sb->s_blocksize; - max = blocksize - (offset & (blocksize - 1)); - - /* - * correct length if it does not fall between - * 'from' and the end of the block - */ - if (length > max || length < 0) - length = max; iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits); @@ -3277,6 +3280,33 @@ unlock: return err; } +/* + * ext4_block_zero_page_range() zeros out a mapping of length 'length' + * starting from file offset 'from'. The range to be zero'd must + * be contained with in one block. If the specified range exceeds + * the end of the block it will be shortened to end of the block + * that cooresponds to 'from' + */ +static int ext4_block_zero_page_range(handle_t *handle, + struct address_space *mapping, loff_t from, loff_t length) +{ + struct inode *inode = mapping->host; + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned blocksize = inode->i_sb->s_blocksize; + unsigned max = blocksize - (offset & (blocksize - 1)); + + /* + * correct length if it does not fall between + * 'from' and the end of the block + */ + if (length > max || length < 0) + length = max; + + if (IS_DAX(inode)) + return dax_zero_page_range(inode, from, length, ext4_get_block); + return __ext4_block_zero_page_range(handle, mapping, from, length); +} + /* * ext4_block_truncate_page() zeroes out a mapping from file offset `from' * up to the end of the block which corresponds to `from'. @@ -3798,8 +3828,10 @@ void ext4_set_inode_flags(struct inode *inode) new_fl |= S_NOATIME; if (flags & EXT4_DIRSYNC_FL) new_fl |= S_DIRSYNC; + if (test_opt(inode->i_sb, DAX)) + new_fl |= S_DAX; inode_set_flags(inode, new_fl, - S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX); } /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */ @@ -4052,7 +4084,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext4_dir_inode_operations; @@ -4534,7 +4569,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. */ - truncate_pagecache(inode, inode->i_size); + truncate_pagecache(inode, inode->i_size); } /* * We want to call ext4_truncate() even if attr->ia_size == diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 2291923dae4e..28fe71a2904c 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2235,7 +2235,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); err = ext4_add_nondir(handle, dentry, inode); if (!err && IS_DIRSYNC(dir)) @@ -2299,7 +2302,10 @@ retry: err = PTR_ERR(inode); if (!IS_ERR(inode)) { inode->i_op = &ext4_file_inode_operations; - inode->i_fop = &ext4_file_operations; + if (test_opt(inode->i_sb, DAX)) + inode->i_fop = &ext4_dax_file_operations; + else + inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); d_tmpfile(dentry, inode); err = ext4_orphan_add(handle, inode); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 64c39c7c594f..10e8c6b7ca08 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1124,7 +1124,7 @@ enum { Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota, Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota, Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err, - Opt_usrquota, Opt_grpquota, Opt_i_version, + Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax, Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit, Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity, Opt_inode_readahead_blks, Opt_journal_ioprio, @@ -1187,6 +1187,7 @@ static const match_table_t tokens = { {Opt_barrier, "barrier"}, {Opt_nobarrier, "nobarrier"}, {Opt_i_version, "i_version"}, + {Opt_dax, "dax"}, {Opt_stripe, "stripe=%u"}, {Opt_delalloc, "delalloc"}, {Opt_nodelalloc, "nodelalloc"}, @@ -1371,6 +1372,7 @@ static const struct mount_opts { {Opt_min_batch_time, 0, MOPT_GTE0}, {Opt_inode_readahead_blks, 0, MOPT_GTE0}, {Opt_init_itable, 0, MOPT_GTE0}, + {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET}, {Opt_stripe, 0, MOPT_GTE0}, {Opt_resuid, 0, MOPT_GTE0}, {Opt_resgid, 0, MOPT_GTE0}, @@ -1606,6 +1608,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, return -1; } sbi->s_jquota_fmt = m->mount_opt; +#endif +#ifndef CONFIG_FS_DAX + } else if (token == Opt_dax) { + ext4_msg(sb, KERN_INFO, "dax option not supported"); + return -1; #endif } else { if (!args->from) @@ -3589,6 +3596,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) "both data=journal and dioread_nolock"); goto failed_mount; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + goto failed_mount; + } if (test_opt(sb, DELALLOC)) clear_opt(sb, DELALLOC); } @@ -3652,6 +3664,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount; } + if (sbi->s_mount_opt & EXT4_MOUNT_DAX) { + if (blocksize != PAGE_SIZE) { + ext4_msg(sb, KERN_ERR, + "error: unsupported blocksize for dax"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext4_msg(sb, KERN_ERR, + "error: device does not support dax"); + goto failed_mount; + } + } + if (sb->s_blocksize != blocksize) { /* Validate the filesystem blocksize */ if (!sb_set_blocksize(sb, blocksize)) { @@ -4869,6 +4894,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) err = -EINVAL; goto restore_opts; } + if (test_opt(sb, DAX)) { + ext4_msg(sb, KERN_ERR, "can't mount with " + "both data=journal and dax"); + err = -EINVAL; + goto restore_opts; + } + } + + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) { + ext4_msg(sb, KERN_WARNING, "warning: refusing change of " + "dax flag with busy inodes while remounting"); + sbi->s_mount_opt ^= EXT4_MOUNT_DAX; } if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) -- cgit v1.2.3-70-g09d2 From d92576f1167cacf7844e5993f343eed4a6d8a147 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 16 Feb 2015 15:59:44 -0800 Subject: dax: does not work correctly with virtual aliasing caches The DAX code accesses the underlying storage through the kernel's linear mapping, which may not be cache-coherent with user mappings on ARM, MIPS or SPARC. Temporarily disable the DAX code until this problem is resolved. The original XIP code also had this problem, but it was never noticed. Signed-off-by: Matthew Wilcox Cc: Andreas Dilger Cc: Boaz Harrosh Cc: Christoph Hellwig Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Cc: Kirill A. Shutemov Cc: Mathieu Desnoyers Cc: Randy Dunlap Cc: Ross Zwisler Cc: Theodore Ts'o Cc: Ralf Baechle Cc: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/dax.txt | 3 +++ fs/Kconfig | 1 + 2 files changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index be376d91d058..baf41118660d 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -82,6 +82,9 @@ Shortcomings Even if the kernel or its modules are stored on a filesystem that supports DAX on a block device that supports DAX, they will still be copied into RAM. +The DAX code does not work correctly on architectures which have virtually +mapped caches such as ARM, MIPS and SPARC. + Calling get_user_pages() on a range of user memory that has been mmaped from a DAX file will fail as there are no 'struct page' to describe those pages. This problem is being worked on. That means that O_DIRECT diff --git a/fs/Kconfig b/fs/Kconfig index 5331497d5b25..ec35851e5b71 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -36,6 +36,7 @@ source "fs/nilfs2/Kconfig" config FS_DAX bool "Direct Access (DAX) support" depends on MMU + depends on !(ARM || MIPS || SPARC) help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, -- cgit v1.2.3-70-g09d2