summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-12-28io_uring: kill poll linking optimisationPavel Begunkov
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll requests doesn't make much sense, and in any case re-adding it shouldn't be a problem considering batching in tctx_task_work(). We can remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-28io_uring: move common poll bitsPavel Begunkov
Move some poll helpers/etc up, we'll need them there shortly Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-28io_uring: refactor poll updatePavel Begunkov
Clean up io_poll_update() and unify cancellation paths for remove and update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-28io_uring: remove double poll on poll updatePavel Begunkov
Before updating a poll request we should remove it from poll queues, including the double poll entry. Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-28kobject: remove kset from struct kset_uevent_ops callbacksGreg Kroah-Hartman
There is no need to pass the pointer to the kset in the struct kset_uevent_ops callbacks as no one uses it, so just remove that pointer entirely. Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Wedson Almeida Filho <wedsonaf@google.com> Link: https://lore.kernel.org/r/20211227163924.3970661-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-23Merge tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd fixes from Steve French: "Three ksmbd fixes, all for stable as well. Two fix potential unitialized memory and one fixes a security problem where encryption is unitentionally disabled from some clients" * tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1 ksmbd: fix uninitialized symbol 'pntsd_size' ksmbd: fix error code in ndr_read_int32()
2021-12-23Merge tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fix from Jens Axboe: "Single fix for not clearing kiocb->ki_pos back to 0 for a stream, destined for stable as well" * tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block: io_uring: zero iocb->ki_pos for stream file types
2021-12-23ext4: update fast commit TODOsHarshad Shirwadkar
This series takes care of a couple of TODOs and adds new ones. Update the TODOs section to reflect current state and future work that needs to happen. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-5-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23ext4: simplify updating of fast commit statsHarshad Shirwadkar
Move fast commit stats updating logic to a separate function from ext4_fc_commit(). This significantly improves readability of ext4_fc_commit(). Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23ext4: drop ineligible txn start stop APIsHarshad Shirwadkar
This patch drops ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions should simply call ext4_fc_mark_ineligible() after starting the trasaction. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-3-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23ext4: use ext4_journal_start/stop for fast commit transactionsHarshad Shirwadkar
This patch drops all calls to ext4_fc_start_update() and ext4_fc_stop_update(). To ensure that there are no ongoing journal updates during fast commit, we also make jbd2_fc_begin_commit() lock journal for updates. This way we don't have to maintain two different transaction start stop APIs for fast commit and full commit. This patch doesn't remove the functions altogether since in future we want to have inode level locking for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23iomap: Fix error handling in iomap_zero_iter()Matthew Wilcox (Oracle)
iomap_write_end() does not return a negative errno to indicate an error, but the number of bytes successfully copied. It cannot return an error today, so include a debugging assertion like the one in iomap_unshare_iter(). Fixes: c6f40468657d ("fsdax: decouple zeroing from the iomap buffered I/O code") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211221044450.517558-1-willy@infradead.org Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-23jffs2: GC deadlock reading a page that is used in jffs2_write_begin()Kyeong Yoo
GC task can deadlock in read_cache_page() because it may attempt to release a page that is actually allocated by another task in jffs2_write_begin(). The reason is that in jffs2_write_begin() there is a small window a cache page is allocated for use but not set Uptodate yet. This ends up with a deadlock between two tasks: 1) A task (e.g. file copy) - jffs2_write_begin() locks a cache page - jffs2_write_end() tries to lock "alloc_sem" from jffs2_reserve_space() <-- STUCK 2) GC task (jffs2_gcd_mtd3) - jffs2_garbage_collect_pass() locks "alloc_sem" - try to lock the same cache page in read_cache_page() <-- STUCK So to avoid this deadlock, hold "alloc_sem" in jffs2_write_begin() while reading data in a cache page. Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: read-only if LEB may always be taken in ubifs_garbage_collectBaokun Li
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb returns error, a LEB will always has a "taken" flag. In this case, set the ubifs to read-only to prevent a worse situation. Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix double return leb in ubifs_garbage_collectBaokun Li
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out" branch, ubifs_return_leb will execute twice on the same lnum. This can cause data loss in concurrency situations. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix slab-out-of-bounds in ubifs_change_lpBaokun Li
Hulk Robot reported a KASAN report about slab-out-of-bounds: ================================================================== BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs] Read of size 8 at addr ffff888101c961f8 by task fsstress/1068 [...] Call Trace: check_memory_region+0x1c1/0x1e0 ubifs_change_lp+0x3a9/0x1390 [ubifs] ubifs_change_one_lp+0x170/0x220 [ubifs] ubifs_garbage_collect+0x7f9/0xda0 [ubifs] ubifs_budget_space+0xfe4/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs] Allocated by task 1068: kmemdup+0x25/0x50 ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs] ubifs_update_one_lp+0x46/0x260 [ubifs] ubifs_tnc_end_commit+0x98b/0x1720 [ubifs] do_commit+0x6cb/0x1950 [ubifs] ubifs_run_commit+0x15a/0x2b0 [ubifs] ubifs_budget_space+0x1061/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs] [...] ================================================================== In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error, lp is an uninitialized variable. But lp.num might be used in the out branch, which is a random value. If the value is -1 or another value that can pass the check, soob may occur in the ubifs_change_lp() in the following procedure. To solve this problem, we initialize lp.lnum to -1, and then initialize it correctly in ubifs_find_dirty_leb, which is not equal to -1, and ubifs_return_leb is executed only when lp.lnum != -1. if find a retained or indexing LEB and continue to next loop, but break before find another LEB, the "taken" flag of this LEB will be cleaned in ubi_return_lebi(). This bug has also been fixed in this patch. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix snprintf() length checkDan Carpenter
The snprintf() function returns the number of bytes (not including the NUL terminator) which would have been printed if there were enough space. So it can be greater than UBIFS_DFS_DIR_LEN. And actually if it equals UBIFS_DFS_DIR_LEN then that's okay so this check is too strict. Fixes: 9a620291fc01 ("ubifs: Export filesystem error counters") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Export filesystem error countersStefan Schaeckeler
Not all ubifs filesystem errors are propagated to userspace. Export bad magic, bad node and crc errors via sysfs. This allows userspace to notice filesystem errors: /sys/fs/ubifs/ubiX_Y/errors_magic /sys/fs/ubifs/ubiX_Y/errors_node /sys/fs/ubifs/ubiX_Y/errors_crc The counters are reset to 0 with a remount. Signed-off-by: Stefan Schaeckeler <sschaeck@cisco.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffersPetr Cvachoucek
it seems freeing the write buffers in the error path of the ubifs_remount_rw() is wrong. It leads later to a kernel oops like this: [10016.431274] UBIFS (ubi0:0): start fixing up free space [10090.810042] UBIFS (ubi0:0): free space fixup complete [10090.814623] UBIFS error (ubi0:0 pid 512): ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4 [10101.915108] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 517 [10105.275498] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030 [10105.284352] Mem abort info: [10105.287160] ESR = 0x96000006 [10105.290252] EC = 0x25: DABT (current EL), IL = 32 bits [10105.295592] SET = 0, FnV = 0 [10105.298652] EA = 0, S1PTW = 0 [10105.301848] Data abort info: [10105.304723] ISV = 0, ISS = 0x00000006 [10105.308573] CM = 0, WnR = 0 [10105.311564] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000f03d1000 [10105.318034] [0000000000000030] pgd=00000000f6cee003, pud=00000000f4884003, pmd=0000000000000000 [10105.326783] Internal error: Oops: 96000006 [#1] PREEMPT SMP [10105.332355] Modules linked in: ath10k_pci ath10k_core ath mac80211 libarc4 cfg80211 nvme nvme_core cryptodev(O) [10105.342468] CPU: 3 PID: 518 Comm: touch Tainted: G O 5.4.3 #1 [10105.349517] Hardware name: HYPEX CPU (DT) [10105.353525] pstate: 40000005 (nZcv daif -PAN -UAO) [10105.358324] pc : atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.364596] lr : mutex_lock+0x1c/0x34 [10105.368253] sp : ffff000075633aa0 [10105.371563] x29: ffff000075633aa0 x28: 0000000000000001 [10105.376874] x27: ffff000076fa80c8 x26: 0000000000000004 [10105.382185] x25: 0000000000000030 x24: 0000000000000000 [10105.387495] x23: 0000000000000000 x22: 0000000000000038 [10105.392807] x21: 000000000000000c x20: ffff000076fa80c8 [10105.398119] x19: ffff000076fa8000 x18: 0000000000000000 [10105.403429] x17: 0000000000000000 x16: 0000000000000000 [10105.408741] x15: 0000000000000000 x14: fefefefefefefeff [10105.414052] x13: 0000000000000000 x12: 0000000000000fe0 [10105.419364] x11: 0000000000000fe0 x10: ffff000076709020 [10105.424675] x9 : 0000000000000000 x8 : 00000000000000a0 [10105.429986] x7 : ffff000076fa80f4 x6 : 0000000000000030 [10105.435297] x5 : 0000000000000000 x4 : 0000000000000000 [10105.440609] x3 : 0000000000000000 x2 : ffff00006f276040 [10105.445920] x1 : ffff000075633ab8 x0 : 0000000000000030 [10105.451232] Call trace: [10105.453676] atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.459600] ubifs_garbage_collect+0xb4/0x334 [10105.463956] ubifs_budget_space+0x398/0x458 [10105.468139] ubifs_create+0x50/0x180 [10105.471712] path_openat+0x6a0/0x9b0 [10105.475284] do_filp_open+0x34/0x7c [10105.478771] do_sys_open+0x78/0xe4 [10105.482170] __arm64_sys_openat+0x1c/0x24 [10105.486180] el0_svc_handler+0x84/0xc8 [10105.489928] el0_svc+0x8/0xc [10105.492808] Code: 52800013 17fffffb d2800003 f9800011 (c85ffc05) [10105.498903] ---[ end trace 46b721d93267a586 ]--- To reproduce the problem: 1. Filesystem initially mounted read-only, free space fixup flag set. 2. mount -o remount,rw <mountpoint> 3. it takes some time (free space fixup running) ... try to terminate running mount by CTRL-C ... does not respond, only after free space fixup is complete ... then "ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4" 4. mount -o remount,rw <mountpoint> ... now finished instantly (fixup already done). 5. Create file or just unmount the filesystem and we get the oops. Cc: <stable@vger.kernel.org> Fixes: b50b9f408502 ("UBIFS: do not free write-buffers when in R/O mode") Signed-off-by: Petr Cvachoucek <cvachoucek@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Make use of the helper macro kthread_run()Cai Huoqing
Repalce kthread_create/wake_up_process() with kthread_run() to simplify the code. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Fix spelling mistakesAlexander Dahl
Found with `codespell -i 3 -w fs/ubifs/**` and proof reading that parts. Signed-off-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ext4: fix i_version handling on remountLukas Czerner
i_version mount option is getting lost on remount. This is because the 'i_version' mount option differs from the util-linux mount option 'iversion', but it has exactly the same functionality. We have to specifically notify the vfs that this is what we want by setting appropriate flag in fc->sb_flags. Fix it and as a result we can remove *flags argument from __ext4_remount(); do the same for __ext4_fill_super(). In addition set out to deprecate ext4 specific 'i_version' mount option in favor or 'iversion' by kernel version 5.20. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Fixes: cebe85d570cf ("ext4: switch to the new mount api") Link: https://lore.kernel.org/r/20211222104517.11187-2-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23ext4: remove lazytime/nolazytime mount options handled by MS_LAZYTIMELukas Czerner
The lazytime and nolazytime mount options were added temporarily back in 2015 with commit a26f49926da9 ("ext4: add optimization for the lazytime mount option"). It think it has been enough time for the util-linux with lazytime support to get widely used. Remove the mount options. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20211222104517.11187-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-23ext4: don't fail remount if journalling mode didn't changeLukas Czerner
Switching to the new mount api introduced inconsistency in how the journalling mode mount option (data=) is handled during a remount. Ext4 always prevented changing the journalling mode during the remount, however the new code always fails the remount when the journalling mode is specified, even if it remains unchanged. Fix it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Fixes: cebe85d570cf ("ext4: switch to the new mount api") Link: https://lore.kernel.org/r/20211220152657.101599-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-22io_uring: zero iocb->ki_pos for stream file typesJens Axboe
io_uring supports using offset == -1 for using the current file position, and we read that in as part of read/write command setup. For the non-iter read/write types we pass in NULL for the position pointer, but for the iter types we should not be passing any anything but 0 for the position for a stream. Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we do, then the request will error with -ESPIPE. Fixes: ba04291eb66e ("io_uring: allow use of offset == -1 to mean file position") Link: https://github.com/axboe/liburing/discussions/501 Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-22xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocateDarrick J. Wong
The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at the end of files, just like fallocate and RESVSP. Make the behavior consistent with the other ioctls. Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2021-12-22xfs: prevent UAF in xfs_log_item_in_current_chkptDarrick J. Wong
While I was running with KASAN and lockdep enabled, I stumbled upon an KASAN report about a UAF to a freed CIL checkpoint. Looking at the comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me that the original patch to xfs_defer_finish_noroll should have done something to lock the CIL to prevent it from switching the CIL contexts while the predicate runs. For upper level code that needs to know if a given log item is new enough not to need relogging, add a new wrapper that takes the CIL context lock long enough to sample the current CIL context. This is kind of racy in that the CIL can switch the contexts immediately after sampling, but that's ok because the consequence is that the defer ops code is a little slow to relog items. ================================================================== BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs] Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999 CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x83/0xdf xfs_log_item_in_current_chkpt+0x139/0x160 xfs_defer_finish_noroll+0x3bb/0x1e30 __xfs_trans_commit+0x6c8/0xcf0 xfs_reflink_remap_extent+0x66f/0x10e0 xfs_reflink_remap_blocks+0x2dd/0xa90 xfs_file_remap_range+0x27b/0xc30 vfs_dedupe_file_range_one+0x368/0x420 vfs_dedupe_file_range+0x37c/0x5d0 do_vfs_ioctl+0x308/0x1260 __x64_sys_ioctl+0xa1/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2c71a2950b Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20 </TASK> Allocated by task 464064: kasan_save_stack+0x1e/0x50 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0xcd/0x2c0 [xfs] xlog_cil_ctx_alloc+0x17/0x1e0 [xfs] xlog_cil_push_work+0x141/0x13d0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Freed by task 51: kasan_save_stack+0x1e/0x50 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xed/0x130 slab_free_freelist_hook+0x7f/0x160 kfree+0xde/0x340 xlog_cil_committed+0xbfd/0xfe0 [xfs] xlog_cil_process_committed+0x103/0x1c0 [xfs] xlog_state_do_callback+0x45d/0xbd0 [xfs] xlog_ioend_work+0x116/0x1c0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Last potentially related work creation: kasan_save_stack+0x1e/0x50 __kasan_record_aux_stack+0xb7/0xc0 insert_work+0x48/0x2e0 __queue_work+0x4e7/0xda0 queue_work_on+0x69/0x80 xlog_cil_push_now.isra.0+0x16b/0x210 [xfs] xlog_cil_force_seq+0x1b7/0x850 [xfs] xfs_log_force_seq+0x1c7/0x670 [xfs] xfs_file_fsync+0x7c1/0xa60 [xfs] __x64_sys_fsync+0x52/0x80 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88804ea5f600 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 8 bytes inside of 256-byte region [ffff88804ea5f600, ffff88804ea5f700) The buggy address belongs to the page: page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e head:ffffea00013a9780 order:1 compound_mapcount:0 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21hostfs: Fix writeback of dirty pagesSjoerd Simons
Hostfs was not setting up the backing device information, which means it uses the noop bdi. The noop bdi does not have the writeback capability enabled, which in turns means dirty pages never got written back to storage. In other words programs using mmap to write to files on hostfs never actually got their data written out... Fix this by simply setting up the bdi with default settings as all the required code for writeback is already in place. Signed-off-by: Sjoerd Simons <sjoerd@collabora.com> Reviewed-by: Christopher Obbard <chris.obbard@collabora.com> Tested-by: Ritesh Raj Sarraf <ritesh@collabora.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-21Merge tag 'nfsd-5.16-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: "Address a buffer overrun reported by Anatoly Trosinenko" * tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix READDIR buffer overflow
2021-12-21iomap: Inline __iomap_zero_iter into its callerMatthew Wilcox (Oracle)
To make the merge easier, replicate the inlining of __iomap_zero_iter() into iomap_zero_iter() that is currently in the nvdimm tree. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()Dan Carpenter
The "bufsize" comes from the root user. If "bufsize" is negative then, because of type promotion, neither of the validation checks at the start of the function are able to catch it: if (bufsize < sizeof(struct xfs_attrlist) || bufsize > XFS_XATTR_LIST_MAX) return -EINVAL; This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in kvmalloc_node(). Fix this by changing the type from int to size_t. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: Fix comments mentioning xfs_iallocYang Xu
Since kernel commit 1abcf261016e ("xfs: move on-disk inode allocation out of xfs_ialloc()"), xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments. Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: check sb_meta_uuid for dabuf buffer recoveryDave Chinner
Got a report that a repeated crash test of a container host would eventually fail with a log recovery error preventing the system from mounting the root filesystem. It manifested as a directory leaf node corruption on writeback like so: XFS (loop0): Mounting V5 Filesystem XFS (loop0): Starting recovery (logdev: internal) XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed Tracing indicated that we were recovering changes from a transaction at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57. That is, log recovery was overwriting a buffer with newer changes on disk than was in the transaction. Tracing indicated that we were hitting the "recovery immediately" case in xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the buffer. The code was extracting the LSN correctly, then ignoring it because the UUID in the buffer did not match the superblock UUID. The problem arises because the UUID check uses the wrong UUID - it should be checking the sb_meta_uuid, not sb_uuid. This filesystem has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the correct matching sb_meta_uuid in it, it's just the code checked it against the wrong superblock uuid. The is no corruption in the filesystem, and failing to recover the buffer due to a write verifier failure means the recovery bug did not propagate the corruption to disk. Hence there is no corruption before or after this bug has manifested, the impact is limited simply to an unmountable filesystem.... This was missed back in 2015 during an audit of incorrect sb_uuid usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs to validate against sb_meta_uuid") that fixed the magic32 buffers to validate against sb_meta_uuid instead of sb_uuid. It missed the magicda buffers.... Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: fix a bug in the online fsck directory leaf1 bestcount checkDarrick J. Wong
When xfs_scrub encounters a directory with a leaf1 block, it tries to validate that the leaf1 block's bestcount (aka the best free count of each directory data block) is the correct size. Previously, this author believed that comparing bestcount to the directory isize (since directory data blocks are under isize, and leaf/bestfree blocks are above it) was sufficient. Unfortunately during testing of online repair, it was discovered that it is possible to create a directory with a hole between the last directory block and isize. The directory code seems to handle this situation just fine and xfs_repair doesn't complain, which effectively makes this quirk part of the disk format. Fix the check to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: only run COW extent recovery when there are no live extentsDarrick J. Wong
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: don't expose internal symlink metadata buffers to the vfsDarrick J. Wong
Ian Kent reported that for inline symlinks, it's possible for vfs_readlink to hang on to the target buffer returned by _vn_get_link_inline long after it's been freed by xfs inode reclaim. This is a layering violation -- we should never expose XFS internals to the VFS. When the symlink has a remote target, we allocate a separate buffer, copy the internal information, and let the VFS manage the new buffer's lifetime. Let's adapt the inline code paths to do this too. It's less efficient, but fixes the layering violation and avoids the need to adapt the if_data lifetime to rcu rules. Clearly I don't care about readlink benchmarks. As a side note, this fixes the minor locking violation where we can access the inode data fork without taking any locks; proper locking (and eliminating the possibility of having to switch inode_operations on a live inode) is essential to online repair coordinating repairs correctly. Reported-by: Ian Kent <raven@themaw.net> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: fix quotaoff mutex usage now that we don't support disabling itDarrick J. Wong
Prior to commit 40b52225e58c ("xfs: remove support for disabling quota accounting on a mounted file system"), we used the quotaoff mutex to protect dquot operations against quotaoff trying to pull down dquots as part of disabling quota. Now that we only support turning off quota enforcement, the quotaoff mutex only protects changes in m_qflags/sb_qflags. We don't need it to protect dquots, which means we can remove it from setqlimits and the dquot scrub code. While we're at it, fix the function that forces quotacheck, since it should have been taking the quotaoff mutex. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: shut down filesystem if we xfs_trans_cancel with deferred work itemsDarrick J. Wong
While debugging some very strange rmap corruption reports in connection with the online directory repair code. I root-caused the error to the following incorrect sequence: <start repair transaction> <expand directory, causing a deferred rmap to be queued> <roll transaction> <cancel transaction> Obviously, we should have committed the transaction instead of cancelling it. Thinking more broadly, however, xfs_trans_cancel should have warned us that we were throwing away work item that we already committed to performing. This is not correct, and we need to shut down the filesystem. Change xfs_trans_cancel to complain in the loudest manner if we're cancelling any transaction with deferred work items attached. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-19Merge tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Two cifs/smb3 fixes, one fscache related, and one mount parsing related for stable" * tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: sanitize multiple delimiters in prepath cifs: ignore resource_id while getting fscache super cookie
2021-12-18NFSD: Fix READDIR buffer overflowChuck Lever
If a client sends a READDIR count argument that is too small (say, zero), then the buffer size calculation in the new init_dirlist helper functions results in an underflow, allowing the XDR stream functions to write beyond the actual buffer. This calculation has always been suspect. NFSD has never sanity- checked the READDIR count argument, but the old entry encoders managed the problem correctly. With the commits below, entry encoding changed, exposing the underflow to the pointer arithmetic in xdr_reserve_space(). Modern NFS clients attempt to retrieve as much data as possible for each READDIR request. Also, we have no unit tests that exercise the behavior of READDIR at the lower bound of @count values. Thus this case was missed during testing. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Fixes: f5dcccd647da ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream") Fixes: 7f87fc2d34d4 ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-18dax: remove the copy_from_iter and copy_to_iter methodsChristoph Hellwig
These methods indirect the actual DAX read/write path. In the end pmem uses magic flush and mc safe variants and fuse and dcssblk use plain ones while device mapper picks redirects to the underlying device. Add set_dax_nocache() and set_dax_nomc() APIs to control which copy routines are used to remove indirect call from the read/write fast path as well as a lot of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs] Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-18dax: remove the DAXDEV_F_SYNC flagChristoph Hellwig
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and just let the drivers call set_dax_synchronous directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-18fs: 9p: remove unneeded variableChangcheng Deng
Fix the following coccicheck review: ./fs/9p/vfs_file.c: 117: 5-8: Unneeded variable Remove unneeded variable used to store return value. Link: https://lkml.kernel.org/r/20211109114343.132844-1-deng.changcheng@zte.com.cn Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2021-12-18xfs: Support large foliosMatthew Wilcox (Oracle)
Now that iomap has been converted, XFS is large folio safe. Indicate to the VFS that it can now create large folios for XFS. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap: Support large folios in invalidatepageMatthew Wilcox (Oracle)
If we're punching a hole in a large folio, we need to remove the per-folio iomap data as the folio is about to be split and each page will need its own. If a dirty folio is only partially-uptodate, the iomap data contains the information about which blocks cannot be written back, so assert that a dirty folio is fully uptodate. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap: Convert iomap_migrate_page() to use foliosMatthew Wilcox (Oracle)
The arguments are still pages for now, but we can use folios internally and cut out a lot of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap: Convert iomap_add_to_ioend() to take a folioMatthew Wilcox (Oracle)
We still iterate one block at a time, but now we call compound_head() less often. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap: Simplify iomap_do_writepage()Matthew Wilcox (Oracle)
Rename end_offset to end_pos and offset_into_page to poff to match the rest of the file. Simplify the handling of the last page straddling i_size by doing the EOF check based on the byte granularity i_size instead of converting to a pgoff prematurely. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap: Simplify iomap_writepage_map()Matthew Wilcox (Oracle)
Rename end_offset to end_pos and file_offset to pos to match the rest of the file. Simplify the loop by calculating nblocks up front instead of each time around the loop. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18iomap,xfs: Convert ->discard_page to ->discard_folioMatthew Wilcox (Oracle)
XFS has the only implementation of ->discard_page today, so convert it to use folios in the same patch as converting the API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>