summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-15Merge tag 'for-6.12-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that seems urgent and good to have in 6.12 final. It could potentially lead to unexpected transaction aborts, due to wrong comparison and order of processing of delayed refs" * tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix incorrect comparison for delayed refs
2024-11-14ocfs2: uncache inode which has failed entering the groupDmitry Antipov
Syzbot has reported the following BUG: kernel BUG at fs/ocfs2/uptodate.c:509! ... Call Trace: <TASK> ? __die_body+0x5f/0xb0 ? die+0x9e/0xc0 ? do_trap+0x15a/0x3a0 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? do_error_trap+0x1dc/0x2c0 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? __pfx_do_error_trap+0x10/0x10 ? handle_invalid_op+0x34/0x40 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ? exc_invalid_op+0x38/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? ocfs2_set_new_buffer_uptodate+0x2e/0x160 ? ocfs2_set_new_buffer_uptodate+0x144/0x160 ? ocfs2_set_new_buffer_uptodate+0x145/0x160 ocfs2_group_add+0x39f/0x15a0 ? __pfx_ocfs2_group_add+0x10/0x10 ? __pfx_lock_acquire+0x10/0x10 ? mnt_get_write_access+0x68/0x2b0 ? __pfx_lock_release+0x10/0x10 ? rcu_read_lock_any_held+0xb7/0x160 ? __pfx_rcu_read_lock_any_held+0x10/0x10 ? smack_log+0x123/0x540 ? mnt_get_write_access+0x68/0x2b0 ? mnt_get_write_access+0x68/0x2b0 ? mnt_get_write_access+0x226/0x2b0 ocfs2_ioctl+0x65e/0x7d0 ? __pfx_ocfs2_ioctl+0x10/0x10 ? smack_file_ioctl+0x29e/0x3a0 ? __pfx_smack_file_ioctl+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x43d/0x780 ? __pfx_lockdep_hardirqs_on_prepare+0x10/0x10 ? __pfx_ocfs2_ioctl+0x10/0x10 __se_sys_ioctl+0xfb/0x170 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> When 'ioctl(OCFS2_IOC_GROUP_ADD, ...)' has failed for the particular inode in 'ocfs2_verify_group_and_input()', corresponding buffer head remains cached and subsequent call to the same 'ioctl()' for the same inode issues the BUG() in 'ocfs2_set_new_buffer_uptodate()' (trying to cache the same buffer head of that inode). Fix this by uncaching the buffer head with 'ocfs2_remove_from_cache()' on error path in 'ocfs2_group_add()'. Link: https://lkml.kernel.org/r/20241114043844.111847-1-dmantipov@yandex.ru Fixes: 7909f2bf8353 ("[PATCH 2/2] ocfs2: Implement group add for online resize") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+453873f1588c2d75b447@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=453873f1588c2d75b447 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Dmitry Antipov <dmantipov@yandex.ru> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()Dan Carpenter
The "arg->vec_len" variable is a u64 that comes from the user at the start of the function. The "arg->vec_len * sizeof(struct page_region))" multiplication can lead to integer wrapping. Use size_mul() to avoid that. Also the size_add/mul() functions work on unsigned long so for 32bit systems we need to ensure that "arg->vec_len" fits in an unsigned long. Link: https://lkml.kernel.org/r/39d41335-dd4d-48ed-8a7f-402c57d8ea84@stanley.mountain Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Andrei Vagin <avagin@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14Merge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "This fixes one minor regression from the btree cache fixes (in the scan_for_btree_nodes repair path) - and the shutdown path fix is the big one here, in terms of bugs closed: - Assorted tiny syzbot fixes - Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()" The shutdown path wasn't flushing the btree write buffer, leading to shutting down while we still had operations in flight. This fixes a whole slew of syzbot bugs, and undoubtedly other strange heisenbugs. * tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs: bcachefs: Fix assertion pop in bch2_ptr_swab() bcachefs: Fix journal_entry_dev_usage_to_text() overrun bcachefs: Allow for unknown key types in backpointers fsck bcachefs: Fix assertion pop in topology repair bcachefs: Fix hidden btree errors when reading roots bcachefs: Fix validate_bset() repair path bcachefs: Fix missing validation for bch_backpointer.level bcachefs: Fix bch_member.btree_bitmap_shift validation bcachefs: bch2_btree_write_buffer_flush_going_ro()
2024-11-14btrfs: fix incorrect comparison for delayed refsJosef Bacik
When I reworked delayed ref comparison in cf4f04325b2b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake and returned -1 for the case where ref1->ref_root was > than ref2->ref_root. This is a subtle bug that can result in improper delayed ref running order, which can result in transaction aborts. Fixes: cf4f04325b2b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node") CC: stable@vger.kernel.org # 6.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-13Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All singletons" * tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: swapfile: fix cluster reclaim work crash on rotational devices selftests: hugetlb_dio: fixup check for initial conditions to skip in the start mm/thp: fix deferred split queue not partially_mapped: fix mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases nommu: pass NULL argument to vma_iter_prealloc() ocfs2: fix UBSAN warning in ocfs2_verify_volume() nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint mm: page_alloc: move mlocked flag clearance into free_pages_prepare() mm: count zeromap read and set for swapout and swapin
2024-11-12bcachefs: Fix assertion pop in bch2_ptr_swab()Kent Overstreet
This runs on extents that haven't yet been validated, so we don't want to assert that we have a valid entry type. Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-12bcachefs: Fix journal_entry_dev_usage_to_text() overrunKent Overstreet
If the jset_entry_dev_usage is malformed, and too small, our nr_entries calculation will be incorrect - just bail out. Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11ocfs2: fix UBSAN warning in ocfs2_verify_volume()Dmitry Antipov
Syzbot has reported the following splat triggered by UBSAN: UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10 shift exponent 32768 is too large for 32-bit type 'int' CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x241/0x360 ? __pfx_dump_stack_lvl+0x10/0x10 ? __pfx__printk+0x10/0x10 ? __asan_memset+0x23/0x50 ? lockdep_init_map_type+0xa1/0x910 __ubsan_handle_shift_out_of_bounds+0x3c8/0x420 ocfs2_fill_super+0xf9c/0x5750 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? __pfx_validate_chain+0x10/0x10 ? validate_chain+0x11e/0x5920 ? __lock_acquire+0x1384/0x2050 ? __pfx_validate_chain+0x10/0x10 ? string+0x26a/0x2b0 ? widen_string+0x3a/0x310 ? string+0x26a/0x2b0 ? bdev_name+0x2b1/0x3c0 ? pointer+0x703/0x1210 ? __pfx_pointer+0x10/0x10 ? __pfx_format_decode+0x10/0x10 ? __lock_acquire+0x1384/0x2050 ? vsnprintf+0x1ccd/0x1da0 ? snprintf+0xda/0x120 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x14f/0x370 ? __pfx_snprintf+0x10/0x10 ? set_blocksize+0x1f9/0x360 ? sb_set_blocksize+0x98/0xf0 ? setup_bdev_super+0x4e6/0x5d0 mount_bdev+0x20c/0x2d0 ? __pfx_ocfs2_fill_super+0x10/0x10 ? __pfx_mount_bdev+0x10/0x10 ? vfs_parse_fs_string+0x190/0x230 ? __pfx_vfs_parse_fs_string+0x10/0x10 legacy_get_tree+0xf0/0x190 ? __pfx_ocfs2_mount+0x10/0x10 vfs_get_tree+0x92/0x2b0 do_new_mount+0x2be/0xb40 ? __pfx_do_new_mount+0x10/0x10 __se_sys_mount+0x2d6/0x3c0 ? __pfx___se_sys_mount+0x10/0x10 ? do_syscall_64+0x100/0x230 ? __x64_sys_mount+0x20/0xc0 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f37cae96fda Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240 RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000 R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0 R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000 </TASK> For a really damaged superblock, the value of 'i_super.s_blocksize_bits' may exceed the maximum possible shift for an underlying 'int'. So add an extra check whether the aforementioned field represents the valid block size, which is 512 bytes, 1K, 2K, or 4K. Link: https://lkml.kernel.org/r/20241106092100.2661330-1-dmantipov@yandex.ru Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+56f7cd1abe4b8e475180@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_dirty_buffer tracepointRyusuke Konishi
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty() may cause a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because, since the tracepoint was added in mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, nilfs_grab_buffer(), which grabs a buffer to read (or create) a block of metadata, including b-tree node blocks, does not set the block device, but instead does so only if the buffer is not in the "uptodate" state for each of its caller block reading functions. However, if the uptodate flag is set on a folio/page, and the buffer heads are detached from it by try_to_free_buffers(), and new buffer heads are then attached by create_empty_buffers(), the uptodate flag may be restored to each buffer without the block device being set to bh->b_bdev, and mark_buffer_dirty() may be called later in that state, resulting in the bug mentioned above. Fix this issue by making nilfs_grab_buffer() always set the block device of the super block structure to the buffer head, regardless of the state of the buffer's uptodate flag. Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ubisectech Sirius <bugreport@valiantsec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11nilfs2: fix null-ptr-deref in block_touch_buffer tracepointRyusuke Konishi
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints". This series fixes null pointer dereference bugs that occur when using nilfs2 and two block-related tracepoints. This patch (of 2): It has been reported that when using "block:block_touch_buffer" tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a NULL pointer dereference, or a general protection fault when KASAN is enabled. This happens because since the tracepoint was added in touch_buffer(), it references the dev_t member bh->b_bdev->bd_dev regardless of whether the buffer head has a pointer to a block_device structure. In the current implementation, the block_device structure is set after the function returns to the caller. Here, touch_buffer() is used to mark the folio/page that owns the buffer head as accessed, but the common search helper for folio/page used by the caller function was optimized to mark the folio/page as accessed when it was reimplemented a long time ago, eliminating the need to call touch_buffer() here in the first place. So this solves the issue by eliminating the touch_buffer() call itself. Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: Ubisectech Sirius <bugreport@valiantsec.com> Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2 Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11bcachefs: Allow for unknown key types in backpointers fsckKent Overstreet
We can't assume that btrees only contain keys of a given type - even if they only have a single key type listed in the allowed key types for that btree; this is a forwards compatibility issue. Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11bcachefs: Fix assertion pop in topology repairKent Overstreet
Fixes: baefd3f849ed ("bcachefs: btree_cache.freeable list fixes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-10Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes, 14 of which are cc:stable. Three affect DAMON. Lorenzo's five-patch series to address the mmap_region error handling is here also. Apart from that, various singletons" * tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Thorsten Blum ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() signal: restore the override_rlimit logic fs/proc: fix compile warning about variable 'vmcore_mmap_ops' ucounts: fix counter leak in inc_rlimit_get_ucounts() selftests: hugetlb_dio: check for initial conditions to skip in the start mm: fix docs for the kernel parameter ``thp_anon=`` mm/damon/core: avoid overflow in damon_feed_loop_next_input() mm/damon/core: handle zero schemes apply interval mm/damon/core: handle zero {aggregation,ops_update} intervals mm/mlock: set the correct prev on failure objpool: fix to make percpu slot allocation more robust mm/page_alloc: keep track of free highatomic mm: resolve faulty mmap_region() error path behaviour mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling mm: refactor map_deny_write_exec() mm: unconditionally close VMAs on error mm: avoid unsafe VMA hook invocation when error arises on mmap hook mm/thp: fix deferred split unqueue naming and locking mm/thp: fix deferred split queue not partially_mapped
2024-11-09Merge tag 'nfsd-6.12-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD * tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-09Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fix from Steve French: "Fix net namespace refcount use after free issue" * tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Fix use-after-free of network namespace.
2024-11-08Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: "Four fixes, all also marked for stable: - fix two potential use after free issues - fix OOM issue with many simultaneous requests - fix missing error check in RPC pipe handling" * tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: check outstanding simultaneous SMB operations ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create ksmbd: Fix the missing xa_store error check
2024-11-08bcachefs: Fix hidden btree errors when reading rootsKent Overstreet
We silence btree errors in btree_node_scan, since it's probing and errors are expected: add a fake pass so that btree_node_scan is no longer recovery pass 0, and we don't think we're in btree node scan when reading btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-08bcachefs: Fix validate_bset() repair pathKent Overstreet
When we truncate a bset (due to it extending past the end of the btree node), we can't skip the rest of the validation for e.g. the packed format (if it's the first bset in the node). Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-08Merge tag 'for-6.12-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more one-liners that fix some user visible problems: - use correct range when clearing qgroup reservations after COW - properly reset freed delayed ref list head - fix ro/rw subvolume mounts to be backward compatible with old and new mount API" * tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix the length of reserved qgroup to free btrfs: reinitialize delayed ref list after deleting it from the list btrfs: fix per-subvolume RO/RW flags with new mount API
2024-11-08Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "Some trivial syzbot fixes, two more serious btree fixes found by looping single_devices.ktest small_nodes: - Topology error on split after merge, where we accidentaly picked the node being deleted for the pivot, resulting in an assertion pop - New nodes being preallocated were left on the freedlist, unlocked, resulting in them sometimes being accidentally freed: this dated from pre-cycle detector, when we could leave them locked. This should have resulted in more explosions and fireworks, but turned out to be surprisingly hard to hit because the preallocated nodes were being used right away. The fix for this is bigger than we'd like - reworking btree list handling was a bit invasive - but we've now got more assertions and it's well tested. - Also another mishandled transaction restart fix (in btree_node_prefetch) - we're almost done with those" * tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs: bcachefs: Fix UAF in __promote_alloc() error path bcachefs: Change OPT_STR max to be 1 less than the size of choices array bcachefs: btree_cache.freeable list fixes bcachefs: check the invalid parameter for perf test bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery bcachefs: Fix topology errors on split after merge bcachefs: Ancient versions with bad bkey_formats are no longer supported bcachefs: Fix error handling in bch2_btree_node_prefetch() bcachefs: Fix null ptr deref in bucket_gen_get()
2024-11-08bcachefs: Fix missing validation for bch_backpointer.levelKent Overstreet
This fixes an assertion pop where we try to navigate to the target of the backpointer, and the path level isn't what we expect. Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Fix bch_member.btree_bitmap_shift validationKent Overstreet
Needs to match the assert later when we resize... Reported-by: syzbot+e8eff054face85d7ea41@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: bch2_btree_write_buffer_flush_going_ro()Kent Overstreet
The write buffer needs to be specifically flushed when going RO: keys in the journal that haven't yet been moved to the write buffer don't have a journal pin yet. This fixes numerous syzbot bugs, all with symptoms of still doing writes after we've got RO. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()Andrew Kanner
Syzkaller is able to provoke null-ptr-dereference in ocfs2_xa_remove(): [ 57.319872] (a.out,1161,7):ocfs2_xa_remove:2028 ERROR: status = -12 [ 57.320420] (a.out,1161,7):ocfs2_xa_cleanup_value_truncate:1999 ERROR: Partial truncate while removing xattr overlay.upper. Leaking 1 clusters and removing the entry [ 57.321727] BUG: kernel NULL pointer dereference, address: 0000000000000004 [...] [ 57.325727] RIP: 0010:ocfs2_xa_block_wipe_namevalue+0x2a/0xc0 [...] [ 57.331328] Call Trace: [ 57.331477] <TASK> [...] [ 57.333511] ? do_user_addr_fault+0x3e5/0x740 [ 57.333778] ? exc_page_fault+0x70/0x170 [ 57.334016] ? asm_exc_page_fault+0x2b/0x30 [ 57.334263] ? __pfx_ocfs2_xa_block_wipe_namevalue+0x10/0x10 [ 57.334596] ? ocfs2_xa_block_wipe_namevalue+0x2a/0xc0 [ 57.334913] ocfs2_xa_remove_entry+0x23/0xc0 [ 57.335164] ocfs2_xa_set+0x704/0xcf0 [ 57.335381] ? _raw_spin_unlock+0x1a/0x40 [ 57.335620] ? ocfs2_inode_cache_unlock+0x16/0x20 [ 57.335915] ? trace_preempt_on+0x1e/0x70 [ 57.336153] ? start_this_handle+0x16c/0x500 [ 57.336410] ? preempt_count_sub+0x50/0x80 [ 57.336656] ? _raw_read_unlock+0x20/0x40 [ 57.336906] ? start_this_handle+0x16c/0x500 [ 57.337162] ocfs2_xattr_block_set+0xa6/0x1e0 [ 57.337424] __ocfs2_xattr_set_handle+0x1fd/0x5d0 [ 57.337706] ? ocfs2_start_trans+0x13d/0x290 [ 57.337971] ocfs2_xattr_set+0xb13/0xfb0 [ 57.338207] ? dput+0x46/0x1c0 [ 57.338393] ocfs2_xattr_trusted_set+0x28/0x30 [ 57.338665] ? ocfs2_xattr_trusted_set+0x28/0x30 [ 57.338948] __vfs_removexattr+0x92/0xc0 [ 57.339182] __vfs_removexattr_locked+0xd5/0x190 [ 57.339456] ? preempt_count_sub+0x50/0x80 [ 57.339705] vfs_removexattr+0x5f/0x100 [...] Reproducer uses faultinject facility to fail ocfs2_xa_remove() -> ocfs2_xa_value_truncate() with -ENOMEM. In this case the comment mentions that we can return 0 if ocfs2_xa_cleanup_value_truncate() is going to wipe the entry anyway. But the following 'rc' check is wrong and execution flow do 'ocfs2_xa_remove_entry(loc);' twice: * 1st: in ocfs2_xa_cleanup_value_truncate(); * 2nd: returning back to ocfs2_xa_remove() instead of going to 'out'. Fix this by skipping the 2nd removal of the same entry and making syzkaller repro happy. Link: https://lkml.kernel.org/r/20241103193845.2940988-1-andrew.kanner@gmail.com Fixes: 399ff3a748cf ("ocfs2: Handle errors while setting external xattr values.") Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com> Reported-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/671e13ab.050a0220.2b8c0f.01d0.GAE@google.com/T/ Tested-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07fs/proc: fix compile warning about variable 'vmcore_mmap_ops'Qi Xi
When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops' is defined but not used: >> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops' 458 | static const struct vm_operations_struct vmcore_mmap_ops = { Fix this by only defining it when CONFIG_MMU is enabled. Link: https://lkml.kernel.org/r/20241101034803.9298-1-xiqi2@huawei.com Fixes: 9cb218131de1 ("vmcore: introduce remap_oldmem_pfn_range()") Signed-off-by: Qi Xi <xiqi2@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/lkml/202410301936.GcE8yUos-lkp@intel.com/ Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07bcachefs: Fix UAF in __promote_alloc() error pathKent Overstreet
If we error in data_update_init() after adding to the rhashtable of outstanding promotes, kfree_rcu() is required. Reported-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Change OPT_STR max to be 1 less than the size of choices arrayPiotr Zalewski
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices" array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text. The "_choices" array is a null-terminated array, so computing the maximum using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values bigger than the actual maximum would pass through option validation. Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6 Fixes: 63c4b2545382 ("bcachefs: Better superblock opt validation") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: btree_cache.freeable list fixesKent Overstreet
When allocating new btree nodes, we were leaving them on the freeable list - unlocked - allowing them to be reclaimed: ouch. Additionally, bch2_btree_node_free_never_used() -> bch2_btree_node_hash_remove was putting it on the freelist, while bch2_btree_node_free_never_used() was putting it back on the btree update reserve list - ouch. Originally, the code was written to always keep btree nodes on a list - live or freeable - and this worked when new nodes were kept locked. But now with the cycle detector, we can't keep nodes locked that aren't tracked by the cycle detector; and this is fine as long as they're not reachable. We also have better and more robust leak detection now, with memory allocation profiling, so the original justification no longer applies. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: check the invalid parameter for perf testHongbo Li
The perf_test does not check the number of iterations and threads when it is zero. If nr_thread is 0, the perf test will keep waiting for wakekup. If iteration is 0, it will cause exception of division by zero. This can be reproduced by: echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test or echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: add check NULL return of bio_kmalloc in journal_read_bucketPei Xiao
bio_kmalloc may return NULL, will cause NULL pointer dereference. Add check NULL return for bio_kmalloc in journal_read_bucket. Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn> Fixes: ac10a9611d87 ("bcachefs: Some fixes for building in userspace") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recoveryKent Overstreet
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction commit path that updates should be done via the list of journal replay keys. This must be set before multithreaded use commences. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Fix topology errors on split after mergeKent Overstreet
If a btree split picks a pivot that's being deleted by a btree node merge, we're going to have problems. Fix this by checking if the pivot is being deleted, the same as we check for deletions in journal replay keys. Found by single_devic.ktest small_nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Ancient versions with bad bkey_formats are no longer supportedKent Overstreet
Syzbot found an assertion pop, by generating an ancient filesystem version with an invalid bkey_format (with fields that can overflow) as well as packed keys that aren't representable unpacked. This breaks key comparisons in all sorts of painful ways. Filesystems have been automatically rewriting nodes with such invalid formats for years; we can safely drop support for them. Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Fix error handling in bch2_btree_node_prefetch()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Fix null ptr deref in bucket_gen_get()Kent Overstreet
bucket_gen() checks if we're lookup up a valid bucket and returns NULL otherwise, but bucket_gen_get() was failing to check; other callers were correct. Also do a bit of cleanup on callers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07proc/softirqs: replace seq_printf with seq_put_decimal_ull_widthDavid Wang
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs would yield 10*n decimal values, and the extra cost parsing format string grows linearly with number of cpus. Replace seq_printf with seq_put_decimal_ull_width have significant performance improvement. On an 8CPUs system, reading /proc/softirqs show ~40% performance gain with this patch. Signed-off-by: David Wang <00107082@163.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-07NFSD: Fix READDIR on NFSv3 mounts of ext4 exportsChuck Lever
I noticed that recently, simple operations like "make" started failing on NFSv3 mounts of ext4 exports. Network capture shows that READDIRPLUS operated correctly but READDIR failed with NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is passed a non-zero starting directory cookie. I bisected to commit c689bdd3bffa ("nfsd: further centralize protocol version checks."). Turns out that nfsd3_proc_readdir() does not call fh_verify() before it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when the directory is opened. For ext4, this causes the wrong "max file size" value to be used when sanity checking the incoming directory cookie (which is a seek offset value). The fhp->fh_64bit_cookies boolean is /always/ properly initialized after nfsd_open() returns. There doesn't seem to be a reason for the generic NFSD open helper to handle the f_mode fix-up for directories, so just move that to the one caller that tries to open an S_IFDIR with NFSD_MAY_64BIT_COOKIE. Suggested-by: NeilBrown <neilb@suse.de> Fixes: c689bdd3bffa ("nfsd: further centralize protocol version checks.") Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-07btrfs: fix the length of reserved qgroup to freeHaisu Wang
The dealloc flag may be cleared and the extent won't reach the disk in cow_file_range when errors path. The reserved qgroup space is freed in commit 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range"). However, the length of untouched region to free needs to be adjusted with the correct remaining region size. Fixes: 30479f31d44d ("btrfs: fix qgroup reserve leaks in cow_file_range") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07btrfs: reinitialize delayed ref list after deleting it from the listFilipe Manana
At insert_delayed_ref() if we need to update the action of an existing ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's ref_add_list using list_del(), which leaves the ref's add_list member not reinitialized, as list_del() sets the next and prev members of the list to LIST_POISON1 and LIST_POISON2, respectively. If later we end up calling drop_delayed_ref() against the ref, which can happen during merging or when destroying delayed refs due to a transaction abort, we can trigger a crash since at drop_delayed_ref() we call list_empty() against the ref's add_list, which returns false since the list was not reinitialized after the list_del() and as a consequence we call list_del() again at drop_delayed_ref(). This results in an invalid list access since the next and prev members are set to poison pointers, resulting in a splat if CONFIG_LIST_HARDENED and CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences otherwise. So fix this by deleting from the list with list_del_init() instead. Fixes: 1d57ee941692 ("btrfs: improve delayed refs iterations") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07btrfs: fix per-subvolume RO/RW flags with new mount APIQu Wenruo
[BUG] With util-linux 2.40.2, the 'mount' utility is already utilizing the new mount API. e.g: # strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/ ... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 But this leads to a new problem, that per-subvolume RO/RW mount no longer works, if the initial mount is RO: # mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test # mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch # mount | grep mnt /dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1) /dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2) # touch /mnt/scratch/foobar touch: cannot touch '/mnt/scratch/foobar': Read-only file system This is a common use cases on distros. [CAUSE] We have a workaround for remount to handle the RO->RW change, but if the mount is using the new mount API, we do not do that, and rely on the mount tool NOT to set the ro flag. But that's not how the mount tool is doing for the new API: fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0 fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0 move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0 This means we will set the super block RO at the first mount. Later RW mount will not try to reconfigure the fs to RW because the mount tool is already using the new API. This totally breaks the per-subvolume RO/RW mount behavior. [FIX] Do not skip the reconfiguration even if using the new API. The old comments are just expecting any mount tool to properly skip the RO flag set even if we specify "ro", which is not the reality. Update the comments regarding the backward compatibility on the kernel level so it works with old and new mount utilities. CC: stable@vger.kernel.org # 6.8+ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-06Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "These are mostly fixes that came up during the nfs bakeathon the other week. Stable Fixes: - Fix KMSAN warning in decode_getfattr_attrs() Other Bugfixes: - Handle -ENOTCONN in xs_tcp_setup_socked() - NFSv3: only use NFS timeout for MOUNT when protocols are compatible - Fix attribute delegation behavior on exclusive create and a/mtime changes - Fix localio to cope with racing nfs_local_probe() - Avoid i_lock contention in fs_clear_invalid_mapping()" * tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs: avoid i_lock contention in nfs_clear_invalid_mapping nfs_common: fix localio to cope with racing nfs_local_probe() NFS: Further fixes to attribute delegation a/mtime changes NFS: Fix attribute delegation behaviour on exclusive create nfs: Fix KMSAN warning in decode_getfattr_attrs() NFSv3: only use NFS timeout for MOUNT when protocols are compatible sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
2024-11-06Merge tag 'tracefs-v6.12-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracefs fixes from Steven Rostedt: "Fix tracefs mount options. Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API") broke the gid setting when set by fstab or other mount utility. It is ignored when it is set. Fix the code so that it recognises the option again and will honor the settings on mount at boot up. Update the internal documentation and create a selftest to make sure it doesn't break again in the future" * tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/selftests: Add tracefs mount options test tracing: Document tracefs gid mount option tracing: Fix tracefs mount options
2024-11-05ksmbd: check outstanding simultaneous SMB operationsNamjae Jeon
If Client send simultaneous SMB operations to ksmbd, It exhausts too much memory through the "ksmbd_work_cache”. It will cause OOM issue. ksmbd has a credit mechanism but it can't handle this problem. This patch add the check if it exceeds max credits to prevent this problem by assuming that one smb request consumes at least one credit. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-05ksmbd: fix slab-use-after-free in smb3_preauth_hash_rspNamjae Jeon
ksmbd_user_session_put should be called under smb3_preauth_hash_rsp(). It will avoid freeing session before calling smb3_preauth_hash_rsp(). Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-05ksmbd: fix slab-use-after-free in ksmbd_smb2_session_createNamjae Jeon
There is a race condition between ksmbd_smb2_session_create and ksmbd_expire_session. This patch add missing sessions_table_lock while adding/deleting session from global session table. Cc: stable@vger.kernel.org # v5.15+ Reported-by: Norbert Szetei <norbert@doyensec.com> Tested-by: Norbert Szetei <norbert@doyensec.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-04nfs: avoid i_lock contention in nfs_clear_invalid_mappingMike Snitzer
Multi-threaded buffered reads to the same file exposed significant inode spinlock contention in nfs_clear_invalid_mapping(). Eliminate this spinlock contention by checking flags without locking, instead using smp_rmb and smp_load_acquire accordingly, but then take spinlock and double-check these inode flags. Also refactor nfs_set_cache_invalid() slightly to use smp_store_release() to pair with nfs_clear_invalid_mapping()'s smp_load_acquire(). While this fix is beneficial for all multi-threaded buffered reads issued by an NFS client, this issue was identified in the context of surprisingly low LOCALIO performance with 4K multi-threaded buffered read IO. This fix dramatically speeds up LOCALIO performance: before: read: IOPS=1583k, BW=6182MiB/s (6482MB/s)(121GiB/20002msec) after: read: IOPS=3046k, BW=11.6GiB/s (12.5GB/s)(232GiB/20001msec) Fixes: 17dfeb911339 ("NFS: Fix races in nfs_revalidate_mapping") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04nfs_common: fix localio to cope with racing nfs_local_probe()Mike Snitzer
Fix the possibility of racing nfs_local_probe() resulting in: list_add double add: new=ffff8b99707f9f58, prev=ffff8b99707f9f58, next=ffffffffc0f30000. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:35! Add nfs_uuid_init() to properly initialize all nfs_uuid_t members (particularly its list_head). Switch to returning bool from nfs_uuid_begin(), returns false if nfs_uuid_t is already in-use (its list_head is on a list). Update nfs_local_probe() to return early if the nfs_client's cl_uuid (nfs_uuid_t) is in-use. Also, switch nfs_uuid_begin() from using list_add_tail_rcu() to list_add_tail() -- rculist was used in an earlier version of the localio code that had a lockless nfs_uuid_lookup interface. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04NFS: Further fixes to attribute delegation a/mtime changesTrond Myklebust
When asked to set both an atime and an mtime to the current system time, ensure that the setting is atomic by calling inode_update_timestamps() only once with the appropriate flags. Fixes: e12912d94137 ("NFSv4: Add support for delegated atime and mtime attributes") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>