summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2024-09-23Merge tag 'for-6.12-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix dangling pointer to rb-tree of defragmented inodes after cleanup - a followup fix to handle concurrent lseek on the same fd that could leak memory under some conditions - fix wrong root id reported in tree checker when verifying dref * tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag btrfs: tree-checker: fix the wrong output of data backref objectid btrfs: fix race setting file private on concurrent lseek using same fd
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-17btrfs: fix use-after-free on rbtree that tracks inodes for auto defragFilipe Manana
When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called during remount and unmount, we are freeing every node from the rbtree that tracks inodes for auto defrag using rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree itself. So once we unlock the lock that protects the rbtree, we have a tree pointing to a root that was freed (and a root pointing to freed nodes, and their children pointing to other freed nodes, and so on). This makes further access to the tree result in a use-after-free with unpredictable results. Fix this by initializing the rbtree to an empty root after the call to rbtree_postorder_for_each_entry_safe() and before unlocking. Fixes: 276940915f23 ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()") Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-17btrfs: tree-checker: fix the wrong output of data backref objectidQu Wenruo
[BUG] There are some reports about invalid data backref objectids, the report looks like this: BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543 The data ref objectid is the inode number inside the subvolume. But in above case, the value is completely sane, not really showing the problem. [CAUSE] The root cause of the problem is the deprecated feature, inode cache. This feature results a special inode number, -12ULL, and it's no longer recognized by tree-checker, triggering the error. The direct problem here is the output of data ref objectid. The value shown is in fact the dref_root (subvolume id), not the dref_objectid (inode number). [FIX] Fix the output to use dref_objectid instead. Reported-by: Neil Parton <njparton@gmail.com> Reported-by: Archange <archange@archlinux.org> Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/ Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid") CC: stable@vger.kernel.org # 6.11 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-17btrfs: fix race setting file private on concurrent lseek using same fdFilipe Manana
When doing concurrent lseek(2) system calls against the same file descriptor, using multiple threads belonging to the same process, we have a short time window where a race happens and can result in a memory leak. The race happens like this: 1) A program opens a file descriptor for a file and then spawns two threads (with the pthreads library for example), lets call them task A and task B; 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at file.c:find_desired_extent() while holding a read lock on the inode; 3) At the start of find_desired_extent(), it extracts the file's private_data pointer into a local variable named 'private', which has a value of NULL; 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode in shared mode and enters file.c:find_desired_extent(), where it also extracts file->private_data into its local variable 'private', which has a NULL value; 5) Because it saw a NULL file private, task A allocates a private structure and assigns to the file structure; 6) Task B also saw a NULL file private so it also allocates its own file private and then assigns it to the same file structure, since both tasks are using the same file descriptor. At this point we leak the private structure allocated by task A. Besides the memory leak, there's also the detail that both tasks end up using the same cached state record in the private structure (struct btrfs_file_private::llseek_cached_state), which can result in a use-after-free problem since one task can free it while the other is still using it (only one task took a reference count on it). Also, sharing the cached state is not a good idea since it could result in incorrect results in the future - right now it should not be a problem because it end ups being used only in extent-io-tree.c:count_range_bits() where we do range validation before using the cached state. Fix this by protecting the private assignment and check of a file while holding the inode's spinlock and keep track of the task that allocated the private, so that it's used only by that task in order to prevent user-after-free issues with the cached state record as well as potentially using it incorrectly in the future. Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-16Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...
2024-09-10btrfs: only unlock the to-be-submitted ranges inside a folioQu Wenruo
[SUBPAGE COMPRESSION LIMITS] Currently inside writepage_delalloc(), if a delalloc range is going to be submitted asynchronously (inline or compression, the page dirty/writeback/unlock are all handled in at different time, not at the submission time), then we return 1 and extent_writepage() will skip the submission. This is fine if every sector matches page size, but if a sector is smaller than page size (aka, subpage case), then it can be very problematic, for example for the following 64K page: 0 16K 32K 48K 64K |/| |///////| |/| | | 4K 52K Where |/| is the dirty range we need to submit. In the above case, we need the following different handling for the 3 ranges: - [0, 4K) needs to be submitted for regular write A single sector cannot be compressed. - [16K, 32K) needs to be submitted for compressed write - [48K, 52K) needs to be submitted for regular write. Above, if we try to submit [16K, 32K) for compressed write, we will return 1 and immediately, and without submitting the remaining [48K, 52K) range. Furthermore, since extent_writepage() will exit without unlocking any sectors, the submitted range [0, 4K) will not have sector unlocked. That's the reason why for now subpage is only allowed for full page range. [ENHANCEMENT] - Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap This records which sectors will be submitted by extent_writepage_io(). This allows us to track which sectors needs to be submitted thus later to be properly unlocked. For asynchronously submitted range (inline/compression), the corresponding bits will be cleared from that bitmap. - Only return 1 if no sector needs to be submitted in writepage_delalloc() - Only submit sectors marked by submission bitmap inside extent_writepage_io() So we won't touch the asynchronously submitted part. - Introduce btrfs_folio_end_writer_lock_bitmap() helper This will only unlock the involved sectors specified by @bitmap parameter, to avoid touching the range asynchronously submitted. Please note that, since subpage compression is still limited to page aligned range, this change is only a preparation for future sector perfect compression support for subpage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock()Qu Wenruo
The function btrfs_folio_unlock_writer() is already calling btrfs_folio_end_writer_lock() to do the heavy lifting work, the only missing 0 writer check. Thus there is no need to keep two different functions, move the 0 writer check into btrfs_folio_end_writer_lock(), and remove btrfs_folio_unlock_writer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: BTRFS_PATH_AUTO_FREE in orphan.cLeo Martins
All cleanup paths lead to btrfs_path_free so path can be defined with the automatic freeing callback in the following functions: - btrfs_insert_orphan_item() - btrfs_del_orphan_item() Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: use btrfs_path auto free in zoned.cLeo Martins
All cleanup paths lead to btrfs_path_free so path can be defined with the automatic freeing callback in the following functions: - calculate_emulated_zone_size() - calculate_alloc_pointer() Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: DEFINE_FREE for struct btrfs_pathLeo Martins
Add a DEFINE_FREE for struct btrfs_path. This defines a function that can be called using the __free attribute. Define a macro BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path very clear. The intended use is to define the auto free of path in cases where the path is allocated somewhere at the beginning and freed either on all error paths or at the end of the function. int func() { BTRFS_PATH_AUTO_FREE(path); if (...) return -ERROR; path = alloc_path(); ... if (...) return -ERROR; ... return 0; } Signed-off-by: Leo Martins <loemra.dev@gmail.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: remove btrfs_folio_end_all_writers()Qu Wenruo
The function btrfs_folio_end_all_writers() is only utilized in extent_writepage() as a way to unlock all subpage range (for both successful submission and error handling). Meanwhile we have a similar function, btrfs_folio_end_writer_lock(). The difference is, btrfs_folio_end_writer_lock() expects a range that is a subset of the already locked range. This limit on btrfs_folio_end_writer_lock() is a little overkilled, preventing it from being utilized for error paths. So here we enhance btrfs_folio_end_writer_lock() to accept a superset of the locked range, and only end the locked subset. This means we can replace btrfs_folio_end_all_writers() with btrfs_folio_end_writer_lock() instead. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: constify more pointer parametersDavid Sterba
Continue adding const to parameters. This is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rework BTRFS_I as macro to preserve parameter constDavid Sterba
Currently BTRFS_I is a static inline function that takes a const inode and returns btrfs inode, dropping the 'const' qualifier. This can break assumptions of compiler though it seems there's no real case. To make the parameter and return type consistent regardint const we can use the container_of_const() that preserves it. However this would not check the parameter type. To fix that use the same _Generic construct but implement only the two expected types. Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: add and use helper to verify the calling task has locked the inodeFilipe Manana
We have a few places that check if we have the inode locked by doing: ASSERT(inode_is_locked(vfs_inode)); This actually proved to be useful several times as if assertions are enabled (and by default they are in many distros) it immediately triggers a crash which is impossible for users to miss. However that doesn't check if the lock is held by the calling task, so the check passes if some other task locked the inode. Using one of the lockdep functions to check the lock is held, like lockdep_assert_held() for example, does check that the calling task holds the lock, and if that's not the case it produces a warning and stack trace in dmesg. However, despite the misleading "assert" in the name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just a warning and splat in dmesg, which is easy to get unnoticed by users who may have lockdep enabled. So add a helper that does the ASSERT() and calls lockdep_assert_held() immediately after and use it every where we check the inode is locked. Like this if the lock is held by some other task we get the warning in dmesg which is caught by fstests, very helpful during development, and may also be occassionaly noticed by users with lockdep enabled. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: always update fstrim_range on failure in FITRIM ioctlLuca Stefani
Even in case of failure we could've discarded some data and userspace should be made aware of it, so copy fstrim_range to userspace regardless. Also make sure to update the trimmed bytes amount even if btrfs_trim_free_extents fails. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert copy_inline_to_page() to use folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover find_or_create_page() is compatible API, and it can replaced with __filemap_get_folio(). Some interfaces have been converted to use folio before, so the conversion operation from page can be eliminated here. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert btrfs_decompress() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Based on the previous patch, the compression path can be directly used in folio without converting to page. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert zstd_decompress() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert lzo_decompress() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert zlib_decompress() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_extent_mapping() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And page_to_inode() can be replaced with folio_to_inode() now. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_extent_state() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use folio_pos() instead of page_offset(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert submit_eb_page() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert submit_eb_subpage() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use folio_pos() instead of page_offset(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert read_key_bytes() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(), which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_extent_buffer() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert try_release_subpage_extent_buffer() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And use folio_pos instead of page_offset, which is more consistent with folio usage. At the same time, folio_test_private() can handle folio directly without converting from page to folio first. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert get_next_extent_buffer() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Use folio_pos instead of page_offset, which is more consistent with folio usage. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert clear_page_extent_mapped() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Now clear_page_extent_mapped() can deal with a folio directly, so change its name to clear_folio_extent_mapped(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: make compression path to be subpage compatibleQu Wenruo
Currently btrfs compression path is not really subpage compatible, every thing is still done in page unit. That's fine for regular sector size and subpage routine. As even for subpage routine compression is only enabled if the whole range is page aligned, so reading the page cache in page unit is totally fine. However in preparation for the future subpage perfect compression support, we need to change the compression routine to properly handle a subpage range. This patch would prepare both zlib and zstd to only read the subpage range for compression. Lzo is already doing subpage aware read, as lzo's on-disk format is already sectorsize dependent. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()Qu Wenruo
There are only two differences between the two functions: - btrfs_orig_bbio_end_io() does extra error propagation This is mostly to allow tolerance for write errors. - btrfs_orig_bbio_end_io() does extra pending_ios check This check can handle both the original bio, or the cloned one. (All accounting happens in the original one). This makes btrfs_orig_bbio_end_io() a much safer call. In fact we already had a double freeing error due to usage of btrfs_bio_end_io() in the error path of btrfs_submit_chunk(). So just move the whole content of btrfs_orig_bbio_end_io() into btrfs_bio_end_io(). For normal paths this brings no change, because they are already calling btrfs_orig_bbio_end_io() in the first place. For error paths (not only inside bio.c but also external callers), this change will introduce extra checks, especially for external callers, as they will error out without submitting the btrfs bio. But considering it's already in the error path, such slower but much safer checks are still an overall win. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: do not hold the extent lock for entire readJosef Bacik
Historically we've held the extent lock throughout the entire read. There's been a few reasons for this, but it's mostly just caused us problems. For example, this prevents us from allowing page faults during direct io reads, because we could deadlock. This has forced us to only allow 4k reads at a time for io_uring NOWAIT requests because we have no idea if we'll be forced to page fault and thus have to do a whole lot of work. On the buffered side we are protected by the page lock, as long as we're reading things like buffered writes, punch hole, and even direct IO to a certain degree will get hung up on the page lock while the page is in flight. On the direct side we have the dio extent lock, which acts much like the way the extent lock worked previously to this patch, however just for direct reads. This protects direct reads from concurrent direct writes, while we're protected from buffered writes via the inode lock. Now that we're protected in all cases, narrow the extent lock to the part where we're getting the extent map to submit the reads, no longer holding the extent lock for the entire read operation. Push the extent lock down into do_readpage() so that we're only grabbing it when looking up the extent map. This portion was contributed by Goldwyn. Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: take the dio extent lock during O_DIRECT operationsJosef Bacik
Currently we hold the extent lock for the entire duration of a read. This isn't really necessary in the buffered case, we're protected by the page lock, however it's necessary for O_DIRECT. For O_DIRECT reads, if we only locked the extent for the part where we get the extent, we could potentially race with an O_DIRECT write in the same region. This isn't really a problem, unless the read is delayed so much that the write does the COW, unpins the old extent, and some other application re-allocates the extent before the read is actually able to be submitted. At that point at best we'd have a checksum mismatch, but at worse we could read data that doesn't belong to us. To address this potential race we need to make sure we don't have overlapping, concurrent direct io reads and writes. To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO case in the same spot as the current extent lock. The writes will take this while they're creating the ordered extent, which is also used to make sure concurrent buffered reads or concurrent direct reads are not allowed to occur, and drop it after the ordered extent is taken. For reads it will act as the current read behavior for the EXTENT_LOCKED bit, we set it when we're starting the read, we clear it in the end_io to allow other direct writes to continue. This still has the drawback of disallowing concurrent overlapping direct reads from occurring, but that exists with the current extent locking. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: introduce EXTENT_DIO_LOCKEDJosef Bacik
In order to support dropping the extent lock during a read we need a way to make sure that direct reads and direct writes for overlapping ranges are protected from each other. To accomplish this introduce another lock bit specifically for direct io. Subsequent patches will utilize this to protect direct IO operations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: always pass readahead state to defragDavid Sterba
Defrag ioctl passes readahead from the file, but autodefrag does not have a file so the readahead state is allocated when needed. The autodefrag loop in cleaner thread iterates over inodes so we can simply provide an on-stack readahead state and will not need to allocate it in btrfs_defrag_file(). The size is 32 bytes which is acceptable. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: drop transaction parameter from btrfs_add_inode_defrag()David Sterba
There's only one caller inode_should_defrag() that passes NULL to btrfs_add_inode_defrag() so we can drop it an simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: return void from btrfs_add_inode_defrag()David Sterba
The potential memory allocation failure is not a fatal error, skipping autodefrag is fine and the caller inode_should_defrag() does not care about the errors. Further writes can attempt to add the inode back to the defragmentation list again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: clear defragmented inodes using postorder in ↵David Sterba
btrfs_cleanup_defrag_inodes() btrfs_cleanup_defrag_inodes() is not called frequently, only in remount or unmount, but the way it frees the inodes in fs_info->defrag_inodes is inefficient. Each time it needs to locate first node, remove it, potentially rebalance tree until it's done. This allows to do a conditional reschedule. For cleanups the rbtree_postorder_for_each_entry_safe() iterator is convenient but we can't reschedule and restart iteration because some of the tree nodes would be already freed. The cleanup operation is kmem_cache_free() which will likely take the fast path for most objects so rescheduling should not be necessary. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __btrfs_run_defrag_inode() and drop double underscoresDavid Sterba
The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __btrfs_add_inode_defrag() and drop double underscoresDavid Sterba
The function does not follow the pattern where the underscores would be justified, so rename it. Also update the misleading comment, the passed item is not freed, that's what the caller does. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __need_auto_defrag() and drop double underscoresDavid Sterba
The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: constify arguments of compare_inode_defrag()David Sterba
A comparator function does not change its parameters, make them const. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __compare_inode_defrag() and drop double underscoresDavid Sterba
The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __extent_writepage() and drop double underscoresDavid Sterba
The function does not follow the pattern where the underscores would be justified, so rename it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename __btrfs_submit_bio() and drop double underscoresDavid Sterba
Previous patch freed the function name btrfs_submit_bio() so we can use it for a helper that submits struct bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()David Sterba
The function name is a bit misleading as it submits the btrfs_bio (bbio), rename it so we can use btrfs_submit_bio() when an actual bio is submitted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: subpage: remove btrfs_fs_info::subpage_info memberQu Wenruo
The member btrfs_fs_info::subpage_info stores the cached bitmap start position inside the merged bitmap. However in reality there is only one thing depending on the sectorsize, bitmap_nr_bits, which records the number of sectors that fit inside a page. The sequence of sub-bitmaps have fixed order, thus it's just a quick multiplication to calculate the start position of each sub-bitmaps. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: remove the nr_ret parameter from __extent_writepage_io()Qu Wenruo
The parameter @nr_ret is used to tell the caller how many sectors have been submitted for IO. Then callers check @nr_ret value to determine if we need to manually clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all sectors are beyond i_size) there is no folio_start_writeback() called thus PAGECACHE_TAG_DIRTY tag will not be cleared. Remove this parameter by: - Moving the btrfs_folio_clear_writeback() call into __extent_writepage_io() So that if we didn't submit any IO, then manually call btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when the page is no longer dirty. - Use a bool to record if we have submitted any sector Instead of an int. - Use subpage compatible helpers to end folio writeback. This brings no change to the behavior, just for the sake of consistency. As for the call site inside __extent_writepage(), we're always called for the whole page, so the existing full page helper folio_(start|end)_writeback() is totally fine. For the call site inside extent_write_locked_range(), although we can have subpage range, folio_start_writeback() will only clear PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio will still be dirty if there is any subpage dirty range. Only when the last dirty subpage sector is cleared, the folio_start_writeback() will clear PAGECACHE_TAG_DIRTY. So no matter if we call the full page or subpage helper, the result is still the same, then just use the subpage helpers for consistency. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: send: fix grammar in commentsThorsten Blum
Fix a few obvious grammar mistakes: a -> an, then -> than. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>