summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2024-07-11btrfs: export the expected file extent through can_nocow_extent()Qu Wenruo
Currently function can_nocow_extent() only returns members needed for extent_map. However since we will soon change the extent_map structure to be more like btrfs_file_extent_item, we want to expose the expected file extent caused by the NOCOW write for future usage. This introduces a new structure, btrfs_file_extent, to be a more memory access friendly representation of btrfs_file_extent_item. And use that structure to expose the expected file extent caused by the NOCOW write. For now there is no user of the new structure yet. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename extent_map::orig_block_len to disk_num_bytesQu Wenruo
This would make it very obvious that the member just matches btrfs_file_extent_item::disk_num_bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: move fiemap code into its own fileFilipe Manana
Currently the core of the fiemap code lives in extent_io.c, which does not make any sense because it's not related to extent IO at all (and it was not as well before the big rewrite of fiemap I did some time ago). The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's an inode operation. Since there's a significant amount of fiemap code, move all of it into a dedicated file, including its entry point inode.c:btrfs_fiemap(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()Filipe Manana
Now that there is a helper to commit the current transaction and we are using it, there's no need for the label and goto statements at ensure_commit_roots_uptodate(). So replace them with direct return statements that call btrfs_commit_current_transaction(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: add and use helper to commit the current transactionFilipe Manana
We have several places that attach to the current transaction with btrfs_attach_transaction_barrier() and then commit the transaction if there is one. Add a helper and use it to deduplicate this pattern. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: scrub: avoid create/commit empty transaction at ↵Filipe Manana
finish_extent_writes_for_zoned() At finish_extent_writes_for_zoned() we use btrfs_join_transaction() to catch any running transaction and then commit it. This will however create a new and empty transaction in case there's no running transaction anymore (got committed by the transaction kthread or other task for example) or there's a running transaction finishing its commit and with a state >= TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything while in the second case we just need to wait for the transaction to complete its commit. So improve this by using btrfs_attach_transaction_barrier() instead, which does not create a new transaction if there's none running, and if there's a current transaction that is committing, it will wait for it to fully commit and not create a new transaction. This helps avoiding creating and committing empty transactions, saving IO, time and unnecessary rotation of the backup roots in the super block. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: send: avoid create/commit empty transaction at ↵Filipe Manana
ensure_commit_roots_uptodate() At ensure_commit_roots_uptodate() we use btrfs_join_transaction() to catch any running transaction and then commit it. This will however create a new and empty transaction in case there's no running transaction anymore (got committed by the transaction kthread or other task for example) or there's a running transaction finishing its commit and with a state >= TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything while in the second case we just need to wait for the transaction to complete its commit. So improve this by using btrfs_attach_transaction_barrier() instead, which does not create a new transaction if there's none running, and if there's a current transaction that is committing, it will wait for it to fully commit and not create a new transaction. This helps avoiding creating and committing empty transactions, saving IO, time and unnecessary rotation of the backup roots in the super block. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficientFilipe Manana
Before starting a send operation we have to make sure that every root has its commit root matching the regular root, to that send doesn't find stale inodes in the commit root (inodes that were deleted in the regular root) and fails the inode lookups with -ESTALE. Currently we keep looking for roots used by the send operation and as soon as we find one we commit the current transaction (or a new one since btrfs_join_transaction() creates one if there isn't any running or the running one is in a state >= TRANS_STATE_UNBLOCKED). It's pointless to keep looking until we don't find any, because after the first transaction commit all the other roots are updated too, as they were already tagged in the fs_info->fs_roots_radix radix tree when they were modified in order to have a commit root different from the regular root. Currently we are also always passing the main send root into btrfs_join_transaction(), which despite not having any functional issue, it is not optimal because in case the root wasn't modified we end up adding it to fs_info->fs_roots_radix and then update its root item in the root tree when committing the transaction, causing unnecessary work. So simplify and make this more efficient by removing the looping and by passing the first root we found that is modified as the argument to btrfs_join_transaction(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: avoid create and commit empty transaction when committing superFilipe Manana
At btrfs_commit_super(), called in a few contexts such as when unmounting a filesystem, we use btrfs_join_transaction() to catch any running transaction and then commit it. This will however create a new and empty transaction in case there's no running transaction or there's a running transaction with a state >= TRANS_STATE_UNBLOCKED. As we just want to be sure that any existing transaction is fully committed, we can use btrfs_attach_transaction_barrier() instead of btrfs_join_transaction(), therefore avoiding the creation and commit of empty transactions, which only waste IO and causes rotation of the precious backup roots. Example where we create and commit a pointless empty transaction: $ mkfs.btrfs -f /dev/sdj $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation' generation 6 $ mount /dev/sdj /mnt/sdj $ touch /mnt/sdj/foo # Commit the currently open transaction. Just 'sync' or wait ~30 # seconds for the transaction kthread to commit it. $ sync $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation' generation 7 $ umount /mnt/sdj $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation' generation 8 The transaction with id 8 was pointless, an empty transaction that did not achieve anything. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: qgroup: avoid start/commit empty transaction when flushing reservationsFilipe Manana
When flushing reservations we are using btrfs_join_transaction() to get a handle for the current transaction and then commit it to try to release space. However btrfs_join_transaction() has some undesirable consequences: 1) If there's no running transaction, it will create one, and we will commit it right after. This is unnecessary because it will not release any space, and it will result in unnecessary IO and rotation of backup roots in the superblock; 2) If there's a current transaction and that transaction is committing (its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that transaction to almost finish its commit (for its state to be >= TRANS_STATE_UNBLOCKED) and then start and return a new transaction. We will then commit that new transaction, which is pointless because all we wanted was to wait for the current (previous) transaction to fully finish its commit (state == TRANS_STATE_COMPLETED), and by starting and committing a new transaction we are wasting IO too and causing unnecessary rotation of backup roots in the superblock. So improve this by using btrfs_attach_transaction_barrier() instead, which does not create a new transaction if there's none running, and if there's a current transaction that is committing, it will wait for it to fully commit and not create a new transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: simplify range parameters of btrfs_wait_ordered_roots()David Sterba
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: automatically remove the subvolume qgroupQu Wenruo
Currently if we fully clean a subvolume (not only delete its directory, but fully clean all it's related data and root item), the associated qgroup would not be removed. We have "btrfs qgroup clear-stale" to handle such 0 level qgroups. Change the behavior to automatically removie the qgroup of a fully cleaned subvolume when possible: - Full qgroup but still consistent We can and should remove the qgroup. The qgroup numbers should be 0, without any rsv. - Full qgroup but inconsistent Can happen with drop_subtree_threshold feature (skip accounting and mark qgroup inconsistent). We can and should remove the qgroup. Higher level qgroup numbers will be incorrect, but since qgroup is already inconsistent, it should not be a problem. - Squota mode This is the special case, we can only drop the qgroup if its numbers are all 0. This would be handled by can_delete_qgroup(), so we only need to check the return value and ignore the -EBUSY error. Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: slightly loosen the requirement for qgroup removalQu Wenruo
[BUG] Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs, and a snapshot with level higher than that value is dropped, we will not be able to delete the qgroup until next qgroup rescan: uuid=ffffffff-eeee-dddd-cccc-000000000000 wipefs -fa $dev mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid mount $dev $mnt btrfs subvolume create $mnt/subv1/ for (( i = 0; i < 1024; i++ )); do xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null done sync btrfs subvolume snapshot $mnt/subv1 $mnt/snapshot btrfs quota enable $mnt btrfs quota rescan -w $mnt sync echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold btrfs subvolume delete $mnt/snapshot btrfs subvolume sync $mnt btrfs qgroup show -prce --sync $mnt btrfs qgroup destroy 0/257 $mnt umount $mnt The final qgroup removal would fail with the following error: ERROR: unable to destroy quota group: Device or resource busy [CAUSE] The above script would generate a subvolume of level 2, then snapshot it, enable qgroup, set the drop_subtree_threshold, then drop the snapshot. Since the subvolume drop would meet the threshold, qgroup would be marked inconsistent and skip accounting to avoid hanging the system at transaction commit. But currently we do not allow a qgroup with any rfer/excl numbers to be dropped, and this is not really compatible with the new drop_subtree_threshold behavior. [FIX] Only require the strict zero rfer/excl/rfer_cmpr/excl_cmpr for squota mode. This is due to the fact that squota can never go inconsistent, and it can have dropped subvolume but with non-zero qgroup numbers for future accounting. For full qgroup mode, we only check if there is a subvolume for it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: constify parameters of write_eb_member() and its usersDavid Sterba
Reported by 'gcc -Wcast-qual', the argument from which write_extent_buffer() reads data to write to the eb should be const. In addition the const needs to be also added to __write_extent_buffer() local buffers. All callers of write_eb_member() can now be updated to use const for the input buffer structure or type. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: keep const when returning value from get_unaligned_le8()David Sterba
This was reported by 'gcc -Wcast-qual', the get_unaligned_le8() simply returns the argument and there's no reason to drop the cast. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove unused define EXTENT_SIZE_PER_ITEMDavid Sterba
This was added in c61a16a701a126 ("Btrfs: fix the confusion between delalloc bytes and metadata bytes") and removed in 03fe78cc2942c5 ("btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc") where the calculation was reworked to use a non-constant numbers. This was found by 'make W=2'. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use for-local variables that shadow function variablesDavid Sterba
We've started to use for-loop local variables and in a few places this shadows a function variable. Convert a few cases reported by 'make W=2'. If applicable also change the style to post-increment, that's the preferred one. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename macro local variables that clash with other variablesDavid Sterba
Fix variable names in two macros where there's a local function variable of the same name. In subpage_calc_start_bit() it's in several callers, in btrfs_abort_transaction() it's only in replace_file_extents(). Found by 'make W=2'. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove duplicate name variable declarationsDavid Sterba
When running 'make W=2' there are a few reports where a variable of the same name is declared in a nested block. In all the cases we can use the one declared in the parent block, no problematic cases were found. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a btrfs_inode local variable at btrfs_sync_file()Filipe Manana
Instead of using a VFS inode local pointer and then doing many BTRFS_I() calls inside btrfs_sync_file(), use a btrfs_inode pointer instead. This makes everything a bit easier to read and less confusing, allowing to make some statements shorter. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()Filipe Manana
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()Filipe Manana
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx)Filipe Manana
Instead of using a inode pointer, use a btrfs_inode pointer in the log context structure, as this is generally what we need and allows for some internal APIs to take a btrfs_inode instead, making them more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: make btrfs_finish_ordered_extent() return voidFilipe Manana
Currently btrfs_finish_ordered_extent() returns a boolean indicating if the ordered extent was added to the work queue for completion, but none of its callers cares about it, so make it return void. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: move btrfs_block_group_root() to block-group.cAnand Jain
The function btrfs_block_group_root() is declared in disk-io.c; however, all its callers are in block-group.c. Move it to the latter file and declare it static. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: drop bytenr_orig and fix comment in btrfs_scan_one_device()Anand Jain
Drop the single-use variable bytenr_orig and instead use btrfs_sb_offset() in the function argument passing. Fix a stale comment about not automatically fixing a bad primary superblock from the backup mirror copies. Also, move the comment closer to where the primary superblock read occurs. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a regular rb_root instead of cached rb_root for extent_map_treeFilipe Manana
We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename rb_root member of extent_map_tree from map to rootFilipe Manana
Currently we name the rb_root member of struct extent_map_tree as 'map', which is odd and confusing. Since it's a root node, rename it to 'root'. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove objectid from struct btrfs_inode on 64 bits platformsFilipe Manana
On 64 bits platforms we don't really need to have a dedicated member (the objectid field) for the inode's number since we store in the VFS inode's i_ino member, which is an unsigned long and this type is 64 bits wide on 64 bits platforms. We only need that field in case we are on a 32 bits platform because the unsigned long type is 32 bits wide on such platforms See commit 33345d01522f ("Btrfs: Always use 64bit inode number") regarding this 64/32 bits detail. The objectid field of struct btrfs_inode is also used to store the ID of a root for directories that are stubs for unreferenced roots. In such cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime flag set. So in order to reduce the size of btrfs_inode structure on 64 bits platforms we can remove the objectid member and use the VFS inode's i_ino member instead whenever we need to get the inode number. In case the inode is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member last_reflink_trans to store the ID of the unreferenced root, since such inode is a directory and reflinks can't be done against directories. So remove the objectid fields for 64 bits platforms and alias the last_reflink_trans field with a name of ref_root_id in a union. On a release kernel config, this reduces the size of struct btrfs_inode from 1040 bytes down to 1032 bytes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove location key from struct btrfs_inodeFilipe Manana
Currently struct btrfs_inode has a key member, named "location", that is either: 1) The key of the inode's item. In this case the objectid is the number of the inode; 2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for the case where we have a root that is a snapshot of a subvolume that points to other subvolumes. In this case the objectid is the ID of a subvolume inside the snapshotted parent subvolume. The key is only used to lookup the inode item for the first case, while for the second it's never used since it corresponds to directory stubs created with new_simple_dir() and which are marked as dummy, so there's no actual inode item to ever update. In the second case we only check the key type at btrfs_ino() for 32 bits platforms and its objectid is only needed for unlink. Instead of using a key we can do fine with just the objectid, since we can generate the key whenever we need it having only the objectid, as in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset is always 0. So use only an objectid instead of a full key. This reduces the size of struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: don't allocate file extent tree for non regular filesFilipe Manana
When not using the NO_HOLES feature we always allocate an io tree for an inode's file_extent_tree. This is wasteful because that io tree is only used for regular files, so we allocate more memory than needed for inodes that represent directories or symlinks for example, or for inodes that correspond to free space inodes. So improve on this by allocating the io tree only for inodes of regular files that are not free space inodes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: unify index_cnt and csum_bytes from struct btrfs_inodeFilipe Manana
The index_cnt field of struct btrfs_inode is used only for two purposes: 1) To store the index for the next entry added to a directory; 2) For the data relocation inode to track the logical start address of the block group currently being relocated. For the relocation case we use index_cnt because it's not used for anything else in the relocation use case - we could have used other fields that are not used by relocation such as defrag_bytes, last_unlink_trans or last_reflink_trans for example (among others). Since the csum_bytes field is not used for directories, do the following changes: 1) Put index_cnt and csum_bytes in a union, and index_cnt is only initialized when the inode is a directory. The csum_bytes is only accessed in IO paths for regular files, so we're fine here; 2) Use the defrag_bytes field for relocation, since the data relocation inode is never used for defrag purposes. And to make the naming better, alias it to reloc_block_group_start by using a union. This reduces the size of struct btrfs_inode by 8 bytes in a release kernel, from 1056 bytes down to 1048 bytes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove inode_lock from struct btrfs_root and use xarray locksFilipe Manana
Currently we use the spinlock inode_lock from struct btrfs_root to serialize access to two different data structures: 1) The delayed inodes xarray (struct btrfs_root::delayed_nodes); 2) The inodes xarray (struct btrfs_root::inodes). Instead of using our own lock, we can use the spinlock that is part of the xarray implementation, by using the xa_lock() and xa_unlock() APIs and using the xarray APIs with the double underscore prefix that don't take the xarray locks and assume the caller is using xa_lock() and xa_unlock(). So remove the spinlock inode_lock from struct btrfs_root and use the corresponding xarray locks. This brings 2 benefits: 1) We reduce the size of struct btrfs_root, from 1336 bytes down to 1328 bytes on a 64 bits release kernel config; 2) We reduce lock contention by not using anymore the same lock for changing two different and unrelated xarrays. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()Filipe Manana
Make btrfs_iget_path() simpler and easier to read by avoiding nesting of if-then-else statements and having an error label to do all the error handling instead of repeating it a couple times. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: preallocate inodes xarray entry to avoid transaction abortFilipe Manana
When creating a new inode, at btrfs_create_new_inode(), one of the very last steps is to add the inode to the root's inodes xarray. This often requires allocating memory which may fail (even though xarrays have a dedicated kmem_cache which make it less likely to fail), and at that point we are forced to abort the current transaction (as some, but not all, of the inode metadata was added to its subvolume btree). To avoid a transaction abort, preallocate memory for the xarray early at btrfs_create_new_inode(), so that if we fail we don't need to abort the transaction and the insertion into the xarray is guaranteed to succeed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use an xarray to track open inodes in a rootFilipe Manana
Currently we use a red black tree (rb-tree) to track the currently open inodes of a root (in struct btrfs_root::inode_tree). This however is not very efficient when the number of inodes is large since rb-trees are binary trees. For example for 100K open inodes, the tree has a depth of 17. Besides that, inserting into the tree requires navigating through it and pulling useless cache lines in the process since the red black tree nodes are embedded within the btrfs inode - on the other hand, by being embedded, it requires no extra memory allocations. We can improve this by using an xarray instead, which is efficient when indices are densely clustered (such as inode numbers), is more cache friendly and behaves like a resizable array, with a much better search and insertion complexity than a red black tree. This only has one small disadvantage which is that insertion will sometimes require allocating memory for the xarray - which may fail (not that often since it uses a kmem_cache) - but on the other hand we can reduce the btrfs inode structure size by 24 bytes (from 1080 down to 1056 bytes) after removing the embedded red black tree node, which after the next patches will allow to reduce the size of the structure to 1024 bytes, meaning we will be able to store 4 inodes per 4K page instead of 3 inodes. This change does a straightforward change to use an xarray, and results in a transaction abort if we can't allocate memory for the xarray when creating an inode - but the next patch changes things so that we don't need to abort. Running the following fs_mark test showed some improvements: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" FILES=100000 THREADS=$(nproc --all) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this patch: FSUse% Count Size Files/sec App Overhead 10 1200000 0 92081.6 12505547 16 2400000 0 138222.6 13067072 23 3600000 0 148833.1 13290336 43 4800000 0 97864.7 13931248 53 6000000 0 85597.3 14384313 After this patch: FSUse% Count Size Files/sec App Overhead 10 1200000 0 93225.1 12571078 16 2400000 0 146720.3 12805007 23 3600000 0 160626.4 13073835 46 4800000 0 116286.2 13802927 53 6000000 0 90087.9 14754892 The test was run with a release kernel config (Debian's default config). Also capturing the insertion times into the rb tree and into the xarray, that is measuring the duration of the old function inode_tree_add() and the duration of the new btrfs_add_inode_to_root() function, gave the following results (in nanoseconds): Before this patch, inode_tree_add() execution times: Count: 5000000 Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961 Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000 0.000 - 7.816: 40 | 7.816 - 37.858: 209 | 37.858 - 170.278: 6059 | 170.278 - 753.961: 2754890 ##################################################### 753.961 - 3326.728: 2232312 ########################################### 3326.728 - 14667.018: 4366 | 14667.018 - 64652.943: 852 | 64652.943 - 284981.761: 550 | 284981.761 - 1256150.914: 221 | 1256150.914 - 5536887.000: 7 | After this patch, btrfs_add_inode_to_root() execution times: Count: 5000000 Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369 Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000 0.000 - 7.264: 104 | 7.264 - 33.145: 352 | 33.145 - 140.081: 109606 # 140.081 - 581.930: 4840090 ##################################################### 581.930 - 2407.590: 43532 | 2407.590 - 9950.979: 2245 | 9950.979 - 41119.278: 514 | 41119.278 - 169902.616: 155 | 169902.616 - 702018.539: 47 | 702018.539 - 2900652.000: 9 | Average, percentiles, standard deviation, etc, are all much better. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: raid56: do extra dumping for CONFIG_BTRFS_ASSERTQu Wenruo
There are several hard-to-hit ASSERT()s hit inside raid56. Unfortunately the ASSERT() expression is a little complex, and except the ASSERT(), there is nothing to provide any clue. Considering if race is involved, it's pretty hard to reproduce. Meanwhile sometimes the dump of the rbio structure can provide some pretty good clues, it's worth to do the extra multi-line dump for btrfs raid56 related code. The dump looks like this: BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5 BTRFS critical (device dm-3): nr=0 devid=1 physical=1166147584 BTRFS critical (device dm-3): nr=1 devid=2 physical=1145176064 BTRFS critical (device dm-3): nr=2 devid=4 physical=1145176064 BTRFS critical (device dm-3): nr=3 devid=5 physical=1145176064 BTRFS critical (device dm-3): nr=4 devid=3 physical=1145176064 BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0 BTRFS critical (device dm-3): logical=4598530048 assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: fix function name in comment for btrfs_remove_ordered_extent()Filipe Manana
Due to a refactoring introduced by commit 53d9981ca20e ("btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers"), the function btrfs_alloc_ordered_extent() was renamed to alloc_ordered_extent(), so the comment at btrfs_remove_ordered_extent() is no longer very accurate. Update the comment to refer to the new name "alloc_ordered_extent()". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: fix misspelled end IO compression callbacksFilipe Manana
Fix typo in the end IO compression callbacks, from "comprssed" to "compressed". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove no longer used btrfs_migrate_to_delayed_refs_rsv()Filipe Manana
The function btrfs_migrate_to_delayed_refs_rsv() is no longer used. Its last use was removed in commit 2f6397e448e6 ("btrfs: don't refill whole delayed refs block reserve when starting transaction"). So remove the function. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: zoned: make btrfs_get_dev_zone() staticFilipe Manana
It's not used outside zoned.c, so make it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass struct btrfs_io_geometry into handle_ops_on_dev_replace()Johannes Thumshirn
Passing in a 'struct btrfs_io_geometry into handle_ops_on_dev_replace can reduce the number of arguments by two. No functional changes otherwise. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: qgroup: do quick checks if quotas are enabled before starting ioctlsDavid Sterba
The ioctls that add relations, create qgroups or set limits start/join transaction. When quotas are not enabled this is not necessary, there will be errors reported back anyway but this could be also misleading and we should really report that quotas are not enabled. For that use -ENOTCONN. The helper is meant to do a quick check before any other standard ioctl checks are done. If quota is disabled meanwhile we still rely on proper locking inside any active operation changing the qgroup structures. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04Merge tag 'for-6.10-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix folio refcounting when releasing them (encoded write, dummy extent buffer) - fix out of bounds read when checking qgroup inherit data - fix how configurable chunk size is handled in zoned mode - in the ref-verify tool, fix uninitialized return value when checking extent owner ref and simple quota are not enabled * tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix folio refcount in __alloc_dummy_extent_buffer() btrfs: fix folio refcount in btrfs_do_encoded_write() btrfs: fix uninitialized return value in the ref-verify tool btrfs: always do the basic checks for btrfs_qgroup_inherit structure btrfs: zoned: fix calc_available_free_space() for zoned mode
2024-07-03readahead: drop index argument of page_cache_async_readahead()Jan Kara
The index argument of page_cache_async_readahead() is just folio->index so there's no point in passing is separately. Drop it. Link: https://lkml.kernel.org/r/20240625101909.12234-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04btrfs: fix folio refcount in __alloc_dummy_extent_buffer()Boris Burkov
Another improper use of __folio_put() in an error path after freshly allocating pages/folios which returns them with the refcount initialized to 1. The refactor from __free_pages() -> __folio_put() (instead of folio_put) removed a refcount decrement found in __free_pages() and folio_put but absent from __folio_put(). Fixes: 13df3775efca ("btrfs: cleanup metadata page pointer usage") CC: stable@vger.kernel.org # 6.8+ Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04btrfs: fix folio refcount in btrfs_do_encoded_write()Boris Burkov
The conversion to folios switched __free_page() to __folio_put() in the error path in btrfs_do_encoded_write(). However, this gets the page refcounting wrong. If we do hit that error path (I reproduced by modifying btrfs_do_encoded_write to pretend to always fail in a way that jumps to out_folios and running the fstests case btrfs/281), then we always hit the following BUG freeing the folio: BUG: Bad page state in process btrfs pfn:40ab0b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 bad_page+0xea/0xf0 free_unref_page+0x8e1/0x900 ? __mem_cgroup_uncharge+0x69/0x90 __folio_put+0xe6/0x190 btrfs_do_encoded_write+0x445/0x780 ? current_time+0x25/0xd0 btrfs_do_write_iter+0x2cc/0x4b0 btrfs_ioctl_encoded_write+0x2b6/0x340 It turns out __free_page() decreases the page reference count while __folio_put() does not. Switch __folio_put() to folio_put() which decreases the folio reference count first. Fixes: 400b172b8cdc ("btrfs: compression: migrate compression/decompression paths to folios") Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02btrfs: fix uninitialized return value in the ref-verify toolFilipe Manana
In the ref-verify tool, when processing the inline references of an extent item, we may end up returning with uninitialized return value, because: 1) The 'ret' variable is not initialized if there are no inline extent references ('ptr' == 'end' before the while loop starts); 2) If we find an extent owner inline reference we don't initialize 'ret'. So fix these cases by initializing 'ret' to 0 when declaring the variable and set it to -EINVAL if we find an extent owner inline references and simple quotas are not enabled (as well as print an error message). Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02btrfs: always do the basic checks for btrfs_qgroup_inherit structureQu Wenruo
[BUG] Syzbot reports the following regression detected by KASAN: BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171 CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922 btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382 create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875 btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340 btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422 btrfs_ioctl+0x99e/0xc60 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcbf1992509 RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509 RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660 R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e </TASK> And it also pinned it down to commit b5357cb268c4 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled"). [CAUSE] That offending commit skips the whole qgroup inherit check if qgroup is not enabled. But that also skips the very basic checks like num_ref_copies/num_excl_copies and the structure size checks. Meaning if a qgroup enable/disable race is happening at the background, and we pass a btrfs_qgroup_inherit structure when the qgroup is disabled, the check would be completely skipped. Then at the time of transaction commitment, qgroup is re-enabled and btrfs_qgroup_inherit() is going to use the incorrect structure and causing the above KASAN error. [FIX] Make btrfs_qgroup_check_inherit() only skip the source qgroup checks. So that even if invalid btrfs_qgroup_inherit structure is passed in, we can still reject invalid ones no matter if qgroup is enabled or not. Furthermore we do already have an extra safety inside btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources, so even if we only skip the qgroup source check we're still safe. Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com Fixes: b5357cb268c4 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02btrfs: zoned: fix calc_available_free_space() for zoned modeNaohiro Aota
calc_available_free_space() returns the total size of metadata (or system) block groups, which can be allocated from unallocated disk space. The logic is wrong on zoned mode in two places. First, the calculation of data_chunk_size is wrong. We always allocate one zone as one chunk, and no partial allocation of a zone. So, we should use zone_size (= data_sinfo->chunk_size) as it is. Second, the result "avail" may not be zone aligned. Since we always allocate one zone as one chunk on zoned mode, returning non-zone size aligned bytes will result in less pressure on the async metadata reclaim process. This is serious for the nearly full state with a large zone size device. Allowing over-commit too much will result in less async reclaim work and end up in ENOSPC. We can align down to the zone size to avoid that. Fixes: cb6cbab79055 ("btrfs: adjust overcommit logic when very close to full") CC: stable@vger.kernel.org # 6.9 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>