summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-01-07btrfs: zoned: unset dedicated block group on allocation failureNaohiro Aota
Allocating an extent from a block group can fail for various reasons. When an allocation from a dedicated block group (for tree-log or relocation data) fails, we need to unregister it as a dedicated one so that we can allocate a new block group for the dedicated one. However, we are returning early when the block group in case it is read-only, fully used, or not be able to activate the zone. As a result, we keep the non-usable block group as a dedicated one, leading to further allocation failure. With many block groups, the allocator will iterate hopeless loop to find a free extent, results in a hung task. Fix the issue by delaying the return and doing the proper cleanups. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zonedJohannes Thumshirn
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the bio's bio_op. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: sink zone check into btrfs_repair_one_zoneJohannes Thumshirn
Sink zone check into btrfs_repair_one_zone() so we don't need to do it in all callers. Also as btrfs_repair_one_zone() doesn't return a sensible error, make it a boolean function and return false in case it got called on a non-zoned filesystem and true on a zoned filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: simplify btrfs_check_meta_write_pointerJohannes Thumshirn
btrfs_check_meta_write_pointer() will always be called with a NULL 'cache_ret' argument. As there's no need to check if we have a valid block_group passed in remove these checks. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: zoned: encapsulate inode locking for zoned relocationJohannes Thumshirn
Encapsulate the inode lock needed for serializing the data relocation writes on a zoned filesystem into a helper. This streamlines the code reading flow and hides special casing for zoned filesystems. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the deviceAnand Jain
In the case of the seed device, the fsid can be different from the mounted sprout fsid. The userland has to read the device superblock to know the fsid but, that idea fails if the device is missing. So add a sysfs interface devinfo/<devid>/fsid to show the fsid of the device. For example: $ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8 $ cat devinfo/1/fsid c44d771f-639d-4df3-99ec-5bc7ad2af93b $ cat devinfo/3/fsid b10b02a5-f9de-4276-b9e8-2bfd09a578a8 Though it's related to seeding, the name of the sysfs file is plain fsid as it matches what blkid says. A path to the device's fsid will aid scripting. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: reserve extra space for the free space treeJosef Bacik
Filipe reported a problem where sometimes he'd get an ENOSPC abort when running delayed refs with generic/619 and the free space tree enabled. This is partly because we do not reserve space for modifying the free space tree, nor do we have a block rsv associated with that tree. The delayed_refs_rsv tracks the amount of space required to run delayed refs. This means 1 modification means 1 change to the extent root. With the free space tree this turns into 2 changes, because modifying 1 extent means updating the extent tree and potentially updating the free space tree to either remove that entry or add the free space. Thus if we have the FST enabled, simply double the reservation size for our modification. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: include the free space tree in the global rsv minimum calculationJosef Bacik
Filipe reported a problem where generic/619 was failing with an ENOSPC abort while running delayed refs, like the following BTRFS: Transaction aborted (error -28) WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs] CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G W 5.16.0-rc2-btrfs-next-106 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs] RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4 R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90 FS: 0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __btrfs_free_extent+0x516/0x950 [btrfs] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] btrfs_run_delayed_refs+0x86/0x210 [btrfs] flush_space+0x403/0x630 [btrfs] ? call_rcu_tasks_generic+0x50/0x80 ? lock_release+0x223/0x4a0 ? btrfs_get_alloc_profile+0xb5/0x290 [btrfs] ? do_raw_spin_unlock+0x4b/0xa0 btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs] process_one_work+0x24c/0x5b0 worker_thread+0x55/0x3c0 ? process_one_work+0x5b0/0x5b0 kthread+0x17c/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 There's a couple of reasons for this, but in generic/619's case the largest reason is because it is a very small file system, ad we do not reserve enough space for the global reserve. With the free space tree we now have the free space tree that we need to modify when running delayed refs. This means we need the global reserve to take this into account when it calculates the minimum size it needs to be. This is especially important for very small file systems. Fix this by adjusting the minimum global block rsv size math to include the size of the free space tree when calculating the size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIOQu Wenruo
These two values were introduced in commit ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk") as an optimization. But the truth is, block layer scheduler can do whatever it wants to merge/split bios to improve performance. Doing such "optimization" is not really going to affect much, especially considering how good current block layer optimizations are doing. Remove such old and immature optimization from our code. Since we're here, also change BUG_ON()s using these two macros to use ASSERT()s. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: update SCRUB_MAX_PAGES_PER_BLOCKQu Wenruo
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to calculate this value. And remove one stale comment on the value, in fact with recent subpage support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond BTRFS_STRIPE_LEN, just we don't use the full page. Also since we're here, update the BUG_ON() related to SCRUB_MAX_PAGES_PER_BLOCK to ASSERT(). As those ASSERT() are really only for developers to catch early obvious bugs, not to let end users suffer. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: do not check -EAGAIN when truncating inodes in the log rootJosef Bacik
We only throttle the btrfs_truncate_inode_items if the root is SHAREABLE, which isn't set on the log root, which means this loop is unnecessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make should_throttle loop local in btrfs_truncate_inode_itemsJosef Bacik
We reset this bool on every loop through the truncate loop, make this variable local to the loop. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: combine extra if statements in btrfs_truncate_inode_itemsJosef Bacik
We have if (del_item) // do something else // something else if (del_item) // do yet another thing else // something else entirely back to back in btrfs_truncate_inode_items, collapse these two sets of if statements into one. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: convert BUG() for pending_del_nr into an ASSERTJosef Bacik
This is a logic correctness check, convert it into an ASSERT() instead of a BUG(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: convert BUG_ON() in btrfs_truncate_inode_items to ASSERTJosef Bacik
We have a correctness BUG_ON() in btrfs_truncate_inode_items to make sure that we're always using min_type == BTRFS_EXTENT_DATA_KEY if new_size is > 0. Convert this to an ASSERT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add inode to truncate controlJosef Bacik
In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: pass the ino via truncate controlJosef Bacik
In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: use a flag to control when to clear the file extent rangeJosef Bacik
We only care about updating the file extent range when we are doing a normal truncation. We skip this for tree logging currently, but we can also skip this for eviction as well. Using a flag makes it more explicit when we want to do this work. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: control extent reference updates with a control flag for truncateJosef Bacik
We've had weird bugs in the past where we forgot to adjust the truncate path to deal with the fact that we can be called by the tree log path. Instead of checking if our root is a LOG_ROOT use a flag on the btrfs_truncate_control to indicate that we don't want to do extent reference updates during this truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: only call inode_sub_bytes in truncate paths that careJosef Bacik
We currently have a bunch of awkward checks to make sure we only update the inode i_bytes if we're truncating the real inode. Instead keep track of the number of bytes we need to sub in the btrfs_truncate_control, and then do the appropriate adjustment in the truncate paths that care. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: only update i_size in truncate paths that careJosef Bacik
We currently will update the i_size of the inode as we truncate it down, however we skip this if we're calling btrfs_truncate_inode_items from the tree log code. However we also don't care about this in the case of evict. Instead keep track of this value in the btrfs_truncate_control and then have btrfs_truncate() and the free space cache truncate path both do the i_size update themselves. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add truncate control structJosef Bacik
I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove found_extent from btrfs_truncate_inode_itemsJosef Bacik
We only set this if we find a normal file extent, del_item == 1, and the file extent points to a real extent and isn't a hole extent. We can use del_item == 1 && extent_start != 0 to get the same information that found_extent provides, so remove this variable and use the other variables instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move btrfs_kill_delayed_inode_items into evictJosef Bacik
We have a special case in btrfs_truncate_inode_items() to call btrfs_kill_delayed_inode_items() if min_type == 0, which is only called during evict. Instead move this out into evict proper, and add some comments because I erroneously attempted to remove this code altogether without understanding what we were doing. Evict is updating the inode only because we only care about making sure the i_nlink count has hit disk. If we had pending deletions we don't want to process those via the delayed inode updates, we simply want to drop all of them and reclaim the reserved metadata space. Then from there the btrfs_truncate_inode_items() will do the work to remove all of the items as appropriate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove free space cache inode check in btrfs_truncate_inode_itemsJosef Bacik
We no longer have inode cache feature, so this check is extraneous as the only inode cache is in the tree_root, which is not marked as SHAREABLE. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move extent locking outside of btrfs_truncate_inode_itemsJosef Bacik
Currently we are locking the extent and dropping the extent cache for any inodes we truncate, unless they're in the tree log. We call this helper from: - truncate - evict - tree log - free space cache truncation For evict we've already dropped all of the extent cache for this inode once we've gotten here, and we're the only one accessing this inode, so this step is unnecessary. For the tree log code we already skip this part. Pull this work into the truncate path and the free space cache truncation path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move btrfs_truncate_inode_items to inode-item.cJosef Bacik
This is an inode item related manipulation with a few vfs related adjustments. I'm going to remove the vfs related code from this helper and simplify it a lot, but I want those changes to be easily seen via git blame, so move this function now and then the simplification work can be done. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add an inode-item.hJosef Bacik
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove stale comment about locking at btrfs_search_slot()Filipe Manana
The comment refers to the old extent buffer locking code, where we used to have custom locks that had blocking and spinning behaviour modes. That is not the case anymore, since we have transitioned to rw semaphores, so the comment does not offer any value anymore. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove BUG_ON() after splitting leafFilipe Manana
After calling split_leaf() we BUG_ON() if the returned value is greater than zero. However split_leaf() only returns 0, in case of success, or a negative value in case of an error. The reason for the BUG_ON() is that if we ever get a positive return value from split_leaf(), we can not simply propagate it to the callers of btrfs_search_slot(), as that would be interpreted as "key not found" and not as an error. That means it could result in callers ending up causing some potential silent corruption. So change the BUG_ON() to an ASSERT(), and in case assertions are disabled, produce a warning and set the return value to an error, to make it not possible to get into a silent corruption and having the error not noticed. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: move leaf search logic out of btrfs_search_slot()Filipe Manana
There's quite a significant amount of code for doing the key search for a leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus btrfs_search_slot() is already big enough. So move the logic that does the key search on a leaf into a new helper function. This makes it better organized, removing the need for the labels and the gotos, as well as reducing the indentation level and the size of btrfs_search_slot(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove useless condition check before splitting leafFilipe Manana
When inserting a key, we check if the write_lock_level is less than 1, and if so we set it to 1, release the path and retry the tree traversal. However that is unnecessary, because when ins_len is greater than 0, we know that write_lock_level can never be less than 1. The logic to retry is also buggy, because in case ins_len was decremented, due to an exact key match and the search is not meant for item extension (path->search_for_extension is 0), we retry without incrementing ins_len, which would make the next retry decrement it again by the same amount. So remove the check for write_lock_level being less than 1 and add an assertion to assert it's always >= 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: try to unlock parent nodes earlier when inserting a keyFilipe Manana
When inserting a new key, we release the write lock on the leaf's parent only after doing the binary search on the leaf. This is because if the key ends up at slot 0, we will have to update the key at slot 0 of the parent node. The same reasoning applies to any other upper level nodes when their slot is 0. We also need to keep the parent locked in case the leaf does not have enough free space to insert the new key/item, because in that case we will split the leaf and we will need to add a new key to the parent due to a new leaf resulting from the split operation. However if the leaf has enough space for the new key and the key does not end up at slot 0 of the leaf we could release our write lock on the parent before doing the binary search on the leaf to figure out the destination slot. That leads to reducing the amount of time other tasks are blocked waiting to lock the parent, therefore increasing parallelism when there are other tasks that are trying to access other leaves accessible through the same parent. This also applies to other upper nodes besides the immediate parent, when their slot is 0, since we keep locks on them until we figure out if the leaf slot is slot 0 or not. In fact, having the key ending at up slot 0 when is rare. Typically it only happens when the key is less than or equals to the smallest, the "left most", key of the entire btree, during a split attempt when we try to push to the right sibling leaf or when the caller just wants to update the item of an existing key. It's also very common that a leaf has enough space to insert a new key, since after a split we move about half of the keys from one into the new leaf. So unlock the parent, and any other upper level nodes, when during a key insertion we notice the key is greater then the first key in the leaf and the leaf has enough free space. After unlocking the upper level nodes, do the binary search using a low boundary of slot 1 and not slot 0, to figure out the slot where the key will be inserted (or where the key already is in case it exists and the caller wants to modify its item data). This extra comparison, with the first key, is cheap and the key is very likely already in a cache line because it immediately follows the header of the extent buffer and we have recently read the level field of the header (which in fact is the last field of the header). The following fs_mark test was run on a non-debug kernel (debian's default kernel config), with a 12 cores intel CPU, and using a NVMe device: $ cat run-fsmark.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 165273.6 5958381 0 2400000 0 190938.3 6284477 0 3600000 0 181429.1 6044059 0 4800000 0 173979.2 6223418 0 6000000 0 139288.0 6384560 0 7200000 0 163000.4 6520083 1 8400000 0 57799.2 5388544 1 9600000 0 66461.6 5552969 2 10800000 0 49593.5 5163675 2 12000000 0 57672.1 4889398 After this change: FSUse% Count Size Files/sec App Overhead 0 1200000 0 167987.3 (+1.6%) 6272730 0 2400000 0 198563.9 (+4.0%) 6048847 0 3600000 0 197436.6 (+8.8%) 6163637 0 4800000 0 202880.7 (+16.6%) 6371771 1 6000000 0 167275.9 (+20.1%) 6556733 1 7200000 0 204051.2 (+25.2%) 6817091 1 8400000 0 69622.8 (+20.5%) 5525675 1 9600000 0 69384.5 (+4.4%) 5700723 1 10800000 0 61454.1 (+23.9%) 5363754 3 12000000 0 61908.7 (+7.3%) 5370196 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: allow generic_bin_search() to take low boundary as an argumentFilipe Manana
Right now generic_bin_search() always uses a low boundary slot of 0, but in the next patch we'll want to often skip slot 0 when searching for a key. So make generic_bin_search() have the low boundary slot specified as an argument, and move the check for the extent buffer level from btrfs_bin_search() to generic_bin_search() to avoid adding another wrapper around generic_bin_search(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: check the root node for uptodate before returning itJosef Bacik
Now that we clear the extent buffer uptodate if we fail to write it out we need to check to see if our root node is uptodate before we search down it. Otherwise we could return stale data (or potentially corrupt data that was caught by the write verification step) and think that the path is OK to search down. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: allow device add if balance is pausedNikolay Borisov
Currently paused balance precludes adding a device since they are both considered exclusive ops and we can have at most one running at a time. This is problematic in case a filesystem encounters an ENOSPC situation while balance is running, in this case the only thing the user can do is mount the fs with "skip_balance" which pauses balance and delete some data to free up space for balance. However, it should be possible to add a new device when balance is paused. Fix this by allowing device add to proceed when balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make device add compatible with paused balance in ↵Nikolay Borisov
btrfs_exclop_start_try_lock This is needed to enable device add to work in cases when a file system has been mounted with 'skip_balance' mount option. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: introduce exclusive operation BALANCE_PAUSED stateNikolay Borisov
Current set of exclusive operation states is not sufficient to handle all practical use cases. In particular there is a need to be able to add a device to a filesystem that have paused balance. Currently there is no way to distinguish between a running and a paused balance. Fix this by introducing BTRFS_EXCLOP_BALANCE_PAUSED which is going to be set in 2 occasions: 1. When a filesystem is mounted with skip_balance and there is an unfinished balance it will now be into BALANCE_PAUSED instead of simply BALANCE state. 2. When a running balance is paused. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make send work with concurrent block group relocationFilipe Manana
We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07vfs, fscache: Implement pinning of cache usage for writebackDavid Howells
Cachefiles has a problem in that it needs to keep the backing file for a cookie open whilst there are local modifications pending that need to be written to it. However, we don't want to keep the file open indefinitely, as that causes EMFILE/ENFILE/ENOMEM problems. Reopening the cache file, however, is a problem if this is being done due to writeback triggered by exit(). Some filesystems will oops if we try to open a file in that context because they want to access current->fs or other resources that have already been dismantled. To get around this, I added the following: (1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem inode to indicate that we have a usage count on the cookie caching that inode. (2) A flag in struct writeback_control, unpinned_fscache_wb, that is set when __writeback_single_inode() clears the last dirty page from i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this flag. This has to be done here so that clearing I_PINNING_FSCACHE_WB can be done atomically with the check of PAGECACHE_TAG_DIRTY that clears I_DIRTY_PAGES. (3) A function, fscache_set_page_dirty(), which if it is not set, sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache resources. (4) A function, fscache_unpin_writeback(), to be called by ->write_inode() to unuse the cookie. (5) A function, fscache_clear_inode_writeback(), to be called when the inode is evicted, before clear_inode() is called. This cleans up any lingering I_PINNING_FSCACHE_WB. The network filesystem can then use these tools to make sure that fscache_write_to_cache() can write locally modified data to the cache as well as to the server. For the future, I'm working on write helpers for netfs lib that should allow this facility to be removed by keeping track of the dirty regions separately - but that's incomplete at the moment and is also going to be affected by folios, one way or another, since it deals with pages Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement higher-level write I/O interfaceDavid Howells
Provide a higher-level function than fscache_write() to perform a write from an inode's pagecache to the cache, whilst fending off concurrent writes by means of the PG_fscache mark on a page: void fscache_write_to_cache(struct fscache_cookie *cookie, struct address_space *mapping, loff_t start, size_t len, loff_t i_size, netfs_io_terminated_t term_func, void *term_func_priv, bool caching); If caching is false, this function does nothing except call (*term_func)() if given. It assumes that, in such a case, PG_fscache will not have been set on the pages. Otherwise, if caching is true, this function requires the source pages to have had PG_fscache set on them before calling. start and len define the region of the file to be modified and i_size indicates the new file size. The source pages are extracted from the mapping. term_func and term_func_priv work as for fscache_write(). The PG_fscache marks will be cleared at the end of the operation, before term_func is called or the function otherwise returns. There is an additonal helper function to clear the PG_fscache bits from a range of pages: void fscache_clear_page_bits(struct fscache_cookie *cookie, struct address_space *mapping, loff_t start, size_t len, bool caching); If caching is true, the pages to be managed are expected to be located on mapping in the range defined by start and len. If caching is false, it does nothing. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819614155.215744.5528123235123721230.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906916346.143852.15632773570362489926.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967123599.1823006.12946816026724657428.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021522672.640689.4381958316198807813.stgit@warthog.procyon.org.uk/ # v4
2022-01-07netfs: Pass more information on how to deal with a hole in the cacheDavid Howells
Pass more information to the cache on how to deal with a hole if it encounters one when trying to read from the cache. Three options are provided: (1) NETFS_READ_HOLE_IGNORE. Read the hole along with the data, assuming it to be a punched-out extent by the backing filesystem. (2) NETFS_READ_HOLE_CLEAR. If there's a hole, erase the requested region of the cache and clear the read buffer. (3) NETFS_READ_HOLE_FAIL. Fail the read if a hole is detected. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819612321.215744.9738308885948264476.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906914460.143852.6284247083607910189.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967119923.1823006.15637375885194297582.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021519762.640689.16994364383313159319.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Provide read/write stat counters for the cacheDavid Howells
Provide read/write stat counters for the cache backend to use. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819609532.215744.10821082637727410554.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906912598.143852.12960327989649429069.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967113830.1823006.3222957649202368162.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021517502.640689.6077928311710357342.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Count data storage objects in a cacheDavid Howells
Count the data storage objects that are currently allocated in a cache. This is used to pin certain cache structures until cache withdrawal is complete. Three helpers are provided to manage and make use of the count: (1) void fscache_count_object(struct fscache_cache *cache); This should be called by the cache backend to note that an object has been allocated and attached to the cache. (2) void fscache_uncount_object(struct fscache_cache *cache); This should be called by the backend to note that an object has been destroyed. This sends a wakeup event that allows cache withdrawal to proceed if it was waiting for that object. (3) void fscache_wait_for_objects(struct fscache_cache *cache); This can be used by the backend to wait for all outstanding cache object to be destroyed. Each cache's counter is displayed as part of /proc/fs/fscache/caches. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819608594.215744.1812706538117388252.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906911646.143852.168184059935530127.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967111846.1823006.9868154941573671255.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021516219.640689.4934796654308958158.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Provide a means to begin an operationDavid Howells
Provide a function to begin a read operation: int fscache_begin_read_operation( struct netfs_cache_resources *cres, struct fscache_cookie *cookie) This is primarily intended to be called by network filesystems on behalf of netfslib, but may also be called to use the I/O access functions directly. It attaches the resources required by the cache to cres struct from the supplied cookie. This holds access to the cache behind the cookie for the duration of the operation and forces cache withdrawal and cookie invalidation to perform synchronisation on the operation. cres->inval_counter is set from the cookie at this point so that it can be compared at the end of the operation. Note that this does not guarantee that the cache state is fully set up and able to perform I/O immediately; looking up and creation may be left in progress in the background. The operations intended to be called by the network filesystem, such as reading and writing, are expected to wait for the cookie to move to the correct state. This will, however, potentially sleep, waiting for a certain minimum state to be set or for operations such as invalidate to advance far enough that I/O can resume. Also provide a function for the cache to call to wait for the cache object to get to a state where it can be used for certain things: bool fscache_wait_for_operation(struct netfs_cache_resources *cres, enum fscache_want_stage stage); This looks at the cache resources provided by the begin function and waits for them to get to an appropriate stage. There's a choice of wanting just some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O (FSCACHE_WANT_READ or FSCACHE_WANT_WRITE). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819603692.215744.146724961588817028.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906910672.143852.13856103384424986357.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967110245.1823006.2239170567540431836.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021513617.640689.16627329360866150606.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cookie invalidationDavid Howells
Add a function to invalidate the cache behind a cookie: void fscache_invalidate(struct fscache_cookie *cookie, const void *aux_data, loff_t size, unsigned int flags) This causes any cached data for the specified cookie to be discarded. If the cookie is marked as being in use, a new cache object will be created if possible and future I/O will use that instead. In-flight I/O should be abandoned (writes) or reconsidered (reads). Each time it is called cookie->inval_counter is incremented and this can be used to detect invalidation at the end of an I/O operation. The coherency data attached to the cookie can be updated and the cookie size should be reset. One flag is available, FSCACHE_INVAL_DIO_WRITE, which should be used to indicate invalidation due to a DIO write on a file. This will temporarily disable caching for this cookie. Changes ======= ver #2: - Should only change to inval state if can get access to cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819602231.215744.11206598147269491575.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906909707.143852.18056070560477964891.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967107447.1823006.5945029409592119962.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021512640.640689.11418616313147754172.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement cookie user counting and resource pinningDavid Howells
Provide a pair of functions to count the number of users of a cookie (open files, writeback, invalidation, resizing, reads, writes), to obtain and pin resources for the cookie and to prevent culling for the whilst there are users. The first function marks a cookie as being in use: void fscache_use_cookie(struct fscache_cookie *cookie, bool will_modify); The caller should indicate the cookie to use and whether or not the caller is in a context that may modify the cookie (e.g. a file open O_RDWR). If the cookie is not already resourced, fscache will ask the cache backend in the background to do whatever it needs to look up, create or otherwise obtain the resources necessary to access data. This is pinned to the cookie and may not be culled, though it may be withdrawn if the cache as a whole is withdrawn. The second function removes the in-use mark from a cookie and, optionally, updates the coherency data: void fscache_unuse_cookie(struct fscache_cookie *cookie, const void *aux_data, const loff_t *object_size); If non-NULL, the aux_data buffer and/or the object_size will be saved into the cookie and will be set on the backing store when the object is committed. If this removes the last usage on a cookie, the cookie is placed onto an LRU list from which it will be removed and closed after a couple of seconds if it doesn't get reused. This prevents resource overload in the cache - in particular it prevents it from holding too many files open. Changes ======= ver #2: - Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a potential race if the cookie gets reused before it completes the unusement. - Added missing transition to LRU_DISCARDING state. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819600612.215744.13678350304176542741.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906907567.143852.16979631199380722019.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967106467.1823006.6790864931048582667.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021511674.640689.10084988363699111860.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Implement simple cookie state machineDavid Howells
Implement a very simple cookie state machine to handle lookup, invalidation, withdrawal, relinquishment and, to be added later, commit on LRU discard. Three cache methods are provided: ->lookup_cookie() to look up and, if necessary, create a data storage object; ->withdraw_cookie() to free the resources associated with that object and potentially delete it; and ->prepare_to_write(), to do prepare for changes to the cached data to be modified locally. Changes ======= ver #3: - Fix a race between LRU discard and relinquishment whereby the former would override the latter and thus the latter would never happen[1]. ver #2: - Don't hold n_accesses elevated whilst cache is bound to a cookie, but rather add a flag that prevents the state machine from being queued when n_accesses reaches 0. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/599331.1639410068@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/163819599657.215744.15799615296912341745.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906903925.143852.1805855338154353867.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967105456.1823006.14730395299835841776.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021510706.640689.7961423370243272583.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Add a function for a cache backend to note an I/O errorDavid Howells
Add a function to the backend API to note an I/O error in a cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819598741.215744.891281275151382095.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906901316.143852.15225412215771586528.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967100721.1823006.16435671567428949398.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021508840.640689.11902836226570620424.stgit@warthog.procyon.org.uk/ # v4
2022-01-07fscache: Provide and use cache methods to lookup/create/free a volumeDavid Howells
Add cache methods to lookup, create and remove a volume. Looking up or creating the volume requires the cache pinning for access; freeing the volume requires the volume pinning for access. The ->acquire_volume() method is used to ask the cache backend to lookup and, if necessary, create a volume; the ->free_volume() method is used to free the resources for a volume. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/163819597821.215744.5225318658134989949.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906898645.143852.8537799955945956818.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967099771.1823006.1455197910571061835.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021507345.640689.4073511598838843040.stgit@warthog.procyon.org.uk/ # v4