summaryrefslogtreecommitdiff
path: root/fs/btrfs/compression.c
AgeCommit message (Collapse)Author
2022-08-03Merge tag 'for-5.20-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This brings some long awaited changes, the send protocol bump, otherwise lots of small improvements and fixes. The main core part is reworking bio handling, cleaning up the submission and endio and improving error handling. There are some changes outside of btrfs adding helpers or updating API, listed at the end of the changelog. Features: - sysfs: - export chunk size, in debug mode add tunable for setting its size - show zoned among features (was only in debug mode) - show commit stats (number, last/max/total duration) - send protocol updated to 2 - new commands: - ability write larger data chunks than 64K - send raw compressed extents (uses the encoded data ioctls), ie. no decompression on send side, no compression needed on receive side if supported - send 'otime' (inode creation time) among other timestamps - send file attributes (a.k.a file flags and xflags) - this is first version bump, backward compatibility on send and receive side is provided - there are still some known and wanted commands that will be implemented in the near future, another version bump will be needed, however we want to minimize that to avoid causing usability issues - print checksum type and implementation at mount time - don't print some messages at mount (mentioned as people asked about it), we want to print messages namely for new features so let's make some space for that - big metadata - this has been supported for a long time and is not a feature that's worth mentioning - skinny metadata - same reason, set by default by mkfs Performance improvements: - reduced amount of reserved metadata for delayed items - when inserted items can be batched into one leaf - when deleting batched directory index items - when deleting delayed items used for deletion - overall improved count of files/sec, decreased subvolume lock contention - metadata item access bounds checker micro-optimized, with a few percent of improved runtime for metadata-heavy operations - increase direct io limit for read to 256 sectors, improved throughput by 3x on sample workload Notable fixes: - raid56 - reduce parity writes, skip sectors of stripe when there are no data updates - restore reading from on-disk data instead of using stripe cache, this reduces chances to damage correct data due to RMW cycle - refuse to replay log with unknown incompat read-only feature bit set - zoned - fix page locking when COW fails in the middle of allocation - improved tracking of active zones, ZNS drives may limit the number and there are ENOSPC errors due to that limit and not actual lack of space - adjust maximum extent size for zone append so it does not cause late ENOSPC due to underreservation - mirror reading error messages show the mirror number - don't fallback to buffered IO for NOWAIT direct IO writes, we don't have the NOWAIT semantics for buffered io yet - send, fix sending link commands for existing file paths when there are deleted and created hardlinks for same files - repair all mirrors for profiles with more than 1 copy (raid1c34) - fix repair of compressed extents, unify where error detection and repair happen Core changes: - bio completion cleanups - don't double defer compression bios - simplify endio workqueues - add more data to btrfs_bio to avoid allocation for read requests - rework bio error handling so it's same what block layer does, the submission works and errors are consumed in endio - when asynchronous bio offload fails fall back to synchronous checksum calculation to avoid errors under writeback or memory pressure - new trace points - raid56 events - ordered extent operations - super block log_root_transid deprecated (never used) - mixed_backref and big_metadata sysfs feature files removed, they've been default for sufficiently long time, there are no known users and mixed_backref could be confused with mixed_groups Non-btrfs changes, API updates: - minor highmem API update to cover const arguments - switch all kmap/kmap_atomic to kmap_local - remove redundant flush_dcache_page() - address_space_operations::writepage callback removed - add bdev_max_segments() helper" * tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits) btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read btrfs: fix repair of compressed extents btrfs: remove the start argument to check_data_csum and export btrfs: pass a btrfs_bio to btrfs_repair_one_sector btrfs: simplify the pending I/O counting in struct compressed_bio btrfs: repair all known bad mirrors btrfs: merge btrfs_dev_stat_print_on_error with its only caller btrfs: join running log transaction when logging new name btrfs: simplify error handling in btrfs_lookup_dentry btrfs: send: always use the rbtree based inode ref management infrastructure btrfs: send: fix sending link commands for existing file paths btrfs: send: introduce recorded_ref_alloc and recorded_ref_free btrfs: zoned: wait until zone is finished when allocation didn't progress btrfs: zoned: write out partially allocated region btrfs: zoned: activate necessary block group btrfs: zoned: activate metadata block group on flush_space btrfs: zoned: disable metadata overcommit for zoned btrfs: zoned: introduce space_info->active_total_bytes btrfs: zoned: finish least available block group on data bg allocation btrfs: let can_allocate_chunk return error ...
2022-07-25btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_readChristoph Hellwig
This flag was used to communicate that the low-level compression code already did verify the checksum to the high-level I/O completion code. But it has been unused for a long time as the upper btrfs_bio for the decompressed data had a NULL csum pointer basically since that pointer existed and the code already checks for that a little later. Note that this does not affect the other use of the checked flag, which is only used for the COW fixup worker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix repair of compressed extentsChristoph Hellwig
Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify the pending I/O counting in struct compressed_bioChristoph Hellwig
Instead of counting the sectors just count the bios, with an extra reference held during submission. This significantly simplifies the submission side error handling. This slightly changes completion and error handling of btrfs_submit_compressed_{read,write} because with the old code the compressed_bio could have been completed in submit_compressed_{read,write} only if there was an error during submission for one of the lower bio, whilst with the new code there is a chance for this to happen even for successful submission if the all the lower bios complete before the end of the function is reached. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not return errors from btrfs_map_bioChristoph Hellwig
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove btrfs_end_io_wqChristoph Hellwig
All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: don't use btrfs_bio_wq_end_io for compressed writesChristoph Hellwig
Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove redundant calls to flush_dcache_pageDavid Sterba
Both memzero_page and memcpy_to_page already call flush_dcache_page so we can remove the calls from btrfs code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: introduce a data checksum checking helperQu Wenruo
Although we have several data csum verification code, we never have a function really just to verify checksum for one sector. Function check_data_csum() do extra work for error reporting, thus it requires a lot of extra things like file offset, bio_offset etc. Function btrfs_verify_data_csum() is even worse, it will utilize page checked flag, which means it can not be utilized for direct IO pages. Here we introduce a new helper, btrfs_check_sector_csum(), which really only accept a sector in page, and expected checksum pointer. We use this function to implement check_data_csum(), and export it for incoming patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: keep passing the csum array as an arguments, as the callers want to print it, rename per request] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-14fs/btrfs: Use the enum req_op and blk_opf_t typesBart Van Assche
Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16btrfs: derive compression type from extent map during readsGoldwyn Rodrigues
Derive the compression type from extent map as opposed to the bio flags passed. This makes it more precise and not reliant on function parameters. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: do not return errors from btrfs_submit_compressed_readChristoph Hellwig
btrfs_submit_compressed_read already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: do not pass compressed_bio to submit_compressed_bio()Goldwyn Rodrigues
Parameter struct compressed_bio is not used by the function submit_compressed_bio(). Remove it. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: factor out allocating an array of pagesSweet Tea Dorminy
Several functions currently populate an array of page pointers one allocated page at a time. Factor out the common code so as to allow improvements to all of the sites at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06btrfs: fix btrfs_submit_compressed_write cgroup attributionDennis Zhou
This restores the logic from commit 46bcff2bfc5e ("btrfs: fix compressed write bio blkcg attribution") which added cgroup attribution to btrfs writeback. It also adds back the REQ_CGROUP_PUNT flag for these ios. Fixes: 91507240482e ("btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not double complete bio on errors during compressed readsJosef Bacik
I hit some weird panics while fixing up the error handling from btrfs_lookup_bio_sums(). Turns out the compression path will complete the bio we use if we set up any of the compression bios and then return an error, and then btrfs_submit_data_bio() will also call bio_endio() on the bio. Fix this by making btrfs_submit_compressed_read() responsible for calling bio_endio() on the bio if there are any errors. Currently it was only doing it if we created the compression bios, otherwise it was depending on btrfs_submit_data_bio() to do the right thing. This creates the above problem, so fix up btrfs_submit_compressed_read() to always call bio_endio() in case of an error, and then simply return from btrfs_submit_data_bio() if we had to call btrfs_submit_compressed_read(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: track compressed bio errors as blk_status_tJosef Bacik
Right now we just have a binary "errors" flag, so any error we get on the compressed bio's gets translated to EIO. This isn't necessarily a bad thing, but if we get an ENOMEM it may be nice to know that's what happened instead of an EIO. Track our errors as a blk_status_t, and do the appropriate setting of the errors accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: remove the bio argument from finish_compressed_bio_readJosef Bacik
This bio is usually one of the compressed bio's, and we don't actually need it in this function, so remove the argument and stop passing it around. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: check correct bio in finish_compressed_bio_readJosef Bacik
Commit c09abff87f90 ("btrfs: cloned bios must not be iterated by bio_for_each_segment_all") added ASSERT()'s to make sure we weren't calling bio_for_each_segment_all() on a RAID5/6 bio. However it was checking the bio that the compression code passed in, not the cb->orig_bio that we actually iterate over, so adjust this ASSERT() to check the correct bio. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add BTRFS_IOC_ENCODED_WRITEOmar Sandoval
The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()Omar Sandoval
btrfs_csum_one_bio() loops over each filesystem block in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because: 1) the compressed bio always covers one ordered extent 2) the size of the bio is always less than the size of the ordered extent However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's get rid of the contig parameter and make it implied by the offset parameter, similar to the change we recently made to btrfs_lookup_bio_sums(). Additionally, let's rename nr_sectors to blockcount to make it clear that it's the number of filesystem blocks, not the number of 512-byte sectors. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove unnecessary parameter type from compression_decompress_bioSu Yue
btrfs_decompress_bio, the only caller of compression_decompress_bio gets type from @cb and passes it to compression_decompress_bio. However, compression_decompress_bio can get compression type directly from @cb. So remove the parameter and access it through @cb. No functional change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum rootJosef Bacik
We have a few places where we skip doing csums if we mounted with one of the rescue options that ignores bad csum roots. In the future when there are multiple csum roots it'll be costly to check and see if there are any missing csum roots, so simply add a flag to indicate the fs should skip loading csums in case of errors. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-01Merge tag 'for-5.16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The updates this time are more under the hood and enhancing existing features (subpage with compression and zoned namespaces). Performance related: - misc small inode logging improvements (+3% throughput, -11% latency on sample dbench workload) - more efficient directory logging: bulk item insertion, less tree searches and locking - speed up bulk insertion of items into a b-tree, which is used when logging directories, when running delayed items for directories (fsync and transaction commits) and when running the slow path (full sync) of an fsync (bulk creation run time -4%, deletion -12%) Core: - continued subpage support - make defragmentation work - make compression write work - zoned mode - support ZNS (zoned namespaces), zone capacity is number of usable blocks in each zone - add dedicated block group (zoned) for relocation, to prevent out of order writes in some cases - greedy block group reclaim, pick the ones with least usable space first - preparatory work for send protocol updates - error handling improvements - cleanups and refactoring Fixes: - lockdep warnings - in show_devname callback, on seeding device - device delete on loop device due to conversions to workqueues - fix deadlock between chunk allocation and chunk btree modifications - fix tracking of missing device count and status" * tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits) btrfs: remove root argument from check_item_in_log() btrfs: remove root argument from add_link() btrfs: remove root argument from btrfs_unlink_inode() btrfs: remove root argument from drop_one_dir_item() btrfs: clear MISSING device status bit in btrfs_close_one_device btrfs: call btrfs_check_rw_degradable only if there is a missing device btrfs: send: prepare for v2 protocol btrfs: fix comment about sector sizes supported in 64K systems btrfs: update device path inode time instead of bd_inode fs: export an inode_update_time helper btrfs: fix deadlock when defragging transparent huge pages btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE btrfs: update comments for chunk allocation -ENOSPC cases btrfs: fix deadlock between chunk allocation and chunk btree modifications btrfs: zoned: use greedy gc for auto reclaim btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls btrfs: add a btrfs_get_dev_args_from_path helper btrfs: handle device lookup with btrfs_dev_lookup_args ...
2021-11-01Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
2021-10-27Revert "btrfs: compression: drop kmap/kunmap from generic helpers"David Sterba
This reverts commit 4c2bf276b56d8d27ddbafcdf056ef3fc60ae50b0. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make end_compressed_bio_writeback() compatibleQu Wenruo
In end_compressed_writeback() we just clear the full page writeback. For subpage case, if there are two delalloc ranges in the same page, the 2nd range will trigger a BUG_ON() as the page writeback is already cleared by previous range. Fix it by using btrfs_page_clamp_clear_writeback() helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make btrfs_submit_compressed_write() compatibleQu Wenruo
There is a WARN_ON() checking if @start is aligned to PAGE_SIZE, not sectorsize, which will cause false alert for subpage. Fix it to check against sectorsize. Furthermore: - Use ASSERT() to do the check So that in the future we may skip the check for production build - Also check alignment for @len Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: determine stripe boundary at bio allocation time in ↵Qu Wenruo
btrfs_submit_compressed_write Currently btrfs_submit_compressed_write() will check btrfs_bio_fits_in_stripe() each time a new page is going to be added. Even if compressed extent is small, we don't really need to do that for every page. Align the behavior to extent_io.c, by determining the stripe boundary when allocating a bio. Unlike extent_io.c, in compressed.c we don't need to bother things like different bio flags, thus no need to re-use bio_ctrl. Here we just manually introduce new local variable, next_stripe_start, and use that value returned from alloc_compressed_bio() to calculate the stripe boundary. Then each time we add some page range into the bio, we check if we reached the boundary. And if reached, submit it. Also, since we have @cur_disk_bytenr to determine whether we're the last bio, we don't need a explicit last_bio: tag for error handling any more. And since we use @cur_disk_bytenr to wait, there is no need for pending_bios, also remove it to save some memory of compressed_bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: determine stripe boundary at bio allocation time in ↵Qu Wenruo
btrfs_submit_compressed_read Currently btrfs_submit_compressed_read() will check btrfs_bio_fits_in_stripe() each time a new page is going to be added. Even if compressed extent is small, we don't really need to do that for every page. This patch will align the behavior to extent_io.c, by determining the stripe boundary when allocating a bio. Unlike extent_io.c, in compressed.c we don't need to bother things like different bio flags, thus no need to re-use bio_ctrl. Here we just manually introduce new local variable, next_stripe_start, and teach alloc_compressed_bio() to calculate the stripe boundary. Then each time we add some page range into the bio, we check if we reached the boundary. And if reached, submit it. Also, since we have @cur_disk_byte to determine whether we're the last bio, we don't need a explicit last_bio: tag for error handling any more. And we can use @cur_disk_byte to track which range has been added to bio, we can also use @cur_disk_byte to calculate the wait condition, no need for @pending_bios. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce alloc_compressed_bio() for compressionQu Wenruo
Just aggregate the bio allocation code into one helper, so that we can replace 4 call sites. There is one special note for zoned write. Currently btrfs_submit_compressed_write() will only allocate the first bio using ZONE_APPEND. If we have to submit current bio due to stripe boundary, the new bio allocated will not use ZONE_APPEND. In theory this should be a bug, but considering zoned mode currently only support SINGLE profile, which doesn't have any stripe boundary limit, it should never be a problem and we have assertions in place. This function will provide a good entrance for any work which needs to be done at bio allocation time. Like determining the stripe boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce submit_compressed_bio() for compressionQu Wenruo
The new helper, submit_compressed_bio(), will aggregate the following work: - Increase compressed_bio::pending_bios - Remap the endio function - Map and submit the bio This slightly reorders calls to btrfs_csum_one_bio or btrfs_lookup_bio_sums but but none of them does anything regarding IO submission so this is effectively no change. We mainly care about order of - atomic_inc - btrfs_bio_wq_end_io - btrfs_map_bio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle errors properly inside btrfs_submit_compressed_write()Qu Wenruo
Just like btrfs_submit_compressed_read(), there are quite some BUG_ON()s inside btrfs_submit_compressed_write() for the bio submission path. Fix them using the same method: - For last bio, just endio the bio As in that case, one of the endio function of all these submitted bio will be able to free the compressed_bio - For half-submitted bio, wait and finish the compressed_bio manually In this case, as long as all other bio finish, we're the only one referring the compressed bio, and can manually finish it. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle errors properly inside btrfs_submit_compressed_read()Qu Wenruo
There are quite some BUG_ON()s inside btrfs_submit_compressed_read(), namely all errors inside the for() loop relies on BUG_ON() to handle -ENOMEM. Handle these errors properly by: - Wait for submitted bios to finish first Using wake_var_event() APIs to wait without introducing extra memory overhead inside compressed_bio. This allows us to wait for any submitted bio to finish, while still keeps the compressed_bio from being freed. - Introduce finish_compressed_bio_read() to finish the compressed_bio - Properly end the bio and finish compressed_bio when error happens Now in btrfs_submit_compressed_read() even when the bio submission failed, we can properly handle the error without triggering BUG_ON(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: add bitmap for PageChecked flagQu Wenruo
Although in btrfs we have very limited usage of PageChecked flag, it's still some page flag not yet subpage compatible. Fix it by introducing btrfs_subpage::checked_offset to do the convert. For most call sites, especially for free-space cache, COW fixup and btrfs_invalidatepage(), they all work in full page mode anyway. For other call sites, they work as subpage compatible mode. Some call sites need extra modification: - btrfs_drop_pages() Needs extra parameter to get the real range we need to clear checked flag. Also since btrfs_drop_pages() will accept pages beyond the dirtied range, update btrfs_subpage_clamp_range() to handle such case by setting @len to 0 if the page is beyond target range. - btrfs_invalidatepage() We need to call subpage helper before calling __btrfs_releasepage(), or it will trigger ASSERT() as page->private will be cleared. - btrfs_verify_data_csum() In theory we don't need the io_bio->csum check anymore, but it's won't hurt. Just change the comment. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce compressed_bio::pending_sectors to trace compressed bioQu Wenruo
For btrfs_submit_compressed_read() and btrfs_submit_compressed_write(), we have a pretty weird dance around compressed_bio::pending_bios: btrfs_submit_compressed_read/write() { cb = kmalloc() refcount_set(&cb->pending_bios, 0); bio = btrfs_alloc_bio(); /* NOTE here, we haven't yet submitted any bio */ refcount_set(&cb->pending_bios, 1); for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { if (submit) { /* Here we submit bio, but we always have one * extra pending_bios */ refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } } /* Submit the last bio */ ret = btrfs_map_bio(); } There are two reasons why we do this: - compressed_bio::pending_bios is a refcount Thus if it's reduced to 0, it can not be increased again. - To ensure the compressed_bio is not freed by some submitted bios If the submitted bio is finished before the next bio submitted, we can free the compressed_bio completely. But the above code is sometimes confusing, and we can do it better by introducing a new member, compressed_bio::pending_sectors. Now we use compressed_bio::pending_sectors to indicate whether we have any pending sectors under IO or not yet submitted. If pending_sectors == 0, we're definitely the last bio of compressed_bio, and is OK to release the compressed bio. Now the workflow looks like this: btrfs_submit_compressed_read/write() { cb = kmalloc() atomic_set(&cb->pending_bios, 0); refcount_set(&cb->pending_sectors, compressed_len >> sectorsize_bits); bio = btrfs_alloc_bio(); for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { if (submit) { refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } } /* Submit the last bio */ refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } For now we still need pending_bios for later error handling, but will remove pending_bios eventually after properly handling the errors. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make add_ra_bio_pages() compatibleQu Wenruo
[BUG] If we remove the subpage limitation in add_ra_bio_pages(), then read a compressed extent which has part of its range in next page, like the following inode layout: 0 32K 64K 96K 128K |<--------------|-------------->| Btrfs will trigger ASSERT() in endio function: assertion failed: atomic_read(&subpage->readers) >= nbits ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! Internal error: Oops - BUG: 0 [#1] SMP Workqueue: btrfs-endio btrfs_work_helper [btrfs] Call trace: assertfail.constprop.0+0x28/0x2c [btrfs] btrfs_subpage_end_reader+0x148/0x14c [btrfs] end_page_read+0x8c/0x100 [btrfs] end_bio_extent_readpage+0x320/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 ---[ end trace c8b7b552d3bb408c ]--- [CAUSE] When we read the page range [0, 64K), we find it's a compressed extent, and we will try to add extra pages in add_ra_bio_pages() to avoid reading the same compressed extent. But when we add such page into the read bio, it doesn't follow the behavior of btrfs_do_readpage() to properly set subpage::readers. This means, for page [64K, 128K), its subpage::readers is still 0. And when endio is executed on both pages, since page [64K, 128K) has 0 subpage::readers, it triggers above ASSERT() [FIX] Function add_ra_bio_pages() is far from subpage compatible, it always assume PAGE_SIZE == sectorsize, thus when it skip to next range it always just skip PAGE_SIZE. Make it subpage compatible by: - Skip to next page properly when needed If we find there is already a page cache, we need to skip to next page. For that case, we shouldn't just skip PAGE_SIZE bytes, but use @pg_index to calculate the next bytenr and continue. - Only add the page range covered by current extent map We need to calculate which range is covered by current extent map and only add that part into the read bio. - Update subpage::readers before submitting the bio - Use proper cursor other than confusing @last_offset - Calculate the missed threshold based on sector size It's no longer using missed pages, as for 64K page size, we have at most 3 pages to skip. (If aligned only 2 pages) - Add ASSERT() to make sure our bytenr is always aligned - Add comment for the function Add a special note for subpage case, as the function won't really work well for subpage cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove unused parameter nr_pages in add_ra_bio_pages()Qu Wenruo
Variable @nr_pages only gets increased but never used. Remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename struct btrfs_io_bio to btrfs_bioQu Wenruo
Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove btrfs_bio_alloc() helperQu Wenruo
The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(), except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes bio->bi_iter.bi_sector. However the naming itself is not using "btrfs_io_bio" to indicate its parameter is "strcut btrfs_io_bio" and can be easily confused with "struct btrfs_bio". Considering assigned bio->bi_iter.bi_sector is such a simple work and there are already tons of call sites doing that manually, there is no need to do that in a helper. Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc() function to provide a fail-safe value for its @nr_iovecs. And then replace all btrfs_bio_alloc() callers with btrfs_io_bio_alloc(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-18mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>Christoph Hellwig
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so break the include chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23btrfs: rework btrfs_decompress_buf2page()Qu Wenruo
There are several bugs inside the function btrfs_decompress_buf2page() - @start_byte doesn't take bvec.bv_offset into consideration Thus it can't handle case where the target range is not page aligned. - Too many helper variables There are tons of helper variables, @buf_offset, @current_buf_start, @start_byte, @prev_start_byte, @working_bytes, @bytes. This hurts anyone who wants to read the function. - No obvious main cursor for the iteartion A new problem caused by previous problem. - Comments for parameter list makes no sense Like @buf_start is the offset to @buf, or offset inside the full decompressed extent? (Spoiler alert, the later case) And @total_out acts more like @buf_start + @size_of_buf. The worst is @disk_start. The real meaning of it is the file offset of the full decompressed extent. This patch will rework the whole function by: - Add a proper comment with ASCII art to explain the parameter list - Rework parameter list The old @buf_start is renamed to @decompressed, to show how many bytes are already decompressed inside the full decompressed extent. The old @total_out is replaced by @buf_len, which is the decompressed data size. For old @disk_start and @bio, just pass @compressed_bio in. - Use single main cursor The main cursor will be @cur_file_offset, to show what's the current file offset. Other helper variables will be declared inside the main loop, and only minimal amount of helper variables: * offset_inside_decompressed_buf: The only real helper * copy_start_file_offset: File offset we start memcpy * bvec_file_offset: File offset of current bvec Even with all these extensive comments, the final function is still smaller than the original function, which is definitely a win. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: grab correct extent map for subpage compressed extent readQu Wenruo
[BUG] When subpage compressed read write support is enabled, btrfs/038 always fails with EIO. A simplified script can easily trigger the problem: mkfs.btrfs -f -s 4k $dev mount $dev $mnt -o compress=lzo xfs_io -f -c "truncate 118811" $mnt/foo xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap1 xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null sync xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap2 cat $mnt/mysnap2/foo # Above cat will fail due to EIO [CAUSE] The problem is in btrfs_submit_compressed_read(). When it tries to grab the extent map of the read range, it uses the following call: em = lookup_extent_mapping(em_tree, page_offset(bio_first_page_all(bio)), fs_info->sectorsize); The problem is in the page_offset(bio_first_page_all(bio)) part. The offending inode has the following file extent layout item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13680640 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53 generation 8 type 1 (regular) extent data disk byte 0 nr 0 item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13676544 nr 4096 extent data offset 0 nr 53248 ram 86016 extent compression 2 (lzo) And the bio passed in has the following parameters: page_offset(bio_first_page_all(bio)) = 131072 bio_first_bvec_all(bio)->bv_offset = 65536 If we use page_offset(bio_first_page_all(bio) without adding bv_offset, we will get an extent map for file offset 131072, not 196608. This means we read uncompressed data from disk, and later decompression will definitely fail. [FIX] Take bv_offset into consideration when trying to grab an extent map. And add an ASSERT() to ensure we're really getting a compressed extent. Thankfully this won't affect anything but subpage, thus we only need to ensure this patch get merged before we enabled basic subpage support. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: disable compressed readahead for subpageQu Wenruo
For current subpage support, we only support 64K page size with 4K sector size. This makes compressed readahead less effective, as maximum compressed extent size is only 128K, 2x the page size. On the other hand, the function add_ra_bio_pages() is still assuming sectorsize == PAGE_SIZE, and code change may affect 4K page size systems. So for now, let's disable subpage compressed readahead for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from generic helpersDavid Sterba
The pages in compressed_pages are not from highmem anymore so we can drop the mapping for checksum calculation and inline extent. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: drop from __GFP_HIGHMEM all allocationsDavid Sterba
The highmem flag is used for allocating pages for compression and for raid56 pages. The high memory makes sense on 32bit systems but is not without problems. On 64bit system's it's just another layer of wrappers. The time the pages are allocated for compression or raid56 is relatively short (about a transaction commit), so the pages are not blocked indefinitely. As the number of pages depends on the amount of data being written/read, there's a theoretical problem. A fast device on a 32bit system could use most of the low memory pool, while with the highmem allocation that would not happen. This was possibly the original idea long time ago, but nowadays we optimize for 64bit systems. This patch removes all usage of the __GFP_HIGHMEM flag for page allocation, the kmap/kunmap are still in place and will be removed in followup patches. Remaining is masking out the bit in alloc_extent_state and __lookup_free_space_inode, that can safely stay. Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28btrfs: mark compressed range uptodate only if all bio succeedGoldwyn Rodrigues
In compression write endio sequence, the range which the compressed_bio writes is marked as uptodate if the last bio of the compressed (sub)bios is completed successfully. There could be previous bio which may have failed which is recorded in cb->errors. Set the writeback range as uptodate only if cb->errors is zero, as opposed to checking only the last bio's status. Backporting notes: in all versions up to 4.4 the last argument is always replaced by "!cb->errors". CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: remove a stale comment for btrfs_decompress_bio()Qu Wenruo
Since commit 8140dc30a432 ("btrfs: btrfs_decompress_bio() could accept compressed_bio instead"), btrfs_decompress_bio() accepts "struct compressed_bio" other than open-coded parameter list. Thus the comments for the parameter list is no longer needed. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()Qu Wenruo
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: fix comment about max_out in btrfs_compress_pagesAnand Jain
Commit e5d74902362f ("btrfs: derive maximum output size in the compression implementation") removed @max_out argument in btrfs_compress_pages() but its comment remained, remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>