Age | Commit message (Collapse) | Author |
|
Instead of using pages for everything, find a folio and use that. This
makes things a bit cleaner as a lot of the functions calls here all take
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
__extent_writepage_io uses page everywhere, but a lot of these functions
take a folio. Convert it to use the folio based helpers, and then
change it to take a folio as an argument and update its callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Willy is wanting to get rid of page->index, convert the writepage
tracepoint to take a folio so we can do folio->index instead of
page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that the callers and helpers mostly use folio, convert
btrfs_do_readpage to take a folio, and rename it to btrfs_do_read_folio.
Update all of the page stuff to use the folio based helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The callers of this helper are going to be converted to using a folio,
so adjust submit_extent_page to become submit_extent_folio and update it
to use all the relevant folio helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This already uses a folio internally, change it to take a folio as an
argument instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have this helper function to set the page range uptodate once we're
done reading it, as well as run fsverity against it. Half of these
functions already take a folio, just rename this to end_folio_read and
then rework it to take a folio instead, and update everything
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we're using the page for everything here. Convert this to use
the folio helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're the only user of readahead_page_batch(). Convert
btrfs_readahead() to use the folio based helpers to do readahead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
mapping xarray if the folio is not dirty. This worked fine before commit
97713b1a2ced ("btrfs: do not clear page dirty inside
extent_write_locked_range()").
After the commit, however, the folio is still dirty at this point, so the
mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
results in the page being unlocked with a "strange" state. The page is not
PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.
This strange state looks like causing a hang with a call trace below when
running fstests generic/091 on a null_blk device. It is waiting for a folio
lock.
While I don't have an exact relation between this hang and the strange
state, fixing the state also fixes the hang. And, that state is worth
fixing anyway.
This commit reorders btrfs_folio_clear_dirty() and
btrfs_set_range_writeback() in __extent_writepage_io(), so that the
PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.
[464.274] task:fsx state:D stack:0 pid:3034 tgid:3034 ppid:2853 flags:0x00004002
[464.286] Call Trace:
[464.291] <TASK>
[464.295] __schedule+0x10ed/0x6260
[464.301] ? __pfx___blk_flush_plug+0x10/0x10
[464.308] ? __submit_bio+0x37c/0x450
[464.314] ? __pfx___schedule+0x10/0x10
[464.321] ? lock_release+0x567/0x790
[464.327] ? __pfx_lock_acquire+0x10/0x10
[464.334] ? __pfx_lock_release+0x10/0x10
[464.340] ? __pfx_lock_acquire+0x10/0x10
[464.347] ? __pfx_lock_release+0x10/0x10
[464.353] ? do_raw_spin_lock+0x12e/0x270
[464.360] schedule+0xdf/0x3b0
[464.365] io_schedule+0x8f/0xf0
[464.371] folio_wait_bit_common+0x2ca/0x6d0
[464.378] ? folio_wait_bit_common+0x1cc/0x6d0
[464.385] ? __pfx_folio_wait_bit_common+0x10/0x10
[464.392] ? __pfx_filemap_get_folios_tag+0x10/0x10
[464.400] ? __pfx_wake_page_function+0x10/0x10
[464.407] ? __pfx___might_resched+0x10/0x10
[464.414] ? do_raw_spin_unlock+0x58/0x1f0
[464.420] extent_write_cache_pages+0xe49/0x1620 [btrfs]
[464.428] ? lock_acquire+0x435/0x500
[464.435] ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
[464.443] ? btrfs_do_write_iter+0x493/0x640 [btrfs]
[464.451] ? orc_find.part.0+0x1d4/0x380
[464.457] ? __pfx_lock_release+0x10/0x10
[464.464] ? __pfx_lock_release+0x10/0x10
[464.471] ? btrfs_do_write_iter+0x493/0x640 [btrfs]
[464.478] btrfs_writepages+0x1cc/0x460 [btrfs]
[464.485] ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
[464.493] ? is_bpf_text_address+0x6e/0x100
[464.500] ? kernel_text_address+0x145/0x160
[464.507] ? unwind_get_return_address+0x5e/0xa0
[464.514] ? arch_stack_walk+0xac/0x100
[464.521] do_writepages+0x176/0x780
[464.527] ? lock_release+0x567/0x790
[464.533] ? __pfx_do_writepages+0x10/0x10
[464.540] ? __pfx_lock_acquire+0x10/0x10
[464.546] ? __pfx_stack_trace_save+0x10/0x10
[464.553] ? do_raw_spin_lock+0x12e/0x270
[464.560] ? do_raw_spin_unlock+0x58/0x1f0
[464.566] ? _raw_spin_unlock+0x23/0x40
[464.573] ? wbc_attach_and_unlock_inode+0x3da/0x7d0
[464.580] filemap_fdatawrite_wbc+0x113/0x180
[464.587] ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
[464.596] __filemap_fdatawrite_range+0xaf/0xf0
[464.603] ? __pfx___filemap_fdatawrite_range+0x10/0x10
[464.611] ? trace_irq_enable.constprop.0+0xce/0x110
[464.618] ? kasan_quarantine_put+0xd7/0x1e0
[464.625] btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
[464.633] ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
[464.642] ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
[464.650] btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
[464.659] ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
[464.669] btrfs_read_folio+0x12a/0x1d0 [btrfs]
[464.676] ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
[464.684] ? __pfx_filemap_add_folio+0x10/0x10
[464.691] ? __pfx___might_resched+0x10/0x10
[464.698] ? __filemap_get_folio+0x1c5/0x450
[464.705] prepare_uptodate_page+0x12e/0x4d0 [btrfs]
[464.713] prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
[464.721] ? fault_in_iov_iter_readable+0xd2/0x240
[464.729] btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
[464.737] ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
[464.745] ? __pfx_lock_release+0x10/0x10
[464.752] ? generic_write_checks+0x275/0x400
[464.759] ? down_write+0x118/0x1f0
[464.765] ? up_write+0x19b/0x500
[464.770] btrfs_direct_write+0x731/0xba0 [btrfs]
[464.778] ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
[464.785] ? __pfx___might_resched+0x10/0x10
[464.792] ? lock_acquire+0x435/0x500
[464.798] ? lock_acquire+0x435/0x500
[464.804] btrfs_do_write_iter+0x494/0x640 [btrfs]
[464.811] ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
[464.819] ? __pfx___might_resched+0x10/0x10
[464.825] ? rw_verify_area+0x6d/0x590
[464.831] vfs_write+0x5d7/0xf50
[464.837] ? __might_fault+0x9d/0x120
[464.843] ? __pfx_vfs_write+0x10/0x10
[464.849] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.856] ? lock_release+0x567/0x790
[464.862] ksys_write+0xfb/0x1d0
[464.867] ? __pfx_ksys_write+0x10/0x10
[464.873] ? _raw_spin_unlock+0x23/0x40
[464.879] ? btrfs_getattr+0x4af/0x670 [btrfs]
[464.886] ? vfs_getattr_nosec+0x79/0x340
[464.892] do_syscall_64+0x95/0x180
[464.898] ? __do_sys_newfstat+0xde/0xf0
[464.904] ? __pfx___do_sys_newfstat+0x10/0x10
[464.911] ? trace_irq_enable.constprop.0+0xce/0x110
[464.918] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.925] ? do_syscall_64+0xa1/0x180
[464.931] ? trace_irq_enable.constprop.0+0xce/0x110
[464.939] ? trace_irq_enable.constprop.0+0xce/0x110
[464.946] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.953] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.960] ? do_syscall_64+0xa1/0x180
[464.966] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
[464.973] ? trace_irq_enable.constprop.0+0xce/0x110
[464.980] ? syscall_exit_to_user_mode+0xac/0x2a0
[464.987] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
[464.995] ? trace_irq_enable.constprop.0+0xce/0x110
[465.002] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
[465.010] ? do_syscall_64+0xa1/0x180
[465.016] ? lock_release+0x567/0x790
[465.022] ? __pfx_lock_acquire+0x10/0x10
[465.028] ? __pfx_lock_release+0x10/0x10
[465.034] ? trace_irq_enable.constprop.0+0xce/0x110
[465.042] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.049] ? do_syscall_64+0xa1/0x180
[465.055] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.062] ? do_syscall_64+0xa1/0x180
[465.068] ? syscall_exit_to_user_mode+0xac/0x2a0
[465.075] ? do_syscall_64+0xa1/0x180
[465.081] ? clear_bhb_loop+0x25/0x80
[465.087] ? clear_bhb_loop+0x25/0x80
[465.093] ? clear_bhb_loop+0x25/0x80
[465.099] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[465.106] RIP: 0033:0x7f093b8ee784
[465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
[465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
[465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
[465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
[465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
[465.170] </TASK>
[465.174] INFO: lockdep is turned off.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 97713b1a2ced ("btrfs: do not clear page dirty inside extent_write_locked_range()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array(). And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.
Rename the @extra_gfp parameter to @nofail to indicate that.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pass a struct btrfs_inode to is_data_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can add const to many parameters, this is for clarity and minor
addition to safety. There are some minor effects, in the assembly
code and .ko measured on release config. This patch does not cover all
possible conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.
Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
For subpage + zoned case, the following workload can lead to rsv data
leak at unmount time:
# mkfs.btrfs -f -s 4k $dev
# mount $dev $mnt
# fsstress -w -n 8 -d $mnt -s 1709539240
0/0: fiemap - no filename
0/1: copyrange read - no filename
0/2: write - no filename
0/3: rename - no source filename
0/4: creat f0 x:0 0 0
0/4: creat add id=0,parent=-1
0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
# umount $mnt
The dmesg includes the following rsv leak detection warning (all call
trace skipped):
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info DATA has 268218368 free, is not full
BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.
[CAUSE]
I have added several trace_printk() to show the events (header skipped):
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The above lines show our buffered write has dirtied 3 pages of inode
259 of root 5:
704K 768K 832K 896K
I |////I/////////////////I///////////| I
756K 868K
|///| is the dirtied range using subpage bitmaps. and 'I' is the page
boundary.
Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
flag set.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
Then direct IO write starts, since the range [680K, 780K) covers the
beginning part of the above dirty range, we need to writeback the
two pages at 704K and 768K.
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
Now the above 2 lines show that we're writing back for dirty range
[756K, 756K + 64K).
We only writeback 64K because the zoned device has max zone append size
as 64K.
> extent_write_locked_range: r/i=5/259 clear dirty for page=786432
!!! The above line shows the root cause. !!!
We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
for the page 768K.
This is because extent_write_locked_range() can go beyond the current
locked page, here we hit the page at 768K and clear its page dirt.
In fact this would lead to the desync between subpage dirty and page
dirty flags. We have the page dirty flag cleared, but the subpage range
[820K, 832K) is still dirty.
After the writeback of range [756K, 820K), the dirty flags look like
this, as page 768K no longer has dirty flag set.
704K 768K 832K 896K
I I | I/////////////| I
820K 868K
This means we will no longer writeback range [820K, 832K), thus the
reserved data/metadata space would never be properly released.
> extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
Now even though we try to start writeback for page 768K, since the
page is not dirty, we completely skip it at extent_write_cache_pages()
time.
> btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
Now the direct IO finished.
> cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
> extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
Now we writeback the remaining dirty range, which is [832K, 868K).
Causing the range [820K, 832K) never to be submitted, thus leaking the
reserved space.
This bug only affects subpage and zoned case. For non-subpage and zoned
case, we have exactly one sector for each page, thus no such partial dirty
cases.
For subpage and non-zoned case, we never go into run_delalloc_cow(), and
normally all the dirty subpage ranges would be properly submitted inside
__extent_writepage_io().
[FIX]
Just do not clear the page dirty at all inside extent_write_locked_range().
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page and subpage dirty flags anyway.
Now the correct trace would look like this:
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The page dirty part is still the same 3 pages.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
And the writeback for the first 64K is still correct.
> cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
> extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
Now with the fix, we can properly writeback the range [820K, 832K), and
properly release the reserved data/metadata space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.
Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Reported by 'gcc -Wcast-qual', the argument from which write_extent_buffer()
reads data to write to the eb should be const. In addition the const
needs to be also added to __write_extent_buffer() local buffers.
All callers of write_eb_member() can now be updated to use const for the
input buffer structure or type.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Fixes: 13df3775efca ("btrfs: cleanup metadata page pointer usage")
CC: stable@vger.kernel.org # 6.8+
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
error state
Currently the error status of super block write is tracked in page/folio
status bit Error. For that we need to keep the reference for the whole
duration of write and wait.
Count the number of superblock writeback errors in the btrfs_device.
That means we don't need the folio to stay around until it's waited for,
and can avoid the extra call to folio_get/put.
Also remove a mention of PageError in a comment as it's the last mention
of the page Error state.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We want to limit the scope of the extent lock to be around operations
that can change in flight. Currently we hold the extent lock through
the entire writepage operation, which isn't really necessary.
We want to protect to make sure nobody has updated DELALLOC. In
find_lock_delalloc_range we must lock the range in order to validate the
contents of our io_tree. However once we've done that we're safe to
unlock the range and continue, as we have the page lock already held for
the range.
We are protected from all operations at this point.
* mmap() - we're holding the page lock, thus are protected.
* buffered writes - again, we're protected because we take the page lock
for the first and last page in our range for buffered writes so we
won't create new delalloc ranges in this area.
* direct IO - we invalidate pagecache before attempting to write a new
area, which requires the page lock, so again are protected once we're
holding the page lock on this range.
Additionally this behavior actually already exists for compressed, we
unlock the range as soon as we start to process the async extents, and
re-lock it during compression. So this is completely safe, and makes
the locking more consistent.
Make this simple by just pushing the extent lock into
btrfs_run_delalloc_range. From there followup patches will push the
lock further down into its users.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At try_release_extent_mapping(), called during the release folio callback
(btrfs_release_folio() callchain), we don't release any extent maps in the
range if the GFP flags don't allow blocking. This behaviour is exaggerated
because:
1) Both searching for extent maps and removing them are not blocking
operations. The only thing that it is the cond_resched() call at the
end of the loop that searches for and removes extent maps;
2) We currently only operate on a single page, so for the case where
block size matches the page size, we can only have one extent map,
and for the case where the block size is smaller than the page size,
we can have at most 16 extent maps.
So it's very unlikely the cond_resched() call will ever block even in the
block size smaller than page size scenario.
So instead of not removing any extent maps at all in case the GFP glags
don't allow blocking, keep removing extent maps while we don't need to
reschedule. This makes it safe for the subpage case and for a future
where we can process folios with a size larger than a page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we don't attempt to release extent maps if the inode has an
i_size that is not greater than 16M. This condition was added way back
in 2008 by commit 70dec8079d78 ("Btrfs: extent_io and extent_state
optimizations"), without any explanation about it. A quick chat with
Chris on slack revealed that the goal was probably to release the extent
maps for small files only when closing the inode. This however can be
harmful in case we have tons of such files being kept open for very long
periods of time, since we will consume more and more pages for extent
maps.
So remove the condition.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Nowadays we have the btrfs_get_fs_generation() to get the current
generation of the filesystem, so there's no need anymore to lock the
transaction spinlock to read it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename the following variables:
1) "btrfs_inode" to "inode", because it's shorter to type and clear, and
we don't have a VFS inode here as well, so there's no confusion;
2) "tree" to "io_tree", to be clear which tree we are dealing with, since
we use 2 different trees in the function;
3) "map" to "extent_tree" since "map" gives the idea we are dealing with
an extent map for example, but we are dealing with the inode's extent
tree (the tree which stores extent maps).
These also make the next patches simpler.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
remove_extent_mapping().
In order to facilitate an upcoming change that adds a shrinker for extent
maps, change remove_extent_mapping() to receive the inode instead of its
extent map tree.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Our subpage testing started hanging on generic/560 and I bisected it
down to 1cab1375ba6d ("btrfs: reuse cloned extent buffer during
fiemap to avoid re-allocations"). This is subtle because we use
eb->start to figure out where in the folio we're copying to when we're
subpage, as our ->start may refer to an area inside of the folio.
For example, assume a 16K page size machine with a 4K node size, and
assume that we already have a cloned extent buffer when we cloned the
previous search.
copy_extent_buffer_full() will do the following when copying the extent
buffer path->nodes[0] (src) into cloned (dest):
src->start = 8k; // this is the new leaf we're cloning
cloned->start = 4k; // this is left over from the previous clone
src_addr = folio_address(src->folios[0]);
dest_addr = folio_address(dest->folios[0]);
memcpy(dest_addr + get_eb_offset_in_folio(dst, 0),
src_addr + get_eb_offset_in_folio(src, 0), src->len);
Now get_eb_offset_in_folio() is where the problems occur, because for
sub-pagesize blocksize we can have multiple eb's per folio, the code for
this is as follows
size_t get_eb_offset_in_folio(eb, offset) {
return (eb->start + offset & (folio_size(eb->folio[0]) - 1));
}
So in the above example we are copying into offset 4K inside the folio.
However once we update cloned->start to 8K to match the src the math for
get_eb_offset_in_folio() changes, and any subsequent reads (i.e.
btrfs_item_key_to_cpu()) will start reading from the offset 8K instead
of 4K where we copied to, giving us garbage.
Fix this by setting start before we co copy_extent_buffer_full() to make
sure that we're copying into the same offset inside of the folio that we
will read from later.
All other sites of copy_extent_buffer_full() are correct because we
either set ->start beforehand or we simply don't change it in the case
of the tree-log usage.
With this fix we now pass generic/560 on our subpage tests.
Fixes: 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper will do the same thing as btrfs_alloc_page_array(), but
with folios.
One extra difference is, there is no extra helper for bulk allocation,
thus it may not be as efficient as the page version.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We recently tracked down a race condition that triggered a read for an
extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read
was in progress, other concurrent readers would see the UPTODATE bit and
return early as if the read was already complete, making accesses to the
extent buffer conflict with the read operation that was overwriting it.
Add a WARN_ON() to end_bbio_meta_read() for this situation to make
similar races easier to spot in the future.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are clearing the bit and waking up any waiters in two different
places. Factor that code out into a static helper function.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.
[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.
The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between
incomplete batch memory allocations").
[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.
Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.
Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa769 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;
3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;
4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:
1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.
The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:
btrfs_read_folio
btrfs_do_readpage
__get_extent_map
extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The fs_info and sectorsize remain the same during the loops, no need to
set them on each iteration.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|