Age | Commit message (Collapse) | Author |
|
Pull bcachefs fixes from Kent Overstreet:
"Some more mostly boring fixes, but some not
User reported ones:
- the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
performance bug; user reported an untar initially taking two
seconds and then ~2 minutes
- kill a __GFP_NOFAIL in the buffered read path; this was a leftover
from the trickier fix to kill __GFP_NOFAIL in readahead, where we
can't return errors (and have to silently truncate the read
ourselves).
bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
filesystems because our folio state is just barely too big, 2MB
hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
additionally, the flags argument was just buggy, we weren't
supplying GFP_KERNEL previously (!)"
* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
bcachefs: fix bch2_save_backtrace()
bcachefs: Fix check_snapshot() memcpy
bcachefs: Fix bch2_journal_flush_device_pins()
bcachefs: fix iov_iter count underflow on sub-block dio read
bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
bcachefs: Kill __GFP_NOFAIL in buffered read path
bcachefs: fix backpointer_to_text() when dev does not exist
|
|
Missed a call in the previous fix.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking
* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix refcount on the metabuf used for inode lookup
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU pathwalk fixes from Al Viro:
"We still have some races in filesystem methods when exposed to RCU
pathwalk. This series is a result of code audit (the second round of
it) and it should deal with most of that stuff.
Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
Up to maintainers (a note for NTFS folks - when documentation says
that a method may not block, it *does* imply that blocking allocations
are to be avoided. Really)"
[ More explanations for people who aren't familiar with the vagaries of
RCU path walking: most of it is hidden from filesystems, but if a
filesystem actively participates in the low-level path walking it
needs to make sure the fields involved in that walk are RCU-safe.
That "actively participate in low-level path walking" includes things
like having its own ->d_hash()/->d_compare() routines, or by having
its own directory permission function that doesn't just use the common
helpers. Having a ->d_revalidate() function will also have this issue.
Note that instead of making everything RCU safe you can also choose to
abort the RCU pathwalk if your operation cannot be done safely under
RCU, but that obviously comes with a performance penalty. One common
pattern is to allow the simple cases under RCU, and abort only if you
need to do something more complicated.
So not everything needs to be RCU-safe, and things like the inode etc
that the VFS itself maintains obviously already are. But these fixes
tend to be about properly RCU-delaying things like ->s_fs_info that
are maintained by the filesystem and that got potentially released too
early. - Linus ]
* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ext4_get_link(): fix breakage in RCU mode
cifs_get_link(): bail out in unsafe case
fuse: fix UAF in rcu pathwalks
procfs: make freeing proc_fs_info rcu-delayed
procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
nfs: fix UAF on pathwalk running into umount
nfs: make nfs_set_verifier() safe for use in RCU pathwalk
afs: fix __afs_break_callback() / afs_drop_open_mmap() race
hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
affs: free affs_sb_info with kfree_rcu()
rcu pathwalk: prevent bogus hard errors from may_lookup()
fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
|
|
Pull vfs fixes from Al Viro:
"A couple of fixes - revert of regression from this cycle and a fix for
erofs failure exit breakage (had been there since way back)"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
erofs: fix handling kern_mount() failure
Revert "get rid of DCACHE_GENOCIDE"
|
|
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
->d_revalidate() bails out there, anyway. It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup. Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.
Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
server->flags
server->caps
*(server->io_stats)
and, worst of all, call
server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()). We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry. For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked(). It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.
That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time. If that happens, back-to-back attempts to kill dentry and
its parent are quite normal. Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
do queue_work(&vnode->cb_work). In afs_drop_open_mmap() we decrement
->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.
The trouble is, there's nothing to prevent __afs_break_callback() from
seeing ->cb_nr_mmap before the decrement and do queue_work() after both
the decrement and flush_work(). If that happens, we might be in trouble -
vnode might get freed before the queued work runs.
__afs_break_callback() is always done under ->cb_lock, so let's make
sure that ->cb_nr_mmap can change from non-zero to zero while holding
->cb_lock (the spinlock component of it - it's a seqlock and we don't
need to mess with the counter).
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
->d_hash() and ->d_compare() use those, so we need to delay freeing
them.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
that is in process of getting shut down.
Besides, having nls and upcase table cleanup moved from ->put_super() towards
the place where sbi is freed makes for simpler failure exits.
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
one of the flags in it is used by ->d_hash()/->d_compare()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If lazy call of ->permission() returns a hard error, check that
try_to_unlazy() succeeds before returning it. That both makes
life easier for ->permission() instances and closes the race
in ENOTDIR handling - it is possible that positive d_can_lookup()
seen in link_path_walk() applies to the state *after* unlink() +
mkdir(), while nd->inode matches the state prior to that.
Normally seeing e.g. EACCES from permission check in rcu pathwalk
means that with some timings non-rcu pathwalk would've run into
the same; however, running into a non-executable regular file
in the middle of a pathname would not get to permission check -
it would fail with ENOTDIR instead.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Avoids fun races in RCU pathwalk... Same goes for freeing LSM shite
hanging off super_block's arse.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
check_snapshot() copies the bch_snapshot to a temporary to easily handle
older versions that don't have all the fields of the current version,
but it lacked a min() to correctly handle keys newer and larger than the
current version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_direct_IO_read() checks the request offset and size for sector
alignment and then falls through to a couple calculations to shrink
the size of the request based on the inode size. The problem is that
these checks round up to the fs block size, which runs the risk of
underflowing iter->count if the block size happens to be large
enough. This is triggered by fstest generic/361 with a 4k block
size, which subsequently leads to a crash. To avoid this crash,
check that the shorten length doesn't exceed the overall length of
the iter.
Fixes:
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.
Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).
But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the
easy one in read_single_folio() (where wa can return an error) was
missed - oops.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix a memory leak in cachefiles
- Restrict aio cancellations to I/O submitted through the aio
interfaces as this is otherwise causing issues for I/O submitted
via io_uring
- Increase buffer for afs volume status to avoid overflow
- Fix a missing zero-length check in unbuffered writes in the
netfs library. If generic_write_checks() returns zero make
netfs_unbuffered_write_iter() return right away
- Prevent a leak in i_dio_count caused by netfs_begin_read() operating
past i_size. It will return early and leave i_dio_count incremented
- Account for ipv4 addresses as well as ipv6 addresses when processing
incoming callbacks in afs
* tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
afs: Increase buffer size in afs_update_volume_status()
afs: Fix ignored callbacks over ipv4
cachefiles: fix memory leak in cachefiles_add_cache()
netfs: Fix missing zero-length check in unbuffered write
netfs: Fix i_dio_count leak on DIO read past i_size
|
|
In erofs_find_target_block() when erofs_dirnamecmp() returns 0,
we do not assign the target metabuf. This causes the caller
erofs_namei()'s erofs_put_metabuf() at the end to be not effective
leaving the refcount on the page.
As the page from metabuf (buf->page) is never put, such page cannot be
migrated or reclaimed. Fix it now by putting the metabuf from
previous loop and assigning the current metabuf to target before
returning so caller erofs_namei() can do the final put as it was
intended.
Fixes: 500edd095648 ("erofs: use meta buffers for inode lookup")
Cc: <stable@vger.kernel.org> # 5.18+
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240221210348.3667795-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- Fix a deadlock in fiemap.
There was a big lock around the whole operation that can interfere
with a page fault and mkwrite.
Reducing the lock scope can also speed up fiemap
- Fix range condition for extent defragmentation which could lead to
worse layout in some cases
* tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix deadlock with fiemap and extent locking
btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
|
|
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
following kernel warning appears:
WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
Call trace:
kiocb_set_cancel_fn+0x9c/0xa8
ffs_epfile_read_iter+0x144/0x1d0
io_read+0x19c/0x498
io_issue_sqe+0x118/0x27c
io_submit_sqes+0x25c/0x5fc
__arm64_sys_io_uring_enter+0x104/0xab0
invoke_syscall+0x58/0x11c
el0_svc_common+0xb4/0xf4
do_el0_svc+0x2c/0xb0
el0_svc+0x2c/0xa4
el0t_64_sync_handler+0x68/0xb4
el0t_64_sync+0x1a4/0x1a8
Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
submitted by libaio.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The max length of volume->vid value is 20 characters.
So increase idbuf[] size up to 24 to avoid overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
[DH: Actually, it's 20 + NUL, so increase it to 24 and use snprintf()]
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240211150442.3416-1-d.dulov@aladdin.ru/ # v1
Link: https://lore.kernel.org/r/20240212083347.10742-1-d.dulov@aladdin.ru/ # v2
Link: https://lore.kernel.org/r/20240219143906.138346-3-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When searching for a matching peer, all addresses need to be searched,
not just the ipv6 ones in the fs_addresses6 list.
Given that the lists no longer contain addresses, there is little
reason to splitting things between separate lists, so unify them
into a single list.
When processing an incoming callback from an ipv4 address, this would
lead to a failure to set call->server, resulting in the callback being
ignored and the client seeing stale contents.
Fixes: 72904d7b9bfb ("rxrpc, afs: Allow afs to pin rxrpc_peer objects")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008035.html
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008037.html # v1
Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008066.html # v2
Link: https://lore.kernel.org/r/20240219143906.138346-2-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The following memory leak was reported after unbinding /dev/cachefiles:
==================================================================
unreferenced object 0xffff9b674176e3c0 (size 192):
comm "cachefilesd2", pid 680, jiffies 4294881224
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc ea38a44b):
[<ffffffff8eb8a1a5>] kmem_cache_alloc+0x2d5/0x370
[<ffffffff8e917f86>] prepare_creds+0x26/0x2e0
[<ffffffffc002eeef>] cachefiles_determine_cache_security+0x1f/0x120
[<ffffffffc00243ec>] cachefiles_add_cache+0x13c/0x3a0
[<ffffffffc0025216>] cachefiles_daemon_write+0x146/0x1c0
[<ffffffff8ebc4a3b>] vfs_write+0xcb/0x520
[<ffffffff8ebc5069>] ksys_write+0x69/0xf0
[<ffffffff8f6d4662>] do_syscall_64+0x72/0x140
[<ffffffff8f8000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
==================================================================
Put the reference count of cache_cred in cachefiles_daemon_unbind() to
fix the problem. And also put cache_cred in cachefiles_add_cache() error
branch to avoid memory leaks.
Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240217081431.796809-1-libaokun1@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
if you have a variable that holds NULL or a pointer to live struct mount,
do not shove ERR_PTR() into it - not if you later treat "not NULL" as
"holds a pointer to object".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With the following file extent layout, defrag would do unnecessary IO
and result more on-disk space usage.
# mkfs.btrfs -f $dev
# mount $dev $mnt
# xfs_io -f -c "pwrite 0 40m" $mnt/foobar
# sync
# xfs_io -f -c "pwrite 40m 16k" $mnt/foobar
# sync
Above command would lead to the following file extent layout:
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 298844160 nr 41943040
extent data offset 0 nr 41943040 ram 41943040
extent compression 0 (none)
item 7 key (257 EXTENT_DATA 41943040) itemoff 15763 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13631488 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0 (none)
Which is mostly fine. We can allow the final 16K to be merged with the
previous 40M, but it's upon the end users' preference.
But if we defrag the file using the default parameters, it would result
worse file layout:
# btrfs filesystem defrag $mnt/foobar
# sync
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 298844160 nr 41943040
extent data offset 0 nr 8650752 ram 41943040
extent compression 0 (none)
item 7 key (257 EXTENT_DATA 8650752) itemoff 15763 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 340787200 nr 33292288
extent data offset 0 nr 33292288 ram 33292288
extent compression 0 (none)
item 8 key (257 EXTENT_DATA 41943040) itemoff 15710 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13631488 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0 (none)
Note the original 40M extent is still there, but a new 32M extent is
created for no benefit at all.
[CAUSE]
There is an existing check to make sure we won't defrag a large enough
extent (the threshold is by default 32M).
But the check is using the length to the end of the extent:
range_len = em->len - (cur - em->start);
/* Skip too large extent */
if (range_len >= extent_thresh)
goto next;
This means, for the first 8MiB of the extent, the range_len is always
smaller than the default threshold, and would not be defragged.
But after the first 8MiB, the remaining part would fit the requirement,
and be defragged.
Such different behavior inside the same extent caused the above problem,
and we should avoid different defrag decision inside the same extent.
[FIX]
Instead of using @range_len, just use @em->len, so that we have a
consistent decision among the same file extent.
Now with this fix, we won't touch the extent, thus not making it any
worse.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 0cb5950f3f3b ("btrfs: fix deadlock when reserving space during defrag")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull bcachefs fixes from Kent Overstreet:
"Mostly pretty trivial, the user visible ones are:
- don't barf when replicas_required > replicas
- fix check_version_upgrade() so it doesn't do something nonsensical
when we're downgrading"
* tag 'bcachefs-2024-02-17' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix missing va_end()
bcachefs: Fix check_version_upgrade()
bcachefs: Clamp replicas_required to replicas
bcachefs: fix missing endiannes conversion in sb_members
bcachefs: fix kmemleak in __bch2_read_super error handling path
bcachefs: Fix missing bch2_err_class() calls
|
|
Pull smb client fixes from Steve French:
"Five smb3 client fixes, most also for stable:
- Two multichannel fixes (one to fix potential handle leak on retry)
- Work around possible serious data corruption (due to change in
folios in 6.3, for cases when non standard maximum write size
negotiated)
- Symlink creation fix
- Multiuser automount fix"
* tag '6.8-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: Fix regression in writes when non-standard maximum write size negotiated
smb: client: handle path separator of created SMB symlinks
smb: client: set correct id, uid and cruid for multiuser automounts
cifs: update the same create_guid on replay
cifs: fix underflow in parse_server_interfaces()
|
|
Pull ceph fixes from Ilya Dryomov:
"Additional cap handling fixes from Xiubo to avoid "client isn't
responding to mclientcaps(revoke)" stalls on the MDS side"
* tag 'ceph-for-6.8-rc5' of https://github.com/ceph/ceph-client:
ceph: add ceph_cap_unlink_work to fire check_caps() immediately
ceph: always queue a writeback when revoking the Fb caps
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
- Fix direct write error handling to avoid a race between failed IO
completion and the submission path itself which can result in an
invalid file size exposed to the user after the failed IO.
* tag 'zonefs-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Improve error handling
|
|
The conversion to netfs in the 6.3 kernel caused a regression when
maximum write size is set by the server to an unexpected value which is
not a multiple of 4096 (similarly if the user overrides the maximum
write size by setting mount parm "wsize", but sets it to a value that
is not a multiple of 4096). When negotiated write size is not a
multiple of 4096 the netfs code can skip the end of the final
page when doing large sequential writes, causing data corruption.
This section of code is being rewritten/removed due to a large
netfs change, but until that point (ie for the 6.3 kernel until now)
we can not support non-standard maximum write sizes.
Add a warning if a user specifies a wsize on mount that is not
a multiple of 4096 (and round down), also add a change where we
round down the maximum write size if the server negotiates a value
that is not a multiple of 4096 (we also have to check to make sure that
we do not round it down to zero).
Reported-by: R. Diez" <rdiez-2006@rd10.de>
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Suggested-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Acked-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # v6.3+
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:
1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).
2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.
Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regular fixes and one fix for space reservation regression since
6.7 that users have been reporting:
- fix over-reservation of metadata chunks due to not keeping proper
balance between global block reserve and delayed refs reserve; in
practice this leaves behind empty metadata block groups, the
workaround is to reclaim them by using the '-musage=1' balance
filter
- other space reservation fixes:
- do not delete unused block group if it may be used soon
- do not reserve space for checksums for NOCOW files
- fix extent map assertion failure when writing out free space inode
- reject encoded write if inode has nodatasum flag set
- fix chunk map leak when loading block group zone info"
* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't refill whole delayed refs block reserve when starting transaction
btrfs: zoned: fix chunk map leak when loading block group zone info
btrfs: reject encoded write if inode has nodatasum flag set
btrfs: don't reserve space for checksums when writing to nocow files
btrfs: add new unused block groups to the list of unused block groups
btrfs: do not delete unused block group if it may be used soon
btrfs: add and use helper to check if block group is used
btrfs: don't drop extent_map for free space inode on write error
|
|
Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Reported-by: coverity scan
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When also downgrading, check_version_upgrade() could pick a new version
greater than the latest supported version.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This prevents going emergency read only when the user has specified
replicas_required > replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.
This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.
So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.
CC: stable@vger.kernel.org # 6.7+
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Reported-by: Heddxh <g311571057@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_load_block_group_zone_info() we never drop a reference on the
chunk map we have looked up, therefore leaking a reference on it. So
add the missing btrfs_free_chunk_map() at the end of the function.
Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps")
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.
For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
root 256 inode 257 errors 1040, bad file extent, some csum missing
root 256 inode 258 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
(...)
So reject encoded writes if the target inode has NODATASUM set.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently when doing a write to a file we always reserve metadata space
for inserting data checksums. However we don't need to do it if we have
a nodatacow file (-o nodatacow mount option or chattr +C) or if checksums
are disabled (-o nodatasum mount option), as in that case we are only
adding unnecessary pressure to metadata reservations.
For example on x86_64, with the default node size of 16K, a 4K buffered
write into a nodatacow file is reserving 655360 bytes of metadata space,
as it's accounting for checksums. After this change, which stops reserving
space for checksums if we have a nodatacow file or checksums are disabled,
we only need to reserve 393216 bytes of metadata.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.
This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we
will always ignore queue a writeback. Queue a writeback is very
important because it will block kclient flushing the snapcaps to
MDS and which will block MDS waiting for revoking the 'Fb' caps.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|