Age | Commit message (Collapse) | Author |
|
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move some poll helpers/etc up, we'll need them there shortly
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Clean up io_poll_update() and unify cancellation paths for remove and
update.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Before updating a poll request we should remove it from poll queues,
including the double poll entry.
Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no need to pass the pointer to the kset in the struct
kset_uevent_ops callbacks as no one uses it, so just remove that pointer
entirely.
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Wedson Almeida Filho <wedsonaf@google.com>
Link: https://lore.kernel.org/r/20211227163924.3970661-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull ksmbd fixes from Steve French:
"Three ksmbd fixes, all for stable as well.
Two fix potential unitialized memory and one fixes a security problem
where encryption is unitentionally disabled from some clients"
* tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
ksmbd: fix uninitialized symbol 'pntsd_size'
ksmbd: fix error code in ndr_read_int32()
|
|
Pull io_uring fix from Jens Axboe:
"Single fix for not clearing kiocb->ki_pos back to 0 for a stream,
destined for stable as well"
* tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block:
io_uring: zero iocb->ki_pos for stream file types
|
|
This series takes care of a couple of TODOs and adds new ones. Update
the TODOs section to reflect current state and future work that needs
to happen.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-5-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Move fast commit stats updating logic to a separate function from
ext4_fc_commit(). This significantly improves readability of
ext4_fc_commit().
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch drops ext4_fc_start_ineligible() and
ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions
should simply call ext4_fc_mark_ineligible() after starting the
trasaction.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-3-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch drops all calls to ext4_fc_start_update() and
ext4_fc_stop_update(). To ensure that there are no ongoing journal
updates during fast commit, we also make jbd2_fc_begin_commit() lock
journal for updates. This way we don't have to maintain two different
transaction start stop APIs for fast commit and full commit. This
patch doesn't remove the functions altogether since in future we want
to have inode level locking for fast commits.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
iomap_write_end() does not return a negative errno to indicate an
error, but the number of bytes successfully copied. It cannot return
an error today, so include a debugging assertion like the one in
iomap_unshare_iter().
Fixes: c6f40468657d ("fsdax: decouple zeroing from the iomap buffered I/O code")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221044450.517558-1-willy@infradead.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
GC task can deadlock in read_cache_page() because it may attempt
to release a page that is actually allocated by another task in
jffs2_write_begin().
The reason is that in jffs2_write_begin() there is a small window
a cache page is allocated for use but not set Uptodate yet.
This ends up with a deadlock between two tasks:
1) A task (e.g. file copy)
- jffs2_write_begin() locks a cache page
- jffs2_write_end() tries to lock "alloc_sem" from
jffs2_reserve_space() <-- STUCK
2) GC task (jffs2_gcd_mtd3)
- jffs2_garbage_collect_pass() locks "alloc_sem"
- try to lock the same cache page in read_cache_page() <-- STUCK
So to avoid this deadlock, hold "alloc_sem" in jffs2_write_begin()
while reading data in a cache page.
Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb
returns error, a LEB will always has a "taken" flag. In this case,
set the ubifs to read-only to prevent a worse situation.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out"
branch, ubifs_return_leb will execute twice on the same lnum. This
can cause data loss in concurrency situations.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Hulk Robot reported a KASAN report about slab-out-of-bounds:
==================================================================
BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs]
Read of size 8 at addr ffff888101c961f8 by task fsstress/1068
[...]
Call Trace:
check_memory_region+0x1c1/0x1e0
ubifs_change_lp+0x3a9/0x1390 [ubifs]
ubifs_change_one_lp+0x170/0x220 [ubifs]
ubifs_garbage_collect+0x7f9/0xda0 [ubifs]
ubifs_budget_space+0xfe4/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
Allocated by task 1068:
kmemdup+0x25/0x50
ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs]
ubifs_update_one_lp+0x46/0x260 [ubifs]
ubifs_tnc_end_commit+0x98b/0x1720 [ubifs]
do_commit+0x6cb/0x1950 [ubifs]
ubifs_run_commit+0x15a/0x2b0 [ubifs]
ubifs_budget_space+0x1061/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
[...]
==================================================================
In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error,
lp is an uninitialized variable. But lp.num might be used in the out
branch, which is a random value. If the value is -1 or another value
that can pass the check, soob may occur in the ubifs_change_lp() in
the following procedure.
To solve this problem, we initialize lp.lnum to -1, and then initialize
it correctly in ubifs_find_dirty_leb, which is not equal to -1, and
ubifs_return_leb is executed only when lp.lnum != -1.
if find a retained or indexing LEB and continue to next loop, but break
before find another LEB, the "taken" flag of this LEB will be cleaned
in ubi_return_lebi(). This bug has also been fixed in this patch.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The snprintf() function returns the number of bytes (not including the
NUL terminator) which would have been printed if there were enough
space. So it can be greater than UBIFS_DFS_DIR_LEN. And actually if
it equals UBIFS_DFS_DIR_LEN then that's okay so this check is too
strict.
Fixes: 9a620291fc01 ("ubifs: Export filesystem error counters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Not all ubifs filesystem errors are propagated to userspace.
Export bad magic, bad node and crc errors via sysfs. This allows userspace
to notice filesystem errors:
/sys/fs/ubifs/ubiX_Y/errors_magic
/sys/fs/ubifs/ubiX_Y/errors_node
/sys/fs/ubifs/ubiX_Y/errors_crc
The counters are reset to 0 with a remount.
Signed-off-by: Stefan Schaeckeler <sschaeck@cisco.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
it seems freeing the write buffers in the error path of the
ubifs_remount_rw() is wrong. It leads later to a kernel oops like this:
[10016.431274] UBIFS (ubi0:0): start fixing up free space
[10090.810042] UBIFS (ubi0:0): free space fixup complete
[10090.814623] UBIFS error (ubi0:0 pid 512): ubifs_remount_fs: cannot
spawn "ubifs_bgt0_0", error -4
[10101.915108] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started,
PID 517
[10105.275498] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000030
[10105.284352] Mem abort info:
[10105.287160] ESR = 0x96000006
[10105.290252] EC = 0x25: DABT (current EL), IL = 32 bits
[10105.295592] SET = 0, FnV = 0
[10105.298652] EA = 0, S1PTW = 0
[10105.301848] Data abort info:
[10105.304723] ISV = 0, ISS = 0x00000006
[10105.308573] CM = 0, WnR = 0
[10105.311564] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000f03d1000
[10105.318034] [0000000000000030] pgd=00000000f6cee003,
pud=00000000f4884003, pmd=0000000000000000
[10105.326783] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[10105.332355] Modules linked in: ath10k_pci ath10k_core ath mac80211
libarc4 cfg80211 nvme nvme_core cryptodev(O)
[10105.342468] CPU: 3 PID: 518 Comm: touch Tainted: G O
5.4.3 #1
[10105.349517] Hardware name: HYPEX CPU (DT)
[10105.353525] pstate: 40000005 (nZcv daif -PAN -UAO)
[10105.358324] pc : atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34
[10105.364596] lr : mutex_lock+0x1c/0x34
[10105.368253] sp : ffff000075633aa0
[10105.371563] x29: ffff000075633aa0 x28: 0000000000000001
[10105.376874] x27: ffff000076fa80c8 x26: 0000000000000004
[10105.382185] x25: 0000000000000030 x24: 0000000000000000
[10105.387495] x23: 0000000000000000 x22: 0000000000000038
[10105.392807] x21: 000000000000000c x20: ffff000076fa80c8
[10105.398119] x19: ffff000076fa8000 x18: 0000000000000000
[10105.403429] x17: 0000000000000000 x16: 0000000000000000
[10105.408741] x15: 0000000000000000 x14: fefefefefefefeff
[10105.414052] x13: 0000000000000000 x12: 0000000000000fe0
[10105.419364] x11: 0000000000000fe0 x10: ffff000076709020
[10105.424675] x9 : 0000000000000000 x8 : 00000000000000a0
[10105.429986] x7 : ffff000076fa80f4 x6 : 0000000000000030
[10105.435297] x5 : 0000000000000000 x4 : 0000000000000000
[10105.440609] x3 : 0000000000000000 x2 : ffff00006f276040
[10105.445920] x1 : ffff000075633ab8 x0 : 0000000000000030
[10105.451232] Call trace:
[10105.453676] atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34
[10105.459600] ubifs_garbage_collect+0xb4/0x334
[10105.463956] ubifs_budget_space+0x398/0x458
[10105.468139] ubifs_create+0x50/0x180
[10105.471712] path_openat+0x6a0/0x9b0
[10105.475284] do_filp_open+0x34/0x7c
[10105.478771] do_sys_open+0x78/0xe4
[10105.482170] __arm64_sys_openat+0x1c/0x24
[10105.486180] el0_svc_handler+0x84/0xc8
[10105.489928] el0_svc+0x8/0xc
[10105.492808] Code: 52800013 17fffffb d2800003 f9800011 (c85ffc05)
[10105.498903] ---[ end trace 46b721d93267a586 ]---
To reproduce the problem:
1. Filesystem initially mounted read-only, free space fixup flag set.
2. mount -o remount,rw <mountpoint>
3. it takes some time (free space fixup running)
... try to terminate running mount by CTRL-C
... does not respond, only after free space fixup is complete
... then "ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4"
4. mount -o remount,rw <mountpoint>
... now finished instantly (fixup already done).
5. Create file or just unmount the filesystem and we get the oops.
Cc: <stable@vger.kernel.org>
Fixes: b50b9f408502 ("UBIFS: do not free write-buffers when in R/O mode")
Signed-off-by: Petr Cvachoucek <cvachoucek@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Repalce kthread_create/wake_up_process() with kthread_run()
to simplify the code.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Found with `codespell -i 3 -w fs/ubifs/**` and proof reading that parts.
Signed-off-by: Alexander Dahl <ada@thorsis.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
i_version mount option is getting lost on remount. This is because the
'i_version' mount option differs from the util-linux mount option
'iversion', but it has exactly the same functionality. We have to
specifically notify the vfs that this is what we want by setting
appropriate flag in fc->sb_flags. Fix it and as a result we can remove
*flags argument from __ext4_remount(); do the same for
__ext4_fill_super().
In addition set out to deprecate ext4 specific 'i_version' mount option
in favor or 'iversion' by kernel version 5.20.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Fixes: cebe85d570cf ("ext4: switch to the new mount api")
Link: https://lore.kernel.org/r/20211222104517.11187-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The lazytime and nolazytime mount options were added temporarily back in
2015 with commit a26f49926da9 ("ext4: add optimization for the lazytime
mount option"). It think it has been enough time for the util-linux with
lazytime support to get widely used. Remove the mount options.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211222104517.11187-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Switching to the new mount api introduced inconsistency in how the
journalling mode mount option (data=) is handled during a remount.
Ext4 always prevented changing the journalling mode during the remount,
however the new code always fails the remount when the journalling mode
is specified, even if it remains unchanged. Fix it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Fixes: cebe85d570cf ("ext4: switch to the new mount api")
Link: https://lore.kernel.org/r/20211220152657.101599-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
io_uring supports using offset == -1 for using the current file position,
and we read that in as part of read/write command setup. For the non-iter
read/write types we pass in NULL for the position pointer, but for the
iter types we should not be passing any anything but 0 for the position
for a stream.
Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we
do, then the request will error with -ESPIPE.
Fixes: ba04291eb66e ("io_uring: allow use of offset == -1 to mean file position")
Link: https://github.com/axboe/liburing/discussions/501
Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at
the end of files, just like fallocate and RESVSP. Make the behavior
consistent with the other ioctls.
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
While I was running with KASAN and lockdep enabled, I stumbled upon an
KASAN report about a UAF to a freed CIL checkpoint. Looking at the
comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
that the original patch to xfs_defer_finish_noroll should have done
something to lock the CIL to prevent it from switching the CIL contexts
while the predicate runs.
For upper level code that needs to know if a given log item is new
enough not to need relogging, add a new wrapper that takes the CIL
context lock long enough to sample the current CIL context. This is
kind of racy in that the CIL can switch the contexts immediately after
sampling, but that's ok because the consequence is that the defer ops
code is a little slow to relog items.
==================================================================
BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x140
kasan_report.cold+0x83/0xdf
xfs_log_item_in_current_chkpt+0x139/0x160
xfs_defer_finish_noroll+0x3bb/0x1e30
__xfs_trans_commit+0x6c8/0xcf0
xfs_reflink_remap_extent+0x66f/0x10e0
xfs_reflink_remap_blocks+0x2dd/0xa90
xfs_file_remap_range+0x27b/0xc30
vfs_dedupe_file_range_one+0x368/0x420
vfs_dedupe_file_range+0x37c/0x5d0
do_vfs_ioctl+0x308/0x1260
__x64_sys_ioctl+0xa1/0x170
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2c71a2950b
Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
</TASK>
Allocated by task 464064:
kasan_save_stack+0x1e/0x50
__kasan_kmalloc+0x81/0xa0
kmem_alloc+0xcd/0x2c0 [xfs]
xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
xlog_cil_push_work+0x141/0x13d0 [xfs]
process_one_work+0x7f6/0x1380
worker_thread+0x59d/0x1040
kthread+0x3b0/0x490
ret_from_fork+0x1f/0x30
Freed by task 51:
kasan_save_stack+0x1e/0x50
kasan_set_track+0x21/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0xed/0x130
slab_free_freelist_hook+0x7f/0x160
kfree+0xde/0x340
xlog_cil_committed+0xbfd/0xfe0 [xfs]
xlog_cil_process_committed+0x103/0x1c0 [xfs]
xlog_state_do_callback+0x45d/0xbd0 [xfs]
xlog_ioend_work+0x116/0x1c0 [xfs]
process_one_work+0x7f6/0x1380
worker_thread+0x59d/0x1040
kthread+0x3b0/0x490
ret_from_fork+0x1f/0x30
Last potentially related work creation:
kasan_save_stack+0x1e/0x50
__kasan_record_aux_stack+0xb7/0xc0
insert_work+0x48/0x2e0
__queue_work+0x4e7/0xda0
queue_work_on+0x69/0x80
xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
xlog_cil_force_seq+0x1b7/0x850 [xfs]
xfs_log_force_seq+0x1c7/0x670 [xfs]
xfs_file_fsync+0x7c1/0xa60 [xfs]
__x64_sys_fsync+0x52/0x80
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88804ea5f600
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 8 bytes inside of
256-byte region [ffff88804ea5f600, ffff88804ea5f700)
The buggy address belongs to the page:
page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
head:ffffea00013a9780 order:1 compound_mapcount:0
flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Hostfs was not setting up the backing device information, which means it
uses the noop bdi. The noop bdi does not have the writeback capability
enabled, which in turns means dirty pages never got written back to
storage.
In other words programs using mmap to write to files on hostfs never
actually got their data written out...
Fix this by simply setting up the bdi with default settings as all the
required code for writeback is already in place.
Signed-off-by: Sjoerd Simons <sjoerd@collabora.com>
Reviewed-by: Christopher Obbard <chris.obbard@collabora.com>
Tested-by: Ritesh Raj Sarraf <ritesh@collabora.com>
Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
"Address a buffer overrun reported by Anatoly Trosinenko"
* tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix READDIR buffer overflow
|
|
To make the merge easier, replicate the inlining of __iomap_zero_iter()
into iomap_zero_iter() that is currently in the nvdimm tree.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
The "bufsize" comes from the root user. If "bufsize" is negative then,
because of type promotion, neither of the validation checks at the start
of the function are able to catch it:
if (bufsize < sizeof(struct xfs_attrlist) ||
bufsize > XFS_XATTR_LIST_MAX)
return -EINVAL;
This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
kvmalloc_node(). Fix this by changing the type from int to size_t.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Since kernel commit 1abcf261016e ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Got a report that a repeated crash test of a container host would
eventually fail with a log recovery error preventing the system from
mounting the root filesystem. It manifested as a directory leaf node
corruption on writeback like so:
XFS (loop0): Mounting V5 Filesystem
XFS (loop0): Starting recovery (logdev: internal)
XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=.......
00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...)....
00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.#
00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C......
00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................
00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P
00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A......
00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A....
XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed
Tracing indicated that we were recovering changes from a transaction
at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
That is, log recovery was overwriting a buffer with newer changes on
disk than was in the transaction. Tracing indicated that we were
hitting the "recovery immediately" case in
xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
buffer.
The code was extracting the LSN correctly, then ignoring it because
the UUID in the buffer did not match the superblock UUID. The
problem arises because the UUID check uses the wrong UUID - it
should be checking the sb_meta_uuid, not sb_uuid. This filesystem
has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
correct matching sb_meta_uuid in it, it's just the code checked it
against the wrong superblock uuid.
The is no corruption in the filesystem, and failing to recover the
buffer due to a write verifier failure means the recovery bug did
not propagate the corruption to disk. Hence there is no corruption
before or after this bug has manifested, the impact is limited
simply to an unmountable filesystem....
This was missed back in 2015 during an audit of incorrect sb_uuid
usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs
to validate against sb_meta_uuid") that fixed the magic32 buffers to
validate against sb_meta_uuid instead of sb_uuid. It missed the
magicda buffers....
Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size. Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.
Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize. The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.
Fix the check to work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose. Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:
mount <filesystem> # (0)
fsstress <run only readonly ops> & # (1)
while true; do
fsstress <run all ops>
mount -o remount,ro # (2)
fsstress <run only readonly ops>
mount -o remount,rw # (3)
done
When (2) happens, notice that (1) is still running. xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).
When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata. If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.
The results are catastrophic -- file (B) and the refcount btree are now
corrupt. In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork. In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.
As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them. This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown. As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.
Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time. While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.
Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Ian Kent reported that for inline symlinks, it's possible for
vfs_readlink to hang on to the target buffer returned by
_vn_get_link_inline long after it's been freed by xfs inode reclaim.
This is a layering violation -- we should never expose XFS internals to
the VFS.
When the symlink has a remote target, we allocate a separate buffer,
copy the internal information, and let the VFS manage the new buffer's
lifetime. Let's adapt the inline code paths to do this too. It's
less efficient, but fixes the layering violation and avoids the need to
adapt the if_data lifetime to rcu rules. Clearly I don't care about
readlink benchmarks.
As a side note, this fixes the minor locking violation where we can
access the inode data fork without taking any locks; proper locking (and
eliminating the possibility of having to switch inode_operations on a
live inode) is essential to online repair coordinating repairs
correctly.
Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Prior to commit 40b52225e58c ("xfs: remove support for disabling quota
accounting on a mounted file system"), we used the quotaoff mutex to
protect dquot operations against quotaoff trying to pull down dquots as
part of disabling quota.
Now that we only support turning off quota enforcement, the quotaoff
mutex only protects changes in m_qflags/sb_qflags. We don't need it to
protect dquots, which means we can remove it from setqlimits and the
dquot scrub code. While we're at it, fix the function that forces
quotacheck, since it should have been taking the quotaoff mutex.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
While debugging some very strange rmap corruption reports in connection
with the online directory repair code. I root-caused the error to the
following incorrect sequence:
<start repair transaction>
<expand directory, causing a deferred rmap to be queued>
<roll transaction>
<cancel transaction>
Obviously, we should have committed the transaction instead of
cancelling it. Thinking more broadly, however, xfs_trans_cancel should
have warned us that we were throwing away work item that we already
committed to performing. This is not correct, and we need to shut down
the filesystem.
Change xfs_trans_cancel to complain in the loudest manner if we're
cancelling any transaction with deferred work items attached.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Pull cifs fixes from Steve French:
"Two cifs/smb3 fixes, one fscache related, and one mount parsing
related for stable"
* tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: sanitize multiple delimiters in prepath
cifs: ignore resource_id while getting fscache super cookie
|
|
If a client sends a READDIR count argument that is too small (say,
zero), then the buffer size calculation in the new init_dirlist
helper functions results in an underflow, allowing the XDR stream
functions to write beyond the actual buffer.
This calculation has always been suspect. NFSD has never sanity-
checked the READDIR count argument, but the old entry encoders
managed the problem correctly.
With the commits below, entry encoding changed, exposing the
underflow to the pointer arithmetic in xdr_reserve_space().
Modern NFS clients attempt to retrieve as much data as possible
for each READDIR request. Also, we have no unit tests that
exercise the behavior of READDIR at the lower bound of @count
values. Thus this case was missed during testing.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: f5dcccd647da ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream")
Fixes: 7f87fc2d34d4 ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
These methods indirect the actual DAX read/write path. In the end pmem
uses magic flush and mc safe variants and fuse and dcssblk use plain ones
while device mapper picks redirects to the underlying device.
Add set_dax_nocache() and set_dax_nomc() APIs to control which copy
routines are used to remove indirect call from the read/write fast path
as well as a lot of boilerplate code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs]
Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and
just let the drivers call set_dax_synchronous directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Fix the following coccicheck review:
./fs/9p/vfs_file.c: 117: 5-8: Unneeded variable
Remove unneeded variable used to store return value.
Link: https://lkml.kernel.org/r/20211109114343.132844-1-deng.changcheng@zte.com.cn
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Now that iomap has been converted, XFS is large folio safe.
Indicate to the VFS that it can now create large folios for XFS.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
If we're punching a hole in a large folio, we need to remove the
per-folio iomap data as the folio is about to be split and each page will
need its own. If a dirty folio is only partially-uptodate, the iomap
data contains the information about which blocks cannot be written back,
so assert that a dirty folio is fully uptodate.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
The arguments are still pages for now, but we can use folios internally
and cut out a lot of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
We still iterate one block at a time, but now we call compound_head()
less often.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rename end_offset to end_pos and offset_into_page to poff to match the
rest of the file. Simplify the handling of the last page straddling
i_size by doing the EOF check based on the byte granularity i_size
instead of converting to a pgoff prematurely.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rename end_offset to end_pos and file_offset to pos to match the rest
of the file. Simplify the loop by calculating nblocks up front instead
of each time around the loop.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
XFS has the only implementation of ->discard_page today, so convert it
to use folios in the same patch as converting the API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|