Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range fix from Christian Brauner:
"syzbot reported a bug when asking close_range() to unshare the file
descriptor table and making all fds close-on-exec.
If CLOSE_RANGE_UNSHARE the caller will receive a private file
descriptor table in case their file descriptor table is currently
shared before operating on the requested file descriptor range.
For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, CLOSE_RANGE_UNSHARE) the
kernel knows that the caller does not need any of the file descriptors
anymore and will optimize the close operation by only copying all
files in the range from 0 to 3 and no others.
However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't
optimize.
The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report because it assumed that all fds had
been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case and
copying all fds if the two flags are specified together.
This should've been caught in the selftests but the original patch
didn't cover this case and I didn't catch it during review. So in
addition to the bugfix I'm also adding selftests. They will reliably
reproduce the bug on a non-fixed kernel and allows us to catch
regressions and verify correct behavior.
Note, the kernel selftest tree contained a bunch of changes that made
the original selftest fail to compile so there are small fixups in
here make them compile without warnings"
* tag 'close-range-cloexec-unshare-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/core: add regression test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
selftests/core: add test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
selftests/core: handle missing syscall number for close_range
selftests/core: fix close_range_test build after XFAIL removal
close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
|
|
Merge still more updates from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (memcg and cleanups) and
epoll"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/Kconfig: fix spelling mistake "whats" -> "what's"
selftests/filesystems: expand epoll with epoll_pwait2
epoll: wire up syscall epoll_pwait2
epoll: add syscall epoll_pwait2
epoll: convert internal api to timespec64
epoll: eliminate unnecessary lock for zero timeout
epoll: replace gotos with a proper loop
epoll: pull all code between fetch_events and send_event into the loop
epoll: simplify and optimize busy loop logic
epoll: move eavail next to the list_empty_careful check
epoll: pull fatal signal checks into ep_send_events()
epoll: simplify signal handling
epoll: check for events when removing a timed out thread from the wait queue
mm/memcontrol:rewrite mem_cgroup_page_lruvec()
mm, kvm: account kvm_vcpu_mmap to kmemcg
mm/memcg: remove unused definitions
mm/memcg: warning on !memcg after readahead page charged
mm/memcg: bail early from swap accounting if memcg disabled
|
|
Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that
replaces int timeout with struct timespec. It is equivalent otherwise.
int epoll_pwait2(int fd, struct epoll_event *events,
int maxevents,
const struct timespec *timeout,
const sigset_t *sigset);
The underlying hrtimer is already programmed with nsec resolution.
pselect and ppoll also set nsec resolution timeout with timespec.
The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs
the same.
For timespec, only support this new interface on 2038 aware platforms
that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.
Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "add epoll_pwait2 syscall", v4.
Enable nanosecond timeouts for epoll.
Analogous to pselect and ppoll, introduce an epoll_wait syscall
variant that takes a struct timespec instead of int timeout.
This patch (of 4):
Make epoll more consistent with select/poll: pass along the timeout as
timespec64 pointer.
In anticipation of additional changes affecting all three polling
mechanisms:
- add epoll_pwait2 syscall with timespec semantics,
and share poll_select_set_timeout implementation.
- compute slack before conversion to absolute time,
to save one ktime_get_ts64 call.
Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com
Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We call ep_events_available() under lock when timeout is 0, and then call
it without locks in the loop for the other cases.
Instead, call ep_events_available() without lock for all cases. For
non-zero timeouts, we will recheck after adding the thread to the wait
queue. For zero timeout cases, by definition, user is opportunistically
polling and will have to call epoll_wait again in the future.
Note that this lock was kept in c5a282e9635e9 because the whole loop was
historically under lock.
This patch results in a 1% CPU/RPC reduction in RPC benchmarks.
Link: https://lkml.kernel.org/r/20201106231635.3528496-9-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The existing loop is pointless, and the labels make it really hard to
follow the structure.
Replace that control structure with a simple loop that returns when there
are new events, there is a signal, or the thread has timed out.
Link: https://lkml.kernel.org/r/20201106231635.3528496-8-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is a no-op change which simplifies the follow up patches.
Link: https://lkml.kernel.org/r/20201106231635.3528496-7-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ep_events_available() is called multiple times around the busy loop logic,
even though the logic is generally not used. ep_reset_busy_poll_napi_id()
is similarly always called, even when busy loop is not used.
Eliminate ep_reset_busy_poll_napi_id() and inline it inside
ep_busy_loop(). Make ep_busy_loop() return whether there are any events
available after the busy loop. This will eliminate unnecessary loads and
branches, and simplifies the loop.
Link: https://lkml.kernel.org/r/20201106231635.3528496-6-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is a no-op change and simply to make the code more coherent.
Link: https://lkml.kernel.org/r/20201106231635.3528496-5-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
To simplify the code, pull in checking the fatal signals into
ep_send_events(). ep_send_events() is called only from ep_poll().
Note that, previously, we were always checking fatal events, but it is
checked only if eavail is true. This should be fine because the goal of
that check is to quickly return from epoll_wait() when there is a pending
fatal signal.
Link: https://lkml.kernel.org/r/20201106231635.3528496-4-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Check signals before locking ep->lock, and immediately return -EINTR if
there is any signal pending.
This saves a few loads, stores, and branches from the hot path and
simplifies the loop structure for follow up patches.
Link: https://lkml.kernel.org/r/20201106231635.3528496-3-soheil.kdev@gmail.com
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "simplify ep_poll".
This patch series is a followup based on the suggestions and feedback by
Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
The first patch in the series is a fix for the epoll race in presence of
timeouts, so that it can be cleanly backported to all affected stable
kernels.
The rest of the patch series simplify the ep_poll() implementation. Some
of these simplifications result in minor performance enhancements as well.
We have kept these changes under self tests and internal benchmarks for a
few days, and there are minor (1-2%) performance enhancements as a result.
This patch (of 8):
After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available. Prior to that patch-series we
always called ep_events_available() after exiting the loop.
This can cause races and missed wakeups. For example, consider the
following scenario reported by Guantao Liu:
Suppose we have an eventfd added using EPOLLET to an epollfd.
Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.
Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2. After abc610e01c66, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1. Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().
To fix this issue:
1) Simplify the timed_out case as suggested by Linus.
2) while holding the lock, recheck whether the thread was woken up
after its time out has reached.
Note that (2) is different from Linus' original suggestion: It do not set
"eavail = ep_events_available(ep)" to avoid unnecessary contention (when
there are too many timed-out threads and a small number of events), as
well as races mentioned in the discussion thread.
This is the first patch in the series so that the backport to stable
releases is straightforward.
Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com
Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com
Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Guantao Liu <guantaol@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Guantao Liu <guantaol@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
file descriptor table in case their file descriptor table is currently
shared.
For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
the caller does not need any of the file descriptors anymore and will
optimize the close operation by only copying all files in the range from
0 to 3 and no others.
However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't optimize.
The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report because it assumed that all fds had
been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.
syzbot reported
==================================================================
BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
Read of size 8 at addr 0000000000000077 by task syz-executor511/8522
CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
__kasan_report mm/kasan/report.c:549 [inline]
kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
filp_close+0x22/0x170 fs/open.c:1274
close_files fs/file.c:402 [inline]
put_files_struct fs/file.c:417 [inline]
put_files_struct+0x1cc/0x350 fs/file.c:414
exit_files+0x12a/0x170 fs/file.c:435
do_exit+0xb4f/0x2a00 kernel/exit.c:818
do_group_exit+0x125/0x310 kernel/exit.c:920
get_signal+0x428/0x2100 kernel/signal.c:2792
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x447039
Code: Unable to access opcode bytes at RIP 0x44700f.
RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
==================================================================
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Tested on:
commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN..
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
compiler: gcc (GCC) 10.1.0-syz 20200507
Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
Doing vectored buf-select read with 0 iovec passed is meaningless and
utterly broken, forbid it.
Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix passing of the additional security info via version
operations. Force new open when getting SACL and avoid
reuse of files that were previously open without
sufficient privileges to access SACLs.
Signed-off-by: Boris Protopopov <pboris@amazon.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull xfs updates from Darrick Wong:
"In this release we add the ability to set a 'needsrepair' flag
indicating that we /know/ the filesystem requires xfs_repair, but
other than that, it's the usual strengthening of metadata validation
and miscellaneous cleanups.
Summary:
- Introduce a "needsrepair" "feature" to flag a filesystem as needing
a pass through xfs_repair. This is key to enabling filesystem
upgrades (in xfs_db) that require xfs_repair to make minor
adjustments to metadata.
- Refactor parameter checking of recovered log intent items so that
we actually use the same validation code as them that generate the
intent items.
- Various fixes to online scrub not reacting correctly to directory
entries pointing to inodes that cannot be igetted.
- Refactor validation helpers for data and rt volume extents.
- Refactor XFS_TRANS_DQ_DIRTY out of existence.
- Fix a longstanding bug where mounting with "uqnoenforce" would
start user quotas in non-enforcing mode but /proc/mounts would
display "usrquota", implying that they are being enforced.
- Don't flag dax+reflink inodes as corruption since that is a valid
(but not fully functional) combination right now.
- Clean up raid stripe validation functions.
- Refactor the inode allocation code to be more straightforward.
- Small prep cleanup for idmapping support.
- Get rid of the xfs_buf_t typedef"
* tag 'xfs-5.11-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (40 commits)
xfs: remove xfs_buf_t typedef
fs/xfs: convert comma to semicolon
xfs: open code updating i_mode in xfs_set_acl
xfs: remove xfs_vn_setattr_nonsize
xfs: kill ialloced in xfs_dialloc()
xfs: spilt xfs_dialloc() into 2 functions
xfs: move xfs_dialloc_roll() into xfs_dialloc()
xfs: move on-disk inode allocation out of xfs_ialloc()
xfs: introduce xfs_dialloc_roll()
xfs: convert noroom, okalloc in xfs_dialloc() to bool
xfs: don't catch dax+reflink inodes as corruption in verifier
xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks
xfs: remove unneeded return value check for *init_cursor()
xfs: introduce xfs_validate_stripe_geometry()
xfs: show the proper user quota options
xfs: remove the unused XFS_B_FSB_OFFSET macro
xfs: remove unnecessary null check in xfs_generic_create
xfs: directly return if the delta equal to zero
xfs: check tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag
xfs: delete duplicated tp->t_dqinfo null check and allocation
...
|
|
Add SYSTEM_SECURITY access flag and use with smb2 when opening
files for getting/setting SACLs. Add "system.cifs_ntsd_full"
extended attribute to allow user-space access to the functionality.
Avoid multiple server calls when setting owner, DACL, and SACL.
Signed-off-by: Boris Protopopov <pboris@amazon.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The purpose of io_uring_cancel_files() is to wait for all requests
matching ->files to go/be cancelled. We should first drop files of a
request in io_req_drop_files() and only then make it undiscoverable for
io_uring_cancel_files.
First drop, then delete from list. It's ok to leave req->id->files
dangling, because it's not dereferenced by cancellation code, only
compared against. It would potentially go to sleep and be awaken by
following in io_req_drop_files() wake_up().
Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the first time a req punted to io-wq, we'll initialize io_wq_work's
list to be NULL, then insert req to io_wqe->work_list. If this req is not
inserted into tail of io_wqe->work_list, this req's io_wq_work list will
point to another req's io_wq_work. For splitted bio case, this req maybe
inserted to io_wqe->work_list repeatedly, once we insert it to tail of
io_wqe->work_list for the second time, now io_wq_work->list->next will be
invalid pointer, which then result in many strang error, panic, kernel
soft-lockup, rcu stall, etc.
In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable
iopoll for split bio"), below fio job can reproduce this bug steadily:
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
hipri=1
bs=4m
size=100M
runtime=120
time_based
group_reporting
randrepeat=0
[device]
directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point
If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"),
there will no splitted bio case for polled io, but I think we still to need
to fix this list corruption, it also should maybe go to stable branchs.
To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
initialize req->io_wq_work->list->next to bu NULL.
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The patch 7d6535b72042: "cifs: Simplify reconnect code when dfs
upcall is enabled" leads to the following static checker warning:
fs/cifs/connect.c:160 reconn_set_next_dfs_target()
error: 'server->hostname' dereferencing possible ERR_PTR()
Avoid dereferencing the error pointer by early returning on error
condition.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This code is slightly nicer if we flip the cifs_sockaddr_equal()
around and pull all the code in one tab.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There are three error paths which need to unlock before returning.
Fixes: 121d947d4fe1 ("cifs: Handle witness client move notification")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The unlock is done in the caller, this is a stray which leads to a
double unlock bug.
Fixes: bf80e5d4259a ("cifs: Send witness register and unregister commands to userspace daemon")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull jffs2, ubi and ubifs updates from Richard Weinberger:
"JFFS2:
- Fix for a remount regression
- Fix for an abnormal GC exit
- Fix for a possible NULL pointer issue while mounting
UBI:
- Add support ECC-ed NOR flash
- Removal of dead code
UBIFS:
- Make node dumping debug code more reliable
- Various cleanups: less ifdefs, less typos
- Fix for an info leak"
* tag 'for-linus-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: ubifs_dump_node: Dump all branches of the index node
ubifs: ubifs_dump_sleb: Remove unused function
ubifs: Pass node length in all node dumping callers
Revert "ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len"
ubifs: Limit dumping length by size of memory which is allocated for the node
ubifs: Remove the redundant return in dbg_check_nondata_nodes_order
jffs2: Fix NULL pointer dereference in rp_size fs option parsing
ubifs: Fixed print foramt mismatch in ubifs
ubi: Do not zero out EC and VID on ECC-ed NOR flashes
jffs2: remove trailing semicolon in macro definition
ubifs: Fix error return code in ubifs_init_authentication()
ubifs: wbuf: Don't leak kernel memory to flash
ubi: Remove useless code in bytes_str_to_int
ubifs: Fix the printing type of c->big_lpt
jffs2: Allow setting rp_size to zero during remounting
jffs2: Fix ignoring mounting options problem during remounting
jffs2: Fix GC exit abnormally
ubifs: Code cleanup by removing ifdef macro surrounding
jffs2: Fix if/else empty body warnings
ubifs: Delete duplicated words + other fixes
|
|
Pull cifs updates from Steve French:
"The largest part are for support of the newer mount API which has been
needed for cifs/smb3 mounts for a long time due to the new API's
better handling of remount, and better error reporting. There are
three additional small cleanup patches for this being tested, that are
not included yet.
This series also includes addition of support for the SMB3 witness
protocol which can provide important notifications from the server to
client on server address or export or network changes. This can be
useful for example in order to be notified before the failure - when a
server's IP address changes (in the future it will allow us to support
server notifications of when a share is moved).
It also includes three patches for stable e.g. some that better handle
some confusing error messages during session establishment"
* tag '5.11-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6: (55 commits)
cifs: update internal module version number
cifs: Fix support for remount when not changing rsize/wsize
cifs: handle "guest" mount parameter
cifs: correct four aliased mount parms to allow use of previous names
cifs: Tracepoints and logs for tracing credit changes.
cifs: fix use after free in cifs_smb3_do_mount()
cifs: fix rsize/wsize to be negotiated values
cifs: Fix some error pointers handling detected by static checker
smb3: remind users that witness protocol is experimental
cifs: update super_operations to show_devname
cifs: fix uninitialized variable in smb3_fs_context_parse_param
cifs: update mnt_cifs_flags during reconfigure
cifs: move update of flags into a separate function
cifs: remove ctx argument from cifs_setup_cifs_sb
cifs: do not allow changing posix_paths during remount
cifs: uncomplicate printing the iocharset parameter
cifs: don't create a temp nls in cifs_setup_ipc
cifs: simplify handling of cifs_sb/ctx->local_nls
cifs: we do not allow changing username/password/unc/... during remount
cifs: add initial reconfigure support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major update to this release is that there's a new arch config
option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.
Currently, only x86_64 enables it. All the ftrace callbacks now take a
struct ftrace_regs instead of a struct pt_regs. If the architecture
has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
have enough information to read the arguments of the function being
traced, as well as access to the stack pointer.
This way, if a user (like live kernel patching) only cares about the
arguments, then it can avoid using the heavier weight "regs" callback,
that puts in enough information in the struct ftrace_regs to simulate
a breakpoint exception (needed for kprobes).
A new config option that audits the timestamps of the ftrace ring
buffer at most every event recorded.
Ftrace recursion protection has been cleaned up to move the protection
to the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace
to do it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that
lists all the places that triggered the recursion protection of the
function tracer. This will show where things need to be fixed as
recursion slows down the function tracer.
The eval enum mapping updates done at boot up are now offloaded to a
work queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes"
* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Offload eval map updates to a work queue
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
ring-buffer: Add rb_check_bpage in __rb_allocate_pages
ring-buffer: Fix two typos in comments
tracing: Drop unneeded assignment in ring_buffer_resize()
tracing: Disable ftrace selftests when any tracer is running
seq_buf: Avoid type mismatch for seq_buf_init
ring-buffer: Fix a typo in function description
ring-buffer: Remove obsolete rb_event_is_commit()
ring-buffer: Add test to validate the time stamp deltas
ftrace/documentation: Fix RST C code blocks
tracing: Clean up after filter logic rewriting
tracing: Remove the useless value assignment in test_create_synth_event()
livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
MAINTAINERS: assign ./fs/tracefs to TRACING
tracing: Fix some typos in comments
ftrace: Remove unused varible 'ret'
ring-buffer: Add recording of ring buffer recursion into recursed_functions
...
|
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support
- A series of patches to improve readdir performance, particularly
with large directories
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers
- Micro-optimisations for RDMA
- RDMA tracing improvements
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving
partial pages (Dan Aloni)
- Various fixes for getxattr and listxattr, when used over non-TCP
transports
- Fixes for containerised NFS from Sargun Dhillon
- switch nfsiod to be an UNBOUND workqueue (Neil Brown)
- READDIR should not ask for security label information if there is
no LSM policy (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay)
- A series of RPC and NFS layer fixes to support the NFSv4.2
READ_PLUS code
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes"
* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
NFSv4/pnfs: Add tracing for the deviceid cache
fs/lockd: convert comma to semicolon
NFSv4.2: fix error return on memory allocation failure
NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
NFSv4.2: decode_read_plus_hole() needs to check the extent offset
NFSv4.2: decode_read_plus_data() must skip padding after data segment
NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
SUNRPC: When expanding the buffer, we may need grow the sparse pages
SUNRPC: Cleanup - constify a number of xdr_buf helpers
SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
SUNRPC: _copy_to/from_pages() now check for zero length
SUNRPC: Cleanup xdr_shrink_bufhead()
SUNRPC: Fix xdr_expand_hole()
SUNRPC: Fixes for xdr_align_data()
SUNRPC: _shift_data_left/right_pages should check the shift length
...
|
|
Pull ceph updates from Ilya Dryomov:
"The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
(myself).
On top of that we have a series to avoid intermittent errors during
recovery with recover_session=clean and some MDS request encoding work
from Jeff, a cap handling fix and assorted observability improvements
from Luis and Xiubo and a good number of cleanups.
Luis also ran into a corner case with quotas which sadly means that we
are back to denying cross-quota-realm renames"
* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
libceph: drop ceph_auth_{create,update}_authorizer()
libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
libceph: introduce connection modes and ms_mode option
libceph, rbd: ignore addr->type while comparing in some cases
libceph, ceph: get and handle cluster maps with addrvecs
libceph: factor out finish_auth()
libceph: drop ac->ops->name field
libceph: amend cephx init_protocol() and build_request()
libceph, ceph: incorporate nautilus cephx changes
libceph: safer en/decoding of cephx requests and replies
libceph: more insight into ticket expiry and invalidation
libceph: move msgr1 protocol specific fields to its own struct
libceph: move msgr1 protocol implementation to its own file
libceph: separate msgr1 protocol implementation
libceph: export remaining protocol independent infrastructure
libceph: export zero_page
libceph: rename and export con->flags bits
libceph: rename and export con->state states
libceph: make con->state an int
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
- Allow unprivileged mounting in a user namespace.
For quite some time the security model of overlayfs has been that
operations on underlying layers shall be performed with the
privileges of the mounting task.
This way an unprvileged user cannot gain privileges by the act of
mounting an overlayfs instance. A full audit of all function calls
made by the overlayfs code has been performed to see whether they
conform to this model, and this branch contains some fixes in this
regard.
- Support running on copied filesystem images by optionally disabling
UUID verification.
- Bug fixes as well as documentation updates.
* tag 'ovl-update-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: unprivieged mounts
ovl: do not get metacopy for userxattr
ovl: do not fail because of O_NOATIME
ovl: do not fail when setting origin xattr
ovl: user xattr
ovl: simplify file splice
ovl: make ioctl() safe
ovl: check privs before decoding file handle
vfs: verify source area in vfs_dedupe_file_range_one()
vfs: move cap_convert_nscap() call into vfs_setxattr()
ovl: fix incorrect extent info in metacopy case
ovl: expand warning in ovl_d_real()
ovl: document lower modification caveats
ovl: warn about orphan metacopy
ovl: doc clarification
ovl: introduce new "uuid=off" option for inodes index feature
ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Improve performance of virtio-fs in mixed read/write workloads
- Try to revalidate cache before returning EEXIST on exclusive create
- Add a couple of miscellaneous bug fixes as well as some code cleanups
* tag 'fuse-update-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix bad inode
fuse: support SB_NOSEC flag to improve write performance
fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() request
fuse: don't send ATTR_MODE to kill suid/sgid for handle_killpriv_v2
fuse: setattr should set FATTR_KILL_SUIDGID
fuse: set FUSE_WRITE_KILL_SUIDGID in cached write path
fuse: rename FUSE_WRITE_KILL_PRIV to FUSE_WRITE_KILL_SUIDGID
fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2
fuse: always revalidate if exclusive create
virtiofs: clean up error handling in virtio_fs_get_tree()
fuse: add fuse_sb_destroy() helper
fuse: simplify get_fuse_conn*()
fuse: get rid of fuse_mount refcount
virtiofs: simplify sb setup
virtiofs fix leak in setup
fuse: launder page should wait for page writeback
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've made more work into per-file compression support.
For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to
change the algorithm or cluster size per file. F2FS_IOC_COMPRESS |
DECOMPRESS_FILE provides a way to compress and decompress the existing
normal files manually.
There is also a new mount option, compress_mode=fs|user, which can
control who compresses the data.
Chao also added a checksum feature with a mount option so that
we are able to detect any corrupted cluster.
In addition, Daniel contributed casefolding with encryption patch,
which will be used for Android devices.
Summary:
Enhancements:
- add ioctls and mount option to manage per-file compression feature
- support casefolding with encryption
- support checksum for compressed cluster
- avoid IO starvation by replacing mutex with rwsem
- add sysfs, max_io_bytes, to control max bio size
Bug fixes:
- fix use-after-free issue when compression and fsverity are enabled
- fix consistency corruption during fault injection test
- fix data offset for lseek
- get rid of buffer_head which has 32bits limit in fiemap
- fix some bugs in multi-partitions support
- fix nat entry count calculation in shrinker
- fix some stat information
And, we've refactored some logics and fix minor bugs as well"
* tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
f2fs: compress: fix compression chksum
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
f2fs: fix race of pending_pages in decompression
f2fs: fix to account inline xattr correctly during recovery
f2fs: inline: fix wrong inline inode stat
f2fs: inline: correct comment in f2fs_recover_inline_data
f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
f2fs: convert to F2FS_*_INO macro
f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
f2fs: don't allow any writes on readonly mount
f2fs: avoid race condition for shrinker count
f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
f2fs: add compress_mode mount option
f2fs: Remove unnecessary unlikely()
f2fs: init dirty_secmap incorrectly
f2fs: remove buffer_head which has 32bits limit
f2fs: fix wrong block count instead of bytes
f2fs: use new conversion functions between blks and bytes
f2fs: rename logical_to_blk and blk_to_logical
f2fs: fix kbytes written stat for multi-device case
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, reiserfs, quota and writeback updates from Jan Kara:
- a couple of quota fixes (mostly for problems found by syzbot)
- several ext2 cleanups
- one fix for reiserfs crash on corrupted image
- a fix for spurious warning in writeback code
* tag 'for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: don't warn on an unregistered BDI in __mark_inode_dirty
fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode()
reiserfs: add check for an invalid ih_entry_count
ext2: Fix fall-through warnings for Clang
fs/ext2: Use ext2_put_page
docs: filesystems: Reduce ext2.rst to one top-level heading
quota: Sanity-check quota file headers on load
quota: Don't overflow quota file offsets
ext2: Remove unnecessary blank
fs/quota: update quota state flags scheme with project quota flags
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"A few fsnotify fixes from Amir fixing fallout from big fsnotify
overhaul a few months back and an improvement of defaults limiting
maximum number of inotify watches from Waiman"
* tag 'fsnotify_for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix events reported to watching parent and child
inotify: convert to handle_inode_event() interface
fsnotify: generalize handle_inode_event()
inotify: Increase default inotify.max_user_watches limit to 1048576
|
|
We convert errno's to ext4 on-disk format error codes in
save_error_info(). Add a function and a bit of macro magic to make this
simpler.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Just move error info related functions in super.c close to
ext4_handle_error(). We'll want to combine save_error_info() with
ext4_handle_error() and this makes change more obvious and saves a
forward declaration as well. No functional change.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The only difference between __ext4_abort() and __ext4_error() is that
the former one ignores errors=continue mount option. Unify the code to
reduce duplication.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We use __ext4_error() when ext4_protect_reserved_inode() finds
filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly
capable of reporting all the needed information. So just use that.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Superblock is written out either through ext4_commit_super() or through
ext4_handle_dirty_super(). In both cases we recompute the checksum so it
is not necessary to recompute it after updating superblock free inodes &
blocks counters.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_handle_error() with errors=continue mount option can accidentally
remount the filesystem read-only when the system is rebooting. Fix that.
Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:
CPU1 CPU2
removexattr() freeze_super()
vfs_removexattr()
ext4_xattr_set()
handle = ext4_journal_start()
...
ext4_xattr_set_entry()
iput(old_ea_inode)
ext4_evict_inode(old_ea_inode)
sb->s_writers.frozen = SB_FREEZE_FS;
sb_wait_write(sb, SB_FREEZE_FS);
ext4_freeze()
jbd2_journal_lock_updates()
-> blocks waiting for all
handles to stop
sb_start_intwrite()
-> blocks as sb is already in SB_FREEZE_FS state
Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.
Cc: stable@vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a helper to read number of fast commit blocks from jbd2 superblock
and also rename the JBD2_MIN_FC_BLKS to
JBD2_DEFAULT_FAST_COMMIT_BLOCKS since this constant is just the
default number of fast commit blocks to use in case number of fast
commit blocks isn't set in jbd2 superblock.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201120202232.2240293-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch makes fast_commit.h byte by byte identical with
e2fsprogs/fast_commit.h. This will help us ensure that there are no
on-disk format inconsistencies between e2fsck and kernel ext4.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201120202232.2240293-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/03497331f088a938d7a728e7a689bd7953139429.1605896059.git.gustavoars@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commit on-disk format is designed such that the replay of these
tags can be idempotent. This patch adds documentation in the code in
form of comments and in form kernel docs that describes these
characteristics. This patch also adds a TODO item needed to ensure
kernel fast commit replay idempotence.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201119232822.1860882-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There are no callers of the EXT4_CURRENT_REV macro, so remove it.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The ext4_find_extent() function never returns NULL, it returns error
pointers.
Fixes: 44059e503b03 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Check for valid block size directly by validating s_log_block_size; we
were doing this in two places. First, by calculating blocksize via
BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
was valid. And then secondly, by checking s_log_block_size directly.
The first check is not reliable, and can trigger an UBSAN warning if
s_log_block_size on a maliciously corrupted superblock is greater than
22. This is harmless, since the second test will correctly reject the
maliciously fuzzed file system, but to make syzbot shut up, and
because the two checks are duplicative in any case, delete the
blocksize check, and move the s_log_block_size earlier in
ext4_fill_super().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: syzbot+345b75652b1d24227443@syzkaller.appspotmail.com
|
|
When freeing metadata, we will create an ext4_free_data and
insert it into the pending free list. After the current
transaction is committed, the object will be freed.
ext4_mb_free_metadata() will check whether the area to be freed
overlaps with the pending free list. If true, return directly. At this
time, ext4_free_data is leaked. Fortunately, the probability of this
problem is small, since it only occurs if the file system is corrupted
such that a block is claimed by more one inode and those inodes are
deleted within a single jbd2 transaction.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
We don't need to take uring_lock for SQPOLL|IOPOLL to do
io_cqring_overflow_flush() when cq_overflow_list is empty, remove it
from the hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is only one user of it and the name is misleading, get rid of it
by inlining. By the way make overflow_flush's return value deduction
simpler.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|