summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-12-19Merge tag 'close-range-cloexec-unshare-v5.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull close_range fix from Christian Brauner: "syzbot reported a bug when asking close_range() to unshare the file descriptor table and making all fds close-on-exec. If CLOSE_RANGE_UNSHARE the caller will receive a private file descriptor table in case their file descriptor table is currently shared before operating on the requested file descriptor range. For the case where the caller has requested all file descriptors to be actually closed via e.g. close_range(3, ~0U, CLOSE_RANGE_UNSHARE) the kernel knows that the caller does not need any of the file descriptors anymore and will optimize the close operation by only copying all files in the range from 0 to 3 and no others. However, if the caller requested CLOSE_RANGE_CLOEXEC together with CLOSE_RANGE_UNSHARE the caller wants to still make use of the file descriptors so the kernel needs to copy all of them and can't optimize. The original patch didn't account for this and thus could cause oopses as evidenced by the syzbot report because it assumed that all fds had been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case and copying all fds if the two flags are specified together. This should've been caught in the selftests but the original patch didn't cover this case and I didn't catch it during review. So in addition to the bugfix I'm also adding selftests. They will reliably reproduce the bug on a non-fixed kernel and allows us to catch regressions and verify correct behavior. Note, the kernel selftest tree contained a bunch of changes that made the original selftest fail to compile so there are small fixups in here make them compile without warnings" * tag 'close-range-cloexec-unshare-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests/core: add regression test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC selftests/core: add test for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC selftests/core: handle missing syscall number for close_range selftests/core: fix close_range_test build after XFAIL removal close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
2020-12-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge still more updates from Andrew Morton: "18 patches. Subsystems affected by this patch series: mm (memcg and cleanups) and epoll" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/Kconfig: fix spelling mistake "whats" -> "what's" selftests/filesystems: expand epoll with epoll_pwait2 epoll: wire up syscall epoll_pwait2 epoll: add syscall epoll_pwait2 epoll: convert internal api to timespec64 epoll: eliminate unnecessary lock for zero timeout epoll: replace gotos with a proper loop epoll: pull all code between fetch_events and send_event into the loop epoll: simplify and optimize busy loop logic epoll: move eavail next to the list_empty_careful check epoll: pull fatal signal checks into ep_send_events() epoll: simplify signal handling epoll: check for events when removing a timed out thread from the wait queue mm/memcontrol:rewrite mem_cgroup_page_lruvec() mm, kvm: account kvm_vcpu_mmap to kmemcg mm/memcg: remove unused definitions mm/memcg: warning on !memcg after readahead page charged mm/memcg: bail early from swap accounting if memcg disabled
2020-12-19epoll: add syscall epoll_pwait2Willem de Bruijn
Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that replaces int timeout with struct timespec. It is equivalent otherwise. int epoll_pwait2(int fd, struct epoll_event *events, int maxevents, const struct timespec *timeout, const sigset_t *sigset); The underlying hrtimer is already programmed with nsec resolution. pselect and ppoll also set nsec resolution timeout with timespec. The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs the same. For timespec, only support this new interface on 2038 aware platforms that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME. Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: convert internal api to timespec64Willem de Bruijn
Patch series "add epoll_pwait2 syscall", v4. Enable nanosecond timeouts for epoll. Analogous to pselect and ppoll, introduce an epoll_wait syscall variant that takes a struct timespec instead of int timeout. This patch (of 4): Make epoll more consistent with select/poll: pass along the timeout as timespec64 pointer. In anticipation of additional changes affecting all three polling mechanisms: - add epoll_pwait2 syscall with timespec semantics, and share poll_select_set_timeout implementation. - compute slack before conversion to absolute time, to save one ktime_get_ts64 call. Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: eliminate unnecessary lock for zero timeoutSoheil Hassas Yeganeh
We call ep_events_available() under lock when timeout is 0, and then call it without locks in the loop for the other cases. Instead, call ep_events_available() without lock for all cases. For non-zero timeouts, we will recheck after adding the thread to the wait queue. For zero timeout cases, by definition, user is opportunistically polling and will have to call epoll_wait again in the future. Note that this lock was kept in c5a282e9635e9 because the whole loop was historically under lock. This patch results in a 1% CPU/RPC reduction in RPC benchmarks. Link: https://lkml.kernel.org/r/20201106231635.3528496-9-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: replace gotos with a proper loopSoheil Hassas Yeganeh
The existing loop is pointless, and the labels make it really hard to follow the structure. Replace that control structure with a simple loop that returns when there are new events, there is a signal, or the thread has timed out. Link: https://lkml.kernel.org/r/20201106231635.3528496-8-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: pull all code between fetch_events and send_event into the loopSoheil Hassas Yeganeh
This is a no-op change which simplifies the follow up patches. Link: https://lkml.kernel.org/r/20201106231635.3528496-7-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: simplify and optimize busy loop logicSoheil Hassas Yeganeh
ep_events_available() is called multiple times around the busy loop logic, even though the logic is generally not used. ep_reset_busy_poll_napi_id() is similarly always called, even when busy loop is not used. Eliminate ep_reset_busy_poll_napi_id() and inline it inside ep_busy_loop(). Make ep_busy_loop() return whether there are any events available after the busy loop. This will eliminate unnecessary loads and branches, and simplifies the loop. Link: https://lkml.kernel.org/r/20201106231635.3528496-6-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: move eavail next to the list_empty_careful checkSoheil Hassas Yeganeh
This is a no-op change and simply to make the code more coherent. Link: https://lkml.kernel.org/r/20201106231635.3528496-5-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: pull fatal signal checks into ep_send_events()Soheil Hassas Yeganeh
To simplify the code, pull in checking the fatal signals into ep_send_events(). ep_send_events() is called only from ep_poll(). Note that, previously, we were always checking fatal events, but it is checked only if eavail is true. This should be fine because the goal of that check is to quickly return from epoll_wait() when there is a pending fatal signal. Link: https://lkml.kernel.org/r/20201106231635.3528496-4-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: simplify signal handlingSoheil Hassas Yeganeh
Check signals before locking ep->lock, and immediately return -EINTR if there is any signal pending. This saves a few loads, stores, and branches from the hot path and simplifies the loop structure for follow up patches. Link: https://lkml.kernel.org/r/20201106231635.3528496-3-soheil.kdev@gmail.com Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19epoll: check for events when removing a timed out thread from the wait queueSoheil Hassas Yeganeh
Patch series "simplify ep_poll". This patch series is a followup based on the suggestions and feedback by Linus: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com The first patch in the series is a fix for the epoll race in presence of timeouts, so that it can be cleanly backported to all affected stable kernels. The rest of the patch series simplify the ep_poll() implementation. Some of these simplifications result in minor performance enhancements as well. We have kept these changes under self tests and internal benchmarks for a few days, and there are minor (1-2%) performance enhancements as a result. This patch (of 8): After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout"), we break out of the ep_poll loop upon timeout, without checking whether there is any new events available. Prior to that patch-series we always called ep_events_available() after exiting the loop. This can cause races and missed wakeups. For example, consider the following scenario reported by Guantao Liu: Suppose we have an eventfd added using EPOLLET to an epollfd. Thread 1: Sleeps for just below 5ms and then writes to an eventfd. Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an event of the eventfd, it will write back on that fd. Thread 3: Calls epoll_wait with a negative timeout. Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either by Thread 1 or Thread 2. After abc610e01c66, Thread 3 can be blocked indefinitely if Thread 2 sees a timeout right before the write to the eventfd by Thread 1. Thread 2 will be woken up from schedule_hrtimeout_range and, with evail 0, it will not call ep_send_events(). To fix this issue: 1) Simplify the timed_out case as suggested by Linus. 2) while holding the lock, recheck whether the thread was woken up after its time out has reached. Note that (2) is different from Linus' original suggestion: It do not set "eavail = ep_events_available(ep)" to avoid unnecessary contention (when there are too many timed-out threads and a small number of events), as well as races mentioned in the discussion thread. This is the first patch in the series so that the backport to stable releases is straightforward. Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Guantao Liu <guantaol@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Guantao Liu <guantaol@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-19close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXECChristian Brauner
After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE. When CLOSE_RANGE_UNSHARE is specified the caller will receive a private file descriptor table in case their file descriptor table is currently shared. For the case where the caller has requested all file descriptors to be actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that the caller does not need any of the file descriptors anymore and will optimize the close operation by only copying all files in the range from 0 to 3 and no others. However, if the caller requested CLOSE_RANGE_CLOEXEC together with CLOSE_RANGE_UNSHARE the caller wants to still make use of the file descriptors so the kernel needs to copy all of them and can't optimize. The original patch didn't account for this and thus could cause oopses as evidenced by the syzbot report because it assumed that all fds had been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case. syzbot reported ================================================================== BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline] BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274 Read of size 8 at addr 0000000000000077 by task syz-executor511/8522 CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 __kasan_report mm/kasan/report.c:549 [inline] kasan_report.cold+0x5/0x37 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] atomic_long_read include/asm-generic/atomic-long.h:29 [inline] filp_close+0x22/0x170 fs/open.c:1274 close_files fs/file.c:402 [inline] put_files_struct fs/file.c:417 [inline] put_files_struct+0x1cc/0x350 fs/file.c:414 exit_files+0x12a/0x170 fs/file.c:435 do_exit+0xb4f/0x2a00 kernel/exit.c:818 do_group_exit+0x125/0x310 kernel/exit.c:920 get_signal+0x428/0x2100 kernel/signal.c:2792 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x447039 Code: Unable to access opcode bytes at RIP 0x44700f. RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039 RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c ================================================================== syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Tested on: commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN.. git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3 dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93 compiler: gcc (GCC) 10.1.0-syz 20200507 Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC") Cc: Giuseppe Scrivano <gscrivan@redhat.com> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-19io_uring: fix 0-iov read buffer selectPavel Begunkov
Doing vectored buf-select read with 0 iovec passed is meaningless and utterly broken, forbid it. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-18Add SMB 2 support for getting and setting SACLsBoris Protopopov
Fix passing of the additional security info via version operations. Force new open when getting SACL and avoid reuse of files that were previously open without sufficient privileges to access SACLs. Signed-off-by: Boris Protopopov <pboris@amazon.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-18Merge tag 'xfs-5.11-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "In this release we add the ability to set a 'needsrepair' flag indicating that we /know/ the filesystem requires xfs_repair, but other than that, it's the usual strengthening of metadata validation and miscellaneous cleanups. Summary: - Introduce a "needsrepair" "feature" to flag a filesystem as needing a pass through xfs_repair. This is key to enabling filesystem upgrades (in xfs_db) that require xfs_repair to make minor adjustments to metadata. - Refactor parameter checking of recovered log intent items so that we actually use the same validation code as them that generate the intent items. - Various fixes to online scrub not reacting correctly to directory entries pointing to inodes that cannot be igetted. - Refactor validation helpers for data and rt volume extents. - Refactor XFS_TRANS_DQ_DIRTY out of existence. - Fix a longstanding bug where mounting with "uqnoenforce" would start user quotas in non-enforcing mode but /proc/mounts would display "usrquota", implying that they are being enforced. - Don't flag dax+reflink inodes as corruption since that is a valid (but not fully functional) combination right now. - Clean up raid stripe validation functions. - Refactor the inode allocation code to be more straightforward. - Small prep cleanup for idmapping support. - Get rid of the xfs_buf_t typedef" * tag 'xfs-5.11-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (40 commits) xfs: remove xfs_buf_t typedef fs/xfs: convert comma to semicolon xfs: open code updating i_mode in xfs_set_acl xfs: remove xfs_vn_setattr_nonsize xfs: kill ialloced in xfs_dialloc() xfs: spilt xfs_dialloc() into 2 functions xfs: move xfs_dialloc_roll() into xfs_dialloc() xfs: move on-disk inode allocation out of xfs_ialloc() xfs: introduce xfs_dialloc_roll() xfs: convert noroom, okalloc in xfs_dialloc() to bool xfs: don't catch dax+reflink inodes as corruption in verifier xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks xfs: remove unneeded return value check for *init_cursor() xfs: introduce xfs_validate_stripe_geometry() xfs: show the proper user quota options xfs: remove the unused XFS_B_FSB_OFFSET macro xfs: remove unnecessary null check in xfs_generic_create xfs: directly return if the delta equal to zero xfs: check tp->t_dqinfo value instead of the XFS_TRANS_DQ_DIRTY flag xfs: delete duplicated tp->t_dqinfo null check and allocation ...
2020-12-18SMB3: Add support for getting and setting SACLsBoris Protopopov
Add SYSTEM_SECURITY access flag and use with smb2 when opening files for getting/setting SACLs. Add "system.cifs_ntsd_full" extended attribute to allow user-space access to the functionality. Avoid multiple server calls when setting owner, DACL, and SACL. Signed-off-by: Boris Protopopov <pboris@amazon.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-18io_uring: close a small race gap for files cancelPavel Begunkov
The purpose of io_uring_cancel_files() is to wait for all requests matching ->files to go/be cancelled. We should first drop files of a request in io_req_drop_files() and only then make it undiscoverable for io_uring_cancel_files. First drop, then delete from list. It's ok to leave req->id->files dangling, because it's not dereferenced by cancellation code, only compared against. It would potentially go to sleep and be awaken by following in io_req_drop_files() wake_up(). Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-18io_uring: fix io_wqe->work_list corruptionXiaoguang Wang
For the first time a req punted to io-wq, we'll initialize io_wq_work's list to be NULL, then insert req to io_wqe->work_list. If this req is not inserted into tail of io_wqe->work_list, this req's io_wq_work list will point to another req's io_wq_work. For splitted bio case, this req maybe inserted to io_wqe->work_list repeatedly, once we insert it to tail of io_wqe->work_list for the second time, now io_wq_work->list->next will be invalid pointer, which then result in many strang error, panic, kernel soft-lockup, rcu stall, etc. In my vm, kernel doest not have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"), below fio job can reproduce this bug steadily: [global] name=iouring-sqpoll-iopoll-1 ioengine=io_uring iodepth=128 numjobs=1 thread rw=randread direct=1 registerfiles=1 hipri=1 bs=4m size=100M runtime=120 time_based group_reporting randrepeat=0 [device] directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point If we have commit cc29e1bf0d63f7 ("block: disable iopoll for split bio"), there will no splitted bio case for polled io, but I think we still to need to fix this list corruption, it also should maybe go to stable branchs. To fix this corruption, if a req is inserted into tail of io_wqe->work_list, initialize req->io_wq_work->list->next to bu NULL. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-18cifs: Avoid error pointer dereferenceSamuel Cabrero
The patch 7d6535b72042: "cifs: Simplify reconnect code when dfs upcall is enabled" leads to the following static checker warning: fs/cifs/connect.c:160 reconn_set_next_dfs_target() error: 'server->hostname' dereferencing possible ERR_PTR() Avoid dereferencing the error pointer by early returning on error condition. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-18cifs: Re-indent cifs_swn_reconnect()Dan Carpenter
This code is slightly nicer if we flip the cifs_sockaddr_equal() around and pull all the code in one tab. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-18cifs: Unlock on errors in cifs_swn_reconnect()Dan Carpenter
There are three error paths which need to unlock before returning. Fixes: 121d947d4fe1 ("cifs: Handle witness client move notification") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-18cifs: Delete a stray unlock in cifs_swn_reconnect()Dan Carpenter
The unlock is done in the caller, this is a stray which leads to a double unlock bug. Fixes: bf80e5d4259a ("cifs: Send witness register and unregister commands to userspace daemon") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Samuel Cabrero <scabrero@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-17Merge tag 'for-linus-5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull jffs2, ubi and ubifs updates from Richard Weinberger: "JFFS2: - Fix for a remount regression - Fix for an abnormal GC exit - Fix for a possible NULL pointer issue while mounting UBI: - Add support ECC-ed NOR flash - Removal of dead code UBIFS: - Make node dumping debug code more reliable - Various cleanups: less ifdefs, less typos - Fix for an info leak" * tag 'for-linus-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: ubifs_dump_node: Dump all branches of the index node ubifs: ubifs_dump_sleb: Remove unused function ubifs: Pass node length in all node dumping callers Revert "ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len" ubifs: Limit dumping length by size of memory which is allocated for the node ubifs: Remove the redundant return in dbg_check_nondata_nodes_order jffs2: Fix NULL pointer dereference in rp_size fs option parsing ubifs: Fixed print foramt mismatch in ubifs ubi: Do not zero out EC and VID on ECC-ed NOR flashes jffs2: remove trailing semicolon in macro definition ubifs: Fix error return code in ubifs_init_authentication() ubifs: wbuf: Don't leak kernel memory to flash ubi: Remove useless code in bytes_str_to_int ubifs: Fix the printing type of c->big_lpt jffs2: Allow setting rp_size to zero during remounting jffs2: Fix ignoring mounting options problem during remounting jffs2: Fix GC exit abnormally ubifs: Code cleanup by removing ifdef macro surrounding jffs2: Fix if/else empty body warnings ubifs: Delete duplicated words + other fixes
2020-12-17Merge tag '5.11-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs updates from Steve French: "The largest part are for support of the newer mount API which has been needed for cifs/smb3 mounts for a long time due to the new API's better handling of remount, and better error reporting. There are three additional small cleanup patches for this being tested, that are not included yet. This series also includes addition of support for the SMB3 witness protocol which can provide important notifications from the server to client on server address or export or network changes. This can be useful for example in order to be notified before the failure - when a server's IP address changes (in the future it will allow us to support server notifications of when a share is moved). It also includes three patches for stable e.g. some that better handle some confusing error messages during session establishment" * tag '5.11-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6: (55 commits) cifs: update internal module version number cifs: Fix support for remount when not changing rsize/wsize cifs: handle "guest" mount parameter cifs: correct four aliased mount parms to allow use of previous names cifs: Tracepoints and logs for tracing credit changes. cifs: fix use after free in cifs_smb3_do_mount() cifs: fix rsize/wsize to be negotiated values cifs: Fix some error pointers handling detected by static checker smb3: remind users that witness protocol is experimental cifs: update super_operations to show_devname cifs: fix uninitialized variable in smb3_fs_context_parse_param cifs: update mnt_cifs_flags during reconfigure cifs: move update of flags into a separate function cifs: remove ctx argument from cifs_setup_cifs_sb cifs: do not allow changing posix_paths during remount cifs: uncomplicate printing the iocharset parameter cifs: don't create a temp nls in cifs_setup_ipc cifs: simplify handling of cifs_sb/ctx->local_nls cifs: we do not allow changing username/password/unc/... during remount cifs: add initial reconfigure support ...
2020-12-17Merge tag 'trace-v5.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "The major update to this release is that there's a new arch config option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it. All the ftrace callbacks now take a struct ftrace_regs instead of a struct pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will have enough information to read the arguments of the function being traced, as well as access to the stack pointer. This way, if a user (like live kernel patching) only cares about the arguments, then it can avoid using the heavier weight "regs" callback, that puts in enough information in the struct ftrace_regs to simulate a breakpoint exception (needed for kprobes). A new config option that audits the timestamps of the ftrace ring buffer at most every event recorded. Ftrace recursion protection has been cleaned up to move the protection to the callback itself (this saves on an extra function call for those callbacks). Perf now handles its own RCU protection and does not depend on ftrace to do it for it (saving on that extra function call). New debug option to add "recursed_functions" file to tracefs that lists all the places that triggered the recursion protection of the function tracer. This will show where things need to be fixed as recursion slows down the function tracer. The eval enum mapping updates done at boot up are now offloaded to a work queue, as it caused a noticeable pause on slow embedded boards. Various clean ups and last minute fixes" * tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits) tracing: Offload eval map updates to a work queue Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS" ring-buffer: Add rb_check_bpage in __rb_allocate_pages ring-buffer: Fix two typos in comments tracing: Drop unneeded assignment in ring_buffer_resize() tracing: Disable ftrace selftests when any tracer is running seq_buf: Avoid type mismatch for seq_buf_init ring-buffer: Fix a typo in function description ring-buffer: Remove obsolete rb_event_is_commit() ring-buffer: Add test to validate the time stamp deltas ftrace/documentation: Fix RST C code blocks tracing: Clean up after filter logic rewriting tracing: Remove the useless value assignment in test_create_synth_event() livepatch: Use the default ftrace_ops instead of REGS when ARGS is available ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs MAINTAINERS: assign ./fs/tracefs to TRACING tracing: Fix some typos in comments ftrace: Remove unused varible 'ret' ring-buffer: Add recording of ring buffer recursion into recursed_functions ...
2020-12-17Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv3: Add emulation of lookupp() to improve open_by_filehandle() support - A series of patches to improve readdir performance, particularly with large directories - Basic support for using NFS/RDMA with the pNFS files and flexfiles drivers - Micro-optimisations for RDMA - RDMA tracing improvements Bugfixes: - Fix a long standing bug with xs_read_xdr_buf() when receiving partial pages (Dan Aloni) - Various fixes for getxattr and listxattr, when used over non-TCP transports - Fixes for containerised NFS from Sargun Dhillon - switch nfsiod to be an UNBOUND workqueue (Neil Brown) - READDIR should not ask for security label information if there is no LSM policy (Olga Kornievskaia) - Avoid using interval-based rebinding with TCP in lockd (Calum Mackay) - A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS code - A couple of fixes for pnfs/flexfiles read failover Cleanups: - Various cleanups for the SUNRPC xdr code in conjunction with the READ_PLUS fixes" * tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits) NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read() pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read NFSv4/pnfs: Add tracing for the deviceid cache fs/lockd: convert comma to semicolon NFSv4.2: fix error return on memory allocation failure NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer NFSv4.2: decode_read_plus_hole() needs to check the extent offset NFSv4.2: decode_read_plus_data() must skip padding after data segment NFSv4.2: Ensure we always reset the result->count in decode_read_plus() SUNRPC: When expanding the buffer, we may need grow the sparse pages SUNRPC: Cleanup - constify a number of xdr_buf helpers SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field SUNRPC: _copy_to/from_pages() now check for zero length SUNRPC: Cleanup xdr_shrink_bufhead() SUNRPC: Fix xdr_expand_hole() SUNRPC: Fixes for xdr_align_data() SUNRPC: _shift_data_left/right_pages should check the shift length ...
2020-12-17Merge tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The big ticket item here is support for msgr2 on-wire protocol, which adds the option of full in-transit encryption using AES-GCM algorithm (myself). On top of that we have a series to avoid intermittent errors during recovery with recover_session=clean and some MDS request encoding work from Jeff, a cap handling fix and assorted observability improvements from Luis and Xiubo and a good number of cleanups. Luis also ran into a corner case with quotas which sadly means that we are back to denying cross-quota-realm renames" * tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits) libceph: drop ceph_auth_{create,update}_authorizer() libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1 libceph, ceph: implement msgr2.1 protocol (crc and secure modes) libceph: introduce connection modes and ms_mode option libceph, rbd: ignore addr->type while comparing in some cases libceph, ceph: get and handle cluster maps with addrvecs libceph: factor out finish_auth() libceph: drop ac->ops->name field libceph: amend cephx init_protocol() and build_request() libceph, ceph: incorporate nautilus cephx changes libceph: safer en/decoding of cephx requests and replies libceph: more insight into ticket expiry and invalidation libceph: move msgr1 protocol specific fields to its own struct libceph: move msgr1 protocol implementation to its own file libceph: separate msgr1 protocol implementation libceph: export remaining protocol independent infrastructure libceph: export zero_page libceph: rename and export con->flags bits libceph: rename and export con->state states libceph: make con->state an int ...
2020-12-17Merge tag 'ovl-update-5.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: - Allow unprivileged mounting in a user namespace. For quite some time the security model of overlayfs has been that operations on underlying layers shall be performed with the privileges of the mounting task. This way an unprvileged user cannot gain privileges by the act of mounting an overlayfs instance. A full audit of all function calls made by the overlayfs code has been performed to see whether they conform to this model, and this branch contains some fixes in this regard. - Support running on copied filesystem images by optionally disabling UUID verification. - Bug fixes as well as documentation updates. * tag 'ovl-update-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: unprivieged mounts ovl: do not get metacopy for userxattr ovl: do not fail because of O_NOATIME ovl: do not fail when setting origin xattr ovl: user xattr ovl: simplify file splice ovl: make ioctl() safe ovl: check privs before decoding file handle vfs: verify source area in vfs_dedupe_file_range_one() vfs: move cap_convert_nscap() call into vfs_setxattr() ovl: fix incorrect extent info in metacopy case ovl: expand warning in ovl_d_real() ovl: document lower modification caveats ovl: warn about orphan metacopy ovl: doc clarification ovl: introduce new "uuid=off" option for inodes index feature ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fh
2020-12-17Merge tag 'fuse-update-5.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Improve performance of virtio-fs in mixed read/write workloads - Try to revalidate cache before returning EEXIST on exclusive create - Add a couple of miscellaneous bug fixes as well as some code cleanups * tag 'fuse-update-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix bad inode fuse: support SB_NOSEC flag to improve write performance fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() request fuse: don't send ATTR_MODE to kill suid/sgid for handle_killpriv_v2 fuse: setattr should set FATTR_KILL_SUIDGID fuse: set FUSE_WRITE_KILL_SUIDGID in cached write path fuse: rename FUSE_WRITE_KILL_PRIV to FUSE_WRITE_KILL_SUIDGID fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2 fuse: always revalidate if exclusive create virtiofs: clean up error handling in virtio_fs_get_tree() fuse: add fuse_sb_destroy() helper fuse: simplify get_fuse_conn*() fuse: get rid of fuse_mount refcount virtiofs: simplify sb setup virtiofs fix leak in setup fuse: launder page should wait for page writeback
2020-12-17Merge tag 'f2fs-for-5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've made more work into per-file compression support. For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to change the algorithm or cluster size per file. F2FS_IOC_COMPRESS | DECOMPRESS_FILE provides a way to compress and decompress the existing normal files manually. There is also a new mount option, compress_mode=fs|user, which can control who compresses the data. Chao also added a checksum feature with a mount option so that we are able to detect any corrupted cluster. In addition, Daniel contributed casefolding with encryption patch, which will be used for Android devices. Summary: Enhancements: - add ioctls and mount option to manage per-file compression feature - support casefolding with encryption - support checksum for compressed cluster - avoid IO starvation by replacing mutex with rwsem - add sysfs, max_io_bytes, to control max bio size Bug fixes: - fix use-after-free issue when compression and fsverity are enabled - fix consistency corruption during fault injection test - fix data offset for lseek - get rid of buffer_head which has 32bits limit in fiemap - fix some bugs in multi-partitions support - fix nat entry count calculation in shrinker - fix some stat information And, we've refactored some logics and fix minor bugs as well" * tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits) f2fs: compress: fix compression chksum f2fs: fix shift-out-of-bounds in sanity_check_raw_super() f2fs: fix race of pending_pages in decompression f2fs: fix to account inline xattr correctly during recovery f2fs: inline: fix wrong inline inode stat f2fs: inline: correct comment in f2fs_recover_inline_data f2fs: don't check PAGE_SIZE again in sanity_check_raw_super() f2fs: convert to F2FS_*_INO macro f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size f2fs: don't allow any writes on readonly mount f2fs: avoid race condition for shrinker count f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE f2fs: add compress_mode mount option f2fs: Remove unnecessary unlikely() f2fs: init dirty_secmap incorrectly f2fs: remove buffer_head which has 32bits limit f2fs: fix wrong block count instead of bytes f2fs: use new conversion functions between blks and bytes f2fs: rename logical_to_blk and blk_to_logical f2fs: fix kbytes written stat for multi-device case ...
2020-12-17Merge tag 'for_v5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, reiserfs, quota and writeback updates from Jan Kara: - a couple of quota fixes (mostly for problems found by syzbot) - several ext2 cleanups - one fix for reiserfs crash on corrupted image - a fix for spurious warning in writeback code * tag 'for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: don't warn on an unregistered BDI in __mark_inode_dirty fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode() reiserfs: add check for an invalid ih_entry_count ext2: Fix fall-through warnings for Clang fs/ext2: Use ext2_put_page docs: filesystems: Reduce ext2.rst to one top-level heading quota: Sanity-check quota file headers on load quota: Don't overflow quota file offsets ext2: Remove unnecessary blank fs/quota: update quota state flags scheme with project quota flags
2020-12-17Merge tag 'fsnotify_for_v5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A few fsnotify fixes from Amir fixing fallout from big fsnotify overhaul a few months back and an improvement of defaults limiting maximum number of inotify watches from Waiman" * tag 'fsnotify_for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix events reported to watching parent and child inotify: convert to handle_inode_event() interface fsnotify: generalize handle_inode_event() inotify: Increase default inotify.max_user_watches limit to 1048576
2020-12-17ext4: simplify ext4 error translationJan Kara
We convert errno's to ext4 on-disk format error codes in save_error_info(). Add a function and a bit of macro magic to make this simpler. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: move functions in super.cJan Kara
Just move error info related functions in super.c close to ext4_handle_error(). We'll want to combine save_error_info() with ext4_handle_error() and this makes change more obvious and saves a forward declaration as well. No functional change. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: make ext4_abort() use __ext4_error()Jan Kara
The only difference between __ext4_abort() and __ext4_error() is that the former one ignores errors=continue mount option. Unify the code to reduce duplication. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: standardize error message in ext4_protect_reserved_inode()Jan Kara
We use __ext4_error() when ext4_protect_reserved_inode() finds filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly capable of reporting all the needed information. So just use that. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: remove redundant sb checksum recomputationJan Kara
Superblock is written out either through ext4_commit_super() or through ext4_handle_dirty_super(). In both cases we recompute the checksum so it is not necessary to recompute it after updating superblock free inodes & blocks counters. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: don't remount read-only with errors=continue on rebootJan Kara
ext4_handle_error() with errors=continue mount option can accidentally remount the filesystem read-only when the system is rebooting. Fix that. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix deadlock with fs freezing and EA inodesJan Kara
Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1 CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: stable@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17jbd2: add a helper to find out number of fast commit blocksHarshad Shirwadkar
Add a helper to read number of fast commit blocks from jbd2 superblock and also rename the JBD2_MIN_FC_BLKS to JBD2_DEFAULT_FAST_COMMIT_BLOCKS since this constant is just the default number of fast commit blocks to use in case number of fast commit blocks isn't set in jbd2 superblock. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201120202232.2240293-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.hHarshad Shirwadkar
This patch makes fast_commit.h byte by byte identical with e2fsprogs/fast_commit.h. This will help us ensure that there are no on-disk format inconsistencies between e2fsck and kernel ext4. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201120202232.2240293-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/03497331f088a938d7a728e7a689bd7953139429.1605896059.git.gustavoars@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: add docs about fast commit idempotenceHarshad Shirwadkar
Fast commit on-disk format is designed such that the replay of these tags can be idempotent. This patch adds documentation in the code in form of comments and in form kernel docs that describes these characteristics. This patch also adds a TODO item needed to ensure kernel fast commit replay idempotence. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201119232822.1860882-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: remove the unused EXT4_CURRENT_REV macroKaixu Xia
There are no callers of the EXT4_CURRENT_REV macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix an IS_ERR() vs NULL checkDan Carpenter
The ext4_find_extent() function never returns NULL, it returns error pointers. Fixes: 44059e503b03 ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201023112232.GB282278@mwanda Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-12-17ext4: check for invalid block size early when mounting a file systemTheodore Ts'o
Check for valid block size directly by validating s_log_block_size; we were doing this in two places. First, by calculating blocksize via BLOCK_SIZE << s_log_block_size, and then checking that the blocksize was valid. And then secondly, by checking s_log_block_size directly. The first check is not reliable, and can trigger an UBSAN warning if s_log_block_size on a maliciously corrupted superblock is greater than 22. This is harmless, since the second test will correctly reject the maliciously fuzzed file system, but to make syzbot shut up, and because the two checks are duplicative in any case, delete the blocksize check, and move the s_log_block_size earlier in ext4_fill_super(). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+345b75652b1d24227443@syzkaller.appspotmail.com
2020-12-17ext4: fix a memory leak of ext4_free_dataChunguang Xu
When freeing metadata, we will create an ext4_free_data and insert it into the pending free list. After the current transaction is committed, the object will be freed. ext4_mb_free_metadata() will check whether the area to be freed overlaps with the pending free list. If true, return directly. At this time, ext4_free_data is leaked. Fortunately, the probability of this problem is small, since it only occurs if the file system is corrupted such that a block is claimed by more one inode and those inodes are deleted within a single jbd2 transaction. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-12-17io_uring: limit {io|sq}poll submit locking scopePavel Begunkov
We don't need to take uring_lock for SQPOLL|IOPOLL to do io_cqring_overflow_flush() when cq_overflow_list is empty, remove it from the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17io_uring: inline io_cqring_mark_overflow()Pavel Begunkov
There is only one user of it and the name is misleading, get rid of it by inlining. By the way make overflow_flush's return value deduction simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>