summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2024-08-25io_uring/napi: refactor __io_napi_busy_loop()Pavel Begunkov
we don't need to set ->napi_prefer_busy_poll if we're not going to poll, do the checks first and all polling preparation after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2ad7ede8cc7905328fc62e8c3805fdb11635ae0b.1723039801.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/kbuf: turn io_buffer_list booleans into flagsJens Axboe
We could just move these two and save some space, but in preparation for adding another flag, turn them into flags first. This saves 8 bytes in struct io_buffer_list, making it exactly half a cacheline on 64-bit archs now rather than 40 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/net: use ITER_UBUF for single segment send mapsJens Axboe
Just like what is being done on the recv side, if we only map a single segment, then use ITER_UBUF for mapping it. That's more efficient than using an ITER_IOVEC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/kbuf: use 'bl' directly rather than req->buf_listJens Axboe
req->buf_list is assigned higher up and is safe to use as we remain within a locked region, as is the 'bl' variable itself from which it was assigned. To improve readability, use 'bl' directly rather than get it from the io_kiocb, if we need to increment the head directly in the buffer selection path. This makes it readily apparent that it's the same io_buffer_list being used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: micro optimization of __io_sq_thread() conditionOlivier Langlois
reverse the order of the element evaluation in an if statement. for many users that are not using iopoll, the iopoll_list will always evaluate to false after having made a memory access whereas to_submit is very likely already loaded in a register. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/rsrc: enable multi-hugepage buffer coalescingChenliang Li
Add support for checking and coalescing multi-hugepage-backed fixed buffers. The coalescing optimizes both time and space consumption caused by mapping and storing multi-hugepage fixed buffers. A coalescable multi-hugepage buffer should fully cover its folios (except potentially the first and last one), and these folios should have the same size. These requirements are for easier processing later, also we need same size'd chunks in io_import_fixed for fast iov_iter adjust. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/rsrc: store folio shift and mask into imuChenliang Li
Store the folio shift and folio mask into imu struct and use it in iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a multi-hugepage buffer get coalesced. Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring: add napi busy settings to the fdinfo outputOlivier Langlois
This info may be useful when attempting to debug a problem involving a ring using the NAPI feature. Here is an example of the output: ip-172-31-39-89 /proc/772/fdinfo # cat 14 pos: 0 flags: 02000002 mnt_id: 16 ino: 10243 SqMask: 0xff SqHead: 633 SqTail: 633 CachedSqHead: 633 CqMask: 0x3fff CqHead: 430250 CqTail: 430250 CachedCqTail: 430250 SQEs: 0 CQEs: 0 SqThread: 885 SqThreadCpu: 0 SqTotalTime: 52793826 SqWorkTime: 3590465 UserFiles: 0 UserBufs: 0 PollList: op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=6, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 op=10, task_works=0 CqOverflowList: NAPI: enabled napi_busy_poll_to: 1 napi_prefer_busy_poll: true Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb184f8b62703ddd3e6e19eae7ab6c67b97e1e10.1722293317.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-21io_uring/kbuf: sanitize peek buffer setupJens Axboe
Harden the buffer peeking a bit, by adding a sanity check for it having a valid size. Outside of that, arg->max_len is a size_t, though it's only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT). Bump our needed check to a size_t so we know it fits. Finally, cap the calculated needed iov value to the PEEK_MAX_IMPORT, which is the maximum number of segments that should be peeked. Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13io_uring/sqpoll: annotate debug task == current with data_race()Jens Axboe
There's a debug check in io_sq_thread_park() checking if it's the SQPOLL thread itself calling park. KCSAN warns about this, as we should not be reading sqd->thread outside of sqd->lock. Just silence this with data_race(). The pointer isn't used for anything but this debug check. Reported-by: syzbot+2b946a3fd80caf971b21@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12io_uring/napi: remove duplicate io_napi_entry timeout assignationOlivier Langlois
io_napi_entry() has 2 calling sites. One of them is unlikely to find an entry and if it does, the timeout should arguable not be updated. The other io_napi_entry() calling site is overwriting the update made by io_napi_entry() so the io_napi_entry() timeout value update has no or little value and therefore is removed. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/145b54ff179f87609e20dffaf5563c07cdbcad1a.1723423275.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12io_uring/napi: check napi_enabled in io_napi_add() before proceedingOlivier Langlois
doing so avoids the overhead of adding napi ids to all the rings that do not enable napi. if no id is added to napi_list because napi is disabled, __io_napi_busy_loop() will not be called. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Fixes: b4ccc4dd1330 ("io_uring/napi: enable even with a timeout of 0") Link: https://lore.kernel.org/r/bd989ccef5fda14f5fd9888faf4fefcf66bd0369.1723400131.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-07io_uring/net: don't pick multiple buffers for non-bundle sendJens Axboe
If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a buffer, unless it's a bundle, it should not select multiple buffers. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-07io_uring/net: ensure expanded bundle send gets marked for cleanupJens Axboe
If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-07io_uring/net: ensure expanded bundle recv gets marked for cleanupJens Axboe
If the iovec inside the kmsg isn't already allocated AND one gets expanded beyond the fixed size, then the request may not already have been marked for cleanup. Ensure that it is. Cc: stable@vger.kernel.org Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-30io_uring: remove unused local list heads in NAPI functionsOlivier Langlois
These lists are unused, remove them. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a0ae3e955aed0f3e3d29882fb3d3cb575e0009b.1722294947.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-30io_uring: keep multishot request NAPI timeout currentOlivier Langlois
This refresh statement was originally present in the original patch: https://lore.kernel.org/netdev/20221121191437.996297-2-shr@devkernel.io/ It has been removed with no explanation in v6: https://lore.kernel.org/netdev/20230201222254.744422-2-shr@devkernel.io/ It is important to make the refresh for multishot requests, because if no new requests using the same NAPI device are added to the ring, the entry will become stale and be removed silently. The unsuspecting user will not know that their ring had busy polling for only 60 seconds before being pruned. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 8d0c12a80cdeb ("io-uring: add napi busy poll support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/0fe61a019ec61e5708cd117cb42ed0dab95e1617.1722294646.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26io_uring/napi: pass ktime to io_napi_adjust_timeoutPavel Begunkov
Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26io_uring/napi: use ktime in busy pollingPavel Begunkov
It's more natural to use ktime/ns instead of keeping around usec, especially since we're comparing it against user provided timers, so convert napi busy poll internal handling to ktime. It's also nicer since the type (ktime_t vs unsigned long) now tells the unit of measure. Keep everything as ktime, which we convert to/from micro seconds for IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with usec, however it's not real usec as shift by 10 is used to get it from nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec back and we get back better precision. Note, we can further improve it later by removing the truncation and maybe convincing net/ to use ktime/ns instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-25io_uring/msg_ring: fix uninitialized use of target_req->flagsJens Axboe
syzbot reports that KMSAN complains that 'nr_tw' is an uninit-value with the following report: BUG: KMSAN: uninit-value in io_req_local_work_add io_uring/io_uring.c:1192 [inline] BUG: KMSAN: uninit-value in io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_req_local_work_add io_uring/io_uring.c:1192 [inline] io_req_task_work_add_remote+0x588/0x5d0 io_uring/io_uring.c:1240 io_msg_remote_post io_uring/msg_ring.c:102 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] io_msg_ring_data io_uring/msg_ring.c:152 [inline] io_msg_ring+0x1c38/0x1ef0 io_uring/msg_ring.c:305 io_issue_sqe+0x383/0x22c0 io_uring/io_uring.c:1710 io_queue_sqe io_uring/io_uring.c:1924 [inline] io_submit_sqe io_uring/io_uring.c:2180 [inline] io_submit_sqes+0x1259/0x2f20 io_uring/io_uring.c:2295 __do_sys_io_uring_enter io_uring/io_uring.c:3205 [inline] __se_sys_io_uring_enter+0x40c/0x3ca0 io_uring/io_uring.c:3142 __x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3142 x64_sys_call+0x2d82/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:427 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f which is the following check: if (nr_tw < nr_wait) return; in io_req_local_work_add(). While nr_tw itself cannot be uninitialized, it does depend on req->flags, which off the msg ring issue path can indeed be uninitialized. Fix this by always clearing the allocated 'req' fully if we can't grab one from the cache itself. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Reported-by: syzbot+82609b8937a4458106ca@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/000000000000fd3d8d061dfc0e4a@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: align iowq and task request error handlingPavel Begunkov
There is a difference in how io_queue_sqe and io_wq_submit_work treat error codes they get from io_issue_sqe. The first one fails anything unknown but latter only fails when the code is negative. It doesn't make sense to have this discrepancy, align them to the io_queue_sqe behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c550e152bf4a290187f91a4322ddcb5d6d1f2c73.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: simplify io_uring_cmd returnPavel Begunkov
We don't have to return error code from an op handler back to core io_uring, so once io_uring_cmd() sets the results and handles errors we can juts return IOU_OK and simplify the code. Note, only valid with e0b23d9953b0c ("io_uring: optimise ltimeout for inline execution"), there was a problem with iopoll before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8eae2be5b2a49236cd5f1dadbd1aa5730e9e2d4f.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: fix io_match_task must_holdPavel Begunkov
The __must_hold annotation in io_match_task() uses a non existing parameter "req", fix it. Fixes: 6af3f48bf6156 ("io_uring: fix link traversal locking") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3e65ee7709e96507cef3d93291746f2c489f2307.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: don't allow netpolling with SETUP_IOPOLLPavel Begunkov
IORING_SETUP_IOPOLL rings don't have any netpoll handling, let's fail attempts to register netpolling in this case, there might be people who will mix up IOPOLL and netpoll. Cc: stable@vger.kernel.org Fixes: ef1186c1a875b ("io_uring: add register/unregister napi function") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e7553aee0a8ae4edec6742cd6dd0c1e6914fba8.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24io_uring: tighten task exit cancellationsPavel Begunkov
io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: 3f48cf18f886c ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-20io_uring: fix error pbuf checkingPavel Begunkov
Syz reports a problem, which boils down to NULL vs IS_ERR inconsistent error handling in io_alloc_pbuf_ring(). KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:__io_remove_buffers+0xac/0x700 io_uring/kbuf.c:341 Call Trace: <TASK> io_put_bl io_uring/kbuf.c:378 [inline] io_destroy_buffers+0x14e/0x490 io_uring/kbuf.c:392 io_ring_ctx_free+0xa00/0x1070 io_uring/io_uring.c:2613 io_ring_exit_work+0x80f/0x8a0 io_uring/io_uring.c:2844 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd40 kernel/workqueue.c:3390 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Cc: stable@vger.kernel.org Reported-by: syzbot+2074b1a3d447915c6f1c@syzkaller.appspotmail.com Fixes: 87585b05757dc ("io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5f9df20560bd9830401e8e48abc029e7cfd9f5e.1721329239.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-20io_uring: fix lost getsockopt completionsPavel Begunkov
There is a report that iowq executed getsockopt never completes. The reason being that io_uring_cmd_sock() can return a positive result, and io_uring_cmd() propagates it back to core io_uring, instead of IOU_OK. In case of io_wq_submit_work(), the request will be dropped without completing it. The offending code was introduced by a hack in a9c3eda7eada9 ("io_uring: fix submission-failure handling for uring-cmd"), however it was fine until getsockopt was introduced and started returning positive results. The right solution is to always return IOU_OK, since e0b23d9953b0c ("io_uring: optimise ltimeout for inline execution"), we should be able to do it without problems, however for the sake of backporting and minimising side effects, let's keep returning negative return codes and otherwise do IOU_OK. Link: https://github.com/axboe/liburing/issues/1181 Cc: stable@vger.kernel.org Fixes: 8e9fad0e70b7b ("io_uring: Add io_uring command support for sockets") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/ff349cf0654018189b6077e85feed935f0f8839e.1721149870.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-16Merge tag 'net-next-6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Not much excitement - a handful of large patchsets (devmem among them) did not make it in time. Core & protocols: - Use local_lock in addition to local_bh_disable() to protect per-CPU resources in networking, a step closer for local_bh_disable() not to act as a big lock on PREEMPT_RT - Use flex array for netdevice priv area, ensure its cache alignment - Add a sysctl knob to allow user to specify a default rto_min at socket init time. Bit of a big hammer but multiple companies were independently carrying such patch downstream so clearly it's useful - Support scheduling transmission of packets based on CLOCK_TAI - Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off using cpusets - Support multiple L2TPv3 UDP tunnels using the same 5-tuple address - Allow configuration of multipath hash seed, to both allow synchronizing hashing of two routers, and preventing partial accidental sync - Improve TCP compliance with RFC 9293 for simultaneous connect() - Support sending NAT keepalives in IPsec ESP in UDP states. Userspace IKE daemon had to do this before, but the kernel can better keep track of it - Support sending supervision HSR frames with MAC addresses stored in ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled - Introduce IPPROTO_SMC for selecting SMC when socket is created - Allow UDP GSO transmit from devices with no checksum offload - openvswitch: add packet sampling via psample, separating the sampled traffic from "upcall" packets sent to user space for forwarding - nf_tables: shrink memory consumption for transaction objects Things we sprinkled into general kernel code: - Power Sequencing subsystem (used by Qualcomm Bluetooth driver for QCA6390) [ Already merged separately - Linus ] - Add IRQ information in sysfs for auxiliary bus - Introduce guard definition for local_lock - Add aligned flavor of __cacheline_group_{begin, end}() markings for grouping fields in structures BPF: - Notify user space (via epoll) when a struct_ops object is getting detached/unregistered - Add new kfuncs for a generic, open-coded bits iterator - Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head - Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible WRT BTF from modules - Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs - riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter - Add the capability to offload the netfilter flowtable in XDP layer through kfuncs Driver API: - Allow users to configure IRQ tresholds between which automatic IRQ moderation can choose - Expand Power Sourcing (PoE) status with power, class and failure reason. Support setting power limits - Track additional RSS contexts in the core, make sure configuration changes don't break them - Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP data paths - Support updating firmware on SFP modules Tests and tooling: - mptcp: use net/lib.sh to manage netns - TCP-AO and TCP-MD5: replace debug prints used by tests with tracepoints - openvswitch: make test self-contained (don't depend on OvS CLI tools) Drivers: - Ethernet high-speed NICs: - Broadcom (bnxt): - increase the max total outstanding PTP TX packets to 4 - add timestamping statistics support - implement netdev_queue_mgmt_ops - support new RSS context API - Intel (100G, ice, idpf): - implement FEC statistics and dumping signal quality indicators - support E825C products (with 56Gbps PHYs) - nVidia/Mellanox: - support HW-GRO - mlx4/mlx5: support per-queue statistics via netlink - obey the max number of EQs setting in sub-functions - AMD/Solarflare: - support new RSS context API - AMD/Pensando: - ionic: rework fix for doorbell miss to lower overhead and skip it on new HW - Wangxun: - txgbe: support Flow Director perfect filters - Ethernet NICs consumer, embedded and virtual: - Add driver for Tehuti Networks TN40xx chips - Add driver for Meta's internal NIC chips - Add driver for Ethernet MAC on Airoha EN7581 SoCs - Add driver for Renesas Ethernet-TSN devices - Google cloud vNIC: - flow steering support - Microsoft vNIC: - support page sizes other than 4KB on ARM64 - vmware vNIC: - support latency measurement (update to version 9) - VirtIO net: - support for Byte Queue Limits - support configuring thresholds for automatic IRQ moderation - support for AF_XDP Rx zero-copy - Synopsys (stmmac): - support for STM32MP13 SoC - let platforms select the right PCS implementation - TI: - icssg-prueth: add multicast filtering support - icssg-prueth: enable PTP timestamping and PPS - Renesas: - ravb: improve Rx performance 30-400% by using page pool, theaded NAPI and timer-based IRQ coalescing - ravb: add MII support for R-Car V4M - Cadence (macb): - macb: add ARP support to Wake-On-LAN - Cortina: - use phylib for RX and TX pause configuration - Ethernet switches: - nVidia/Mellanox: - support configuration of multipath hash seed - report more accurate max MTU - use page_pool to improve Rx performance - MediaTek: - mt7530: add support for bridge port isolation - Qualcomm: - qca8k: add support for bridge port isolation - Microchip: - lan9371/2: add 100BaseTX PHY support - NXP: - vsc73xx: implement VLAN operations - Ethernet PHYs: - aquantia: enable support for aqr115c - aquantia: add support for PHY LEDs - realtek: add support for rtl8224 2.5Gbps PHY - xpcs: add memory-mapped device support - add BroadR-Reach link mode and support in Broadcom's PHY driver - CAN: - add document for ISO 15765-2 protocol support - mcp251xfd: workaround for erratum DS80000789E, use timestamps to catch when device returns incorrect FIFO status - WiFi: - mac80211/cfg80211: - parse Transmit Power Envelope (TPE) data in mac80211 instead of in drivers - improvements for 6 GHz regulatory flexibility - multi-link improvements - support multiple radios per wiphy - remove DEAUTH_NEED_MGD_TX_PREP flag - Intel (iwlwifi): - bump FW API to 91 for BZ/SC devices - report 64-bit radiotap timestamp - enable P2P low latency by default - handle Transmit Power Envelope (TPE) advertised by AP - remove support for older FW for new devices - fast resume (keeping the device configured) - mvm: re-enable Multi-Link Operation (MLO) - aggregation (A-MSDU) optimizations - MediaTek (mt76): - mt7925 Multi-Link Operation (MLO) support - Qualcomm (ath10k): - LED support for various chipsets - Qualcomm (ath12k): - remove unsupported Tx monitor handling - support channel 2 in 6 GHz band - support Spatial Multiplexing Power Save (SMPS) in 6 GHz band - supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) - support dynamic VLAN - add panic handler for resetting the firmware state - DebugFS support for datapath statistics - WCN7850: support for Wake on WLAN - Microchip (wilc1000): - read MAC address during probe to make it visible to user space - suspend/resume improvements - TI (wl18xx): - support newer firmware versions - RealTek (rtw89): - preparation for RTL8852BE-VT support - Wake on WLAN support for WiFi 6 chips - 36-bit PCI DMA support - RealTek (rtlwifi): - RTL8192DU support - Broadcom (brcmfmac): - Management Frame Protection support (to enable WPA3) - Bluetooth: - qualcomm: use the power sequencer for QCA6390 - btusb: mediatek: add ISO data transmission functions - hci_bcm4377: add BCM4388 support - btintel: add support for BlazarU core - btintel: add support for Whale Peak2 - btnxpuart: add support for AW693 A1 chipset - btnxpuart: add support for IW615 chipset - btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591" * tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits) eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering" tcp: Replace strncpy() with strscpy() wifi: ath12k: fix build vs old compiler tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child(). eth: fbnic: Write the TCAM tables used for RSS control and Rx to host eth: fbnic: Add L2 address programming eth: fbnic: Add basic Rx handling eth: fbnic: Add basic Tx handling eth: fbnic: Add link detection eth: fbnic: Add initial messaging to notify FW of our presence eth: fbnic: Implement Rx queue alloc/start/stop/free eth: fbnic: Implement Tx queue alloc/start/stop/free eth: fbnic: Allocate a netdevice and napi vectors with queues eth: fbnic: Add FW communication mechanism eth: fbnic: Add message parsing for FW messages eth: fbnic: Add register init to set PCIe/Ethernet device config eth: fbnic: Allocate core device specific structures and devlink interface eth: fbnic: Add scaffolding for Meta's NIC driver PCI: Add Meta Platforms vendor ID net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK ...
2024-07-15Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...
2024-07-15Merge tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the io_uring updates queued up for 6.11. Nothing major this time around, various minor improvements and cleanups/fixes. This contains: - Add bind/listen opcodes. Main motivation is to support direct descriptors, to avoid needing a regular fd just for doing these two operations (Gabriel) - Probe fixes (Gabriel) - Treat io-wq work flags as atomics. Not fixing a real issue, but may as well and it silences a KCSAN warning (me) - Cleanup of rsrc __set_current_state() usage (me) - Add 64-bit for {m,f}advise operations (me) - Improve performance of data ring messages (me) - Fix for ring message overflow posting (Pavel) - Fix for freezer interaction with TWA_NOTIFY_SIGNAL. Not strictly an io_uring thing, but since TWA_NOTIFY_SIGNAL was originally added for faster task_work signaling for io_uring, bundling it with this pull (Pavel) - Add Pavel as a co-maintainer - Various cleanups (me, Thorsten)" * tag 'for-6.11/io_uring-20240714' of git://git.kernel.dk/linux: (28 commits) io_uring/net: check socket is valid in io_bind()/io_listen() kernel: rerun task_work while freezing in get_signal() io_uring/io-wq: limit retrying worker initialisation io_uring/napi: Remove unnecessary s64 cast io_uring/net: cleanup io_recv_finish() bundle handling io_uring/msg_ring: fix overflow posting MAINTAINERS: change Pavel Begunkov from io_uring reviewer to maintainer io_uring/msg_ring: use kmem_cache_free() to free request io_uring/msg_ring: check for dead submitter task io_uring/msg_ring: add an alloc cache for io_kiocb entries io_uring/msg_ring: improve handling of target CQE posting io_uring: add io_add_aux_cqe() helper io_uring: add remote task_work execution helper io_uring/msg_ring: tighten requirement for remote posting io_uring: Allocate only necessary memory in io_probe io_uring: Fix probe of disabled operations io_uring: Introduce IORING_OP_LISTEN io_uring: Introduce IORING_OP_BIND net: Split a __sys_listen helper for io_uring net: Split a __sys_bind helper for io_uring ...
2024-07-15Merge tag 'vfs-6.11.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support passing NULL along AT_EMPTY_PATH for statx(). NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with -EFAULT to retain compatibility (Rust is abusing calls of the sort to detect availability of statx) This avoids path lookup code, lockref management, memory allocation and in case of NULL path userspace memory access (which can be quite expensive with SMAP on x86_64) - Don't block i_writecount during exec. Remove the deny_write_access() mechanism for executables - Relax open_by_handle_at() permissions in specific cases where we can prove that the caller had sufficient privileges to open a file - Switch timespec64 fields in struct inode to discrete integers freeing up 4 bytes Fixes: - Fix false positive circular locking warning in hfsplus - Initialize hfs_inode_info after hfs_alloc_inode() in hfs - Avoid accidental overflows in vfs_fallocate() - Don't interrupt fallocate with EINTR in tmpfs to avoid constantly restarting shmem_fallocate() - Add missing quote in comment in fs/readdir Cleanups: - Don't assign and test in an if statement in mqueue. Move the assignment out of the if statement - Reflow the logic in may_create_in_sticky() - Remove the usage of the deprecated ida_simple_xx() API from procfs - Reject FSCONFIG_CMD_CREATE_EXCL requets that depend on the new mount api early - Rename variables in copy_tree() to make it easier to understand - Replace WARN(down_read_trylock, ...) abuse with proper asserts in various places in the VFS - Get rid of user_path_at_empty() and drop the empty argument from getname_flags() - Check for error while copying and no path in one branch in getname_flags() - Avoid redundant smp_mb() for THP handling in do_dentry_open() - Rename parent_ino to d_parent_ino and make it use RCU - Remove unused header include in fs/readdir - Export in_group_capable() helper and switch f2fs and fuse over to it instead of open-coding the logic in both places" * tag 'vfs-6.11.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) ipc: mqueue: remove assignment from IS_ERR argument vfs: rename parent_ino to d_parent_ino and make it use RCU vfs: support statx(..., NULL, AT_EMPTY_PATH, ...) stat: use vfs_empty_path() helper fs: new helper vfs_empty_path() fs: reflow may_create_in_sticky() vfs: remove redundant smp_mb for thp handling in do_dentry_open fuse: Use in_group_or_capable() helper f2fs: Use in_group_or_capable() helper fs: Export in_group_or_capable() vfs: reorder checks in may_create_in_sticky hfs: fix to initialize fields of hfs_inode_info after hfs_alloc_inode() proc: Remove usage of the deprecated ida_simple_xx() API hfsplus: fix to avoid false alarm of circular locking Improve readability of copy_tree vfs: shave a branch in getname_flags vfs: retire user_path_at_empty and drop empty arg from getname_flags vfs: stop using user_path_at_empty in do_readlinkat tmpfs: don't interrupt fallocate with EINTR fs: don't block i_writecount during exec ...
2024-07-13io_uring/net: check socket is valid in io_bind()/io_listen()Tetsuo Handa
We need to check that sock_from_file(req->file) != NULL. Reported-by: syzbot <syzbot+1e811482aa2c70afa9a0@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=1e811482aa2c70afa9a0 Fixes: 7481fd93fa0a ("io_uring: Introduce IORING_OP_BIND") Fixes: ff140cc8628a ("io_uring: Introduce IORING_OP_LISTEN") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/903da529-eaa3-43ef-ae41-d30f376c60cc@I-love.SAKURA.ne.jp [axboe: move assignment of sock to where the NULL check is] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-11io_uring/io-wq: limit retrying worker initialisationPavel Begunkov
If io-wq worker creation fails, we retry it by queueing up a task_work. tasK_work is needed because it should be done from the user process context. The problem is that retries are not limited, and if queueing a task_work is the reason for the failure, we might get into an infinite loop. It doesn't seem to happen now but it would with the following patch executing task_work in the freezer's loop. For now, arbitrarily limit the number of attempts to create a worker. Cc: stable@vger.kernel.org Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <ju.orth@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8280436925db88448c7c85c6656edee1a43029ea.1720634146.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10io_uring/napi: Remove unnecessary s64 castThorsten Blum
Since the do_div() macro casts the divisor to u32 anyway, remove the unnecessary s64 cast and fix the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_s64 instead Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240710010520.384009-2-thorsten.blum@toblux.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-02io_uring/net: cleanup io_recv_finish() bundle handlingJens Axboe
Combine the two cases that check for whether or not this is a bundle, rather than having them as separate checks. This is easier to reduce, and it reduces the text associated with it as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02io_uring/net: don't clear msg_inq before io_recv_buf_select() needs itJens Axboe
For bundle receives to function properly, the previous iteration msg_inq value is needed to make a judgement call on how much data there is to receive. A previous fix ended up clearing it earlier as an error case would potentially errantly set IORING_CQE_F_SOCK_NONEMPTY if the request got failed. Move the assignment to post assigning buffers for the receive, but ensure that it's cleared for the buffer selection error case. With that, buffer selection has the right msg_inq value and can correctly bundle receives as designed. Noticed while testing where it was apparent than more than 1 buffer was never received. After fix was in place, multiple buffers are correctly picked for receive. This provides a 10x speedup for the test case, as the buffer size used was 64b. Fixes: 18414a4a2eab ("io_uring/net: assign kmsg inq/flags before buffer selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02io_uring/msg_ring: fix overflow postingPavel Begunkov
The caller of io_cqring_event_overflow() should be holding the completion_lock, which is violated by io_msg_tw_complete. There is only one caller of io_add_aux_cqe(), so just add locking there for now. WARNING: CPU: 0 PID: 5145 at io_uring/io_uring.c:703 io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 RIP: 0010:io_cqring_event_overflow+0x442/0x660 io_uring/io_uring.c:703 <TASK> __io_post_aux_cqe io_uring/io_uring.c:816 [inline] io_add_aux_cqe+0x27c/0x320 io_uring/io_uring.c:837 io_msg_tw_complete+0x9d/0x4d0 io_uring/msg_ring.c:78 io_fallback_req_func+0xce/0x1c0 io_uring/io_uring.c:256 process_one_work kernel/workqueue.c:3224 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3305 worker_thread+0x86d/0xd40 kernel/workqueue.c:3383 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:144 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fixes: f33096a3c99c0 ("io_uring: add io_add_aux_cqe() helper") Reported-by: syzbot+f7f9c893345c5c615d34@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c7350d07fefe8cce32b50f57665edbb6355ea8c1.1719927398.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02io_uring/net: move charging socket out of zc io_uringPavel Begunkov
Currently, io_uring's io_sg_from_iter() duplicates the part of __zerocopy_sg_from_iter() charging pages to the socket. It'd be too easy to miss while changing it in net/, the chunk is not the most straightforward for outside users and full of internal implementation details. io_uring is not a good place to keep it, deduplicate it by moving out of the callback into __zerocopy_sg_from_iter(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-01io_uring/msg_ring: use kmem_cache_free() to free requestJens Axboe
The change adding caching around the request allocated and freed for data messages changed a kmem_cache_free() to a kfree(), which isn't correct as the request came from slab in the first place. Fix that up and use the right freeing function if the cache is already at its limit. Note that the current mixing of kmem_cache_alloc and kfree is fine, but consistent alloc/free functions should be used as it's otherwise somewhat confusing. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01io_uring/msg_ring: check for dead submitter taskJens Axboe
The change for improving the handling of the target CQE posting inadvertently dropped the NULL check for the submitter task on the target ring, reinstate that. Fixes: 0617bb500bfa ("io_uring/msg_ring: improve handling of target CQE posting") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: signal SQPOLL task_work with TWA_SIGNAL_NO_IPIJens Axboe
Before SQPOLL was transitioned to managing its own task_work, the core used TWA_SIGNAL_NO_IPI to ensure that task_work was processed. If not, we can't be sure that all task_work is processed at SQPOLL thread exit time. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring/msg_ring: add an alloc cache for io_kiocb entriesJens Axboe
With slab accounting, allocating and freeing memory has considerable overhead. Add a basic alloc cache for the io_kiocb allocations that msg_ring needs to do. Unlike other caches, this one is used by the sender, grabbing it from the remote ring. When the remote ring gets the posted completion, it'll free it locally. Hence it is separately locked, using ctx->msg_lock. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring/msg_ring: improve handling of target CQE postingJens Axboe
Use the exported helper for queueing task_work for message passing, rather than rolling our own. Note that this is only done for strict data messages for now, file descriptor passing messages still rely on the kernel task_work. It could get converted at some point if it's performance critical. This improves peak performance of message passing by about 5x in some basic testing, with 2 threads just sending messages to each other. Before this change, it was capped at around 700K/sec, with the change it's at over 4M/sec. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: add io_add_aux_cqe() helperJens Axboe
This helper will post a CQE, and can be called from task_work where we now that the ctx is already properly locked and that deferred completions will get flushed later on. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring: add remote task_work execution helperJens Axboe
All our task_work handling is targeted at the state in the io_kiocb itself, which is what it is being used for. However, MSG_RING rolls its own task_work handling, ignoring how that is usually done. In preparation for switching MSG_RING to be able to use the normal task_work handling, add io_req_task_work_add_remote() which allows the caller to pass in the target io_ring_ctx. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24io_uring/msg_ring: tighten requirement for remote postingJens Axboe
Currently this is gated on whether or not the target ring needs a local completion - and if so, whether or not we're running on the right task. The use case for same thread cross posting is probably a lot less relevant than remote posting. And since we're going to improve this situation anyway, just gate it on local posting and ignore what task we're currently running on. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20fs: Initial atomic write supportPrasad Singamsetty
An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20io_uring/rsrc: fix incorrect assignment of iter->nr_segs in io_import_fixedChenliang Li
In io_import_fixed when advancing the iter within the first bvec, the iter->nr_segs is set to bvec->bv_len. nr_segs should be the number of bvecs, plus we don't need to adjust it here, so just remove it. Fixes: b000ae0ec2d7 ("io_uring/rsrc: optimise single entry advance") Signed-off-by: Chenliang Li <cliang01.li@samsung.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20240619063819.2445-1-cliang01.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>