summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2025-05-16io_uring/fdinfo: only compile if CONFIG_PROC_FS is setJens Axboe
Rather than wrap fdinfo.c in one big if, handle it on the Makefile side instead. io_uring.c already conditionally sets fops->fdinfo() anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16Merge branch 'io_uring-6.15' into for-6.16/io_uringJens Axboe
Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can get easily extended without causing merge conflicts. * io_uring-6.15: io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo() io_uring/memmap: don't use page_address() on a highmem page io_uring/uring_cmd: fix hybrid polling initialization issue io_uring/sqpoll: Increase task_work submission batch size io_uring: ensure deferred completions are flushed for multishot io_uring: always arm linked timeouts prior to issue io_uring/fdinfo: annotate racy sq/cq head/tail reads io_uring: fix 'sync' handling of io_fallback_tw() io_uring: don't duplicate flushing in io_req_post_cqe
2025-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc7). Conflicts: tools/testing/selftests/drivers/net/hw/ncdevmem.c 97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array") 2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX") https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au Adjacent changes: net/core/devmem.c net/core/devmem.h 0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload") bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()Jens Axboe
Not everything requires locking in there, which is why the 'has_lock' variable exists. But enough does that it's a bit unwieldy to manage. Wrap the whole thing in a ->uring_lock trylock, and just return with no output if we fail to grab it. The existing trylock() will already have greatly diminished utility/output for the failure case. This fixes an issue with reading the SQE fields, if the ring is being actively resized at the same time. Reported-by: Jann Horn <jannh@google.com> Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: unify legacy buf provision and removalPavel Begunkov
Combine IORING_OP_PROVIDE_BUFFERS and IORING_OP_REMOVE_BUFFERS ->issue(), so that we can deduplicate ring locking and list lookups. This way we further reduce code for legacy provided buffers. Locking is also separated from buffer related handling, which makes it a bit simpler with label jumps. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f61af131622ad4337c2fb9f7c453d5b0102c7b90.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: refactor __io_remove_buffersPavel Begunkov
__io_remove_buffers used for two purposes, the first is removing buffers for non ring based lists, which implies that it can be called multiple times for the same list. And the second is for destroying lists, which is not perfectly reentrable for ring based lists. It's confusing, so just have a helper for the legacy pbuf buffer removal, make sure it's not called for ring pbuf, and open code all ring pbuf destruction into io_put_bl(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ae416b099d311ad23f285cea02f2c94c8ae9a6c.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: don't compute size twice on prepPavel Begunkov
The size in prep is calculated by io_provide_buffers_prep(), so remove the recomputation a few lines after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7c97206561b74fce245cb22449c6082d2e066844.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: drop extra vars in io_register_pbuf_ringPavel Begunkov
bl and free_bl variables in io_register_pbuf_ring() always point to the same list since we started to reallocate the pre-existent list. Drop free_bl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d45c3342d74c9030f99376c777a4b3d59089074d.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: use mem_is_zero()Pavel Begunkov
Make use of mem_is_zero() for reserved fields checking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11fe27b7a831329bcdb4ea087317ef123ba7c171.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13io_uring/kbuf: account ring io_buffer_list memoryPavel Begunkov
Follow the non-ringed pbuf struct io_buffer_list allocations and account it against the memcg. There is low chance of that being an actual problem as ring provided buffer should either pin user memory or allocate it, which is already accounted. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13net: devmem: Implement TX pathMina Almasry
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13netmem: add niov->type attribute to distinguish different net_iov typesMina Almasry
Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-12io_uring/memmap: don't use page_address() on a highmem pageJens Axboe
For older/32-bit systems with highmem, don't assume that the pages in a mapped region are always going to be mapped. If io_region_init_ptr() finds that the pages are coalescable, also check if the first page is a HighMem page or not. If it is, fall through to the usual vmap() mapping rather than attempt to get the unmapped page address. Cc: stable@vger.kernel.org Fixes: c4d0ac1c1567 ("io_uring/memmap: optimise single folio regions") Link: https://lore.kernel.org/all/681fe2fb.050a0220.f2294.001a.GAE@google.com/ Reported-by: syzbot+5b8c4abafcb1d791ccfc@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/681fed0a.050a0220.f2294.001c.GAE@google.com/ Reported-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com Tested-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12io_uring: drain based on allocates reqsPavel Begunkov
Don't rely on CQ sequence numbers for draining, as it has become messy and needs cq_extra adjustments. Instead, base it on the number of allocated requests and only allow flushing when all requests are in the drain list. As a result, cq_extra is gone, no overhead for its accounting in aux cqe posting, less bloating as it was inlined before, and it's in general simpler than trying to track where we should bump it and where it should be put back like in cases of overflow. Also, it'll likely help with cleaning and unifying some of the CQ posting helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com [axboe: fold in fix from link2] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12io_uring/uring_cmd: fix hybrid polling initialization issuehexue
Modify the check for whether the timer is initialized during IO transfer when passthrough is used with hybrid polling, to ensure that it's always setup correctly. Cc: stable@vger.kernel.org Fixes: 01ee194d1aba ("io_uring: add support for hybrid IOPOLL") Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20250512052025.293031-1-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: count allocated requestsPavel Begunkov
Keep track of the number requests a ring currently has allocated (and not freed), it'll be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: open code io_account_cq_overflow()Pavel Begunkov
io_account_cq_overflow() doesn't help explaining what's going on in there, and it'll become even smaller with following patches, so open code it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: consolidate drain seq checkingPavel Begunkov
We check sequences when queuing drained requests as well when flushing them. Instead, always queue and immediately try to flush, so that all seq handling can be kept contained in the flushing code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: remove drain prealloc checksPavel Begunkov
Currently io_drain_req() has two steps. The first is fast path checking sequence numbers. The second is allocations, rechecking and actual queuing. Further simplify it by removing the first step. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: simplify drain ret passingPavel Begunkov
"ret" in io_drain_req() is only used in one place, remove it and pass -ENOMEM directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: fix spurious drain flushingPavel Begunkov
io_queue_deferred() is not tolerant to spurious calls not completing some requests. You can have an inflight drain-marked request and another request that came after and got queued into the drain list. Now, if io_queue_deferred() is called before the first request completes, it'll check the 2nd req with req_need_defer(), find that there is no drain flag set, and queue it for execution. To make io_queue_deferred() work, it should at least check sequences for the first request, and then we need also need to check if there is another drain request creating another bubble. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: account drain memory to cgroupPavel Begunkov
Account drain allocations against memcg. It's not a big problem as each such allocation is paired with a request, which is accounted, but it's nicer to follow the limits more closely. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: add lockdep asserts to io_add_aux_cqePavel Begunkov
io_add_aux_cqe() can only be called for rings with uring_lock protected completion queues, add a couple of assertions in regards to that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring/net: move CONFIG_NET guards to MakefilePavel Begunkov
Instruct Makefile to never try to compile net.c without CONFIG_NET and kill ifdefs in the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring: update parameter name in io_pin_pages function declarationLong Li
Rename first parameter in io_pin_pages from ubuf to uaddr for consistency between declaration and implementation. Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09io_uring/sqpoll: Increase task_work submission batch sizeGabriel Krisman Bertazi
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio. There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer. My benchmark is: fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio: Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82 If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue. In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7 After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32 Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07io_uring: ensure deferred completions are flushed for multishotJens Axboe
Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case. When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any. Cc: stable@vger.kernel.org # 6.1+ Reported-by: Norman Maurer <norman_maurer@apple.com> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: move io_req_put_rsrc_nodes()Pavel Begunkov
It'd be nice to hide details of how rsrc nodes are used by a request from rsrc.c, specifically which request fields store them, and what bits are signifying if there is a node in a request. It rather belong to generic request handling, so move the helper to io_uring.c. While doing so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE and doesn't require zeroing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: remove io_preinit_req()Pavel Begunkov
Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a request, from which only ->file_node is mandatory. Remove the function and zero the entire request on first allocation. With that, we also need to initialise ->ctx every time, which might be a good thing for performance as now we're likely overwriting the entire cache line, and so it can write combined and avoid RMW. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring/timeout: don't export link t-out disarm helperPavel Begunkov
[__]io_disarm_linked_timeout() are only used inside timeout.c. so confine them inside the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1eb200911255e643bf252a8e65fb2c787340cf18.1746533800.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring/zcrx: dmabuf backed zerocopy receivePavel Begunkov
Add support for dmabuf backed zcrx areas. To use it, the user should pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags field and pass a dmabuf fd in the dmabuf_fd field. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com [axboe: fold in fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: enable per-io write streamsKeith Busch
Allow userspace to pass a per-I/O write stream in the SQE: __u8 write_stream; The __u8 type matches the size the filesystems and block layer support. Application can query the supported values from the block devices max_write_streams sysfs attribute. Unsupported values are ignored by file operations that do not support write streams or rejected with an error by those that support them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04io_uring: always arm linked timeouts prior to issueJens Axboe
There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed. Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1390 Reported-by: Chase Hiltz <chase@path.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-03futex: Move futex_queue() into futex_wait_setup()Peter Zijlstra
futex_wait_setup() has a weird calling convention in order to return hb to use as an argument to futex_queue(). Mostly such that requeue can have an extra test in between. Reorder code a little to get rid of this and keep the hb usage inside futex_wait_setup(). [bigeasy: fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
2025-05-02io_uring/zcrx: split common area map/unmap partsPavel Begunkov
Extract area type depedent parts of io_zcrx_[un]map_area from the generic path. It'll be helpful once there are more area memory types and not only user pages. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: split out memory holders from areaPavel Begunkov
In the data path users of struct io_zcrx_area don't need to know what kind of memory it's backed by. Only keep there generic bits in there and and split out memory type dependent fields into a new structure. It also logically separates the step that actually imports the memory, e.g. pinning user pages, from the generic area initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: resolve netdev before area creationPavel Begunkov
Some area types will require a valid struct device to be created, so resolve netdev and struct device before creating an area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: improve area validationPavel Begunkov
dmabuf backed area will be taking an offset instead of addresses, and io_buffer_validate() is not flexible enough to facilitate it. It also takes an iovec, which may truncate the u64 length zcrx takes. Add a new helper function for validation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-30io_uring/fdinfo: annotate racy sq/cq head/tail readsJens Axboe
syzbot complains about the cached sq head read, and it's totally right. But we don't need to care, it's just reading fdinfo, and reading the CQ or SQ tail/head entries are known racy in that they are just a view into that very instant and may of course be outdated by the time they are reported. Annotate both the SQ head and CQ tail read with data_race() to avoid this syzbot complaint. Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/ Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28io_uring/cmd: move net cmd into a separate filePavel Begunkov
We keep socket io_uring command implementation in io_uring/uring_cmd.c. Separate it from generic command code into a separate file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28io_uring: delete misleading comment in io_fill_cqe_aux()Pavel Begunkov
io_fill_cqe_aux() doesn't overflow completions, however it might fail them and lets the caller handle it. Remove the comment, which doesn't make any sense. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: fix 'sync' handling of io_fallback_tw()Jens Axboe
A previous commit added a 'sync' parameter to io_fallback_tw(), which if true, means the caller wants to wait on the fallback thread handling it. But the logic is somewhat messed up, ensure that ctxs are swapped and flushed appropriately. Cc: stable@vger.kernel.org Fixes: dfbe5561ae93 ("io_uring: flush offloaded and delayed task_work on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: open code io_eventfd_grab()Pavel Begunkov
io_eventfd_grab() doesn't help wit understanding the path, it'll be simpler to keep the helper open coded. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: clean up rcu lockingPavel Begunkov
Conditional locking is never welcome if there are better options. Move rcu locking into io_eventfd_signal(), make it unconditional and use guards. It also helps with sparse warnings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: dedup signalling helpersPavel Begunkov
Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much of a difference for now, but it prepares it for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: don't duplicate flushing in io_req_post_cqePavel Begunkov
io_req_post_cqe() sets submit_state.cq_flush so that *flush_completions() can take care of batch commiting CQEs. Don't commit it twice by using __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23io_uring/zcrx: add support for multiple ifqsPavel Begunkov
Allow the user to register multiple ifqs / zcrx contexts. With that we can use multiple interfaces / interface queues in a single io_uring instance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com [axboe: fold in fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: move zcrx region to struct io_zcrx_ifqPavel Begunkov
Refill queue region is a part of zcrx and should stay in struct io_zcrx_ifq. We can't have multiple queues without it, so move it there. As a result there is no context global zcrx region anymore, and the region is looked up together with its ifq. To protect a concurrent mmap from seeing an inconsistent region we were protecting changes to ->zcrx_region with mmap_lock, but now it protect the publishing of the ifq. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24f1a728fc03d0166f16d099575457e10d9d90f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: let zcrx choose region for mmapingPavel Begunkov
In preparation for adding multiple ifqs, add a helper returning a region for mmaping zcrx refill queue. For now it's trivial and returns the same ctx global ->zcrx_region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9006e2ef8cd5e5b337c2ba2228ede8409eb60f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: remove sqe->file_index checkPavel Begunkov
sqe->file_index and sqe->zcrx_ifq_idx are aliased, so even though io_recvzc_prep() has all necessary handling and checking for ->zcrx_ifq_idx, it's already rejected a couple of lines above. It's not a real problem, however, as we can only have one ifq at the moment, which index is always zero, and it returns correct error codes to the user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b51acedd326d1a87bf53e33303e6fc87661a3e3f.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>