summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2025-05-06io_uring/zcrx: dmabuf backed zerocopy receivePavel Begunkov
Add support for dmabuf backed zcrx areas. To use it, the user should pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags field and pass a dmabuf fd in the dmabuf_fd field. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com [axboe: fold in fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: enable per-io write streamsKeith Busch
Allow userspace to pass a per-I/O write stream in the SQE: __u8 write_stream; The __u8 type matches the size the filesystems and block layer support. Application can query the supported values from the block devices max_write_streams sysfs attribute. Unsupported values are ignored by file operations that do not support write streams or rejected with an error by those that support them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04io_uring: always arm linked timeouts prior to issueJens Axboe
There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed. Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1390 Reported-by: Chase Hiltz <chase@path.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-03futex: Move futex_queue() into futex_wait_setup()Peter Zijlstra
futex_wait_setup() has a weird calling convention in order to return hb to use as an argument to futex_queue(). Mostly such that requeue can have an extra test in between. Reorder code a little to get rid of this and keep the hb usage inside futex_wait_setup(). [bigeasy: fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
2025-05-02io_uring/zcrx: split common area map/unmap partsPavel Begunkov
Extract area type depedent parts of io_zcrx_[un]map_area from the generic path. It'll be helpful once there are more area memory types and not only user pages. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: split out memory holders from areaPavel Begunkov
In the data path users of struct io_zcrx_area don't need to know what kind of memory it's backed by. Only keep there generic bits in there and and split out memory type dependent fields into a new structure. It also logically separates the step that actually imports the memory, e.g. pinning user pages, from the generic area initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: resolve netdev before area creationPavel Begunkov
Some area types will require a valid struct device to be created, so resolve netdev and struct device before creating an area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02io_uring/zcrx: improve area validationPavel Begunkov
dmabuf backed area will be taking an offset instead of addresses, and io_buffer_validate() is not flexible enough to facilitate it. It also takes an iovec, which may truncate the u64 length zcrx takes. Add a new helper function for validation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-30io_uring/fdinfo: annotate racy sq/cq head/tail readsJens Axboe
syzbot complains about the cached sq head read, and it's totally right. But we don't need to care, it's just reading fdinfo, and reading the CQ or SQ tail/head entries are known racy in that they are just a view into that very instant and may of course be outdated by the time they are reported. Annotate both the SQ head and CQ tail read with data_race() to avoid this syzbot complaint. Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/ Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28io_uring/cmd: move net cmd into a separate filePavel Begunkov
We keep socket io_uring command implementation in io_uring/uring_cmd.c. Separate it from generic command code into a separate file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28io_uring: delete misleading comment in io_fill_cqe_aux()Pavel Begunkov
io_fill_cqe_aux() doesn't overflow completions, however it might fail them and lets the caller handle it. Remove the comment, which doesn't make any sense. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: fix 'sync' handling of io_fallback_tw()Jens Axboe
A previous commit added a 'sync' parameter to io_fallback_tw(), which if true, means the caller wants to wait on the fallback thread handling it. But the logic is somewhat messed up, ensure that ctxs are swapped and flushed appropriately. Cc: stable@vger.kernel.org Fixes: dfbe5561ae93 ("io_uring: flush offloaded and delayed task_work on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: open code io_eventfd_grab()Pavel Begunkov
io_eventfd_grab() doesn't help wit understanding the path, it'll be simpler to keep the helper open coded. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: clean up rcu lockingPavel Begunkov
Conditional locking is never welcome if there are better options. Move rcu locking into io_eventfd_signal(), make it unconditional and use guards. It also helps with sparse warnings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring/eventfd: dedup signalling helpersPavel Begunkov
Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much of a difference for now, but it prepares it for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: don't duplicate flushing in io_req_post_cqePavel Begunkov
io_req_post_cqe() sets submit_state.cq_flush so that *flush_completions() can take care of batch commiting CQEs. Don't commit it twice by using __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23io_uring/zcrx: add support for multiple ifqsPavel Begunkov
Allow the user to register multiple ifqs / zcrx contexts. With that we can use multiple interfaces / interface queues in a single io_uring instance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com [axboe: fold in fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: move zcrx region to struct io_zcrx_ifqPavel Begunkov
Refill queue region is a part of zcrx and should stay in struct io_zcrx_ifq. We can't have multiple queues without it, so move it there. As a result there is no context global zcrx region anymore, and the region is looked up together with its ifq. To protect a concurrent mmap from seeing an inconsistent region we were protecting changes to ->zcrx_region with mmap_lock, but now it protect the publishing of the ifq. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24f1a728fc03d0166f16d099575457e10d9d90f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: let zcrx choose region for mmapingPavel Begunkov
In preparation for adding multiple ifqs, add a helper returning a region for mmaping zcrx refill queue. For now it's trivial and returns the same ctx global ->zcrx_region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9006e2ef8cd5e5b337c2ba2228ede8409eb60f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: remove sqe->file_index checkPavel Begunkov
sqe->file_index and sqe->zcrx_ifq_idx are aliased, so even though io_recvzc_prep() has all necessary handling and checking for ->zcrx_ifq_idx, it's already rejected a couple of lines above. It's not a real problem, however, as we can only have one ifq at the moment, which index is always zero, and it returns correct error codes to the user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b51acedd326d1a87bf53e33303e6fc87661a3e3f.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: move io_zcrx_iov_pagePavel Begunkov
We'll need io_zcrx_iov_page at the top to keep offset calculations closer together, move it there. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/575617033a8b84a5985c7eb760b7121efdbe7e56.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/zcrx: remove duplicated freelist initPavel Begunkov
Several lines below we already initialise the freelist, don't do it twice. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71b07c4e00d1db65899d1ed8d603d961fe7d1c28.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/rsrc: remove null check on importPavel Begunkov
WARN_ON_ONCE() checking imu for NULL in io_import_fixed() is in the hot path and too protective, it's time to get rid of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3782c01e311dbd474bca45aefaf79a2f2822fafb.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/rsrc: clean up io_coalesce_buffer()Pavel Begunkov
We don't need special handling for the first page in io_coalesce_buffer(), move it inside the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/ad698cddc1eadb3d92a7515e95bb13f79420323d.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/rsrc: use unpin_user_folioPavel Begunkov
We want to have a full folio to be left pinned but with only one reference, for that we "unpin" all but the first page with unpin_user_pages(), which can be confusing. There is a new helper to achieve that called unpin_user_folio(), so use that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/e0b2be8f9ea68f6b351ec3bb046f04f437f68491.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/rsrc: remove node assignment helpersJens Axboe
There are two helpers here, one assigns and increments the node ref count, and the other is simply a wrapper around that for the buffer node handling. The buffer node assignment benefits from checking and setting REQ_F_BUF_NODE together, otherwise stalls have been observed on setting that flag later in the process. Hence re-do it so that it's set when checked, and cleared in case of (unlikely) failure. With that, the buffer node helper can go, and then drop the generic io_req_assign_rsrc_node() helper as well as there's only a single user of it left at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring: add support for IORING_OP_PIPEJens Axboe
This works just like pipe2(2), except it also supports fixed file descriptors. Used in a similar fashion as for other fd instantiating opcodes (like accept, socket, open, etc), where sqe->file_slot is set appropriately if two direct descriptors are desired rather than a set of normal file descriptors. sqe->addr must be set to a pointer to an array of 2 integers, which is where the fixed/normal file descriptors are copied to. sqe->pipe_flags contains flags, same as what is allowed for pipe2(2). Future expansion of per-op private flags can go in sqe->ioprio, like we do for other opcodes that take both a "syscall" flag set and an io_uring opcode specific flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring: don't store bgid in req->buf_indexPavel Begunkov
Pass buffer group id into the rest of helpers via struct buf_sel_arg and remove all reassignments of req->buf_index back to bgid. Now, it only stores buffer indexes, and the group is provided by callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ea9fa08113ecb4d9224b943e7806e80a324bdf9.1743437358.git.asml.silence@gmail.com Link: https://lore.kernel.org/io-uring/0c01d76ff12986c2f48614db8610caff8f78c869.1743500909.git.asml.silence@gmail.com/ [axboe: fold in patch from second link] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/kbuf: pass bgid to io_buffer_select()Pavel Begunkov
The current situation with buffer group id juggling is not ideal. req->buf_index first stores the bgid, then it's overwritten by a buffer id, and then it can get restored back no recycling / etc. It's not so easy to control, and it's not handled consistently across request types with receive requests saving and restoring the bgid it by hand. It's a prep patch that adds a buffer group id argument to io_buffer_select(). The caller will be responsible for stashing a copy somewhere and passing it into the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a210d6427cc3f4f42271a6853274cd5a50e56820.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring: set IMPORT_BUFFER in generic send setupPavel Begunkov
Move REQ_F_IMPORT_BUFFER to the common send setup. Currently, the only user is send zc, but we'll want for normal sends to support that in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/18b74edfc61d8a1b6c9fed3b78a9276fe80f8ced.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/net: don't use io_do_buffer_select at prepPavel Begunkov
Prep code is interested whether it's a selected buffer request, not whether a buffer has already been selected like what io_do_buffer_select() returns. Check for REQ_F_BUFFER_SELECT directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4488a029ac698554bebf732263fe19d7734affa6.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21io_uring/wq: avoid indirect do_work/free_work callsCaleb Sander Mateos
struct io_wq stores do_work and free_work function pointers which are called on each work item. But these function pointers are always set to io_wq_submit_work and io_wq_free_work, respectively. So remove these function pointers and just call the functions directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-18io_uring/zcrx: fix late dma unmap for a dead devPavel Begunkov
There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, which is not allowed (see links). That just got fixed for normal page pools, and we need to address memory providers as well. Unmap pages in the memory provider uninstall callback, and protect it with a new lock. There is also a gap between when a dma mapping is created and the mp is installed, so if the device is killed in between, io_uring would be holding on to dma mappings to a dead device with no one to call ->uninstall. Move it to page pool init and rely on ->is_mapped to make sure it's only done once. Link: https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17io_uring/rsrc: ensure segments counts are correct on kbuf buffersJens Axboe
kbuf imports have the front offset adjusted and segments removed, but the tail segments are still included in the segment count that gets passed in the iov_iter. As the segments aren't necessarily all the same size, move importing to a separate helper and iterate the mapped length to get an exact count. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17io_uring/rsrc: send exact nr_segs for fixed bufferNitesh Shetty
Sending exact nr_segs, avoids bio split check and processing in block layer, which takes around 5%[1] of overall CPU utilization. In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2] and 5% less CPU utilization. [1] 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split [2] sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> [Pavel: fixed for kbuf, rebased and reworked on top of cleanups] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com [axboe: fold in fix factoring in buf reg offset] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17io_uring/rsrc: refactor io_import_fixedPavel Begunkov
io_import_fixed is a mess. Even though we know the final len of the iterator, we still assign offset + len and do some magic after to correct for that. Do offset calculation first and finalise it with iov_iter_bvec at the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2d5107fed24f8b23245ef2ede9a5a7f7c426df61.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17io_uring/rsrc: separate kbuf offset adjustmentsPavel Begunkov
Kernel registered buffers are special because segments are not uniform in size, and we have a bunch of optimisations based on that uniformity for normal buffers. Handle kbuf separately, it'll be cleaner this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4e9e5990b0ab5aee723c0be5cd9b5bcf810375f9.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17io_uring/rsrc: don't skip offset calculationPavel Begunkov
Don't optimise for requests with offset=0. Large registered buffers are the preference and hence the user is likely to pass an offset, and the adjustments are not expensive and will be made even cheaper in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1c2beb20470ee3c886a363d4d8340d3790db19f3.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15io_uring/zcrx: add pp to ifq conversion helperPavel Begunkov
It'll likely change how page pools store memory providers, so in preparation for that, keep accesses in one place in io_uring by introducing a helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3522eb8fa9b4e21bcf32e7e9ae656c616b282210.1744722526.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15io_uring/zcrx: return ifq id to the userPavel Begunkov
IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx, which binds it to the corresponding if / queue. However, we don't return that id back to the user. It's fine as currently there can be only one zcrx and the user assumes that its id should be 0, but as we'll need multiple zcrx objects in the future let's explicitly pass it back on registration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8714667d370651962f7d1a169032e5f02682a73e.1744722517.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07io_uring/kbuf: reject zero sized provided buffersJens Axboe
This isn't fixing a real issue, but there's also zero point in going through group and buffer setup, when the buffers are going to be rejected once attempted to get used. Cc: stable@vger.kernel.org Reported-by: syzbot+58928048fd1416f1457c@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07io_uring/zcrx: separate niov number from pagesPavel Begunkov
A preparation patch that separates the number of pages / folios from the number of niovs. They will not match in the future to support huge pages, improved dma mapping and/or larger chunk sizes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0780ac966ee84200385737f45bb0f2ada052392b.1743848231.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07io_uring/zcrx: put refill data into separate cache linePavel Begunkov
Refill queue lock and other bits are only used from the allocation path on the rx softirq side, but it shares the cache line with other fields like ctx that are used also in the "syscall" path, which causes cache bouncing when softirq runs on a different CPU. Separate them into different cache lines. The first one now contains constant fields used by both contextx, followed by a line responsible for refill queue data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6d1f598e27d623c07fc49d6baee13089a9b1216c.1743848241.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-04io_uring: don't post tag CQEs on file/buffer registration failurePavel Begunkov
Buffer / file table registration is all or nothing, if it fails all resources we might have partially registered are dropped and the table is killed. If that happens, it doesn't make sense to post any rsrc tag CQEs. That would be confusing to the application, which should not need to handle that case. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 7029acd8a9503 ("io_uring/rsrc: get rid of per-ring io_rsrc_node list") Link: https://lore.kernel.org/r/c514446a8dcb0197cddd5d4ba8f6511da081cf1f.1743777957.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-03io_uring: always do atomic put from iowqPavel Begunkov
io_uring always switches requests to atomic refcounting for iowq execution before there is any parallilism by setting REQ_F_REFCOUNT, and the flag is not cleared until the request completes. That should be fine as long as the compiler doesn't make up a non existing value for the flags, however KCSAN still complains when the request owner changes oter flag bits: BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work ... read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0: req_ref_put_and_test io_uring/refs.h:22 [inline] Skip REQ_F_REFCOUNT checks for iowq, we know it's set. Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-02io_uring: support vectored kernel fixed bufferMing Lei
io_uring has supported fixed kernel buffer via io_buffer_register_bvec() and io_buffer_unregister_bvec(). The vectored fixed buffer has been ready, so it is natural to support fixed kernel buffer, one use case is ublk. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250325135155.935398-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-02io_uring: add validate_fixed_range() for validate fixed bufferMing Lei
Add helper of validate_fixed_range() for validating fixed buffer range. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250325135155.935398-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0David Wei
When readlen is set for a recvzc request, tcp_read_sock() will call io_zcrx_recv_skb() one final time with len == desc->count == 0. This is caused by the !desc->count check happening too late. The offset + 1 != skb->len happens earlier and causes the while loop to continue. Fix this in io_zcrx_recv_skb() instead of tcp_read_sock(). Return early if len is 0 i.e. the read is done. Fixes: 6699ec9a23f8 ("io_uring/zcrx: add a read limit to recvzc requests") Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20250401195355.1613813-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31io_uring/net: avoid import_ubuf for regvec sendPavel Begunkov
With registered buffers we set up iterators in helpers like io_import_fixed(), and there is no need for a import_ubuf() before that. It was fine as we used real pointers for offset calculation, but that's not the case anymore since introduction of ublk kernel buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b2de1a50844f848f62c8de609b494971033a6b9.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31io_uring/rsrc: check size when importing reg bufferPavel Begunkov
We're relying on callers to verify the IO size, do it inside of io_import_fixed() instead. It's safer, easier to deal with, and more consistent as now it's done close to the iter init site. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f9c2c75ec4d356a0c61289073f68d98e8a9db190.1743446271.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>