summaryrefslogtreecommitdiff
path: root/io_uring/io_uring.c
AgeCommit message (Collapse)Author
4 daysio_uring: ensure deferred completions are flushed for multishotJens Axboe
Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case. When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any. Cc: stable@vger.kernel.org # 6.1+ Reported-by: Norman Maurer <norman_maurer@apple.com> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysio_uring: always arm linked timeouts prior to issueJens Axboe
There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed. Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1390 Reported-by: Chase Hiltz <chase@path.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: fix 'sync' handling of io_fallback_tw()Jens Axboe
A previous commit added a 'sync' parameter to io_fallback_tw(), which if true, means the caller wants to wait on the fallback thread handling it. But the logic is somewhat messed up, ensure that ctxs are swapped and flushed appropriately. Cc: stable@vger.kernel.org Fixes: dfbe5561ae93 ("io_uring: flush offloaded and delayed task_work on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24io_uring: don't duplicate flushing in io_req_post_cqePavel Begunkov
io_req_post_cqe() sets submit_state.cq_flush so that *flush_completions() can take care of batch commiting CQEs. Don't commit it twice by using __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-03io_uring: always do atomic put from iowqPavel Begunkov
io_uring always switches requests to atomic refcounting for iowq execution before there is any parallilism by setting REQ_F_REFCOUNT, and the flag is not cleared until the request completes. That should be fine as long as the compiler doesn't make up a non existing value for the flags, however KCSAN still complains when the request owner changes oter flag bits: BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work ... read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0: req_ref_put_and_test io_uring/refs.h:22 [inline] Skip REQ_F_REFCOUNT checks for iowq, we know it's set. Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31io_uring: add req flag invariant build assertionPavel Begunkov
We're caching some of file related request flags in a tricky way, put a build check to make sure flags don't get reshuffled. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9877577b83c25dd78224a8274f799187e7ec7639.1743407551.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28io_uring: don't pass ctx to tw add remote helperPavel Begunkov
Unlike earlier versions, io_msg_remote_post() creates a valid request with a proper context, so don't pass a context to io_req_task_work_add_remote() explicitly but derive it from the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/721f51cf34996d98b48f0bfd24ad40aa2730167e.1743190078.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Final separate updates for io_uring. This started out as a series of cleanups improvements and improvements for registered buffers, but as the last series of the io_uring changes for 6.15, it also collected a few fixes for the other branches on top: - Add support for vectored fixed/registered buffers. Previously only single segments have been supported for commands, now vectored variants are supported as well. This series includes networking and file read/write support. - Small series unifying return codes across multi and single shot. - Small series cleaning up registerd buffer importing. - Adding support for vectored registered buffers for uring_cmd. - Fix for io-wq handling of command reissue. - Various little fixes and tweaks" * tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits) io_uring/net: fix io_req_post_cqe abuse by send bundle io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc io_uring: move min_events sanitisation io_uring: rename "min" arg in io_iopoll_check() io_uring: open code __io_post_aux_cqe() io_uring: defer iowq cqe overflow via task_work io_uring: fix retry handling off iowq io_uring/net: only import send_zc buffer once io_uring/cmd: introduce io_uring_cmd_import_fixed_vec io_uring/cmd: add iovec cache for commands io_uring/cmd: don't expose entire cmd async data io_uring: rename the data cmd cache io_uring: rely on io_prep_reg_vec for iovec placement io_uring: introduce io_prep_reg_iovec() io_uring: unify STOP_MULTISHOT with IOU_OK io_uring: return -EAGAIN to continue multishot io_uring: cap cached iovec/bvec size io_uring/net: implement vectored reg bufs for zctx io_uring/net: convert to struct iou_vec io_uring/net: pull vec alloc out of msghdr import ...
2025-03-28Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring zero-copy receive support from Jens Axboe: "This adds support for zero-copy receive with io_uring, enabling fast bulk receive of data directly into application memory, rather than needing to copy the data out of kernel memory. While this version only supports host memory as that was the initial target, other memory types are planned as well, with notably GPU memory coming next. This work depends on some networking components which were queued up on the networking side, but have now landed in your tree. This is the work of Pavel Begunkov and David Wei. From the v14 posting: 'We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more' In a local setup, I was able to saturate a 200G link with a single CPU core, and at netdev conf 0x19 earlier this month, Jamal reported 188Gbit of bandwidth using a single core (no HT, including soft-irq). Safe to say the efficiency is there, as bigger links would be needed to find the per-core limit, and it's considerably more efficient and faster than the existing devmem solution" * tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux: io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue
2025-03-27io_uring/net: fix io_req_post_cqe abuse by send bundlePavel Begunkov
[ 114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0 [ 114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0 [ 115.001880][ T5313] Call Trace: [ 115.002222][ T5313] <TASK> [ 115.007813][ T5313] io_send+0x4fe/0x10f0 [ 115.009317][ T5313] io_issue_sqe+0x1a6/0x1740 [ 115.012094][ T5313] io_wq_submit_work+0x38b/0xed0 [ 115.013223][ T5313] io_worker_handle_work+0x62a/0x1600 [ 115.013876][ T5313] io_wq_worker+0x34f/0xdf0 As the comment states, io_req_post_cqe() should only be used by multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are not. Add a flag signifying whether a request wants to post multiple CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but that's left out for simplicity. Cc: stable@vger.kernel.org Fixes: a05d1f625c7aa ("io_uring/net: support bundles for send") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-26Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "This is the first of the io_uring pull requests for the 6.15 merge window, there will be others once the net tree has gone in. This contains: - Cleanup and unification of cancelation handling across various request types. - Improvement for bundles, supporting them both for incrementally consumed buffers, and for non-multishot requests. - Enable toggling of using iowait while waiting on io_uring events or not. Unfortunately this is still tied with CPU frequency boosting on short waits, as the scheduler side has not been very receptive to splitting the (useless) iowait stat from the cpufreq implied boost. - Add support for kbuf nodes, enabling zero-copy support for the ublk block driver. - Various cleanups for resource node handling. - Series greatly cleaning up the legacy provided (non-ring based) buffers. For years, we've been pushing the ring provided buffers as the way to go, and that is what people have been using. Reduce the complexity and code associated with legacy provided buffers. - Series cleaning up the compat handling. - Series improving and cleaning up the recvmsg/sendmsg iovec and msg handling. - Series of cleanups for io-wq. - Start adding a bunch of selftests. The liburing repository generally carries feature and regression tests for everything, but at least for ublk initially, we'll try and go the route of having it in selftests as well. We'll see how this goes, might decide to migrate more tests this way in the future. - Various little cleanups and fixes" * tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits) selftests: ublk: add stripe target selftests: ublk: simplify loop io completion selftests: ublk: enable zero copy for null target selftests: ublk: prepare for supporting stripe target selftests: ublk: move common code into common.c selftests: ublk: increase max buffer size to 1MB selftests: ublk: add single sqe allocator helper selftests: ublk: add generic_01 for verifying sequential IO order selftests: ublk: fix starting ublk device io_uring: enable toggle of iowait usage when waiting on CQEs selftests: ublk: fix write cache implementation selftests: ublk: add variable for user to not show test result selftests: ublk: don't show `modprobe` failure selftests: ublk: add one dependency header io_uring/kbuf: enable bundles for incrementally consumed buffers Revert "io_uring/rsrc: simplify the bvec iter count calculation" selftests: ublk: improve test usability selftests: ublk: add stress test for covering IO vs. killing ublk server selftests: ublk: add one stress test for covering IO vs. removing device selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests ...
2025-03-25Merge tag 'lsm-pr-20250323' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Various minor updates to the LSM Rust bindings Changes include marking trivial Rust bindings as inlines and comment tweaks to better reflect the LSM hooks. - Add LSM/SELinux access controls to io_uring_allowed() Similar to the io_uring_disabled sysctl, add a LSM hook to io_uring_allowed() to enable LSMs a simple way to enforce security policy on the use of io_uring. This pull request includes SELinux support for this new control using the io_uring/allowed permission. - Remove an unused parameter from the security_perf_event_open() hook The perf_event_attr struct parameter was not used by any currently supported LSMs, remove it from the hook. - Add an explicit MAINTAINERS entry for the credentials code We've seen problems in the past where patches to the credentials code sent by non-maintainers would often languish on the lists for multiple months as there was no one explicitly tasked with the responsibility of reviewing and/or merging credentials related code. Considering that most of the code under security/ has a vested interest in ensuring that the credentials code is well maintained, I'm volunteering to look after the credentials code and Serge Hallyn has also volunteered to step up as an official reviewer. I posted the MAINTAINERS update as a RFC to LKML in hopes that someone else would jump up with an "I'll do it!", but beyond Serge it was all crickets. - Update Stephen Smalley's old email address to prevent confusion This includes a corresponding update to the mailmap file. * tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: mailmap: map Stephen Smalley's old email addresses lsm: remove old email address for Stephen Smalley MAINTAINERS: add Serge Hallyn as a credentials reviewer MAINTAINERS: add an explicit credentials entry cred,rust: mark Credential methods inline lsm,rust: reword "destroy" -> "release" in SecurityCtx lsm,rust: mark SecurityCtx methods inline perf: Remove unnecessary parameter of security check lsm: fix a missing security_uring_allowed() prototype io_uring,lsm,selinux: add LSM hooks for io_uring_setup() io_uring: refactor io_uring_allowed()
2025-03-25io_uring: move min_events sanitisationPavel Begunkov
iopoll and normal waiting already duplicate min_completion truncation, so move them inside the corresponding routines. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/254adb289cc04638f25d746a7499260fa89a179e.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25io_uring: rename "min" arg in io_iopoll_check()Pavel Begunkov
Don't name arguments "min", it shadows the namesake function. min_events is also more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f52ce9d88d3bca5732a218b0da14924aa6968909.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25io_uring: open code __io_post_aux_cqe()Pavel Begunkov
There is no reason to keep __io_post_aux_cqe() separately from io_post_aux_cqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2c4c1f68d694deea25a212fc09bbb11f330cd82e.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25io_uring: defer iowq cqe overflow via task_workPavel Begunkov
Don't handle CQE overflows in io_req_complete_post() and defer it to flush_completions. It cuts some duplication, and I also want to limit the number of places directly overflowing completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9046410ac27e18f2baa6f7cdb363ec921cbc3b79.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25io_uring: fix retry handling off iowqPavel Begunkov
io_req_complete_post() doesn't handle reissue and if called with a REQ_F_REISSUE request it might post extra unexpected completions. Fix it by pushing into flush_completion via task work. Fixes: d803d123948fe ("io_uring/rw: handle -EAGAIN retry at IO completion time") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/badb3d7e462881e7edbfcc2be6301090b07dbe53.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-21io_uring/cmd: add iovec cache for commandsPavel Begunkov
Add iou_vec to commands and wire caching for it, but don't expose it to users just yet. We need the vec cleared on initial alloc, but since we can't place it at the beginning at the moment, zero the entire async_data. It's cached, and the performance effects only the initial allocation, and it might be not a bad idea since we're exposing those bits to outside drivers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c0f2145b75791bc6106eb4e72add2cf6a2c72a7a.1742579999.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20io_uring: enable toggle of iowait usage when waiting on CQEsJens Axboe
By default, io_uring marks a waiting task as being in iowait, if it's sleeping waiting on events and there are pending requests. This isn't necessarily always useful, and may be confusing on non-storage setups where iowait isn't expected. It can also cause extra power usage, by preventing the CPU from entering lower sleep states. This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then io_uring will not account the sleeping task as being in iowait. If the kernel supports this feature, then it will be marked by having the IORING_FEAT_NO_IOWAIT feature flag set. As the kernel currently does not support separating the iowait accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT controls both of these at the same time. In the future, if those do end up being split, then it'd be possible to control them separately. However, it seems more likely that the kernel will decouple iowait and CPU frequency boosting anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19io_uring/cmd: don't expose entire cmd async dataPavel Begunkov
io_uring needs private bits in cmd's ->async_data, and they should never be exposed to drivers as it'd certainly be abused. Leave struct io_uring_cmd_data for the drivers but wrap it into a structure. It's a prep patch and doesn't do anything useful yet. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19io_uring: rename the data cmd cachePavel Begunkov
Pick a more descriptive name for the cmd async data cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring: unify STOP_MULTISHOT with IOU_OKPavel Begunkov
IOU_OK means that the request ownership is now handed back to core io_uring and it has to complete it using the result provided in req->cqe. Same is true for multishot and IOU_STOP_MULTISHOT. Rename it into IOU_COMPLETE to avoid confusion and use for both modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6a5b2edb0eb9558acb1c8f1db38ac45fee95491.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring: return -EAGAIN to continue multishotPavel Begunkov
Multishot errors can be mapped 1:1 to normal errors, but there are not identical. It leads to a peculiar situation where all multishot requests has to check in what context they're run and return different codes. Unify them starting with EAGAIN / IOU_ISSUE_SKIP_COMPLETE(EIOCBQUEUED) pair, which mean that core io_uring still owns the request and it should be retried. In case of multishot it's naturally just continues to poll, otherwise it might poll, use iowq or do any other kind of allowed blocking. Introduce IOU_RETRY aliased to -EAGAIN for that. Apart from obvious upsides, multishot can now also check for misuse of IOU_ISSUE_SKIP_COMPLETE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/da117b79ce72ecc3ab488c744e29fae9ba54e23b.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-reg-vecJens Axboe
* for-6.15/io_uring-rx-zc: (80 commits) io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers ...
2025-03-05io_uring: introduce io_cache_free() helperCaleb Sander Mateos
Add a helper function io_cache_free() that returns an allocation to a io_alloc_cache, falling back on kfree() if the io_alloc_cache is full. This is the inverse of io_cache_alloc(), which takes an allocation from an io_alloc_cache and falls back on kmalloc() if the cache is empty. Convert 4 callers to use the helper. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring: cache nodes and mapped buffersKeith Busch
Frequent alloc/free cycles on these is pretty costly. Use an io cache to more efficiently reuse these buffers. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com [axboe: fix imu leak] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring: add support for kernel registered bvecsKeith Busch
Provide an interface for the kernel to leverage the existing pre-registered buffers that io_uring provides. User space can reference these later to achieve zero-copy IO. User space must register an empty fixed buffer table with io_uring in order for the kernel to make use of it. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27Merge branch 'io_uring-6.14' into for-6.15/io_uringJens Axboe
Merge mainline fixes into 6.15 branch, as upcoming patches depend on fixes that went into the 6.14 mainline branch. * io_uring-6.14: io_uring/net: save msg_control for compat io_uring/rw: clean up mshot forced sync mode io_uring/rw: move ki_complete init into prep io_uring/rw: don't directly use ki_complete io_uring/rw: forbid multishot async reads io_uring/rsrc: remove unused constants io_uring: fix spelling error in uapi io_uring.h io_uring: prevent opcode speculation io-wq: backoff when retrying worker creation
2025-02-24io_uring: make io_poll_issue() sturdierPavel Begunkov
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits some of the handling. That's not particularly failure resistant, as for example returning an innocently looking IOU_OK from a multishot issue will lead to severe bugs. Reimplement io_poll_issue() without io_issue_sqe()'s request completion logic. Remove extra checks as we know that req->file is already set, linked timeout are armed, and iopoll is not supported. Also cover it with warnings for now. The patch should be useful by itself, but it's also preparing the codebase for other future clean ups. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3096d7b1026d9a52426a598bdfc8d9d324555545.1740331076.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-21Merge tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Series fixing an issue with multishot read on pollable files that may return -EIOCBQUEUED from ->read_iter(). Four small patches for that, the first one deliberately done in such a way that it'd be easy to backport - Remove some dead constant definitions - Use array_index_nospec() for opcode indexing - Work-around for worker creation retries in the presence of signals * tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux: io_uring/rw: clean up mshot forced sync mode io_uring/rw: move ki_complete init into prep io_uring/rw: don't directly use ki_complete io_uring/rw: forbid multishot async reads io_uring/rsrc: remove unused constants io_uring: fix spelling error in uapi io_uring.h io_uring: prevent opcode speculation io-wq: backoff when retrying worker creation
2025-02-18io_uring: use lockless_cq flag in io_req_complete_post()Caleb Sander Mateos
io_uring_create() computes ctx->lockless_cq as: ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) So use it to simplify that expression in io_req_complete_post(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18io_uring: Use helper function hrtimer_update_function()Nam Cao
The field 'function' of struct hrtimer should not be changed directly, as the write is lockless and a concurrent timer expiry might end up using the wrong function pointer. Switch to use hrtimer_update_function() which also performs runtime checks that it is safe to modify the callback. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/9b33f490fb1d207d3918ef5e116dc3412ae35c1e.1738746927.git.namcao@linutronix.de
2025-02-17io_uring/zcrx: add interface queue and refill queueDavid Wei
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: pass struct io_tw_state by valueCaleb Sander Mateos
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the only field of io_tw_state but kept it as a task work callback argument to "forc[e] users not to invoke them carelessly out of a wrong context". Passing the struct io_tw_state * argument adds a few instructions to all callers that can't inline the functions and see the argument is unused. So pass struct io_tw_state by value instead. Since it's a 0-sized value, it can be passed without any instructions needed to initialize it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: introduce type alias for io_tw_stateCaleb Sander Mateos
In preparation for changing how io_tw_state is passed, introduce a type alias io_tw_token_t for struct io_tw_state *. This allows for changing the representation in one place, without having to update the many functions that just forward their struct io_tw_state * argument. Also add a comment to struct io_tw_state to explain its purpose. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: pass ctx instead of req to io_init_req_drain()Caleb Sander Mateos
io_init_req_drain() takes a struct io_kiocb *req argument but only uses it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so pass it instead. Drop "req" from the function name since it operates on the ctx rather than a specific req. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: use IO_REQ_LINK_FLAGS moreCaleb Sander Mateos
Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with the more commonly used IO_REQ_LINK_FLAGS. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: introduce io_kbuf_drop_legacy()Pavel Begunkov
io_kbuf_drop() is only used for legacy provided buffers, and so __io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the dead branch out of __io_put_kbuf_list(), rename it into io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf cachingPavel Begunkov
Remove all struct io_buffer caches. It makes it a fair bit simpler. Apart from from killing a bunch of lines and juggling between lists, __io_put_kbuf_list() doesn't need ->completion_lock locking now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: move locking into io_kbuf_drop()Pavel Begunkov
Move the burden of locking out of the caller into io_kbuf_drop(), that will help with furher refactoring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf kmem cachePavel Begunkov
Remove the kmem cache used by legacy provided buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: sanitise ring params earlierPavel Begunkov
Do all struct io_uring_params validation early on before allocating the context. That makes initialisation easier, especially by having fewer places where we need to care about partial de-initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: check for iowq alloc_workqueue failurePavel Begunkov
alloc_workqueue() can fail even during init in io_uring_init(), check the result and panic if anything went wrong. Fixes: 73eaa2b583493 ("io_uring: use private workqueue for exit work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: deduplicate caches deallocationPavel Begunkov
Add a function that frees all ring caches since we already have two spots repeating the same thing and it's easy to miss it and change only one of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-15io_uring: prevent opcode speculationPavel Begunkov
sqe->opcode is used for different tables, make sure we santitise it against speculations. Cc: stable@vger.kernel.org Fixes: d3656344fea03 ("io_uring: add lookup table for various opcode needs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/7eddbf31c8ca0a3947f8ed98271acc2b4349c016.1739568408.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-07io_uring,lsm,selinux: add LSM hooks for io_uring_setup()Hamza Mahfooz
It is desirable to allow LSM to configure accessibility to io_uring because it is a coarse yet very simple way to restrict access to it. So, add an LSM for io_uring_allowed() to guard access to io_uring. Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Jens Axboe <axboe@kernel.dk> [PM: merge fuzz due to changes in preceding patches, subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-07io_uring: refactor io_uring_allowed()Hamza Mahfooz
Have io_uring_allowed() return an error code directly instead of true/false. This is needed for follow-up work to guard io_uring_setup() with LSM. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Jens Axboe <axboe@kernel.dk> [PM: goto-to-return conversion as discussed on-list] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-01-31Merge tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: - Series cleaning up the alloc cache changes from this merge window, and then another series on top making it better yet. This also solves an issue with KASAN_EXTRA_INFO, by making io_uring resilient to KASAN using parts of the freed struct for storage - Cleanups and simplications to buffer cloning and io resource node management - Fix an issue introduced in this merge window where READ/WRITE_ONCE was used on an atomic_t, which made some archs complain - Fix for an errant connect retry when the socket has been shut down - Fix for multishot and provided buffers * tag 'io_uring-6.14-20250131' of git://git.kernel.dk/linux: io_uring/net: don't retry connect operation on EPOLLERR io_uring/rw: simplify io_rw_recycle() io_uring: remove !KASAN guards from cache free io_uring/net: extract io_send_select_buffer() io_uring/net: clean io_msg_copy_hdr() io_uring/net: make io_net_vec_assign() return void io_uring: add alloc_cache.c io_uring: dont ifdef io_alloc_cache_kasan() io_uring: include all deps for alloc_cache.h io_uring: fix multishots with selected buffers io_uring/register: use atomic_read/write for sq_flags migration io_uring/alloc_cache: get rid of _nocache() helper io_uring: get rid of alloc cache init_once handling io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock() io_uring/msg_ring: don't leave potentially dangling ->tctx pointer io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to caller io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc() io_uring: clean up io_uring_register_get_file() io_uring/rsrc: Simplify buffer cloning by locking both rings
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>