summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2025-02-20io_uring/epoll: remove CONFIG_EPOLL guardsJens Axboe
Just have the Makefile add the object if epoll is enabled, then it's not necessary to guard the entire epoll.c file inside an CONFIG_EPOLL ifdef. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-20Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-epoll-waitJens Axboe
* for-6.15/io_uring-rx-zc: (77 commits) io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers net: page_pool: add a mp hook to unregister_netdevice* net: page_pool: add callback for mp info printing netdev: add io_uring memory provider info ...
2025-02-19io_uring: Rename KConfig to KconfigGeert Uytterhoeven
Kconfig files are traditionally named "Kconfig". Fixes: 6f377873cb239050 ("io_uring/zcrx: add interface queue and refill queue") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/5ae387c1465f54768b51a5a1ca87be7934c4b2ad.1739976387.git.geert+renesas@glider.be Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/zcrx: fix leaks on failed registrationPavel Begunkov
If we try to register a device-less interface like veth, io_register_zcrx_ifq() will leak struct io_zcrx_ifq with a bunch of resources attached to it. Fix that. Fixes: 035af94b39fd ("io_uring/zcrx: grab a net device") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202502190532.W7NnmyiP-lkp@intel.com/ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fbf16279dd73fa4c6df048168728355636ba5f53.1739959771.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: clean up mshot forced sync modePavel Begunkov
Move code forcing synchronous execution of multishot read requests out a more generic __io_read(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: move ki_complete init into prepPavel Begunkov
Initialise ki_complete during request prep stage, we'll depend on it not being reset during issue in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: don't directly use ki_completePavel Begunkov
We want to avoid checking ->ki_complete directly in the io_uring completion path. Fortunately we have only two callback the selection of which depend on the ring constant flags, i.e. IOPOLL, so use that to infer the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rw: forbid multishot async readsPavel Begunkov
At the moment we can't sanely handle queuing an async request from a multishot context, so disable them. It shouldn't matter as pollable files / socekts don't normally do async. Patching it in __io_read() is not the cleanest way, but it's simpler than other options, so let's fix it there and clean up on top. Cc: stable@vger.kernel.org Reported-by: chase xd <sl1589472800@gmail.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/zcrx: recheck ifq on shutdownPavel Begunkov
io_ring_exit_work() checks ifq before shutting it down and guarantees that the pointer is stable, but instead of relying on rather complicated synchronisation recheck the ifq pointer inside. Reported-by: Kees Bakker <kees@ijzerbout.nl> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/905e55c47235ab26377a735294f939f31d00ae53.1739934175.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19io_uring/rsrc: remove unused constantsCaleb Sander Mateos
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86 ("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the rsrc_node_cache. IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring io_rsrc_node list") removed the separate tag table for registered nodes. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18io_uring: use lockless_cq flag in io_req_complete_post()Caleb Sander Mateos
io_uring_create() computes ctx->lockless_cq as: ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) So use it to simplify that expression in io_req_complete_post(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18io_uring: Use helper function hrtimer_update_function()Nam Cao
The field 'function' of struct hrtimer should not be changed directly, as the write is lockless and a concurrent timer expiry might end up using the wrong function pointer. Switch to use hrtimer_update_function() which also performs runtime checks that it is safe to modify the callback. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/9b33f490fb1d207d3918ef5e116dc3412ae35c1e.1738746927.git.namcao@linutronix.de
2025-02-18io_uring/timeout: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/80ca8d959f2cc67c75f6d61008e3bebfe7fbc30a.1738746821.git.namcao@linutronix.de
2025-02-17net: use napi_id_valid helperStefano Jordhani
In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"), napi_id_valid function was added. Use the helper to refactor open-coded checks in the source. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Stefano Jordhani <sjordhani@gmail.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17io_uring/zcrx: add copy fallbackPavel Begunkov
There are scenarios in which the zerocopy path can get a kernel buffer instead of a net_iov and needs to copy it to the user, whether it is because of mis-steering or simply getting an skb with the linear part. In this case, grab a net_iov, copy into it and return it to the user as normally. At the moment the user doesn't get any indication whether there was a copy or not, which is left for follow up work. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-10-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: throttle receive requestsPavel Begunkov
io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to receive. If the other side sends fast enough, we might get stuck in io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the user to handle them leading to unbound latencies. Break out of it based on an arbitrarily chosen limit, the upper layer will either return to userspace or requeue the request. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-9-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: set pp memory provider for an rx queueDavid Wei
Set the page pool memory provider for the rx queue configured for zero copy to io_uring. Then the rx queue is reset using netdev_rx_queue_restart() and netdev core + page pool will take care of filling the rx queue from the io_uring zero copy memory provider. For now, there is only one ifq so its destruction happens implicitly during io_uring cleanup. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-8-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add io_recvzc requestDavid Wei
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and the socket must be handled by the io_uring instance that the rx queue was registered for zero copy with. That's because neither net_iovs / buffers from our queue can be read by outside applications, nor zero copy is possible if traffic for the zero copy connection goes to another queue. This coordination is outside of the scope of this patch series. Also, any traffic directed to the zero copy enabled queue is immediately visible to the application, which is why CAP_NET_ADMIN is required at the registration step. Of course, no data is actually read out of the socket, it has already been copied by the netdev into userspace memory via DMA. OP_RECV_ZC reads skbs out of the socket and checks that its frags are indeed net_iovs that belong to io_uring. A cqe is queued for each one of these frags. Recall that each cqe is a big cqe, with the top half being an io_uring_zcrx_cqe. The cqe res field contains the len or error. The lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off field contain the offset relative to the start of the zero copy area. The upper part of the off field is trivially zero, and will be used to carry the area id. For now, there is no limit as to how much work each OP_RECV_ZC request does. It will attempt to drain a socket of all available data. This request always operates in multishot mode. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-7-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: dma-map area for the devicePavel Begunkov
Setup DMA mappings for the area into which we intend to receive data later on. We know the device we want to attach to even before we get a page pool and can pre-map in advance. All net_iov are synchronised for device when allocated, see page_pool_mp_return_in_cache(). Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-6-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: implement zerocopy receive pp memory providerPavel Begunkov
Implement a page pool memory provider for io_uring to receieve in a zero copy fashion. For that, the provider allocates user pages wrapped around into struct net_iovs, that are stored in a previously registered struct net_iov_area. Unlike the traditional receive, that frees pages and returns them back to the page pool right after data was copied to the user, e.g. inside recv(2), we extend the lifetime until the user space confirms that it's done processing the data. That's done by taking a net_iov reference. When the user is done with the buffer, it must return it back to the kernel by posting an entry into the refill ring, which is usually polled off the io_uring memory provider callback in the page pool's netmem allocation path. There is also a separate set of per net_iov "user" references accounting whether a buffer is currently given to the user (including possible fragmentation). Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-5-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: grab a net devicePavel Begunkov
Zerocopy receive needs a net device to bind to its rx queue and dma map buffers. As a preparation to following patches, resolve a net device from the if_idx parameter with no functional changes otherwise. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-4-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add io_zcrx_areaDavid Wei
Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace memory, which is then pinned by the kernel. Each net_iov is mapped to one of these pages. The freelist is a spinlock protected list that keeps track of all the net_iovs/pages that aren't used. For now, there is only one area per ifq and area registration happens implicitly as part of ifq registration. There is no API for adding/removing areas yet. The struct for area registration is there for future extensibility once we support multiple areas and TCP devmem. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add interface queue and refill queueDavid Wei
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: pass struct io_tw_state by valueCaleb Sander Mateos
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the only field of io_tw_state but kept it as a task work callback argument to "forc[e] users not to invoke them carelessly out of a wrong context". Passing the struct io_tw_state * argument adds a few instructions to all callers that can't inline the functions and see the argument is unused. So pass struct io_tw_state by value instead. Since it's a 0-sized value, it can be passed without any instructions needed to initialize it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: introduce type alias for io_tw_stateCaleb Sander Mateos
In preparation for changing how io_tw_state is passed, introduce a type alias io_tw_token_t for struct io_tw_state *. This allows for changing the representation in one place, without having to update the many functions that just forward their struct io_tw_state * argument. Also add a comment to struct io_tw_state to explain its purpose. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/rsrc: avoid NULL check in io_put_rsrc_node()Caleb Sander Mateos
Most callers of io_put_rsrc_node() already check that node is non-NULL: - io_rsrc_data_free() - io_sqe_buffer_register() - io_reset_rsrc_node() - io_req_put_rsrc_nodes() (REQ_F_BUF_NODE indicates non-NULL buf_node) Only io_splice_cleanup() can call io_put_rsrc_node() with a NULL node. So move the NULL check there. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250216225900.1075446-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: pass ctx instead of req to io_init_req_drain()Caleb Sander Mateos
io_init_req_drain() takes a struct io_kiocb *req argument but only uses it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so pass it instead. Drop "req" from the function name since it operates on the ctx rather than a specific req. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: use IO_REQ_LINK_FLAGS moreCaleb Sander Mateos
Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with the more commonly used IO_REQ_LINK_FLAGS. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/net: improve recv bundlesJens Axboe
Current recv bundles are only supported for multishot receives, and additionally they also always post at least 2 CQEs if more data is available than what a buffer will hold. This happens because the initial bundle recv will do a single buffer, and then do the rest of what is in the socket as a followup receive. As shown in a test program, if 1k buffers are available and 32k is available to receive in the socket, you'd get the following completions: bundle=1, mshot=0 cqe res 1024 cqe res 1024 [...] cqe res 1024 bundle=1, mshot=1 cqe res 1024 cqe res 31744 where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 && mshot=1 will post a 1k completion and then a 31k completion. To support bundle recv without multishot, it's possible to simply retry the recv immediately and post a single completion, rather than split it into two completions. With the below patch, the same test looks as follows: bundle=1, mshot=0 cqe res 32768 bundle=1, mshot=1 cqe res 32768 where mshot=0 works fine for bundles, and both of them post just a single 32k completion rather than split it into separate completions. Posting fewer completions is always a nice win, and not needing multishot for proper bundle efficiency is nice for cases that can't necessarily use multishot. Reported-by: Norman Maurer <norman_maurer@apple.com> Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/waitid: use generic io_cancel_remove() helperJens Axboe
Don't implement our own loop rolling and checking, just use the generic helper to find and cancel requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/futex: use generic io_cancel_remove() helperJens Axboe
Don't implement our own loop rolling and checking, just use the generic helper to find and cancel requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/cancel: add generic cancel helperJens Axboe
Any opcode that is cancelable ends up defining its own cancel helper for finding and canceling a specific request. Add a generic helper that can be used for this purpose. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/waitid: convert to io_cancel_remove_all()Jens Axboe
Use the generic helper for cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/futex: convert to io_cancel_remove_all()Jens Axboe
Use the generic helper for cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/cancel: add generic remove_all helperJens Axboe
Any opcode that is cancelable ends up defining its own remove all helper, which iterates the pending list and cancels matches. Add a generic helper for it, which can be used by them. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: uninline __io_put_kbufsPavel Begunkov
__io_put_kbufs() and other helper functions are too large to be inlined, compilers would normally refuse to do so. Uninline it and move together with io_kbuf_commit into kbuf.c. io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dade7f55ad590e811aff83b1ec55c9c04e17b2b.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: introduce io_kbuf_drop_legacy()Pavel Begunkov
io_kbuf_drop() is only used for legacy provided buffers, and so __io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the dead branch out of __io_put_kbuf_list(), rename it into io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: open code __io_put_kbuf()Pavel Begunkov
__io_put_kbuf() is a trivial wrapper, open code it into __io_put_kbufs(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9dc17380272b48d56c95992c6f9eaacd5546e1d3.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf cachingPavel Begunkov
Remove all struct io_buffer caches. It makes it a fair bit simpler. Apart from from killing a bunch of lines and juggling between lists, __io_put_kbuf_list() doesn't need ->completion_lock locking now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: simplify __io_put_kbufPavel Begunkov
As a preparation step remove an optimisation from __io_put_kbuf() trying to use the locked cache. With that __io_put_kbuf_list() is only used with ->io_buffers_comp, and we remove the explicit list argument. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7f1394ec4afc7f96b35a61f5992e27c49fd067.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: move locking into io_kbuf_drop()Pavel Begunkov
Move the burden of locking out of the caller into io_kbuf_drop(), that will help with furher refactoring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf kmem cachePavel Begunkov
Remove the kmem cache used by legacy provided buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf bulk allocationPavel Begunkov
Legacy provided buffers are slow and discouraged in favour of the ring variant. Remove the bulk allocation to keep it simpler as we don't care about performance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a064d70370e590efed8076e9501ae4cfc20fe0ca.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: sanitise ring params earlierPavel Begunkov
Do all struct io_uring_params validation early on before allocating the context. That makes initialisation easier, especially by having fewer places where we need to care about partial de-initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: check for iowq alloc_workqueue failurePavel Begunkov
alloc_workqueue() can fail even during init in io_uring_init(), check the result and panic if anything went wrong. Fixes: 73eaa2b583493 ("io_uring: use private workqueue for exit work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring: deduplicate caches deallocationPavel Begunkov
Add a function that frees all ring caches since we already have two spots repeating the same thing and it's easy to miss it and change only one of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/io-wq: pass io_wq to io_get_next_work()Max Kellermann
The only caller has already determined this pointer, so let's skip the redundant dereference. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20250128133927.3989681-7-max.kellermann@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/io-wq: do not use bogus hash valueMax Kellermann
Previously, the `hash` variable was initialized with `-1` and only updated by io_get_next_work() if the current work was hashed. Commit 60cf46ae6054 ("io-wq: hash dependent work") changed this to always call io_get_work_hash() even if the work was not hashed. This caused the `hash != -1U` check to always be true, adding some overhead for the `hash->wait` code. This patch fixes the regression by checking the `IO_WQ_WORK_HASHED` flag. Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`: 38.55% -1.57% [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.86% -0.72% [kernel.kallsyms] [k] io_worker_handle_work 0.10% +0.67% [kernel.kallsyms] [k] put_prev_entity 1.96% +0.59% [kernel.kallsyms] [k] io_nop_prep 3.31% -0.51% [kernel.kallsyms] [k] try_to_wake_up 7.18% -0.47% [kernel.kallsyms] [k] io_wq_free_work Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/io-wq: cache work->flags in variableMax Kellermann
This eliminates several redundant atomic reads and therefore reduces the duration the surrounding spinlocks are held. In several io_uring benchmarks, this reduced the CPU time spent in queued_spin_lock_slowpath() considerably: io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`: 38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work 2.60% +0.19% [kernel.kallsyms] [k] io_nop 3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete 6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work HTTP server, static file: 42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath 2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work 1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback 1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work HTTP server, PHP: 35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath 0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter 0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20250128133927.3989681-5-max.kellermann@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/io-wq: move worker lists to struct io_wq_acctMax Kellermann
Have separate linked lists for bounded and unbounded workers. This way, io_acct_activate_free_worker() sees only workers relevant to it and doesn't need to skip irrelevant ones. This speeds up the linked list traversal (under acct->lock). The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`. It did not actually protect "access to elements below", that is, not all of them; it only protected access to the worker lists. By having two locks instead of one, contention on this lock is reduced. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20250128133927.3989681-4-max.kellermann@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>