summaryrefslogtreecommitdiff
path: root/io_uring/kbuf.c
AgeCommit message (Collapse)Author
2025-04-07io_uring/kbuf: reject zero sized provided buffersJens Axboe
This isn't fixing a real issue, but there's also zero point in going through group and buffer setup, when the buffers are going to be rejected once attempted to get used. Cc: stable@vger.kernel.org Reported-by: syzbot+58928048fd1416f1457c@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10io_uring/kbuf: enable bundles for incrementally consumed buffersJens Axboe
The original support for incrementally consumed buffers didn't allow it to be used with bundles, with the assumption being that incremental buffers are generally larger, and hence there's less of a nedd to support it. But that assumption may not be correct - it's perfectly viable to use smaller buffers with incremental consumption, and there may be valid reasons for an application or framework to do so. As there's really no need to explicitly disable bundles with incrementally consumed buffers, allow it. This actually makes the peek side cheaper and simpler, with the completion side basically the same, just needing to iterate for the consumed length. Reported-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: uninline __io_put_kbufsPavel Begunkov
__io_put_kbufs() and other helper functions are too large to be inlined, compilers would normally refuse to do so. Uninline it and move together with io_kbuf_commit into kbuf.c. io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dade7f55ad590e811aff83b1ec55c9c04e17b2b.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: introduce io_kbuf_drop_legacy()Pavel Begunkov
io_kbuf_drop() is only used for legacy provided buffers, and so __io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the dead branch out of __io_put_kbuf_list(), rename it into io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: open code __io_put_kbuf()Pavel Begunkov
__io_put_kbuf() is a trivial wrapper, open code it into __io_put_kbufs(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9dc17380272b48d56c95992c6f9eaacd5546e1d3.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf cachingPavel Begunkov
Remove all struct io_buffer caches. It makes it a fair bit simpler. Apart from from killing a bunch of lines and juggling between lists, __io_put_kbuf_list() doesn't need ->completion_lock locking now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: simplify __io_put_kbufPavel Begunkov
As a preparation step remove an optimisation from __io_put_kbuf() trying to use the locked cache. With that __io_put_kbuf_list() is only used with ->io_buffers_comp, and we remove the explicit list argument. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7f1394ec4afc7f96b35a61f5992e27c49fd067.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf kmem cachePavel Begunkov
Remove the kmem cache used by legacy provided buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/kbuf: remove legacy kbuf bulk allocationPavel Begunkov
Legacy provided buffers are slow and discouraged in favour of the ring variant. Remove the bulk allocation to keep it simpler as we don't care about performance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a064d70370e590efed8076e9501ae4cfc20fe0ca.1738724373.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-12io_uring/kbuf: reallocate buf lists on upgradePavel Begunkov
IORING_REGISTER_PBUF_RING can reuse an old struct io_buffer_list if it was created for legacy selected buffer and has been emptied. It violates the requirement that most of the field should stay stable after publish. Always reallocate it instead. Cc: stable@vger.kernel.org Reported-by: Pumpkin Chang <pumpkin@devco.re> Fixes: 2fcabce2d7d34 ("io_uring: disallow mixed provided buffer group registrations") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-20Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...
2025-01-03io_uring/kbuf: use pre-committed buffer address for non-pollable fileJens Axboe
For non-pollable files, buffer ring consumption will commit upfront. This is fine, but io_ring_buffer_select() will return the address of the buffer after having committed it. For incrementally consumed buffers, this is incorrect as it will modify the buffer address. Store the pre-committed value and return that. If that isn't done, then the initial part of the buffer is not used and the application will correctly assume the content arrived at the start of the userspace buffer, but the kernel will have put it later in the buffer. Or it can cause a spurious -EFAULT returned in the CQE, depending on the buffer size. As bounds are suitably checked for doing the actual IO, no adverse side effects are possible - it's just a data misplacement within the existing buffer. Reported-by: Gwendal Fernet <gwendalfernet@gmail.com> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: fix unintentional sign extension on shift of reg.bgidColin Ian King
Shifting reg.bgid << IORING_OFF_PBUF_SHIFT results in a promotion from __u16 to a 32 bit signed integer, this is then sign extended to a 64 bit unsigned long on 64 bit architectures. If reg.bgid is greater than 0x7fff then this leads to a sign extended result where all the upper 32 bits of mmap_offset are set to 1. Fix this by casting reg.bgid to the same type as mmap_offset before performing the shift. Fixes: ef62de3c4ad5 ("io_uring/kbuf: use region api for pbuf rings") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241204153923.401674-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: unify io_uring mmap'ing codePavel Begunkov
All mapped memory is now backed by regions and we can unify and clean up io_region_validate_mmap() and io_uring_mmap(). Extract a function looking up a region, the rest of the handling should be generic and just needs the region. There is one more ring type specific code, i.e. the mmaping size truncation quirk for IORING_OFF_[S,C]Q_RING, which is left as is. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f5e1eda1562bfd34276de07465525ae5f10e1e84.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: use region api for pbuf ringsPavel Begunkov
Convert internal parts of the provided buffer ring managment to the region API. It's the last non-region mapped ring we have, so it also kills a bunch of now unused memmap.c helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c40cf7beaa648558acd4d84bc0fb3279a35d74b.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: remove pbuf ring refcountingPavel Begunkov
struct io_buffer_list refcounting was needed for RCU based sync with mmap, now we can kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a9cc54bf0077bb2bf2f3daf917549ddd41080da.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/kbuf: use mmap_lock to sync with mmapPavel Begunkov
A preparation / cleanup patch simplifying the buf ring - mmap synchronisation. Instead of relying on RCU, which is trickier, do it by grabbing the mmap_lock when when anyone tries to publish or remove a registered buffer to / from ->io_bl_xa. Modifications of the xarray should always be protected by both ->uring_lock and ->mmap_lock, while lookups should hold either of them. While a struct io_buffer_list is in the xarray, the mmap related fields like ->flags and ->buf_pages should stay stable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/af13bde56ee1a26bcaefaa9aad37a9ea318a590e.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-16Merge tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - NAPI fixes and cleanups (Pavel, Olivier) - Add support for absolute timeouts (Pavel) - Fixes for io-wq/sqpoll affinities (Felix) - Efficiency improvements for dealing with huge pages (Chenliang) - Support for a minwait mode, where the application essentially has two timouts - one smaller one that defines the batch timeout, and the overall large one similar to what we had before. This enables efficient use of batching based on count + timeout, while still working well with periods of less intensive workloads - Use ITER_UBUF for single segment sends - Add support for incremental buffer consumption. Right now each operation will always consume a full buffer. With incremental consumption, a recv/read operation only consumes the part of the buffer that it needs to satisfy the operation - Add support for GCOV for io_uring, to help retain a high coverage of test to code ratio - Fix regression with ocfs2, where an odd -EOPNOTSUPP wasn't correctly converted to a blocking retry - Add support for cloning registered buffers from one ring to another - Misc cleanups (Anuj, me) * tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux: (35 commits) io_uring: add IORING_REGISTER_COPY_BUFFERS method io_uring/register: provide helper to get io_ring_ctx from 'fd' io_uring/rsrc: add reference count to struct io_mapped_ubuf io_uring/rsrc: clear 'slot' entry upfront io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout ...
2024-08-30io_uring/kbuf: return correct iovec count from classic buffer peekJens Axboe
io_provided_buffers_select() returns 0 to indicate success, but it should be returning 1 to indicate that 1 vec was mapped. This causes peeking to fail with classic provided buffers, and while that's not a use case that anyone should use, it should still work correctly. The end result is that no buffer will be selected, and hence a completion with '0' as the result will be posted, without a buffer attached. Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29io_uring/kbuf: add support for incremental buffer consumptionJens Axboe
By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29io_uring/kbuf: pass in 'len' argument for buffer commitJens Axboe
In preparation for needing the consumed length, pass in the length being completed. Unused right now, but will be used when it is possible to partially consume a buffer. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29io_uring/kbuf: move io_ring_head_to_buf() to kbuf.hJens Axboe
In preparation for using this helper in kbuf.h as well, move it there and turn it into a macro. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29io_uring/kbuf: add io_kbuf_commit() helperJens Axboe
Committing the selected ring buffer is currently done in three different spots, combine it into a helper and just call that. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/kbuf: turn io_buffer_list booleans into flagsJens Axboe
We could just move these two and save some space, but in preparation for adding another flag, turn them into flags first. This saves 8 bytes in struct io_buffer_list, making it exactly half a cacheline on 64-bit archs now rather than 40 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-25io_uring/kbuf: use 'bl' directly rather than req->buf_listJens Axboe
req->buf_list is assigned higher up and is safe to use as we remain within a locked region, as is the 'bl' variable itself from which it was assigned. To improve readability, use 'bl' directly rather than get it from the io_kiocb, if we need to increment the head directly in the buffer selection path. This makes it readily apparent that it's the same io_buffer_list being used. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-21io_uring/kbuf: sanitize peek buffer setupJens Axboe
Harden the buffer peeking a bit, by adding a sanity check for it having a valid size. Outside of that, arg->max_len is a size_t, though it's only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT). Bump our needed check to a size_t so we know it fits. Finally, cap the calculated needed iov value to the PEEK_MAX_IMPORT, which is the maximum number of segments that should be peeked. Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-20io_uring: fix error pbuf checkingPavel Begunkov
Syz reports a problem, which boils down to NULL vs IS_ERR inconsistent error handling in io_alloc_pbuf_ring(). KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:__io_remove_buffers+0xac/0x700 io_uring/kbuf.c:341 Call Trace: <TASK> io_put_bl io_uring/kbuf.c:378 [inline] io_destroy_buffers+0x14e/0x490 io_uring/kbuf.c:392 io_ring_ctx_free+0xa00/0x1070 io_uring/io_uring.c:2613 io_ring_exit_work+0x80f/0x8a0 io_uring/io_uring.c:2844 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd40 kernel/workqueue.c:3390 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Cc: stable@vger.kernel.org Reported-by: syzbot+2074b1a3d447915c6f1c@syzkaller.appspotmail.com Fixes: 87585b05757dc ("io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c5f9df20560bd9830401e8e48abc029e7cfd9f5e.1721329239.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-22io_uring/kbuf: add helpers for getting/peeking multiple buffersJens Axboe
Our provided buffer interface only allows selection of a single buffer. Add an API that allows getting/peeking multiple buffers at the same time. This is only implemented for the ring provided buffers. It could be added for the legacy provided buffers as well, but since it's strongly encouraged to use the new interface, let's keep it simpler and just provide it for the new API. The legacy interface will always just select a single buffer. There are two new main functions: io_buffers_select(), which selects up as many buffers as it can. The caller supplies the iovec array, and io_buffers_select() may allocate a bigger array if the 'out_len' being passed in is non-zero and bigger than what fits in the provided iovec. Buffers grabbed with this helper are permanently assigned. io_buffers_peek(), which works like io_buffers_select(), except they can be recycled, if needed. Callers using either of these functions should call io_put_kbufs() rather than io_put_kbuf() at completion time. The peek interface must be called with the ctx locked from peek to completion. This add a bit state for the request: - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been peeked and should be committed to the buffer ring head when they are put as part of completion. Prior to this, req->buf_list was cleared to NULL when committed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/kbuf: remove dead defineJens Axboe
We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: move mapping/allocation helpers to a separate fileJens Axboe
Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: use unpin_user_pages() where appropriateJens Axboe
There are a few cases of open-rolled loops around unpin_user_page(), use the generic helper instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ringJens Axboe
Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/kbuf: vmap pinned buffer ringJens Axboe
This avoids needing to care about HIGHMEM, and it makes the buffer indexing easier as both ring provided buffer methods are now virtually mapped in a contigious fashion. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: hold io_buffer_list reference over mmapJens Axboe
If we look up the kbuf, ensure that it doesn't get unregistered until after we're done with it. Since we're inside mmap, we cannot safely use the io_uring lock. Rely on the fact that we can lookup the buffer list under RCU now and grab a reference to it, preventing it from being unregistered until we're done with it. The lookup returns the io_buffer_list directly with it referenced. Cc: stable@vger.kernel.org # v6.4+ Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: protect io_buffer_list teardown with a referenceJens Axboe
No functional changes in this patch, just in preparation for being able to keep the buffer list alive outside of the ctx->uring_lock. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of bl->is_readyJens Axboe
Now that xarray is being exclusively used for the buffer_list lookup, this check is no longer needed. Get rid of it and the is_ready member. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of lower BGID listsJens Axboe
Just rely on the xarray for any kind of bgid. This simplifies things, and it really doesn't bring us much, if anything. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring/kbuf: rename is_mappedPavel Begunkov
In buffer lists we have ->is_mapped as well as ->is_mmap, it's pretty hard to stay sane double checking which one means what, and in the long run there is a high chance of an eventual bug. Rename ->is_mapped into ->is_buf_ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLEJens Axboe
We only use the flag for this purpose, so rename it accordingly. This further prevents various other use cases of it, keeping it clean and consistent. Then we can also check it in one spot, when it's being attempted recycled, and remove some dead code in io_kbuf_recycle_ring(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/kbuf: flag request if buffer pool is empty after buffer pickJens Axboe
Normally we do an extra roundtrip for retries even if the buffer pool has depleted, as we don't check that upfront. Rather than add this check, have the buffer selection methods mark the request with REQ_F_BL_EMPTY if the used buffer group is out of buffers after this selection. This is very cheap to do once we're all the way inside there anyway, and it gives the caller a chance to make better decisions on how to proceed. For example, recv/recvmsg multishot could check this flag when it decides whether to keep receiving or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/kbuf: cleanup passing back cflagsJens Axboe
We have various functions calculating the CQE cflags we need to pass back, but it's all the same everywhere. Make a number of the putting functions void, and just have the two main helps for this, io_put_kbuf() and io_put_kbuf_comp() calculate the actual mask and pass it back. While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before this change, we would call into __io_put_kbuf() only to go right back in to the header defined functions. As clearing this type of buffer is just re-assigning the buf_index and incrementing the head, this is very wasteful. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: add io_file_can_poll() helperJens Axboe
This adds a flag to avoid dipping dereferencing file and then f_op to figure out if the file has a poll handler defined or not. We generally call this at least twice for networked workloads, and if using ring provided buffers, we do it on every buffer selection. Particularly the latter is troublesome, as it's otherwise a very fast operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-21io_uring/kbuf: add method for returning provided buffer ring headJens Axboe
The tail of the provided ring buffer is shared between the kernel and the application, but the head is private to the kernel as the application doesn't need to see it. However, this also prevents the application from knowing how many buffers the kernel has consumed. Usually this is fine, as the information is inherently racy in that the kernel could be consuming buffers continually, but for cleanup purposes it may be relevant to know how many buffers are still left in the ring. Add IORING_REGISTER_PBUF_STATUS which will return status for a given provided buffer ring. Right now it just returns the head, but space is reserved for more information later in, if needed. Link: https://github.com/axboe/liburing/discussions/1020 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: check for buffer list readiness after NULL checkJens Axboe
Move the buffer list 'is_ready' check below the validity check for the buffer list for a given group. Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: Fix an NULL vs IS_ERR() bug in io_alloc_pbuf_ring()Dan Carpenter
The io_mem_alloc() function returns error pointers, not NULL. Update the check accordingly. Fixes: b10b73c102a2 ("io_uring/kbuf: recycle freed mapped buffer ring entries") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/5ed268d3-a997-4f64-bd71-47faa92101ab@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring: free io_buffer_list entries via RCUJens Axboe
mmap_lock nests under uring_lock out of necessity, as we may be doing user copies with uring_lock held. However, for mmap of provided buffer rings, we attempt to grab uring_lock with mmap_lock already held from do_mmap(). This makes lockdep, rightfully, complain: WARNING: possible circular locking dependency detected 6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted ------------------------------------------------------ buf-ring.t/442 is trying to acquire lock: ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140 but task is already holding lock: ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_lock){++++}-{3:3}: __might_fault+0x90/0xbc io_register_pbuf_ring+0x94/0x488 __arm64_sys_io_uring_register+0x8dc/0x1318 invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x19a0/0x2d14 lock_acquire+0x2e0/0x44c __mutex_lock+0x118/0x564 mutex_lock_nested+0x20/0x28 io_uring_validate_mmap_request.isra.0+0x4c/0x140 io_uring_mmu_get_unmapped_area+0x3c/0x98 get_unmapped_area+0xa4/0x158 do_mmap+0xec/0x5b4 vm_mmap_pgoff+0x158/0x264 ksys_mmap_pgoff+0x1d4/0x254 __arm64_sys_mmap+0x80/0x9c invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c From that mmap(2) path, we really just need to ensure that the buffer list doesn't go away from underneath us. For the lower indexed entries, they never go away until the ring is freed and we can always sanely reference those as long as the caller has a file reference. For the higher indexed ones in our xarray, we just need to ensure that the buffer list remains valid while we return the address of it. Free the higher indexed io_buffer_list entries via RCU. With that we can avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read lock around the buffer list lookup and address check. To ensure that the arrayed lookup either returns a valid fully formulated entry via RCU lookup, add an 'is_ready' flag that we access with store and release memory ordering. This isn't needed for the xarray lookups, but doesn't hurt either. Since this isn't a fast path, retain it across both types. Similarly, for the allocated array inside the ctx, ensure we use the proper load/acquire as setup could in theory be running in parallel with mmap. While in there, add a few lockdep checks for documentation purposes. Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: prune deferred locked cache when tearing downJens Axboe
We used to just use our page list for final teardown, which would ensure that we got all the buffers, even the ones that were not on the normal cached list. But while moving to slab for the io_buffers, we know only prune this list, not the deferred locked list that we have. This can cause a leak of memory, if the workload ends up using the intermediate locked list. Fix this by always pruning both lists when tearing down. Fixes: b3a4dbc89d40 ("io_uring/kbuf: Use slab for struct io_buffer objects") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: recycle freed mapped buffer ring entriesJens Axboe
Right now we stash any potentially mmap'ed provided ring buffer range for freeing at release time, regardless of when they get unregistered. Since we're keeping track of these ranges anyway, keep track of their registration state as well, and use that to recycle ranges when appropriate rather than always allocate new ones. The lookup is a basic scan of entries, checking for the best matching free entry. Fixes: c392cbecd8ec ("io_uring/kbuf: defer release of mapped buffer rings") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: defer release of mapped buffer ringsJens Axboe
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the kernel allocates the memory for it and the application is expected to mmap(2) this memory. However, io_uring uses remap_pfn_range() for this operation, so we cannot rely on normal munmap/release on freeing them for us. Stash an io_buf_free entry away for each of these, if any, and provide a helper to free them post ->release(). Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: indicate if io_kbuf_recycle did recycle anythingDylan Yudaken
It can be useful to know if io_kbuf_recycle did actually recycle the buffer on the request, or if it left the request alone. Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-2-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>