summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2024-10-29io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cacheJens Axboe
Doesn't matter right now as there's still some bytes left for it, but let's prepare for the io_kiocb potentially growing and add a specific freeptr offset for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/rsrc: move struct io_fixed_file to rsrc.h headerJens Axboe
There's no need for this internal structure to be visible, move it to the private rsrc.h header instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/nop: add support for testing registered files and buffersJens Axboe
Useful for testing performance/efficiency impact of registered files and buffers, vs (particularly) non-registered files. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: add support for fixed wait regionsJens Axboe
Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait *reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); /* set timeout and mark as set, sigmask/sigmask_sz as needed */ reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: change io_get_ext_arg() to use uaccess begin + endJens Axboe
In scenarios where a high frequency of wait events are seen, the copy of the struct io_uring_getevents_arg is quite noticeable in the profiles in terms of time spent. It can be seen as up to 3.5-4.5%. Rewrite the copy-in logic, saving about 0.5% of the time. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: switch struct ext_arg from __kernel_timespec to timespec64Jens Axboe
This avoids intermediate storage for turning a __kernel_timespec user pointer into an on-stack struct timespec64, only then to turn it into a ktime_t. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/sqpoll: wait on sqd->wait for thread parkingJens Axboe
io_sqd_handle_event() just does a mutex unlock/lock dance when it's supposed to park, somewhat relying on full ordering with the thread trying to park it which does a similar unlock/lock dance on sqd->lock. However, with adaptive spinning on mutexes, this can waste an awful lot of time. Normally this isn't very noticeable, as parking and unparking the thread isn't a common (or fast path) occurence. However, in testing ring resizing, it's testing exactly that, as each resize will require the SQPOLL to safely park and unpark. Have io_sq_thread_park() explicitly wait on sqd->park_pending being zero before attempting to grab the sqd->lock again. In a resize test, this brings the runtime of SQPOLL down from about 60 seconds to a few seconds, just like the !SQPOLL tests. And saves a ton of spinning time on the mutex, on both sides. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/register: add IORING_REGISTER_RESIZE_RINGSJens Axboe
Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that. For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful. Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given. Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped. Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state. To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/memmap: explicitly return -EFAULT for mmap on NULL ringsJens Axboe
The later mapping will actually check this too, but in terms of code clarify, explicitly check for whether or not the rings and sqes are valid during validation. That makes it explicit that if they are non-NULL, they are valid and can get mapped. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: abstract out a bit of the ring filling logicJens Axboe
Abstract out a io_uring_fill_params() helper, which fills out the necessary bits of struct io_uring_params. Add it to io_uring.h as well, in preparation for having another internal user of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: move max entry definition and ring sizing into headerJens Axboe
In preparation for needing this somewhere else, move the definitions for the maximum CQ and SQ ring size into io_uring.h. Make the rings_size() helper available as well, and have it take just the setup flags argument rather than the fill ring pointer. That's all that is needed. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/net: clean up io_msg_copy_hdrPavel Begunkov
Put sr->umsg into a local variable, so it doesn't repeat "sr->umsg->" for every field. It looks nicer, and likely without the patch it compiles into a bunch of umsg memory reads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/26c2f30b491ea7998bfdb5bb290662572a61064d.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/net: don't alias send user pointer readsPavel Begunkov
We keep user pointers in an union, which could be a user buffer or a user pointer to msghdr. What is confusing is that it potenitally reads and assigns sqe->addr as one type but then uses it as another via the union. Even more, it's not even consistent across copy and zerocopy versions. Make send and sendmsg setup helpers read sqe->addr and treat it as the right type from the beginning. The end goal would be to get rid of the use of struct io_sr_msg::umsg for send requests as we only need it at the prep side. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/685d788605f5d78af18802fcabf61ba65cfd8002.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/net: don't store send address ptrPavel Begunkov
For non "msg" requests we copy the address at the prep stage and there is no need to store the address user pointer long term. Pass the SQE into io_send_setup(), let it parse it, and remove struct io_sr_msg addr addr_len fields. It saves some space and also less confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/db3dce544e17ca9d4b17d2506fbbac1da8a87824.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/net: split send and sendmsg prep helpersPavel Begunkov
A preparation patch splitting io_sendmsg_prep_setup into two separate helpers for send and sendmsg variants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a2319471ba040e053b7f1d22f4af510d1118eca.1729607201.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/net: move send zc fixed buffer import to issue pathJens Axboe
Let's keep it close with the actual import, there's no reason to do this on the prep side. With that, we can drop one of the branches checking for whether or not IORING_RECVSEND_FIXED_BUF is set. As a side-effect, get rid of req->imu usage. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: remove 'issue_flags' argument for io_req_set_rsrc_node()Jens Axboe
All callers already hold the ring lock and hence are passing '0', remove the argument and the conditional locking that it controlled. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/rw: get rid of using req->imuJens Axboe
It's assigned in the same function that it's being used, get rid of it. A local variable will do just fine. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/uring_cmd: get rid of using req->imuJens Axboe
It's pretty pointless to use io_kiocb as intermediate storage for this, so split the validity check and the actual usage. The resource node is assigned upfront at prep time, to prevent it from going away. The actual import is never called with the ctx->uring_lock held, so grab it for the import. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/rsrc: don't assign bvec twice in io_import_fixed()Jens Axboe
iter->bvec is already set to imu->bvec - remove the one dead assignment and turn the other one into an addition instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: clean up cqe trace pointsPavel Begunkov
We have too many helpers posting CQEs, instead of tracing completion events before filling in a CQE and thus having to pass all the data, set the CQE first, pass it to the tracing helper and let it extract everything it needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b83c1ca9ee5aed2df0f3bb743bf5ed699cce4c86.1729267437.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: static_key for !IORING_SETUP_NO_SQARRAYPavel Begunkov
IORING_SETUP_NO_SQARRAY should be preferred and used by default by liburing, optimise flag checking in io_get_sqe() with a static key. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c164a48542fbb080115e2377ecf160c758562742.1729264988.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: kill io_llist_xchgPavel Begunkov
io_llist_xchg is only used to set the list to NULL, which can also be done with llist_del_all(). Use the latter and kill io_llist_xchg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6765112680d2e86a58b76166b7513391ff4e5d7.1729264960.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: move cancel hash tables to kvmalloc/kvfreeJens Axboe
Convert to using kvmalloc/kfree() for the hash tables, and while at it, make it handle low memory situations better. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/cancel: get rid of init_hash_table() helperJens Axboe
All it does is initialize the lists, just move the INIT_HLIST_HEAD() into the one caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: get rid of per-hashtable bucket locksJens Axboe
Any access to the table is protected by ctx->uring_lock now anyway, the per-bucket locking doesn't buy us anything. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: get rid of io_poll_tw_hash_eject()Jens Axboe
It serves no purposes anymore, all it does is delete the hash list entry. task_work always has the ring locked. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: get rid of unlocked cancel hashJens Axboe
io_uring maintains two hash lists of inflight requests: 1) ctx->cancel_table_locked. This is used when the caller has the ctx->uring_lock held already. This is only an issue side parameter, as removal or task_work will always have it held. 2) ctx->cancel_table. This is used when the issuer does NOT have the ctx->uring_lock held, and relies on the table spinlocks for access. However, it's pretty trivial to simply grab the lock in the one spot where we care about it, for insertion. With that, we can kill the unlocked table (and get rid of the _locked postfix for the other one). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/poll: remove 'ctx' argument from io_poll_req_delete()Jens Axboe
It's always req->ctx being used anyway, having this as a separate argument (that is then not even used) just makes it more confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/msg_ring: add support for sending a sync messageJens Axboe
Normally MSG_RING requires both a source and a destination ring. But some users don't always have a ring avilable to send a message from, yet they still need to notify a target ring. Add support for using io_uring_register(2) without having a source ring, using a file descriptor of -1 for that. Internally those are called blind registration opcodes. Implement IORING_REGISTER_SEND_MSG_RING as a blind opcode, which simply takes an sqe that the application can put on the stack and use the normal liburing helpers to initialize it. Then the app can call: io_uring_register(-1, IORING_REGISTER_SEND_MSG_RING, &sqe, 1); and get the same behavior in terms of the target, where a CQE is posted with the details given in the sqe. For now this takes a single sqe pointer argument, and hence arg must be set to that, and nr_args must be 1. Could easily be extended to take an array of sqes, but for now let's keep it simple. Link: https://lore.kernel.org/r/20240924115932.116167-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/msg_ring: refactor a few helper functionsJens Axboe
Mostly just to skip them taking an io_kiocb, rather just pass in the ctx and io_msg directly. In preparation for being able to issue a MSG_RING request without having an io_kiocb. No functional changes in this patch. Link: https://lore.kernel.org/r/20240924115932.116167-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: move ctx->evfd_last_cq_tail into io_ev_fdJens Axboe
Everything else about the io_uring eventfd support is nicely kept private to that code, except the cached_cq_tail tracking. With everything else in place, move io_eventfd_flush_signal() to using the ev_fd grab+release helpers, which then enables the direct use of io_ev_fd for this tracking too. Link: https://lore.kernel.org/r/20240921080307.185186-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: abstract out ev_fd grab + release helpersJens Axboe
In preparation for needing the ev_fd grabbing (and releasing) from another path, abstract out two helpers for that. Link: https://lore.kernel.org/r/20240921080307.185186-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: move trigger check into a helperJens Axboe
It's a bit hard to read what guards the triggering, move it into a helper and add a comment explaining it too. This additionally moves the ev_fd == NULL check in there as well. Link: https://lore.kernel.org/r/20240921080307.185186-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: move actual signaling part into separate helperJens Axboe
In preparation for using this from multiple spots, move the signaling into a helper. Link: https://lore.kernel.org/r/20240921080307.185186-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: check for the need to async notifier earlierJens Axboe
It's not necessary to do this post grabbing a reference. With that, we can drop the out goto path as well. Link: https://lore.kernel.org/r/20240921080307.185186-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/eventfd: abstract out ev_fd put helperJens Axboe
We call this in two spot, have a helper for it. In preparation for extending this part. Link: https://lore.kernel.org/r/20240921080307.185186-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILEJens Axboe
Rejection of IOSQE_FIXED_FILE combined with IORING_OP_[GS]ETXATTR is fine - these do not take a file descriptor, so such combination makes no sense. The checks are misplaced, though - as it is, they triggers on IORING_OP_F[GS]ETXATTR as well, and those do take a file reference, no matter the origin. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19io_uring/rw: fix wrong NOWAIT check in io_rw_init_file()Jens Axboe
A previous commit improved how !FMODE_NOWAIT is dealt with, but inadvertently negated a check whilst doing so. This caused -EAGAIN to be returned from reading files with O_NONBLOCK set. Fix up the check for REQ_F_SUPPORT_NOWAIT. Reported-by: Julian Orth <ju.orth@gmail.com> Link: https://github.com/axboe/liburing/issues/1270 Fixes: f7c913438533 ("io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAIT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-17io_uring/sqpoll: ensure task state is TASK_RUNNING when running task_workJens Axboe
When the sqpoll is exiting and cancels pending work items, it may need to run task_work. If this happens from within io_uring_cancel_generic(), then it may be under waiting for the io_uring_task waitqueue. This results in the below splat from the scheduler, as the ring mutex may be attempted grabbed while in a TASK_INTERRUPTIBLE state. Ensure that the task state is set appropriately for that, just like what is done for the other cases in io_run_task_work(). do not call blocking ops when !TASK_RUNNING; state=1 set at [<0000000029387fd2>] prepare_to_wait+0x88/0x2fc WARNING: CPU: 6 PID: 59939 at kernel/sched/core.c:8561 __might_sleep+0xf4/0x140 Modules linked in: CPU: 6 UID: 0 PID: 59939 Comm: iou-sqp-59938 Not tainted 6.12.0-rc3-00113-g8d020023b155 #7456 Hardware name: linux,dummy-virt (DT) pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : __might_sleep+0xf4/0x140 lr : __might_sleep+0xf4/0x140 sp : ffff80008c5e7830 x29: ffff80008c5e7830 x28: ffff0000d93088c0 x27: ffff60001c2d7230 x26: dfff800000000000 x25: ffff0000e16b9180 x24: ffff80008c5e7a50 x23: 1ffff000118bcf4a x22: ffff0000e16b9180 x21: ffff0000e16b9180 x20: 000000000000011b x19: ffff80008310fac0 x18: 1ffff000118bcd90 x17: 30303c5b20746120 x16: 74657320313d6574 x15: 0720072007200720 x14: 0720072007200720 x13: 0720072007200720 x12: ffff600036c64f0b x11: 1fffe00036c64f0a x10: ffff600036c64f0a x9 : dfff800000000000 x8 : 00009fffc939b0f6 x7 : ffff0001b6327853 x6 : 0000000000000001 x5 : ffff0001b6327850 x4 : ffff600036c64f0b x3 : ffff8000803c35bc x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000e16b9180 Call trace: __might_sleep+0xf4/0x140 mutex_lock+0x84/0x124 io_handle_tw_list+0xf4/0x260 tctx_task_work_run+0x94/0x340 io_run_task_work+0x1ec/0x3c0 io_uring_cancel_generic+0x364/0x524 io_sq_thread+0x820/0x124c ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-16io_uring/rsrc: ignore dummy_ubuf for buffer cloningJens Axboe
For placeholder buffers, &dummy_ubuf is assigned which is a static value. When buffers are attempted cloned, don't attempt to grab a reference to it, as we both don't need it and it'll actively fail as dummy_ubuf doesn't have a valid reference count setup. Link: https://lore.kernel.org/io-uring/Zw8dkUzsxQ5LgAJL@ly-workstation/ Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-15io_uring/sqpoll: close race on waiting for sqring entriesJens Axboe
When an application uses SQPOLL, it must wait for the SQPOLL thread to consume SQE entries, if it fails to get an sqe when calling io_uring_get_sqe(). It can do so by calling io_uring_enter(2) with the flag value of IORING_ENTER_SQ_WAIT. In liburing, this is generally done with io_uring_sqring_wait(). There's a natural expectation that once this call returns, a new SQE entry can be retrieved, filled out, and submitted. However, the kernel uses the cached sq head to determine if the SQRING is full or not. If the SQPOLL thread is currently in the process of submitting SQE entries, it may have updated the cached sq head, but not yet committed it to the SQ ring. Hence the kernel may find that there are SQE entries ready to be consumed, and return successfully to the application. If the SQPOLL thread hasn't yet committed the SQ ring entries by the time the application returns to userspace and attempts to get a new SQE, it will fail getting a new SQE. Fix this by having io_sqring_full() always use the user visible SQ ring head entry, rather than the internally cached one. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/1267 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-07remove pointless includes of <linux/fdtable.h>Al Viro
some of those used to be needed, some had been cargo-culted for no reason... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-06io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAITJens Axboe
The checking for whether or not io_uring can do a non-blocking read or write attempt is gated on FMODE_NOWAIT. However, if the file is pollable, it's feasible to just check if it's currently in a state in which it can sanely receive or send _some_ data. This avoids unnecessary io-wq punts, and repeated worthless retries before doing that punt, by assuming that some data can get delivered or received if poll tells us that is true. It also allows multishot reads to properly work with these types of files, enabling a bit of a cleanup of the logic that: c9d952b9103b ("io_uring/rw: fix cflags posting for single issue multishot read") had to put in place. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-06io_uring/rw: fix cflags posting for single issue multishot readJens Axboe
If multishot gets disabled, and hence the request will get terminated rather than persist for more iterations, then posting the CQE with the right cflags is still important. Most notably, the buffer reference needs to be included. Refactor the return of __io_read() a bit, so that the provided buffer is always put correctly, and hence returned to the application. Reported-by: Sharon Rosner <Sharon Rosner> Link: https://github.com/axboe/liburing/issues/1257 Cc: stable@vger.kernel.org Fixes: 2a975d426c82 ("io_uring/rw: don't allow multishot reads without NOWAIT support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-30io_uring/net: harden multishot termination case for recvJens Axboe
If the recv returns zero, or an error, then it doesn't matter if more data has already been received for this buffer. A condition like that should terminate the multishot receive. Rather than pass in the collected return value, pass in whether to terminate or keep the recv going separately. Note that this isn't a bug right now, as the only way to get there is via setting MSG_WAITALL with multishot receive. And if an application does that, then -EINVAL is returned anyway. But it seems like an easy bug to introduce, so let's make it a bit more explicit. Link: https://github.com/axboe/liburing/issues/1246 Cc: stable@vger.kernel.org Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24io_uring: fix casts to io_req_flags_tMin-Hua Chen
Apply __force cast to restricted io_req_flags_t type to fix the following sparse warning: io_uring/io_uring.c:2026:23: sparse: warning: cast to restricted io_req_flags_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Link: https://lore.kernel.org/r/20240922104132.157055-1-minhuadotchen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24io_uring: fix memory leak when cache init failGuixin Liu
Exit the percpu ref when cache init fails to free the data memory with in struct percpu_ref. Fixes: 206aefde4f88 ("io_uring: reduce/pack size of io_ring_ctx") Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240923100512.64638-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-24Merge tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Mostly just a set of fixes in here, or little changes that didn't get included in the initial pull request. This contains: - Move the SQPOLL napi polling outside the submission lock (Olivier) - Rename of the "copy buffers" API that got added in the 6.12 merge window. There's really no copying going on, it's just referencing the buffers. After a bit of consideration, decided that it was better to simply rename this to avoid potential confusion (me) - Shrink struct io_mapped_ubuf from 48 to 32 bytes, by changing it to start + len tracking rather than having start / end in there, and by removing the caching of folio_mask when we can just calculate it from folio_shift when we need it (me) - Fixes for the SQPOLL affinity checking (me, Felix) - Fix for how cqring waiting checks for the presence of task_work. Just check it directly rather than check for a specific notification mechanism (me) - Tweak to how request linking is represented in tracing (me) - Fix a syzbot report that deliberately sets up a huge list of overflow entries, and then hits rcu stalls when flushing this list. Just check for the need to preempt, and drop/reacquire locks in the loop. There's no state maintained over the loop itself, and each entry is yanked from head-of-list (me)" * tag 'for-6.12/io_uring-20240922' of git://git.kernel.dk/linux: io_uring: check if we need to reschedule during overflow flush io_uring: improve request linking trace io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL io_uring/sqpoll: do the napi busy poll outside the submission block io_uring: clean up a type in io_uring_register_get_file() io_uring/sqpoll: do not put cpumask on stack io_uring/sqpoll: retain test for whether the CPU is valid io_uring/rsrc: change ubuf->ubuf_end to length tracking io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask io_uring: rename "copy buffers" to "clone buffers"
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.