summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2024-04-15io_uring/net: get rid of ->prep_async() for send sideJens Axboe
Move the io_async_msghdr out of the issue path and into prep handling, e it's now done unconditionally and hence does not need to be part of the issue path. This means any usage of io_sendrecv_prep_async() and io_sendmsg_prep_async(), and hence the forced async setup path is now unified with the normal prep setup. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: get rid of ->prep_async() for receive sideJens Axboe
Move the io_async_msghdr out of the issue path and into prep handling, since it's now done unconditionally and hence does not need to be part of the issue path. This reduces the footprint of the multishot fast path of multiple invocations of ->issue() per prep, and also means that using ->prep_async() can be dropped for recvmsg asthis is now done via setup on the prep side. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: always set kmsg->msg.msg_control_user before issueJens Axboe
We currently set this separately for async/sync entry, but let's just move it to a generic pre-issue spot and eliminate the difference between the two. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: always setup an io_async_msghdrJens Axboe
Rather than use an on-stack one and then need to allocate and copy if async execution is required, always grab one upfront. This should be very cheap, and potentially even have cache hotness benefits for back-to-back send/recv requests. For any recv type of request, this is probably a good choice in general, as it's expected that no data is available initially. For send this is not necessarily the case, as space in the socket buffer is expected to be available. However, getting a cached io_async_msghdr is very cheap, and as it should be cache hot, probably the difference here is neglible, if any. A nice side benefit is that io_setup_async_msg can get killed completely, which has some nasty iovec manipulation code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: unify cleanup handlingJens Axboe
Now that recv/recvmsg both do the same cleanup, put it in the retry and finish handlers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: switch io_recv() to using io_async_msghdrJens Axboe
No functional changes in this patch, just in preparation for carrying more state than what is available now, if necessary. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdrJens Axboe
No functional changes in this patch, just in preparation for carrying more state then what is being done now, if necessary. While unifying some of this code, add a generic send setup prep handler that they can both use. This gets rid of some manual msghdr and sockaddr on the stack, and makes it look a bit more like the sendmsg/recvmsg variants. Going forward, more can get unified on top. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/alloc_cache: shrink default max entries from 512 to 128Jens Axboe
In practice, we just need to recycle a few elements for (by far) most use cases. Shrink the total size down from 512 to 128, which should be more than plenty. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove timeout/poll specific cancelationsJens Axboe
For historical reasons these were special cased, as they were the only ones that needed cancelation. But now we handle cancelations generally, and hence there's no need to check for these in io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles both these and the rest as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: flush delayed fallback task_work in cancelationJens Axboe
Just like we run the inline task_work, ensure we also factor in and run the fallback task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: clean up io_lockdep_assert_cq_lockedPavel Begunkov
Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked() and kill the else branch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: refactor io_req_complete_post()Pavel Begunkov
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests to task_work, it's much cleaner and should normally happen. We couldn't do it before because there was a possibility of looping in complete_post() -> tw -> complete_post() -> ... Also, unexport the function and inline __io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove current check from complete_postPavel Begunkov
task_work execution is now always locked, and we shouldn't get into io_req_complete_post() from them. That means that complete_post() is always called out of the original task context and we don't even need to check current. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: get rid of intermediate aux cqe cachesPavel Begunkov
io_post_aux_cqe(), which is used for multishot requests, delays completions by putting CQEs into a temporary array for the purpose completion lock/flush batching. DEFER_TASKRUN doesn't need any locking, so for it we can put completions directly into the CQ and defer post completion handling with a flag. That leaves !DEFER_TASKRUN, which is not that interesting / hot for multishot requests, so have conditional locking with deferred flush for them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: refactor io_fill_cqe_req_auxPavel Begunkov
The restriction on multishot execution context disallowing io-wq is driven by rules of io_fill_cqe_req_aux(), it should only be called in the master task context, either from the syscall path or in task_work. Since task_work now always takes the ctx lock implying IO_URING_F_COMPLETE_DEFER, we can just assume that the function is always called with its defer argument set to true. Kill the argument. Also rename the function for more consistency as "fill" in CQE related functions was usually meant for raw interfaces only copying data into the CQ without any locking, waking the user and other accounting "post" functions take care of. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: remove struct io_tw_state::lockedPavel Begunkov
ctx is always locked for task_work now, so get rid of struct io_tw_state::locked. Note I'm stopping one step before removing io_tw_state altogether, which is not empty, because it still serves the purpose of indicating which function is a tw callback and forcing users not to invoke them carelessly out of a wrong context. The removal can always be done later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring: force tw ctx lockingPavel Begunkov
We can run normal task_work without locking the ctx, however we try to lock anyway and most handlers prefer or require it locked. It might have been interesting to multi-submitter ring with high contention completing async read/write requests via task_work, however that will still need to go through io_req_complete_post() and potentially take the lock for rsrc node putting or some other case. In other words, it's hard to care about it, so alawys force the locking. The case described would also because of various io_uring caches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/rw: avoid punting to io-wq directlyPavel Begunkov
kiocb_done() should care to specifically redirecting requests to io-wq. Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let the core code io_uring handle offloading. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/cmd: fix tw <-> issue_flags conversionPavel Begunkov
!IO_URING_F_UNLOCKED does not translate to availability of the deferred completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should pass and look for to use io_req_complete_defer() and other variants. Luckily, it's not a real problem as two wrongs actually made it right, at least as far as io_uring_cmd_work() goes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/aef76d34fe9410df8ecc42a14544fd76cd9d8b9e.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/cmd: kill one issue_flags to tw conversionPavel Begunkov
io_uring cmd converts struct io_tw_state to issue_flags and later back to io_tw_state, it's awfully ill-fated, not to mention that intermediate issue_flags state is not correct. Get rid of the last conversion, drag through tw everything that came with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a direct call to io_req_complete_defer(), at least for the time being. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/c53fa3df749752bd058cf6f824a90704822d6bcc.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15io_uring/cmd: move io_uring_try_cancel_uring_cmd()Pavel Begunkov
io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's move it closer to all cmd bits into uring_cmd.c Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-08io_uring/net: restore msg_control on sendzc retryPavel Begunkov
cac9e4418f4cb ("io_uring/net: save msghdr->msg_control for retries") reinstatiates msg_control before every __sys_sendmsg_sock(), since the function can overwrite the value in msghdr. We need to do same for zerocopy sendmsg. Cc: stable@vger.kernel.org Fixes: 493108d95f146 ("io_uring/net: zerocopy sendmsg") Link: https://github.com/axboe/liburing/issues/1067 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-05io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failureAlexey Izbyshev
This bug was introduced in commit 950e79dd7313 ("io_uring: minor io_cqring_wait() optimization"), which was made in preparation for adc8682ec690 ("io_uring: Add support for napi_busy_poll"). The latter got reverted in cb3182167325 ("Revert "io_uring: Add support for napi_busy_poll""), so simply undo the former as well. Cc: stable@vger.kernel.org Fixes: 950e79dd7313 ("io_uring: minor io_cqring_wait() optimization") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: hold io_buffer_list reference over mmapJens Axboe
If we look up the kbuf, ensure that it doesn't get unregistered until after we're done with it. Since we're inside mmap, we cannot safely use the io_uring lock. Rely on the fact that we can lookup the buffer list under RCU now and grab a reference to it, preventing it from being unregistered until we're done with it. The lookup returns the io_buffer_list directly with it referenced. Cc: stable@vger.kernel.org # v6.4+ Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: protect io_buffer_list teardown with a referenceJens Axboe
No functional changes in this patch, just in preparation for being able to keep the buffer list alive outside of the ctx->uring_lock. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of bl->is_readyJens Axboe
Now that xarray is being exclusively used for the buffer_list lookup, this check is no longer needed. Get rid of it and the is_ready member. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of lower BGID listsJens Axboe
Just rely on the xarray for any kind of bgid. This simplifies things, and it really doesn't bring us much, if anything. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring: use private workqueue for exit workJens Axboe
Rather than use the system unbound event workqueue, use an io_uring specific one. This avoids dependencies with the tty, which also uses the system_unbound_wq, and issues flushes of said workqueue from inside its poll handling. Cc: stable@vger.kernel.org Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Iskren Chernev <me@iskren.info> Link: https://github.com/axboe/liburing/issues/1113 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01io_uring: disable io-wq execution of multishot NOWAIT requestsJens Axboe
Do the same check for direct io-wq execution for multishot requests that commit 2a975d426c82 did for the inline execution, and disable multishot mode (and revert to single shot) if the file type doesn't support NOWAIT, and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's a requirement that nonblocking read attempts can be done. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01io_uring/rw: don't allow multishot reads without NOWAIT supportJens Axboe
Supporting multishot reads requires support for NOWAIT, as the alternative would be always having io-wq execute the work item whenever the poll readiness triggered. Any fast file type will have NOWAIT support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the given file type does not, then simply resort to single shot execution. Cc: stable@vger.kernel.org Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-18io_uring/sqpoll: early exit thread if task_context wasn't allocatedJens Axboe
Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread to exit. This can only happen if io_uring_alloc_task_context() fails, which generally requires fault injection. Reported-by: Ubisectech Sirius <bugreport@ubisectech.com> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring: clear opcode specific data for an early failureJens Axboe
If failure happens before the opcode prep handler is called, ensure that we clear the opcode specific area of the request, which holds data specific to that request type. This prevents errors where opcode handlers either don't get to clear per-request private data since prep isn't even called. Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring/net: ensure async prep handlers always initialize ->done_ioJens Axboe
If we get a request with IOSQE_ASYNC set, then we first run the prep async handlers. But if we then fail setting it up and want to post a CQE with -EINVAL, we use ->done_io. This was previously guarded with REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any potential errors, but we need to cover the async setup too. Fixes: 9817ad85899f ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/waitid: always remove waitid entry for cancel allJens Axboe
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our waitid list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Remove the dead check in cancelation as well for the hash_node being empty or not. We already have a waitid reference check for ownership, so we don't need to check the list too. Cc: stable@vger.kernel.org Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/futex: always remove futex entry for cancel allJens Axboe
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our futex list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Fixes: 8f350194d5cf ("io_uring: add support for vectored futex waits") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring: fix poll_remove stalled req completionPavel Begunkov
Taking the ctx lock is not enough to use the deferred request completion infrastructure, it'll get queued into the list but no one would expect it there, so it will sit there until next io_submit_flush_completions(). It's hard to care about the cancellation path, so complete it via tw. Fixes: ef7dfac51d8ed ("io_uring/poll: serialize poll linked timer start with poll removal") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: Fix release of pinned pages when __io_uaddr_map failsGabriel Krisman Bertazi
Looking at the error path of __io_uaddr_map, if we fail after pinning the pages for any reasons, ret will be set to -EINVAL and the error handler won't properly release the pinned pages. I didn't manage to trigger it without forcing a failure, but it can happen in real life when memory is heavily fragmented. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Fixes: 223ef4743164 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages") Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring/kbuf: rename is_mappedPavel Begunkov
In buffer lists we have ->is_mapped as well as ->is_mmap, it's pretty hard to stay sane double checking which one means what, and in the long run there is a high chance of an eventual bug. Rename ->is_mapped into ->is_buf_ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: simplify io_pages_freePavel Begunkov
We never pass a null (top-level) pointer, remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring: clean rings on NO_MMAP alloc failPavel Begunkov
We make a few cancellation judgements based on ctx->rings, so let's zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's done with the mmap case. Likely, it's not a real problem, but zeroing is safer and better tested. Cc: stable@vger.kernel.org Fixes: 03d89a2de25bbc ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retryJens Axboe
If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again. Cc: stable@vger.kernel.org Reported-by: Sargun Dhillon <sargun@meta.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11io_uring: don't save/restore iowait stateJens Axboe
This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: 7b72d661f1f2 ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...
2024-03-09io_uring: Fix sqpoll utilization check racing with dying sqpollGabriel Krisman Bertazi
Commit 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads"), currently in Jens for-next branch, peeks at io_sq_data->thread to report utilization statistics. But, If io_uring_show_fdinfo races with sqpoll terminating, even though we hold the ctx lock, sqd->thread might be NULL and we hit the Oops below. Note that we could technically just protect the getrusage() call and the sq total/work time calculations. But showing some sq information (pid/cpu) and not other information (utilization) is more confusing than not reporting anything, IMO. So let's hide it all if we happen to race with a dying sqpoll. This can be triggered consistently in my vm setup running sqpoll-cancel-hang.t in a loop. BUG: kernel NULL pointer dereference, address: 00000000000007b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x174/0x7c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? getrusage+0x21/0x3e0 ? seq_printf+0x4e/0x70 io_uring_show_fdinfo+0x9db/0xa10 ? srso_alias_return_thunk+0x5/0xfbef5 ? vsnprintf+0x101/0x4d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_vprintf+0x34/0x50 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_printf+0x4e/0x70 ? seq_show+0x16b/0x1d0 ? __pfx_io_uring_show_fdinfo+0x10/0x10 seq_show+0x16b/0x1d0 seq_read_iter+0xd7/0x440 seq_read+0x102/0x140 vfs_read+0xae/0x320 ? srso_alias_return_thunk+0x5/0xfbef5 ? __do_sys_newfstat+0x35/0x60 ksys_read+0xa5/0xe0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f786ec1db4d Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006 RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60 R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628 </TASK> Modules linked in: CR2: 00000000000007b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: dedup io_recv_finish req completionPavel Begunkov
There are two block in io_recv_finish() completing the request, which we can combine and remove jumping. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: refactor DEFER_TASKRUN multishot checksPavel Begunkov
We disallow DEFER_TASKRUN multishots from running by io-wq, which is checked by individual opcodes in the issue path. We can consolidate all it in io_wq_submit_work() at the same time moving the checks out of the hot path. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: fix mshot io-wq checksPavel Begunkov
When checking for concurrent CQE posting, we're not only interested in requests running from the poll handler but also strayed requests ended up in normal io-wq execution. We're disallowing multishots in general from io-wq, not only when they came in a certain way. Cc: stable@vger.kernel.org Fixes: 17add5cea2bba ("io_uring: force multishot CQEs into task context") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: add io_req_msg_cleanup() helperJens Axboe
For the fast inline path, we manually recycle the io_async_msghdr and free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid that needing doing in the slower path. We already do that in 2 spots, and in preparation for adding more, add a helper and use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: simplify msghd->msg_inq checkingJens Axboe
Just check for larger than zero rather than check for non-zero and not -1. This is easier to read, and also protects against any errants < 0 values that aren't -1. Signed-off-by: Jens Axboe <axboe@kernel.dk>