Age | Commit message (Collapse) | Author |
|
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_sr_msg.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_nop.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The only caller to io_buffer_unmap already checks if the node's buf is
not null, so no need to check again.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Open code io_prep_rw_setup() into its only caller, it doesn't provide
any meaningful abstraction anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/61ba72e2d46119db71f27ab908018e6a6cd6c064.1740425922.git.asml.silence@gmail.com
[axboe: fold in 'ret' being unused fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Match the compat part of io_sendmsg_copy_hdr() with its counterpart and
save msg_control.
Fixes: c55978024d123 ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8418821fe83d3b64350ad2b3c0303e9b732bbd.1740498502.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split out a helper out of __io_import_rw_buffer() that handles vectored
buffers. I'll need it for registered vectored buffers, but it also looks
cleaner, especially with parameters being properly named.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/075470cfb24be38709d946815f35ec846d966f41.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_import_iovec() is not limited to iovecs but also imports buffers for
normal reads and selected buffers, rename it for clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91cea59340b61a8f52dc7b8e720274577a25188c.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
rw always allocates async_data, so instead of doing that deeper in prep
calls inside of io_prep_rw_setup(), be a bit more explicit and do that
early on in io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5ead621051bc3374d1e8d96f816454906a6afd71.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently multishot recvzc requests have no read limit and will remain
active so as long as the socket remains open. But, there are sometimes a
need to do a fixed length read e.g. peeking at some data in the socket.
Add a length limit to recvzc requests `len`. A value of 0 means no limit
which is the previous behaviour. A positive value N specifies how many
bytes to read from the socket.
Data will still be posted in aux completions, as before. This could be
split across multiple frags. But the primary recvzc request will now
complete once N bytes have been read. The completion of the recvzc
request will have res and cflags both set to 0.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250224041319.2389785-2-dw@davidwei.uk
[axboe: fixup io_zcrx_recv() for !CONFIG_NET]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.
Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.
The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3096d7b1026d9a52426a598bdfc8d9d324555545.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use a more recognisable pattern for mshot accept, first try to post an
mshot cqe if needed and after do terminating handling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/daf5c0df7e2966deb0a115021c065fc6161a52d7.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.
Cc: stable@vger.kernel.org
Fixes: 390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51c6deb01feaa78b08565ca8f24843c017f5bc80.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use io_is_compat() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/fff93d9d08243284c5db5d546be766a82e85c130.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use io_is_compat() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/28c5b5f1f1bf7f4d18869dafe6e4147ce1bbf0f5.1740400452.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/20250224172337.2009871-1-csander@purestorage.com
[axboe: fold in improvement from Caleb, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Compat performance is not important and simplicity is more appreciated.
Let's not be smart about it and use simpler copy_from_user() instead of
access + __get_user pair.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b334a3a5040efa424ded58e4d8a6ef2554324266.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Even when COMPAT is compiled out, we still have to pass
ctx->compat to __import_iovec(). Replace the read with an indirection
with a constant when the kernel doesn't support compat.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/2819df9c8533c36b46d7baccbb317a0ec89da6cd.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use io_is_compat() to avoid extra overhead in io_uring_cmd() for flag
setting when compat is compiled out.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/f4d74c62d7cbddc386c0a9138ecd2b2ed6d3f146.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A preparation patch adding a simple helper for gauging the compat state.
It'll help us to optimise and compile out more code in the following
commits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/1a87a640265196a67bc38300128e0bfd7839ab1f.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull io_uring fixes from Jens Axboe:
- Series fixing an issue with multishot read on pollable files that may
return -EIOCBQUEUED from ->read_iter(). Four small patches for that,
the first one deliberately done in such a way that it'd be easy to
backport
- Remove some dead constant definitions
- Use array_index_nospec() for opcode indexing
- Work-around for worker creation retries in the presence of signals
* tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux:
io_uring/rw: clean up mshot forced sync mode
io_uring/rw: move ki_complete init into prep
io_uring/rw: don't directly use ki_complete
io_uring/rw: forbid multishot async reads
io_uring/rsrc: remove unused constants
io_uring: fix spelling error in uapi io_uring.h
io_uring: prevent opcode speculation
io-wq: backoff when retrying worker creation
|
|
Allow user to mmap the kernel allocated zerocopy-rx refill queue.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250221085933.26034-1-minhquangbui99@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cross-merge networking fixes after downstream PR (net-6.14-rc4).
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For existing epoll event loops that can't fully convert to io_uring,
the used approach is usually to add the io_uring fd to the epoll
instance and use epoll_wait() to wait on both "legacy" and io_uring
events. While this work, it isn't optimal as:
1) epoll_wait() is pretty limited in what it can do. It does not support
partial reaping of events, or waiting on a batch of events.
2) When an io_uring ring is added to an epoll instance, it activates the
io_uring "I'm being polled" logic which slows things down.
Rather than use this approach, with EPOLL_WAIT support added to io_uring,
event loops can use the normal io_uring wait logic for everything, as
long as an epoll wait request has been armed with io_uring.
Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
is an async request. Waiting on io_uring events in general has various
timeout parameters, and those are the ones that should be used when
waiting on any kind of request. If events are immediately available for
reaping, then This opcode will return those immediately. If none are
available, then it will post an async completion when they become
available.
cqe->res will contain either an error code (< 0 value) for a malformed
request, invalid epoll instance, etc. It will return a positive result
indicating how many events were reaped.
IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
cancelation infrastructure.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just have the Makefile add the object if epoll is enabled, then it's
not necessary to guard the entire epoll.c file inside an CONFIG_EPOLL
ifdef.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
* for-6.15/io_uring-rx-zc: (77 commits)
io_uring: Rename KConfig to Kconfig
io_uring/zcrx: fix leaks on failed registration
io_uring/zcrx: recheck ifq on shutdown
io_uring/zcrx: add selftest
net: add documentation for io_uring zcrx
io_uring/zcrx: add copy fallback
io_uring/zcrx: throttle receive requests
io_uring/zcrx: set pp memory provider for an rx queue
io_uring/zcrx: add io_recvzc request
io_uring/zcrx: dma-map area for the device
io_uring/zcrx: implement zerocopy receive pp memory provider
io_uring/zcrx: grab a net device
io_uring/zcrx: add io_zcrx_area
io_uring/zcrx: add interface queue and refill queue
net: add helpers for setting a memory provider on an rx queue
net: page_pool: add memory provider helpers
net: prepare for non devmem TCP memory providers
net: page_pool: add a mp hook to unregister_netdevice*
net: page_pool: add callback for mp info printing
netdev: add io_uring memory provider info
...
|
|
Kconfig files are traditionally named "Kconfig".
Fixes: 6f377873cb239050 ("io_uring/zcrx: add interface queue and refill queue")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/5ae387c1465f54768b51a5a1ca87be7934c4b2ad.1739976387.git.geert+renesas@glider.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we try to register a device-less interface like veth,
io_register_zcrx_ifq() will leak struct io_zcrx_ifq with a bunch of
resources attached to it. Fix that.
Fixes: 035af94b39fd ("io_uring/zcrx: grab a net device")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202502190532.W7NnmyiP-lkp@intel.com/
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fbf16279dd73fa4c6df048168728355636ba5f53.1739959771.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move code forcing synchronous execution of multishot read requests out
a more generic __io_read().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Initialise ki_complete during request prep stage, we'll depend on it not
being reset during issue in the following patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We want to avoid checking ->ki_complete directly in the io_uring
completion path. Fortunately we have only two callback the selection
of which depend on the ring constant flags, i.e. IOPOLL, so use that
to infer the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
At the moment we can't sanely handle queuing an async request from a
multishot context, so disable them. It shouldn't matter as pollable
files / socekts don't normally do async.
Patching it in __io_read() is not the cleanest way, but it's simpler
than other options, so let's fix it there and clean up on top.
Cc: stable@vger.kernel.org
Reported-by: chase xd <sl1589472800@gmail.com>
Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_ring_exit_work() checks ifq before shutting it down and guarantees
that the pointer is stable, but instead of relying on rather complicated
synchronisation recheck the ifq pointer inside.
Reported-by: Kees Bakker <kees@ijzerbout.nl>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/905e55c47235ab26377a735294f939f31d00ae53.1739934175.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.
IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_uring_create() computes ctx->lockless_cq as:
ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)
So use it to simplify that expression in io_req_complete_post().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The field 'function' of struct hrtimer should not be changed directly, as
the write is lockless and a concurrent timer expiry might end up using the
wrong function pointer.
Switch to use hrtimer_update_function() which also performs runtime checks
that it is safe to modify the callback.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/9b33f490fb1d207d3918ef5e116dc3412ae35c1e.1738746927.git.namcao@linutronix.de
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/80ca8d959f2cc67c75f6d61008e3bebfe7fbc30a.1738746821.git.namcao@linutronix.de
|
|
In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"),
napi_id_valid function was added. Use the helper to refactor open-coded
checks in the source.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Stefano Jordhani <sjordhani@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring
Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are scenarios in which the zerocopy path can get a kernel buffer
instead of a net_iov and needs to copy it to the user, whether it is
because of mis-steering or simply getting an skb with the linear part.
In this case, grab a net_iov, copy into it and return it to the user as
normally.
At the moment the user doesn't get any indication whether there was a
copy or not, which is left for follow up work.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-10-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.
Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-9-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set the page pool memory provider for the rx queue configured for zero
copy to io_uring. Then the rx queue is reset using
netdev_rx_queue_restart() and netdev core + page pool will take care of
filling the rx queue from the io_uring zero copy memory provider.
For now, there is only one ifq so its destruction happens implicitly
during io_uring cleanup.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-8-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registration step.
Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.
Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.
For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-7-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Setup DMA mappings for the area into which we intend to receive data
later on. We know the device we want to attach to even before we get a
page pool and can pre-map in advance. All net_iov are synchronised for
device when allocated, see page_pool_mp_return_in_cache().
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-6-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.
Unlike the traditional receive, that frees pages and returns them back
to the page pool right after data was copied to the user, e.g. inside
recv(2), we extend the lifetime until the user space confirms that it's
done processing the data. That's done by taking a net_iov reference.
When the user is done with the buffer, it must return it back to the
kernel by posting an entry into the refill ring, which is usually polled
off the io_uring memory provider callback in the page pool's netmem
allocation path.
There is also a separate set of per net_iov "user" references accounting
whether a buffer is currently given to the user (including possible
fragmentation).
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-5-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Zerocopy receive needs a net device to bind to its rx queue and dma map
buffers. As a preparation to following patches, resolve a net device
from the if_idx parameter with no functional changes otherwise.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-4-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.
The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.
For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.
The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.
The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.
For now, each io_uring instance only has a single ifq.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.
So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.
Also add a comment to struct io_tw_state to explain its purpose.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Most callers of io_put_rsrc_node() already check that node is non-NULL:
- io_rsrc_data_free()
- io_sqe_buffer_register()
- io_reset_rsrc_node()
- io_req_put_rsrc_nodes() (REQ_F_BUF_NODE indicates non-NULL buf_node)
Only io_splice_cleanup() can call io_put_rsrc_node() with a NULL node.
So move the NULL check there.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250216225900.1075446-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.
Drop "req" from the function name since it operates on the ctx rather
than a specific req.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|