summaryrefslogtreecommitdiff
path: root/io_uring/zcrx.h
AgeCommit message (Collapse)Author
2025-04-18io_uring/zcrx: fix late dma unmap for a dead devPavel Begunkov
There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, which is not allowed (see links). That just got fixed for normal page pools, and we need to address memory providers as well. Unmap pages in the memory provider uninstall callback, and protect it with a new lock. There is also a gap between when a dma mapping is created and the mp is installed, so if the device is killed in between, io_uring would be holding on to dma mappings to a dead device with no one to call ->uninstall. Move it to page pool init and rely on ->is_mapped to make sure it's only done once. Link: https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07io_uring/zcrx: separate niov number from pagesPavel Begunkov
A preparation patch that separates the number of pages / folios from the number of niovs. They will not match in the future to support huge pages, improved dma mapping and/or larger chunk sizes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0780ac966ee84200385737f45bb0f2ada052392b.1743848231.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07io_uring/zcrx: put refill data into separate cache linePavel Begunkov
Refill queue lock and other bits are only used from the allocation path on the rx softirq side, but it shares the cache line with other fields like ctx that are used also in the "syscall" path, which causes cache bouncing when softirq runs on a different CPU. Separate them into different cache lines. The first one now contains constant fields used by both contextx, followed by a line responsible for refill queue data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6d1f598e27d623c07fc49d6baee13089a9b1216c.1743848241.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24io_uring/zcrx: add a read limit to recvzc requestsDavid Wei
Currently multishot recvzc requests have no read limit and will remain active so as long as the socket remains open. But, there are sometimes a need to do a fixed length read e.g. peeking at some data in the socket. Add a length limit to recvzc requests `len`. A value of 0 means no limit which is the previous behaviour. A positive value N specifies how many bytes to read from the socket. Data will still be posted in aux completions, as before. This could be split across multiple frags. But the primary recvzc request will now complete once N bytes have been read. The completion of the recvzc request will have res and cflags both set to 0. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250224041319.2389785-2-dw@davidwei.uk [axboe: fixup io_zcrx_recv() for !CONFIG_NET] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add io_recvzc requestDavid Wei
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and the socket must be handled by the io_uring instance that the rx queue was registered for zero copy with. That's because neither net_iovs / buffers from our queue can be read by outside applications, nor zero copy is possible if traffic for the zero copy connection goes to another queue. This coordination is outside of the scope of this patch series. Also, any traffic directed to the zero copy enabled queue is immediately visible to the application, which is why CAP_NET_ADMIN is required at the registration step. Of course, no data is actually read out of the socket, it has already been copied by the netdev into userspace memory via DMA. OP_RECV_ZC reads skbs out of the socket and checks that its frags are indeed net_iovs that belong to io_uring. A cqe is queued for each one of these frags. Recall that each cqe is a big cqe, with the top half being an io_uring_zcrx_cqe. The cqe res field contains the len or error. The lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off field contain the offset relative to the start of the zero copy area. The upper part of the off field is trivially zero, and will be used to carry the area id. For now, there is no limit as to how much work each OP_RECV_ZC request does. It will attempt to drain a socket of all available data. This request always operates in multishot mode. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-7-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: dma-map area for the devicePavel Begunkov
Setup DMA mappings for the area into which we intend to receive data later on. We know the device we want to attach to even before we get a page pool and can pre-map in advance. All net_iov are synchronised for device when allocated, see page_pool_mp_return_in_cache(). Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-6-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: implement zerocopy receive pp memory providerPavel Begunkov
Implement a page pool memory provider for io_uring to receieve in a zero copy fashion. For that, the provider allocates user pages wrapped around into struct net_iovs, that are stored in a previously registered struct net_iov_area. Unlike the traditional receive, that frees pages and returns them back to the page pool right after data was copied to the user, e.g. inside recv(2), we extend the lifetime until the user space confirms that it's done processing the data. That's done by taking a net_iov reference. When the user is done with the buffer, it must return it back to the kernel by posting an entry into the refill ring, which is usually polled off the io_uring memory provider callback in the page pool's netmem allocation path. There is also a separate set of per net_iov "user" references accounting whether a buffer is currently given to the user (including possible fragmentation). Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-5-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: grab a net devicePavel Begunkov
Zerocopy receive needs a net device to bind to its rx queue and dma map buffers. As a preparation to following patches, resolve a net device from the if_idx parameter with no functional changes otherwise. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-4-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add io_zcrx_areaDavid Wei
Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace memory, which is then pinned by the kernel. Each net_iov is mapped to one of these pages. The freelist is a spinlock protected list that keeps track of all the net_iovs/pages that aren't used. For now, there is only one area per ifq and area registration happens implicitly as part of ifq registration. There is no API for adding/removing areas yet. The struct for area registration is there for future extensibility once we support multiple areas and TCP devmem. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add interface queue and refill queueDavid Wei
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>