summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2025-02-18page_pool: avoid infinite loop to schedule delayed workerJason Xing
We noticed the kworker in page_pool_release_retry() was waken up repeatedly and infinitely in production because of the buggy driver causing the inflight less than 0 and warning us in page_pool_inflight()[1]. Since the inflight value goes negative, it means we should not expect the whole page_pool to get back to work normally. This patch mitigates the adverse effect by not rescheduling the kworker when detecting the inflight negative in page_pool_release_retry(). [1] [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages ... [Mon Feb 10 20:36:11 2025] Call Trace: [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- Note: before this patch, the above calltrace would flood the dmesg due to repeated reschedule of release_dw kworker. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250214064250.85987-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18sockmap, vsock: For connectible sockets allow only connectedMichal Luczaj
sockmap expects all vsocks to have a transport assigned, which is expressed in vsock_proto::psock_update_sk_prot(). However, there is an edge case where an unconnected (connectible) socket may lose its previously assigned transport. This is handled with a NULL check in the vsock/BPF recv path. Another design detail is that listening vsocks are not supposed to have any transport assigned at all. Which implies they are not supported by the sockmap. But this is complicated by the fact that a socket, before switching to TCP_LISTEN, may have had some transport assigned during a failed connect() attempt. Hence, we may end up with a listening vsock in a sockmap, which blows up quickly: KASAN: null-ptr-deref in range [0x0000000000000120-0x0000000000000127] CPU: 7 UID: 0 PID: 56 Comm: kworker/7:0 Not tainted 6.14.0-rc1+ Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_read_skb+0x4b/0x90 Call Trace: sk_psock_verdict_data_ready+0xa4/0x2e0 virtio_transport_recv_pkt+0x1ca8/0x2acc vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 For connectible sockets, instead of relying solely on the state of vsk->transport, tell sockmap to only allow those representing established connections. This aligns with the behaviour for AF_INET and AF_UNIX. Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-18netdev: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/861d9b287f2b5206b853dbcc0f302127f2a7f8ac.1738746872.git.namcao@linutronix.de
2025-02-17netdev-genl: Add an XSK attribute to queuesJoe Damato
Expose a new per-queue nest attribute, xsk, which will be present for queues that are being used for AF_XDP. If the queue is not being used for AF_XDP, the nest will not be present. In the future, this attribute can be extended to include more data about XSK as it is needed. Signed-off-by: Joe Damato <jdamato@fastly.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250214211255.14194-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17net: use napi_id_valid helperStefano Jordhani
In commit 6597e8d35851 ("netdev-genl: Elide napi_id when not present"), napi_id_valid function was added. Use the helper to refactor open-coded checks in the source. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Stefano Jordhani <sjordhani@gmail.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> # for iouring Link: https://patch.msgid.link/20250214181801.931-1-sjordhani@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17drop_monitor: fix incorrect initialization orderGavrilov Ilia
Syzkaller reports the following bug: BUG: spinlock bad magic on CPU#1, syz-executor.0/7995 lock: 0xffff88805303f3e0, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 CPU: 1 PID: 7995 Comm: syz-executor.0 Tainted: G E 5.10.209+ #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x119/0x179 lib/dump_stack.c:118 debug_spin_lock_before kernel/locking/spinlock_debug.c:83 [inline] do_raw_spin_lock+0x1f6/0x270 kernel/locking/spinlock_debug.c:112 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:117 [inline] _raw_spin_lock_irqsave+0x50/0x70 kernel/locking/spinlock.c:159 reset_per_cpu_data+0xe6/0x240 [drop_monitor] net_dm_cmd_trace+0x43d/0x17a0 [drop_monitor] genl_family_rcv_msg_doit+0x22f/0x330 net/netlink/genetlink.c:739 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0x341/0x5a0 net/netlink/genetlink.c:800 netlink_rcv_skb+0x14d/0x440 net/netlink/af_netlink.c:2497 genl_rcv+0x29/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1322 [inline] netlink_unicast+0x54b/0x800 net/netlink/af_netlink.c:1348 netlink_sendmsg+0x914/0xe00 net/netlink/af_netlink.c:1916 sock_sendmsg_nosec net/socket.c:651 [inline] __sock_sendmsg+0x157/0x190 net/socket.c:663 ____sys_sendmsg+0x712/0x870 net/socket.c:2378 ___sys_sendmsg+0xf8/0x170 net/socket.c:2432 __sys_sendmsg+0xea/0x1b0 net/socket.c:2461 do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x62/0xc7 RIP: 0033:0x7f3f9815aee9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3f972bf0c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f3f9826d050 RCX: 00007f3f9815aee9 RDX: 0000000020000000 RSI: 0000000020001300 RDI: 0000000000000007 RBP: 00007f3f981b63bd R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f3f9826d050 R15: 00007ffe01ee6768 If drop_monitor is built as a kernel module, syzkaller may have time to send a netlink NET_DM_CMD_START message during the module loading. This will call the net_dm_monitor_start() function that uses a spinlock that has not yet been initialized. To fix this, let's place resource initialization above the registration of a generic netlink family. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250213152054.2785669-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14tcp: use EXPORT_IPV6_MOD[_GPL]()Eric Dumazet
Use EXPORT_IPV6_MOD[_GPL]() for symbols that don't need to be exported unless CONFIG_IPV6=m tcp_hashinfo and tcp_openreq_init_rwin() are no longer used from any module anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/20250212132418.1524422-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Reapply "net: skb: introduce and use a single page frag cache"Jakub Kicinski
This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76. Sabrina reports that the revert may trigger warnings due to intervening changes, especially the ability to rise MAX_SKB_FRAGS. Let's drop it and revisit once that part is also ironed out. Fixes: 011b03359038 ("Revert "net: skb: introduce and use a single page frag cache"") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/6bf54579233038bc0e76056c5ea459872ce362ab.1739375933.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12net: avoid unconditionally touching sk_tsflags on RXPaolo Abeni
After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"), the sk_tsflags field shares the same cacheline with sk_forward_alloc. The UDP protocol does not acquire the sock lock in the RX path; forward allocations are protected via the receive queue spinlock; additionally udp_recvmsg() calls sock_recv_cmsgs() unconditionally touching sk_tsflags on each packet reception. Due to the above, under high packet rate traffic, when the BH and the user-space process run on different CPUs, UDP packet reception experiences a cache miss while accessing sk_tsflags. The receive path doesn't strictly need to access the problematic field; change sock_set_timestamping() to maintain the relevant information in a newly allocated sk_flags bit, so that sock_recv_cmsgs() can take decisions accessing the latter field only. With this patch applied, on an AMD epic server with i40e NICs, I measured a 10% performance improvement for small packets UDP flood performance tests - possibly a larger delta could be observed with more recent H/W. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/dbd18c8a1171549f8249ac5a8b30b1b5ec88a425.1739294057.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12net: report csum_complete via qstatsJakub Kicinski
Commit 13c7c941e729 ("netdev: add qstat for csum complete") reserved the entry for csum complete in the qstats uAPI. Start reporting this value now that we have a driver which needs it. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250211181356.580800-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Convert RTM_DELRULE to per-netns RTNL.Kuniyuki Iwashima
fib_nl_delrule() is the doit() handler for RTM_DELRULE but also called from vrf_newlink() in case something fails in vrf_add_fib_rules(). In the latter case, RTNL is already held and the 4th arg is true. Let's hold per-netns RTNL in fib_delrule() if rtnl_held is false. Now we can place ASSERT_RTNL_NET() in call_fib_rule_notifiers(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Add error_free label in fib_delrule().Kuniyuki Iwashima
We will hold RTNL just before calling fib_nl2rule_rtnl() in fib_delrule() and release it before kfree(nlrule). Let's add a new rule to make the following change cleaner. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Convert RTM_NEWRULE to per-netns RTNL.Kuniyuki Iwashima
fib_nl_newrule() is the doit() handler for RTM_NEWRULE but also called from vrf_newlink(). In the latter case, RTNL is already held and the 4th arg is true. Let's hold per-netns RTNL in fib_newrule() if rtnl_held is false. Note that we call fib_rule_get() before releasing per-netns RTNL to call notify_rule_change() without RTNL and prevent freeing the new rule. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Factorise fib_newrule() and fib_delrule().Kuniyuki Iwashima
fib_nl_newrule() / fib_nl_delrule() is the doit() handler for RTM_NEWRULE / RTM_DELRULE but also called from vrf_newlink(). Currently, we hold RTNL on both paths but will not on the former. Also, we set dev_net(dev)->rtnl to skb->sk in vrf_fib_rule() because fib_nl_newrule() / fib_nl_delrule() fetch net as sock_net(skb->sk). Let's Factorise the two functions and pass net and rtnl_held flag. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Split fib_nl2rule().Kuniyuki Iwashima
We will move RTNL down to fib_nl_newrule() and fib_nl_delrule(). Some operations in fib_nl2rule() require RTNL: fib_default_rule_pref() and __dev_get_by_name(). Let's split the RTNL parts as fib_nl2rule_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Pass net to fib_nl2rule() instead of skb.Kuniyuki Iwashima
skb is not used in fib_nl2rule() other than sock_net(skb->sk), which is already available in callers, fib_nl_newrule() and fib_nl_delrule(). Let's pass net directly to fib_nl2rule(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10net: fib_rules: Don't check net in rule_exists() and rule_find().Kuniyuki Iwashima
fib_nl_newrule() / fib_nl_delrule() looks up struct fib_rules_ops in sock_net(skb->sk) and calls rule_exists() / rule_find() respectively. fib_nl_newrule() creates a new rule and links it to the found ops, so struct fib_rule never belongs to a different netns's ops->rules_list. Let's remove redundant netns check in rule_exists() and rule_find(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207072502.87775-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10neighbour: use RCU protection in __neigh_notify()Eric Dumazet
__neigh_notify() can be called without RTNL or RCU protection. Use RCU protection to avoid potential UAF. Fixes: 426b5303eb43 ("[NETNS]: Modify the neighbour table code so it handles multiple network namespaces") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07net: page_pool: avoid false positive warning if NAPI was never addedJakub Kicinski
We expect NAPI to be in disabled state when page pool is torn down. But it is also legal if the NAPI is completely uninitialized. Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250206225638.1387810-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07net: devmem: don't call queue stop / start when the interface is downJakub Kicinski
We seem to be missing a netif_running() check from the devmem installation path. Starting a queue on a stopped device makes no sense. We still want to be able to allocate the memory, just to test that the device is indeed setting up the page pools in a memory provider compatible way. This is not a bug fix, because existing drivers check if the interface is down as part of the ops. But new drivers shouldn't have to do this, as long as they can correctly alloc/free while down. Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250206225638.1387810-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07net: refactor netdev_rx_queue_restart() to use local qopsJakub Kicinski
Shorten the lines by storing dev->queue_mgmt_ops in a temp variable. Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250206225638.1387810-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07Revert "net: skb: introduce and use a single page frag cache"Paolo Abeni
This reverts commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"). The intended goal of such change was to counter a performance regression introduced by commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs"). Unfortunately, the blamed commit introduces another regression for the virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny size, so that the whole head frag could fit a 512-byte block. The single page frag cache uses a 1K fragment for such allocation, and the additional overhead, under small UDP packets flood, makes the page allocator a bottleneck. Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head"), this revert does not re-introduce the original regression. Actually, in the relevant test on top of this revert, I measure a small but noticeable positive delta, just above noise level. The revert itself required some additional mangling due to the introduction of the SKB_HEAD_ALIGN() helper and local lock infra in the affected code. Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/e649212fde9f0fdee23909ca0d14158d32bb7425.1738877290.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-07net: fib_rules: annotate data-races around rule->[io]ifindexEric Dumazet
rule->iifindex and rule->oifindex can be read without holding RTNL. Add READ_ONCE()/WRITE_ONCE() annotations where needed. Fixes: 32affa5578f0 ("fib: rules: no longer hold RTNL in fib_nl_dumprule()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250206083051.2494877-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06rtnetlink: fix netns leak with rtnl_setlink()Nicolas Dichtel
A call to rtnl_nets_destroy() is needed to release references taken on netns put in rtnl_nets. CC: stable@vger.kernel.org Fixes: 636af13f213b ("rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250205221037.2474426-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06netdev-genl: Elide napi_id when not presentJoe Damato
There are at least two cases where napi_id may not present and the napi_id should be elided: 1. Queues could be created, but napi_enable may not have been called yet. In this case, there may be a NAPI but it may not have an ID and output of a napi_id should be elided. 2. TX-only NAPIs currently do not have NAPI IDs. If a TX queue happens to be linked with a TX-only NAPI, elide the NAPI ID from the netlink output as a NAPI ID of 0 is not useful for users. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205193751.297211-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Merge branch 'io_uring-zero-copy-rx'Jakub Kicinski
David Wei says: ==================== io_uring zero copy rx This patchset contains net/ patches needed by a new io_uring request implementing zero copy rx into userspace pages, eliminating a kernel to user copy. We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal. This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more. In terms of netdev support, we're first targeting Broadcom bnxt. Patches aren't included since Taehee Yoo has already sent a more comprehensive patchset adding support in [1]. Google gve should already support this, and Mellanox mlx5 support is WIP pending driver changes. =========== Performance =========== Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet. Test setup: * AMD EPYC 9454 * Broadcom BCM957508 200G * Kernel v6.11 base [2] * liburing fork [3] * kperf fork [4] * 4K MTU * Single TCP flow With application thread + net rx softirq pinned to _different_ cores: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 82.2 Gbps | 116.2 Gbps (+41%) | +-------------------------------+ Pinned to _same_ core: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 62.6 Gbps | 80.9 Gbps (+29%) | +-------------------------------+ ===== Links ===== Broadcom bnxt support: [1]: https://lore.kernel.org/20241003160620.1521626-8-ap420073@gmail.com Linux kernel branch including io_uring bits: [2]: https://github.com/isilence/linux.git zcrx/v13 liburing for testing: [3]: https://github.com/isilence/liburing.git zcrx/next kperf for testing: [4]: https://git.kernel.dk/kperf.git ==================== Link: https://patch.msgid.link/20250204215622.695511-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: add helpers for setting a memory provider on an rx queueDavid Wei
Add helpers that properly prep or remove a memory provider for an rx queue then restart the queue. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-11-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: add memory provider helpersPavel Begunkov
Add helpers for memory providers to interact with page pools. net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov with a page pool. If used, the memory provider is responsible to match "set" calls with "clear" once a net_iov is not going to be used by a page pool anymore, changing a page pool, etc. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-10-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: prepare for non devmem TCP memory providersPavel Begunkov
There is a good bunch of places in generic paths assuming that the only page pool memory provider is devmem TCP. As we want to reuse the net_iov and provider infrastructure, we need to patch it up and explicitly check the provider type when we branch into devmem TCP code. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-9-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: add a mp hook to unregister_netdevice*Pavel Begunkov
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of devmem sticking directly out of the genetic path, add a mp function. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-8-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: add callback for mp info printingPavel Begunkov
Add a mandatory callback that prints information about the memory provider to netlink. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-7-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: create hooks for custom memory providersPavel Begunkov
A spin off from the original page pool memory providers patch by Jakub, which allows extending page pools with custom allocators. One of such providers is devmem TCP, and the other is io_uring zerocopy added in following patches. Link: https://lore.kernel.org/netdev/20230707183935.997267-7-kuba@kernel.org/ Co-developed-by: Jakub Kicinski <kuba@kernel.org> # initial mp proposal Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-5-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: generalise net_iov chunk ownersPavel Begunkov
Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner, which serves as a useful abstraction to share data and provide a context. However, it's too devmem specific, and we want to reuse it for other memory providers, and for that we need to decouple net_iov from devmem. Make net_iov to point to a new base structure called net_iov_area, which dmabuf_genpool_chunk_owner extends. Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-4-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: prefix devmem specific helpersPavel Begunkov
Add prefixes to all helpers that are specific to devmem TCP, i.e. net_iov_binding[_id]. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-3-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net: page_pool: don't cast mp param to devmemPavel Begunkov
page_pool_check_memory_provider() is a generic path and shouldn't assume anything about the actual type of the memory provider argument. It's fine while devmem is the only provider, but cast away the devmem specific binding types to avoid confusion. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-2-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06flow_dissector: use RCU protection to fetch dev_net()Eric Dumazet
__skb_flow_dissect() can be called from arbitrary contexts. It must extend its RCU protection section to include the call to dev_net(), which can become dev_net_rcu(). This makes sure the net structure can not disappear under us. Fixes: 9b52e3f267a6 ("flow_dissector: handle no-skb use case") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250205155120.1676781-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: flush_backlog() small changesEric Dumazet
Add READ_ONCE() around reads of skb->dev->reg_state, because this field can be changed from other threads/cpus. Instead of calling dev_kfree_skb_irq() and kfree_skb() while interrupts are masked and locks held, use a temporary list and use __skb_queue_purge_reason() Use SKB_DROP_REASON_DEV_READY drop reason to better describe why these skbs are dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250204144825.316785-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: remove rtnl_trylock from queue attributesAntoine Tenart
Similar to the commit removing remove rtnl_trylock from device attributes we here apply the same technique to networking queues. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-5-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: prevent uncleared queues from being re-addedAntoine Tenart
With the (upcoming) removal of the rtnl_trylock/restart_syscall logic and because of how Tx/Rx queues are implemented (and their requirements), it might happen that a queue is re-added before having the chance to be cleared. In such rare case, do not complete the queue addition operation. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-4-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: move queue attribute groups outside the default groupsAntoine Tenart
Rx/tx queues embed their own kobject for registering their per-queue sysfs files. The issue is they're using the kobject default groups for this and entirely rely on the kobject refcounting for releasing their sysfs paths. In order to remove rtnl_trylock calls we need sysfs files not to rely on their associated kobject refcounting for their release. Thus we here move queues sysfs files from the kobject default groups to their own groups which can be removed separately. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-3-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: remove rtnl_trylock from device attributesAntoine Tenart
There is an ABBA deadlock between net device unregistration and sysfs files being accessed[1][2]. To prevent this from happening all paths taking the rtnl lock after the sysfs one (actually kn->active refcount) use rtnl_trylock and return early (using restart_syscall)[3], which can make syscalls to spin for a long time when there is contention on the rtnl lock[4]. There are not many possibilities to improve the above: - Rework the entire net/ locking logic. - Invert two locks in one of the paths — not possible. But here it's actually possible to drop one of the locks safely: the kernfs_node refcount. More details in the code itself, which comes with lots of comments. Note that we check the device is alive in the added sysfs_rtnl_lock helper to disallow sysfs operations to run after device dismantle has started. This also help keeping the same behavior as before. Because of this calls to dev_isalive in sysfs ops were removed. [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/ [2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/ [3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/ [4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/ Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-2-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: warn if NAPI instance wasn't shut downJakub Kicinski
Drivers should always disable a NAPI instance before removing it. If they don't the instance may be queued for polling. Since commit 86e25f40aa1e ("net: napi: Add napi_config") we also remove the NAPI from the busy polling hash table in napi_disable(), so not disabling would leave a stale entry there. Use of busy polling is relatively uncommon so bugs may be lurking in the drivers. Add an explicit warning. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250203215816.1294081-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04neighbour: remove neigh_parms_destroy()Eric Dumazet
neigh_parms_destroy() is a simple kfree(), no need for a forward declaration. neigh_parms_put() can instead call kfree() directly. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203151152.3163876-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03net: harmonize tstats and dstatsPaolo Abeni
After the blamed commits below, some UDP tunnel use dstats for accounting. On the xmit path, all the UDP-base tunnels ends up using iptunnel_xmit_stats() for stats accounting, and the latter assumes the relevant (tunnel) network device uses tstats. The end result is some 'funny' stat report for the mentioned UDP tunnel, e.g. when no packet is actually dropped and a bunch of packets are transmitted: gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \ state UNKNOWN mode DEFAULT group default qlen 1000 link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 14916 23 0 15 0 0 TX: bytes packets errors dropped carrier collsns 0 1566 0 0 0 0 Address the issue ensuring the same binary layout for the overlapping fields of dstats and tstats. While this solution is a bit hackish, is smaller and with no performance pitfall compared to other alternatives i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or reverting the blamed commit. With time we should possibly move all the IP-based tunnel (and virtual devices) to dstats. Fixes: c77200c07491 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: 6fa6de302246 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: be226352e8dc ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30Merge tag 'net-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from IPSec, netfilter and Bluetooth. Nothing really stands out, but as usual there's a slight concentration of fixes for issues added in the last two weeks before the merge window, and driver bugs from 6.13 which tend to get discovered upon wider distribution. Current release - regressions: - net: revert RTNL changes in unregister_netdevice_many_notify() - Bluetooth: fix possible infinite recursion of btusb_reset - eth: adjust locking in some old drivers which protect their state with spinlocks to avoid sleeping in atomic; core protects netdev state with a mutex now Previous releases - regressions: - eth: - mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node() - bgmac: reduce max frame size to support just 1500 bytes; the jumbo frame support would previously cause OOB writes, but now fails outright - mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid false detection of MPTCP blackholing Previous releases - always broken: - mptcp: handle fastopen disconnect correctly - xfrm: - make sure skb->sk is a full sock before accessing its fields - fix taking a lock with preempt disabled for RT kernels - usb: ipheth: improve safety of packet metadata parsing; prevent potential OOB accesses - eth: renesas: fix missing rtnl lock in suspend/resume path" * tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) MAINTAINERS: add Neal to TCP maintainers net: revert RTNL changes in unregister_netdevice_many_notify() net: hsr: fix fill_frame_info() regression vs VLAN packets doc: mptcp: sysctl: blackhole_timeout is per-netns mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted netfilter: nf_tables: reject mismatching sum of field_len with set key length net: sh_eth: Fix missing rtnl lock in suspend/resume path net: ravb: Fix missing rtnl lock in suspend/resume path selftests/net: Add test for loading devbound XDP program in generic mode net: xdp: Disallow attaching device-bound programs in generic mode tcp: correct handling of extreme memory squeeze bgmac: reduce max frame size to support just MTU 1500 vsock/test: Add test for connect() retries vsock/test: Add test for UAF due to socket unbinding vsock/test: Introduce vsock_connect_fd() vsock/test: Introduce vsock_bind() vsock: Allow retrying on connect() failure vsock: Keep the binding until socket destruction Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming ...
2025-01-30net: revert RTNL changes in unregister_netdevice_many_notify()Eric Dumazet
This patch reverts following changes: 83419b61d187 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) ae646f1a0bb9 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) cfa579f66656 net: no longer hold RTNL while calling flush_all_backlogs() This caused issues in layers holding a private mutex: cleanup_net() rtnl_lock(); mutex_lock(subsystem_mutex); unregister_netdevice(); rtnl_unlock(); // LOCKDEP violation rtnl_lock(); I will revisit this in next cycle, opt-in for the new behavior from safe contexts only. Fixes: cfa579f66656 ("net: no longer hold RTNL while calling flush_all_backlogs()") Fixes: ae646f1a0bb9 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)") Fixes: 83419b61d187 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)") Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29net: xdp: Disallow attaching device-bound programs in generic modeToke Høiland-Jørgensen
Device-bound programs are used to support RX metadata kfuncs. These kfuncs are driver-specific and rely on the driver context to read the metadata. This means they can't work in generic XDP mode. However, there is no check to disallow such programs from being attached in generic mode, in which case the metadata kfuncs will be called in an invalid context, leading to crashes. Fix this by adding a check to disallow attaching device-bound programs in generic mode. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29bpf: Disable non stream socket for strparserJiayuan Chen
Currently, only TCP supports strparser, but sockmap doesn't intercept non-TCP connections to attach strparser. For example, with UDP, although the read/write handlers are replaced, strparser is not executed due to the lack of a read_sock operation. Furthermore, in udp_bpf_recvmsg(), it checks whether the psock has data, and if not, it falls back to the native UDP read interface, making UDP + strparser appear to read correctly. According to its commit history, this behavior is unexpected. Moreover, since UDP lacks the concept of streams, we intercept it directly. Fixes: 1fa1fe8ff161 ("bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0") Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://patch.msgid.link/20250122100917.49845-4-mrpre@163.com