summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-03-25tcp: support TCP_DELACK_MAX_US for set/getsockopt useJason Xing
Support adjusting/reading delayed ack max for socket level by using set/getsockopt(). This option aligns with TCP_BPF_DELACK_MAX usage. Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. Add WRITE_ONCE/READ_ONCE() to prevent data-race if setsockopt() happens to write one value to icsk_delack_max while icsk_delack_max is being read. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: support TCP_RTO_MIN_US for set/getsockopt useJason Xing
Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit. This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net: introduce per netns packet chainsPaolo Abeni
Currently network taps unbound to any interface are linked in the global ptype_all list, affecting the performance in all the network namespaces. Add per netns ptypes chains, so that in the mentioned case only the netns owning the packet socket(s) is affected. While at that drop the global ptype_all list: no in kernel user registers a tap on "any" type without specifying either the target device or the target namespace (and IMHO doing that would not make any sense). Note that this adds a conditional in the fast path (to check for per netns ptype_specific list) and increases the dataset size by a cacheline (owing the per netns lists). Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumaze@google.com> Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24netpoll: Eliminate redundant assignmentBreno Leitao
The assignment of zero to udph->check is unnecessary as it is immediately overwritten in the subsequent line. Remove the redundant assignment. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250319-netpoll_nit-v1-1-a7faac5cbd92@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs afs updates from Christian Brauner: "This contains the work for afs for this cycle: - Fix an occasional hang that's only really encountered when rmmod'ing the kafs module - Remove the "-o autocell" mount option. This is obsolete with the dynamic root and removing it makes the next patch slightly easier - Change how the dynamic root mount is constructed. Currently, the root directory is (de)populated when it is (un)mounted if there are cells already configured and, further, pairs of automount points have to be created/removed each time a cell is added/deleted This is changed so that readdir on the root dir lists all the known cell automount pairs plus the @cell symlinks and the inodes and dentries are constructed by lookup on demand. This simplifies the cell management code - A few improvements to the afs_volume and afs_server tracepoints - Pass trace info into the afs_lookup_cell() function to allow the trace log to indicate the purpose of the lookup - Remove the 'net' parameter from afs_unuse_cell() as it's superfluous - In rxrpc, allow a kernel app (such as kafs) to store a word of information on rxrpc_peer records - Use the information stored on the rxrpc_peer record to point to the afs_server record. This allows the server address lookup to be done away with - Simplify the afs_server ref/activity accounting to make each one self-contained and not garbage collected from the cell management work item - Simplify the afs_cell ref/activity accounting to make each one of these also self-contained and not driven by a central management work item The current code was intended to make it such that a single timer for the namespace and one work item per cell could do all the work required to maintain these records. This, however, made for some sequencing problems when cleaning up these records. Further, the attempt to pass refs along with timers and work items made getting it right rather tricky when the timer or work item already had a ref attached and now a ref had to be got rid of" * tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Simplify cell record handling afs: Fix afs_server ref accounting afs: Use the per-peer app data provided by rxrpc rxrpc: Allow the app to store private data on peer structs afs: Drop the net parameter from afs_unuse_cell() afs: Make afs_lookup_cell() take a trace note afs: Improve server refcount/active count tracing afs: Improve afs_volume tracing to display a debug ID afs: Change dynroot to create contents on demand afs: Remove the "autocell" mount option
2025-03-24tcp/dccp: Remove inet_connection_sock_af_ops.addr2sockaddr().Kuniyuki Iwashima
inet_connection_sock_af_ops.addr2sockaddr() hasn't been used at all in the git era. $ git grep addr2sockaddr $(git rev-list HEAD | tail -n 1) Let's remove it. Note that there was a 4 bytes hole after sockaddr_len and now it's 6 bytes, so the binary layout is not changed. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250318060112.3729-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net: pktgen: add strict buffer parsing index checkPeter Seiderer
Add strict buffer parsing index check to avoid the following Smatch warning: net/core/pktgen.c:877 get_imix_entries() warn: check that incremented offset 'i' is capped Checking the buffer index i after every get_user/i++ step and returning with error code immediately avoids the current indirect (but correct) error handling. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/netdev/36cf3ee2-38b1-47e5-a42a-363efeb0ace3@stanley.mountain/ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250317090401.1240704-1-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24tcp: move icsk_clean_acked to a better locationEric Dumazet
As a followup of my presentation in Zagreb for netdev 0x19: icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE is enabled from tcp_ack(). Rename it to tcp_clean_acked, move it to tcp_sock structure in the tcp_sock_read_rx for better cache locality in TCP fast path. Define this field only when CONFIG_TLS_DEVICE is enabled saving 8 bytes on configs not using it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250317085313.2023214-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net: openvswitch: fix kernel-doc warnings in internal headersIlya Maximets
Some field descriptions were missing, some were not very accurate. Not touching the uAPI header or .c files for now. Formatting of those comments isn't great in general, but at least they are not missing anything now. Before: $ ./scripts/kernel-doc -none -Wall net/openvswitch/*.h 2>&1 | wc -l 16 After: $ ./scripts/kernel-doc -none -Wall net/openvswitch/*.h 2>&1 | wc -l 0 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20250320224431.252489-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24ax25: Remove broken autobindMurad Masimov
Binding AX25 socket by using the autobind feature leads to memory leaks in ax25_connect() and also refcount leaks in ax25_release(). Memory leak was detected with kmemleak: ================================================================ unreferenced object 0xffff8880253cd680 (size 96): backtrace: __kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43) kmemdup_noprof (mm/util.c:136) ax25_rt_autobind (net/ax25/ax25_route.c:428) ax25_connect (net/ax25/af_ax25.c:1282) __sys_connect_file (net/socket.c:2045) __sys_connect (net/socket.c:2064) __x64_sys_connect (net/socket.c:2067) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) ================================================================ When socket is bound, refcounts must be incremented the way it is done in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of autobind, the refcounts are not incremented. This bug leads to the following issue reported by Syzkaller: ================================================================ ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de ------------[ cut here ]------------ refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 Modules linked in: CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 ... Call Trace: <TASK> __refcount_dec include/linux/refcount.h:336 [inline] refcount_dec include/linux/refcount.h:351 [inline] ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236 netdev_tracker_free include/linux/netdevice.h:4302 [inline] netdev_put include/linux/netdevice.h:4319 [inline] ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080 __sock_release net/socket.c:647 [inline] sock_close+0xbc/0x240 net/socket.c:1398 __fput+0x3e9/0x9f0 fs/file_table.c:464 __do_sys_close fs/open.c:1580 [inline] __se_sys_close fs/open.c:1565 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1565 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> ================================================================ Considering the issues above and the comments left in the code that say: "check if we can remove this feature. It is broken."; "autobinding in this may or may not work"; - it is better to completely remove this feature than to fix it because it is broken and leads to various kinds of memory bugs. Now calling connect() without first binding socket will result in an error (-EINVAL). Userspace software that relies on the autobind feature might get broken. However, this feature does not seem widely used with this specific driver as it was not reliable at any point of time, and it is already broken anyway. E.g. ax25-tools and ax25-apps packages for popular distributions do not use the autobind feature for AF_AX25. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33841dc6aa3e1d86b78a Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-23netfilter: nf_tables: Only use nf_skip_indirect_calls() when ↵WangYuli
MITIGATION_RETPOLINE 1. MITIGATION_RETPOLINE is x86-only (defined in arch/x86/Kconfig), so no need to AND with CONFIG_X86 when checking if enabled. 2. Remove unused declaration of nf_skip_indirect_calls() when MITIGATION_RETPOLINE is disabled to avoid warnings. 3. Declare nf_skip_indirect_calls() and nf_skip_indirect_calls_enable() as inline when MITIGATION_RETPOLINE is enabled, as they are called only once and have simple logic. Fix follow error with clang-21 when W=1e: net/netfilter/nf_tables_core.c:39:20: error: unused function 'nf_skip_indirect_calls' [-Werror,-Wunused-function] 39 | static inline bool nf_skip_indirect_calls(void) { return false; } | ^~~~~~~~~~~~~~~~~~~~~~ 1 error generated. make[4]: *** [scripts/Makefile.build:207: net/netfilter/nf_tables_core.o] Error 1 make[3]: *** [scripts/Makefile.build:465: net/netfilter] Error 2 make[3]: *** Waiting for unfinished jobs.... Fixes: d8d760627855 ("netfilter: nf_tables: add static key to skip retpoline workarounds") Co-developed-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-23netfilter: socket: Lookup orig tuple for IPv6 SNATMaxim Mikityanskiy
nf_sk_lookup_slow_v4 does the conntrack lookup for IPv4 packets to restore the original 5-tuple in case of SNAT, to be able to find the right socket (if any). Then socket_match() can correctly check whether the socket was transparent. However, the IPv6 counterpart (nf_sk_lookup_slow_v6) lacks this conntrack lookup, making xt_socket fail to match on the socket when the packet was SNATed. Add the same logic to nf_sk_lookup_slow_v6. IPv6 SNAT is used in Kubernetes clusters for pod-to-world packets, as pods' addresses are in the fd00::/8 ULA subnet and need to be replaced with the node's external address. Cilium leverages Envoy to enforce L7 policies, and Envoy uses transparent sockets. Cilium inserts an iptables prerouting rule that matches on `-m socket --transparent` and redirects the packets to localhost, but it fails to match SNATed IPv6 packets due to that missing conntrack lookup. Closes: https://github.com/cilium/cilium/issues/37932 Fixes: eb31628e37a0 ("netfilter: nf_tables: Add support for IPv6 NAT") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-23netfilter: xtables: Use strscpy() instead of strscpy_pad()Thorsten Blum
kzalloc() already zero-initializes the destination buffer, making strscpy() sufficient for safely copying the name. The additional NUL- padding performed by strscpy_pad() is unnecessary. The size parameter is optional, and strscpy() automatically determines the size of the destination buffer using sizeof() if the argument is omitted. This makes the explicit sizeof() call unnecessary; remove it. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-23netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation errorChenyuan Yang
It is possible that ctx in nfqnl_build_packet_message() could be used before it is properly initialize, which is only initialized by nfqnl_get_sk_secctx(). This patch corrects this problem by initializing the lsmctx to a safe value when it is declared. This is similar to the commit 35fcac7a7c25 ("audit: Initialize lsmctx to avoid memory allocation error"). Fixes: 2d470c778120 ("lsm: replace context+len with lsm_context") Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-239p: Use hashtable.h for hash_errmapSasha Levin
Convert hash_errmap in error.c to use the generic hashtable implementation from hashtable.h instead of the manual hlist_head array implementation. This simplifies the code and makes it more maintainable by using the standard hashtable API and removes the need for manual hash table management. Signed-off-by: Sasha Levin <sashal@kernel.org> Message-ID: <20250320145200.3124863-1-sashal@kernel.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-03-21net: Remove RTNL dance for SIOCBRADDIF and SIOCBRDELIF.Kuniyuki Iwashima
SIOCBRDELIF is passed to dev_ioctl() first and later forwarded to br_ioctl_call(), which causes unnecessary RTNL dance and the splat below [0] under RTNL pressure. Let's say Thread A is trying to detach a device from a bridge and Thread B is trying to remove the bridge. In dev_ioctl(), Thread A bumps the bridge device's refcnt by netdev_hold() and releases RTNL because the following br_ioctl_call() also re-acquires RTNL. In the race window, Thread B could acquire RTNL and try to remove the bridge device. Then, rtnl_unlock() by Thread B will release RTNL and wait for netdev_put() by Thread A. Thread A, however, must hold RTNL after the unlock in dev_ifsioc(), which may take long under RTNL pressure, resulting in the splat by Thread B. Thread A (SIOCBRDELIF) Thread B (SIOCBRDELBR) ---------------------- ---------------------- sock_ioctl sock_ioctl `- sock_do_ioctl `- br_ioctl_call `- dev_ioctl `- br_ioctl_stub |- rtnl_lock | |- dev_ifsioc ' ' |- dev = __dev_get_by_name(...) |- netdev_hold(dev, ...) . / |- rtnl_unlock ------. | | |- br_ioctl_call `---> |- rtnl_lock Race | | `- br_ioctl_stub |- br_del_bridge Window | | | |- dev = __dev_get_by_name(...) | | | May take long | `- br_dev_delete(dev, ...) | | | under RTNL pressure | `- unregister_netdevice_queue(dev, ...) | | | | `- rtnl_unlock \ | |- rtnl_lock <-' `- netdev_run_todo | |- ... `- netdev_run_todo | `- rtnl_unlock |- __rtnl_unlock | |- netdev_wait_allrefs_any |- netdev_put(dev, ...) <----------------' Wait refcnt decrement and log splat below To avoid blocking SIOCBRDELBR unnecessarily, let's not call dev_ioctl() for SIOCBRADDIF and SIOCBRDELIF. In the dev_ioctl() path, we do the following: 1. Copy struct ifreq by get_user_ifreq in sock_do_ioctl() 2. Check CAP_NET_ADMIN in dev_ioctl() 3. Call dev_load() in dev_ioctl() 4. Fetch the master dev from ifr.ifr_name in dev_ifsioc() 3. can be done by request_module() in br_ioctl_call(), so we move 1., 2., and 4. to br_ioctl_stub(). Note that 2. is also checked later in add_del_if(), but it's better performed before RTNL. SIOCBRADDIF and SIOCBRDELIF have been processed in dev_ioctl() since the pre-git era, and there seems to be no specific reason to process them there. [0]: unregister_netdevice: waiting for wpan3 to become free. Usage count = 2 ref_tracker: wpan3@ffff8880662d8608 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4282 [inline] netdev_hold include/linux/netdevice.h:4311 [inline] dev_ifsioc+0xc6a/0x1160 net/core/dev_ioctl.c:624 dev_ioctl+0x255/0x10c0 net/core/dev_ioctl.c:826 sock_do_ioctl+0x1ca/0x260 net/socket.c:1213 sock_ioctl+0x23a/0x6c0 net/socket.c:1318 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x1a4/0x210 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 893b19587534 ("net: bridge: fix ioctl locking") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: yan kang <kangyan91@outlook.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/netdev/SY8P300MB0421225D54EB92762AE8F0F2A1D32@SY8P300MB0421.AUSP300.PROD.OUTLOOK.COM/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250316192851.19781-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21mptcp: sockopt: fix getting freebind & transparentMatthieu Baerts (NGI0)
When adding a socket option support in MPTCP, both the get and set parts are supposed to be implemented. IP(V6)_FREEBIND and IP(V6)_TRANSPARENT support for the setsockopt part has been added a while ago, but it looks like the get part got forgotten. It should have been present as a way to verify a setting has been set as expected, and not to act differently from TCP or any other socket types. Everything was in place to expose it, just the last step was missing. Only new code is added to cover these specific getsockopt(), that seems safe. Fixes: c9406a23c116 ("mptcp: sockopt: add SOL_IP freebind & transparent options") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250314-net-mptcp-fix-data-stream-corr-sockopt-v1-3-122dbb249db3@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21mptcp: sockopt: fix getting IPV6_V6ONLYMatthieu Baerts (NGI0)
When adding a socket option support in MPTCP, both the get and set parts are supposed to be implemented. IPV6_V6ONLY support for the setsockopt part has been added a while ago, but it looks like the get part got forgotten. It should have been present as a way to verify a setting has been set as expected, and not to act differently from TCP or any other socket types. Not supporting this getsockopt(IPV6_V6ONLY) blocks some apps which want to check the default value, before doing extra actions. On Linux, the default value is 0, but this can be changed with the net.ipv6.bindv6only sysctl knob. On Windows, it is set to 1 by default. So supporting the get part, like for all other socket options, is important. Everything was in place to expose it, just the last step was missing. Only new code is added to cover this specific getsockopt(), that seems safe. Fixes: c9b95a135987 ("mptcp: support IPV6_V6ONLY setsockopt") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/550 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250314-net-mptcp-fix-data-stream-corr-sockopt-v1-2-122dbb249db3@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21NFS: Treat ENETUNREACH errors as fatal in containersTrond Myklebust
Propagate the NFS_MOUNT_NETUNREACH_FATAL flag to work with the generic NFS client. If the flag is set, the client will receive ENETDOWN and ENETUNREACH errors from the RPC layer, and is expected to treat them as being fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-21sunrpc: Add a sysfs file for one-step xprt deletionAnna Schumaker
Previously, the admin would need to set the xprt state to "offline" before attempting to remove. This patch adds a new sysfs attr that does both these steps in a single call. Suggested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Link: https://lore.kernel.org/r/20250207204225.594002-6-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21sunrpc: Add a sysfs file for adding a new xprtAnna Schumaker
Writing to this file will clone the 'main' xprt of an xprt_switch and add it to be used as an additional connection. -- Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> v3: Replace call to xprt_iter_get_xprt() with xprt_iter_get_next() Link: https://lore.kernel.org/r/20250207204225.594002-5-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21sunrpc: Add a sysfs files for rpc_clnt informationAnna Schumaker
These files display useful information about the RPC client, such as the rpc version number, program name, and maximum number of connections allowed. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Link: https://lore.kernel.org/r/20250207204225.594002-4-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21sunrpc: Add a sysfs attr for xprtsecAnna Schumaker
This allows the admin to check the TLS configuration for each xprt. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Link: https://lore.kernel.org/r/20250207204225.594002-3-anna@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-03-21xfrm: ipcomp: Use crypto_acomp interfaceHerbert Xu
Replace the legacy comperssion interface with the new acomp interface. This is the first user to make full user of the asynchronous nature of acomp by plugging into the existing xfrm resume interface. As a result of SG support by acomp, the linear scratch buffer in ipcomp can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21xfrm: ipcomp: Call pskb_may_pull in ipcomp_inputHerbert Xu
If a malformed packet is received there may not be enough data to pull. This isn't a problem in practice because the caller has already done xfrm_parse_spi which in effect does the same thing. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-03-21netfilter: fib: avoid lookup if socket is availableFlorian Westphal
In case the fib match is used from the input hook we can avoid the fib lookup if early demux assigned a socket for us: check that the input interface matches sk-cached one. Rework the existing 'lo bypass' logic to first check sk, then for loopback interface type to elide the fib lookup. This speeds up fib matching a little, before: 93.08 GBit/s (no rules at all) 75.1 GBit/s ("fib saddr . iif oif missing drop" in prerouting) 75.62 GBit/s ("fib saddr . iif oif missing drop" in input) After: 92.48 GBit/s (no rules at all) 75.62 GBit/s (fib rule in prerouting) 90.37 GBit/s (fib rule in input). Numbers for the 'no rules' and 'prerouting' are expected to closely match in-between runs, the 3rd/input test case exercises the the 'avoid lookup if cached ifindex in sk matches' case. Test used iperf3 via veth interface, lo can't be used due to existing loopback test. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-20Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 4 non-merge commits during the last 3 day(s) which contain a total of 2 files changed, 35 insertions(+), 12 deletions(-). The main changes are: 1) bpf_getsockopt support for TCP_BPF_RTO_MIN and TCP_BPF_DELACK_MAX, from Jason Xing bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add bpf_getsockopt() for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN tcp: bpf: Support bpf_getsockopt for TCP_BPF_DELACK_MAX tcp: bpf: Support bpf_getsockopt for TCP_BPF_RTO_MIN tcp: bpf: Introduce bpf_sol_tcp_getsockopt to support TCP_BPF flags ==================== Link: https://patch.msgid.link/20250313221620.2512684-1-martin.lau@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc8). Conflict: tools/testing/selftests/net/Makefile 03544faad761 ("selftest: net: add proc_net_pktgen") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") tools/testing/selftests/net/config: 85cb3711acb8 ("selftests: net: Add test cases for link and peer netns") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") Adjacent commits: tools/testing/selftests/net/Makefile c935af429ec2 ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY") 355d940f4d5a ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20Revert "gre: Fix IPv6 link-local address generation."Guillaume Nault
This reverts commit 183185a18ff96751db52a46ccf93fff3a1f42815. This patch broke net/forwarding/ip6gre_custom_multipath_hash.sh in some circumstances (https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/). Let's revert it while the problem is being investigated. Fixes: 183185a18ff9 ("gre: Fix IPv6 link-local address generation.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/8b1ce738eb15dd841aab9ef888640cab4f6ccfea.1742418408.git.gnault@redhat.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20Merge tag 'ipsec-2025-03-19' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-03-19 1) Fix tunnel mode TX datapath in packet offload mode by directly putting it to the xmit path. From Alexandre Cassen. 2) Force software GSO only in tunnel mode in favor of potential HW GSO. From Cosmin Ratiu. ipsec-2025-03-19 * tag 'ipsec-2025-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm_output: Force software GSO only in tunnel mode xfrm: fix tunnel mode TX datapath in packet offload mode ==================== Link: https://patch.msgid.link/20250319065513.987135-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20Merge tag 'batadv-net-pullrequest-20250318' of ↵Paolo Abeni
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is batman-adv bugfix: - Ignore own maximum aggregation size during RX, Sven Eckelmann * tag 'batadv-net-pullrequest-20250318' of git://git.open-mesh.org/linux-merge: batman-adv: Ignore own maximum aggregation size during RX ==================== Link: https://patch.msgid.link/20250318150035.35356-1-sw@simonwunderlich.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTESLin Ma
Previous commit 8b5c171bb3dc ("neigh: new unresolved queue limits") introduces new netlink attribute NDTPA_QUEUE_LENBYTES to represent approximative value for deprecated QUEUE_LEN. However, it forgot to add the associated nla_policy in nl_ntbl_parm_policy array. Fix it with one simple NLA_U32 type policy. Fixes: 8b5c171bb3dc ("neigh: new unresolved queue limits") Signed-off-by: Lin Ma <linma@zju.edu.cn> Link: https://patch.msgid.link/20250315165113.37600-1-linma@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: Fix data stream corruption in the address announcementArthur Mongodin
Because of the size restriction in the TCP options space, the MPTCP ADD_ADDR option is exclusive and cannot be sent with other MPTCP ones. For this reason, in the linked mptcp_out_options structure, group of fields linked to different options are part of the same union. There is a case where the mptcp_pm_add_addr_signal() function can modify opts->addr, but not ended up sending an ADD_ADDR. Later on, back in mptcp_established_options, other options will be sent, but with unexpected data written in other fields due to the union, e.g. in opts->ext_copy. This could lead to a data stream corruption in the next packet. Using an intermediate variable, prevents from corrupting previously established DSS option. The assignment of the ADD_ADDR option parameters is now done once we are sure this ADD_ADDR option can be set in the packet, e.g. after having dropped other suboptions. Fixes: 1bff1e43a30e ("mptcp: optimize out option generation") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Arthur Mongodin <amongodin@randorisec.fr> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> [ Matt: the commit message has been updated: long lines splits and some clarifications. ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250314-net-mptcp-fix-data-stream-corr-sockopt-v1-1-122dbb249db3@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net: ipv6: ioam6: fix lwtunnel_output() loopJustin Iurman
Fix the lwtunnel_output() reentry loop in ioam6_iptunnel when the destination is the same after transformation. Note that a check on the destination address was already performed, but it was not enough. This is the example of a lwtunnel user taking care of loops without relying only on the last resort detection offered by lwtunnel. Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250314120048.12569-3-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net: lwtunnel: fix recursion loopsJustin Iurman
This patch acts as a parachute, catch all solution, by detecting recursion loops in lwtunnel users and taking care of them (e.g., a loop between routes, a loop within the same route, etc). In general, such loops are the consequence of pathological configurations. Each lwtunnel user is still free to catch such loops early and do whatever they want with them. It will be the case in a separate patch for, e.g., seg6 and seg6_local, in order to provide drop reasons and update statistics. Another example of a lwtunnel user taking care of loops is ioam6, which has valid use cases that include loops (e.g., inline mode), and which is addressed by the next patch in this series. Overall, this patch acts as a last resort to catch loops and drop packets, since we don't want to leak something unintentionally because of a pathological configuration in lwtunnels. The solution in this patch reuses dev_xmit_recursion(), dev_xmit_recursion_inc(), and dev_xmit_recursion_dec(), which seems fine considering the context. Closes: https://lore.kernel.org/netdev/2bc9e2079e864a9290561894d2a602d6@akamai.com/ Closes: https://lore.kernel.org/netdev/Z7NKYMY7fJT5cYWu@shredder/ Fixes: ffce41962ef6 ("lwtunnel: support dst output redirect function") Fixes: 2536862311d2 ("lwt: Add support to redirect dst.input") Fixes: 14972cbd34ff ("net: lwtunnel: Handle fragmentation") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250314120048.12569-2-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net: atm: fix use after free in lec_send()Dan Carpenter
The ->send() operation frees skb so save the length before calling ->send() to avoid a use after free. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/c751531d-4af4-42fe-affe-6104b34b791d@stanley.mountain Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: sysctl: add available_path_managersGeliang Tang
Similarly to net.mptcp.available_schedulers, this patch adds a new one net.mptcp.available_path_managers to list the available path managers. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-11-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: sysctl: map pm_type to path_managerGeliang Tang
This patch adds a new proc_handler "proc_pm_type" for "pm_type" to map old path manager sysctl "pm_type" to the newly added "path_manager". path_manager pm_type MPTCP_PM_TYPE_KERNEL -> "kernel" MPTCP_PM_TYPE_USERSPACE -> "userspace" It is important to add this to keep a compatibility with the now deprecated pm_type sysctl knob. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-10-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: sysctl: map path_manager to pm_typeGeliang Tang
This patch maps the newly added path manager sysctl "path_manager" to the old one "pm_type". path_manager pm_type "kernel" -> MPTCP_PM_TYPE_KERNEL "userspace" -> MPTCP_PM_TYPE_USERSPACE others -> __MPTCP_PM_TYPE_NR It is important to add this to keep a compatibility with the now deprecated pm_type sysctl knob. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-9-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: sysctl: set path manager by nameGeliang Tang
Similar to net.mptcp.scheduler, a new net.mptcp.path_manager sysctl knob is added to determine which path manager will be used by each newly created MPTCP socket by setting the name of it. Dealing with an explicit name is easier than with a number, especially when more PMs will be introduced. This sysctl knob makes the old one "pm_type" deprecated. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-8-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: register in-kernel and userspace PMGeliang Tang
This patch defines the original in-kernel netlink path manager as a new struct mptcp_pm_ops named "mptcp_pm_kernel", and register it in mptcp_pm_kernel_register(). And define the userspace path manager as a new struct mptcp_pm_ops named "mptcp_pm_userspace", and register it in mptcp_pm_init(). To ensure that there's always a valid path manager available, the default path manager "mptcp_pm_kernel" will be skipped in mptcp_pm_unregister(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-7-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: define struct mptcp_pm_opsGeliang Tang
In order to allow users to develop their own BPF-based path manager, this patch defines a struct ops "mptcp_pm_ops" for an MPTCP path manager, which contains a set of interfaces. Currently only init() and release() interfaces are included, subsequent patches will add others step by step. Add a set of functions to register, unregister, find and validate a given path manager struct ops. "list" is used to add this path manager to mptcp_pm_list list when it is registered. "name" is used to identify this path manager. mptcp_pm_find() uses "name" to find a path manager on the list. mptcp_pm_unregister is not used in this set, but will be invoked in .unreg of struct bpf_struct_ops. mptcp_pm_validate() will be invoked in .validate of struct bpf_struct_ops. That's why they are exported. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-6-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: add struct_group in mptcp_pm_dataGeliang Tang
This patch adds a "struct_group(reset, ...)" in struct mptcp_pm_data to simplify the reset, and make sure we don't miss any. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-5-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: only fill id_avail_bitmap for in-kernel pmGeliang Tang
id_avail_bitmap of struct mptcp_pm_data is currently only used by the in-kernel PM, so this patch moves its initialization operation under the "if (pm_type == MPTCP_PM_TYPE_KERNEL)" condition. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-4-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: use pm variable instead of msk->pmGeliang Tang
The variable "pm" has been defined in mptcp_pm_fully_established() and mptcp_pm_data_reset() as "msk->pm", so use "pm" directly instead of using "msk->pm". Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-3-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: in-kernel: use kmemdup helperGeliang Tang
Instead of using kmalloc() or kzalloc() to allocate an entry and then immediately duplicate another entry to the newly allocated one, kmemdup() helper can be used to simplify the code. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-2-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20mptcp: pm: split netlink and in-kernel initMatthieu Baerts (NGI0)
The registration of mptcp_genl_family is useful for both the in-kernel and the userspace PM. It should then be done in pm_netlink.c. On the other hand, the registration of the in-kernel pernet subsystem is specific to the in-kernel PM, and should stay there in pm_kernel.c. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313-net-next-mptcp-pm-ops-intro-v1-1-f4e4a88efc50@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net: vlan: don't propagate flags on openStanislav Fomichev
With the device instance lock, there is now a possibility of a deadlock: [ 1.211455] ============================================ [ 1.211571] WARNING: possible recursive locking detected [ 1.211687] 6.14.0-rc5-01215-g032756b4ca7a-dirty #5 Not tainted [ 1.211823] -------------------------------------------- [ 1.211936] ip/184 is trying to acquire lock: [ 1.212032] ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_set_allmulti+0x4e/0xb0 [ 1.212207] [ 1.212207] but task is already holding lock: [ 1.212332] ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_open+0x50/0xb0 [ 1.212487] [ 1.212487] other info that might help us debug this: [ 1.212626] Possible unsafe locking scenario: [ 1.212626] [ 1.212751] CPU0 [ 1.212815] ---- [ 1.212871] lock(&dev->lock); [ 1.212944] lock(&dev->lock); [ 1.213016] [ 1.213016] *** DEADLOCK *** [ 1.213016] [ 1.213143] May be due to missing lock nesting notation [ 1.213143] [ 1.213294] 3 locks held by ip/184: [ 1.213371] #0: ffffffff838b53e0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock+0x1b/0xa0 [ 1.213543] #1: ffffffff84e5fc70 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock+0x37/0xa0 [ 1.213727] #2: ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_open+0x50/0xb0 [ 1.213895] [ 1.213895] stack backtrace: [ 1.213991] CPU: 0 UID: 0 PID: 184 Comm: ip Not tainted 6.14.0-rc5-01215-g032756b4ca7a-dirty #5 [ 1.213993] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 1.213994] Call Trace: [ 1.213995] <TASK> [ 1.213996] dump_stack_lvl+0x8e/0xd0 [ 1.214000] print_deadlock_bug+0x28b/0x2a0 [ 1.214020] lock_acquire+0xea/0x2a0 [ 1.214027] __mutex_lock+0xbf/0xd40 [ 1.214038] dev_set_allmulti+0x4e/0xb0 # real_dev->flags & IFF_ALLMULTI [ 1.214040] vlan_dev_open+0xa5/0x170 # ndo_open on vlandev [ 1.214042] __dev_open+0x145/0x270 [ 1.214046] __dev_change_flags+0xb0/0x1e0 [ 1.214051] netif_change_flags+0x22/0x60 # IFF_UP vlandev [ 1.214053] dev_change_flags+0x61/0xb0 # for each device in group from dev->vlan_info [ 1.214055] vlan_device_event+0x766/0x7c0 # on netdevsim0 [ 1.214058] notifier_call_chain+0x78/0x120 [ 1.214062] netif_open+0x6d/0x90 [ 1.214064] dev_open+0x5b/0xb0 # locks netdevsim0 [ 1.214066] bond_enslave+0x64c/0x1230 [ 1.214075] do_set_master+0x175/0x1e0 # on netdevsim0 [ 1.214077] do_setlink+0x516/0x13b0 [ 1.214094] rtnl_newlink+0xaba/0xb80 [ 1.214132] rtnetlink_rcv_msg+0x440/0x490 [ 1.214144] netlink_rcv_skb+0xeb/0x120 [ 1.214150] netlink_unicast+0x1f9/0x320 [ 1.214153] netlink_sendmsg+0x346/0x3f0 [ 1.214157] __sock_sendmsg+0x86/0xb0 [ 1.214160] ____sys_sendmsg+0x1c8/0x220 [ 1.214164] ___sys_sendmsg+0x28f/0x2d0 [ 1.214179] __x64_sys_sendmsg+0xef/0x140 [ 1.214184] do_syscall_64+0xec/0x1d0 [ 1.214190] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 1.214191] RIP: 0033:0x7f2d1b4a7e56 Device setup: netdevsim0 (down) ^ ^ bond netdevsim1.100@netdevsim1 allmulticast=on (down) When we enslave the lower device (netdevsim0) which has a vlan, we propagate vlan's allmuti/promisc flags during ndo_open. This causes (re)locking on of the real_dev. Propagate allmulti/promisc on flags change, not on the open. There is a slight semantics change that vlans that are down now propagate the flags, but this seems unlikely to result in the real issues. Reproducer: echo 0 1 > /sys/bus/netdevsim/new_device dev_path=$(ls -d /sys/bus/netdevsim/devices/netdevsim0/net/*) dev=$(echo $dev_path | rev | cut -d/ -f1 | rev) ip link set dev $dev name netdevsim0 ip link set dev netdevsim0 up ip link add link netdevsim0 name netdevsim0.100 type vlan id 100 ip link set dev netdevsim0.100 allmulticast on down ip link add name bond1 type bond mode 802.3ad ip link set dev netdevsim0 down ip link set dev netdevsim0 master bond1 ip link set dev bond1 up ip link show Reported-by: syzbot+b0c03d76056ef6cd12a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/Z9CfXjLMKn6VLG5d@mini-arch/T/#m15ba130f53227c883e79fb969687d69d670337a0 Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313100657.2287455-1-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20net: phy: Support speed selection for PHY loopbackGerhard Engleder
phy_loopback() leaves it to the PHY driver to select the speed of the loopback mode. Thus, the speed of the loopback mode depends on the PHY driver in use. Add support for speed selection to phy_loopback() to enable loopback with defined speeds. Ensure that link up is signaled if speed changes as speed is not allowed to change during link up. Link down and up is necessary for a new speed. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20250312203010.47429-3-gerhard@engleder-embedded.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19xsk: fix an integer overflow in xp_create_and_assign_umem()Gavrilov Ilia
Since the i and pool->chunk_size variables are of type 'u32', their product can wrap around and then be cast to 'u64'. This can lead to two different XDP buffers pointing to the same memory area. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 94033cd8e73b ("xsk: Optimize for aligned case") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Link: https://patch.msgid.link/20250313085007.3116044-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>