summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-04-23net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functionsYong Wang
When a bridge port STP state is changed from BLOCKING/DISABLED to FORWARDING, the port's igmp query timer will NOT re-arm itself if the bridge has been configured as per-VLAN multicast snooping. Solve this by choosing the correct multicast context(s) to enable/disable port multicast based on whether per-VLAN multicast snooping is enabled or not, i.e. using per-{port, VLAN} context in case of per-VLAN multicast snooping by re-implementing br_multicast_enable_port() and br_multicast_disable_port() functions. Before the patch, the IGMP query does not happen in the last step of the following test sequence, i.e. no growth for tx counter: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # bridge link set dev swp1 state 0 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge link set dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 After the patch, the IGMP query happens in the last step of the test: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # bridge link set dev swp1 state 0 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge link set dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 3 Signed-off-by: Yong Wang <yongwang@nvidia.com> Reviewed-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-04-22xdp: create locked/unlocked instances of xdp redirect target settersJoshua Washington
Commit 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") introduces the netdev lock to xdp_set_features_flag(). The change includes a _locked version of the method, as it is possible for a driver to have already acquired the netdev lock before calling this helper. However, the same applies to xdp_features_(set|clear)_redirect_flags(), which ends up calling the unlocked version of xdp_set_features_flags() leading to deadlocks in GVE, which grabs the netdev lock as part of its suspend, reset, and shutdown processes: [ 833.265543] WARNING: possible recursive locking detected [ 833.270949] 6.15.0-rc1 #6 Tainted: G E [ 833.276271] -------------------------------------------- [ 833.281681] systemd-shutdow/1 is trying to acquire lock: [ 833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90 [ 833.295470] [ 833.295470] but task is already holding lock: [ 833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] [ 833.309508] [ 833.309508] other info that might help us debug this: [ 833.316130] Possible unsafe locking scenario: [ 833.316130] [ 833.322142] CPU0 [ 833.324681] ---- [ 833.327220] lock(&dev->lock); [ 833.330455] lock(&dev->lock); [ 833.333689] [ 833.333689] *** DEADLOCK *** [ 833.333689] [ 833.339701] May be due to missing lock nesting notation [ 833.339701] [ 833.346582] 5 locks held by systemd-shutdow/1: [ 833.351205] #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210 [ 833.360695] #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0 [ 833.369144] #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0 [ 833.377603] #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve] [ 833.386138] #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] Introduce xdp_features_(set|clear)_redirect_target_locked() versions which assume that the netdev lock has already been acquired before setting the XDP feature flag and update GVE to use the locked version. Fixes: 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") Tested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: Drop hold_rtnl arg from ops_undo_list().Kuniyuki Iwashima
ops_undo_list() first iterates over ops_list for ->pre_exit(). Let's check if any of the ops has ->exit_rtnl() there and drop the hold_rtnl argument. Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl is always true for setup_net() and cleanup_net() for now. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22tipc: fix NULL pointer dereference in tipc_mon_reinit_self()Tung Nguyen
syzbot reported: tipc: Node number set to 1055423674 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 3 UID: 0 PID: 6017 Comm: kworker/3:5 Not tainted 6.15.0-rc1-syzkaller-00246-g900241a5cc15 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: events tipc_net_finalize_work RIP: 0010:tipc_mon_reinit_self+0x11c/0x210 net/tipc/monitor.c:719 ... RSP: 0018:ffffc9000356fb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003ee87cba RDX: 0000000000000000 RSI: ffffffff8dbc56a7 RDI: ffff88804c2cc010 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: fffffbfff2111097 R14: ffff88804ead8000 R15: ffff88804ead9010 FS: 0000000000000000(0000) GS:ffff888097ab9000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f720eb00 CR3: 000000000e182000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tipc_net_finalize+0x10b/0x180 net/tipc/net.c:140 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> ... RIP: 0010:tipc_mon_reinit_self+0x11c/0x210 net/tipc/monitor.c:719 ... RSP: 0018:ffffc9000356fb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003ee87cba RDX: 0000000000000000 RSI: ffffffff8dbc56a7 RDI: ffff88804c2cc010 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: fffffbfff2111097 R14: ffff88804ead8000 R15: ffff88804ead9010 FS: 0000000000000000(0000) GS:ffff888097ab9000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f720eb00 CR3: 000000000e182000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 There is a racing condition between workqueue created when enabling bearer and another thread created when disabling bearer right after that as follow: enabling_bearer | disabling_bearer --------------- | ---------------- tipc_disc_timeout() | { | bearer_disable() ... | { schedule_work(&tn->work); | tipc_mon_delete() ... | { } | ... | write_lock_bh(&mon->lock); | mon->self = NULL; | write_unlock_bh(&mon->lock); | ... | } tipc_net_finalize_work() | } { | ... | tipc_net_finalize() | { | ... | tipc_mon_reinit_self() | { | ... | write_lock_bh(&mon->lock); | mon->self->addr = tipc_own_addr(net); | write_unlock_bh(&mon->lock); | ... | } | ... | } | ... | } | 'mon->self' is set to NULL in disabling_bearer thread and dereferenced later in enabling_bearer thread. This commit fixes this issue by validating 'mon->self' before assigning node address to it. Reported-by: syzbot+ed60da8d686dc709164c@syzkaller.appspotmail.com Fixes: 46cb01eeeb86 ("tipc: update mon's self addr when node addr generated") Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250417074826.578115-1-tung.quang.nguyen@est.tech Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: 802: Remove unused p8022 codeDr. David Alan Gilbert
p8022.c defines two external functions, register_8022_client() and unregister_8022_client(), the last use of which was removed in 2018 by commit 7a2e838d28cf ("staging: ipx: delete it from the tree") Remove the p8022.c file, it's corresponding header, and glue surrounding it. There was one place the header was included in vlan.c but it didn't use the functions it declared. There was a comment in net/802/Makefile about checking against net/core/Makefile, but that's at least 20 years old and there's no sign of net/core/Makefile mentioning it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250418011519.145320-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: lwtunnel: disable BHs when requiredJustin Iurman
In lwtunnel_{output|xmit}(), dev_xmit_recursion() may be called in preemptible scope for PREEMPT kernels. This patch disables BHs before calling dev_xmit_recursion(). BHs are re-enabled only at the end, since we must ensure the same CPU is used for both dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other recursion levels in some cases) in order to maintain valid per-cpu counters. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/ Reported-by: Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/netdev/m2h62qwf34.fsf@gmail.com/ Fixes: 986ffb3a57c5 ("net: lwtunnel: fix recursion loops") Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160716.8823-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22net: selftests: initialize TCP header and skb payload with zeroOleksij Rempel
Zero-initialize TCP header via memset() to avoid garbage values that may affect checksum or behavior during test transmission. Also zero-fill allocated payload and padding regions using memset() after skb_put(), ensuring deterministic content for all outgoing test packets. Fixes: 3e1e58d64c3d ("net: add generic selftest support") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416160125.2914724-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22rxrpc: rxgk: Set error code in rxgk_yfs_decode_ticket()Dan Carpenter
Propagate the error code if key_alloc() fails. Don't return success. Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/Z_-P_1iLDWksH1ik@stanley.mountain Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-04-17 We've added 12 non-merge commits during the last 9 day(s) which contain a total of 18 files changed, 1748 insertions(+), 19 deletions(-). The main changes are: 1) bpf qdisc support, from Amery Hung. A qdisc can be implemented in bpf struct_ops programs and can be used the same as other existing qdiscs in the "tc qdisc" command. 2) Add xsk tail adjustment tests, from Tushar Vyavahare. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Test attaching bpf qdisc to mq and non root selftests/bpf: Add a bpf fq qdisc to selftest selftests/bpf: Add a basic fifo qdisc test libbpf: Support creating and destroying qdisc bpf: net_sched: Disable attaching bpf qdisc to non root bpf: net_sched: Support updating bstats bpf: net_sched: Add a qdisc watchdog timer bpf: net_sched: Add basic bpf qdisc kfuncs bpf: net_sched: Support implementation of Qdisc_ops in bpf bpf: Prepare to reuse get_ctx_arg_idx selftests/xsk: Add tail adjustment tests and support check selftests/xsk: Add packet stream replacement function ==================== Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: Use nlmsg_payload in rtnetlink fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-2-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: Use nlmsg_payload in neighbour fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: fix the missing unlock for detached devicesJakub Kicinski
The combined condition was left as is when we converted from __dev_get_by_index() to netdev_get_by_index_lock(). There was no need to undo anything with the former, for the latter we need an unlock. Fixes: 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250418015317.1954107-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3Alexei Starovoitov
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-18net: ethtool: mm: reset verification status when link is downFaizal Rahim
When the link partner goes down, "ethtool --show-mm" still displays "Verification status: SUCCEEDED," reflecting a previous state that is no longer valid. Reset the verification status to ensure it reflects the current state. Reviewed-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18net: ethtool: mm: extract stmmac verification logic into common libraryVladimir Oltean
It appears that stmmac is not the only hardware which requires a software-driven verification state machine for the MAC Merge layer. While on the one hand it's good to encourage hardware implementations, on the other hand it's quite difficult to tolerate multiple drivers implementing independently fairly non-trivial logic. Extract the hardware-independent logic from stmmac into library code and put it in ethtool. Name the state structure "mmsv" for MAC Merge Software Verification. Let this expose an operations structure for executing the hardware stuff: sync hardware with the tx_active boolean (result of verification process), enable/disable the pMAC, send mPackets, notify library of external events (reception of mPackets), as well as link state changes. Note that it is assumed that the external events are received in hardirq context. If they are not, it is probably a good idea to disable hardirqs when calling ethtool_mmsv_event_handle(), because the library does not do so. Also, the MM software verification process has no business with the tx_min_frag_size, that is all the driver's to handle. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Co-developed-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Tested-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Tested-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-17netdev: fix the locking for netdev notificationsJakub Kicinski
Kuniyuki reports that the assert for netdev lock fires when there are netdev event listeners (otherwise we skip the netlink event generation). Correct the locking when coming from the notifier. The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked, it's the documentation that's incorrect. Fixes: 99e44f39a8f7 ("netdev: depend on netdev->lock for xdp features") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17Merge tag 'net-6.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth, CAN and Netfilter. Current release - regressions: - two fixes for the netdev per-instance locking - batman-adv: fix double-hold of meshif when getting enabled Current release - new code bugs: - Bluetooth: increment TX timestamping tskey always for stream sockets - wifi: static analysis and build fixes for the new Intel sub-driver Previous releases - regressions: - net: fib_rules: fix iif / oif matching on L3 master (VRF) device - ipv6: add exception routes to GC list in rt6_insert_exception() - netfilter: conntrack: fix erroneous removal of offload bit - Bluetooth: - fix sending MGMT_EV_DEVICE_FOUND for invalid address - l2cap: process valid commands in too long frame - btnxpuart: Revert baudrate change in nxp_shutdown Previous releases - always broken: - ethtool: fix memory corruption during SFP FW flashing - eth: - hibmcge: fixes for link and MTU handling, pause frames etc - igc: fixes for PTM (PCIe timestamping) - dsa: b53: enable BPDU reception for management port Misc: - fixes for Netlink protocol schemas" * tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps net: ethernet: mtk_eth_soc: reapply mdc divider on reset net: ti: icss-iep: Fix possible NULL pointer dereference for perout request net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame() net: ti: icssg-prueth: Fix kernel warning while bringing down network interface netfilter: conntrack: fix erronous removal of offload bit net: don't try to ops lock uninitialized devs ptp: ocp: fix start time alignment in ptp_ocp_signal_set net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails net: dsa: free routing table on probe failure net: dsa: clean up FDB, MDB, VLAN entries on unbind net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered net: txgbe: fix memory leak in txgbe_probe() error path net: bridge: switchdev: do not notify new brentries as changed net: b53: enable BPDU reception for management port netlink: specs: rt-neigh: prefix struct nfmsg members with ndm netlink: specs: rt-link: adjust mctp attribute naming netlink: specs: rtnetlink: attribute naming corrections ...
2025-04-17bpf: net_sched: Disable attaching bpf qdisc to non rootAmery Hung
Do not allow users to attach bpf qdiscs to classful qdiscs. This is to prevent accidentally breaking existings classful qdiscs if they rely on some data in the child qdisc. This restriction can potentially be lifted in the future. Note that, we still allow bpf qdisc to be attached to mq. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-7-ameryhung@gmail.com
2025-04-17bpf: net_sched: Support updating bstatsAmery Hung
Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is only available in .dequeue programs. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-6-ameryhung@gmail.com
2025-04-17bpf: net_sched: Add a qdisc watchdog timerAmery Hung
Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be useful for building traffic shaping scheduling algorithm, where the time the next packet will be dequeued is known. The implementation relies on struct_ops gen_prologue/epilogue to patch bpf programs provided by users. Operator specific prologue/epilogue kfuncs are introduced instead of watchdog kfuncs so that it is easier to extend prologue/epilogue in the future (writing C vs BPF bytecode). Signed-off-by: Amery Hung <amery.hung@bytedance.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-5-ameryhung@gmail.com
2025-04-17bpf: net_sched: Add basic bpf qdisc kfuncsAmery Hung
Add basic kfuncs for working on skb in qdisc. Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release a reference to an skb. However, bpf_qdisc_skb_drop() can only be called in .enqueue where a to_free skb list is available from kernel to defer the release. bpf_kfree_skb() should be used elsewhere. It is also used in bpf_obj_free_fields() when cleaning up skb in maps and collections. bpf_skb_get_hash() returns the flow hash of an skb, which can be used to build flow-based queueing algorithms. Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb(). Signed-off-by: Amery Hung <amery.hung@bytedance.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-4-ameryhung@gmail.com
2025-04-17bpf: net_sched: Support implementation of Qdisc_ops in bpfAmery Hung
The recent advancement in bpf such as allocated objects, bpf list and bpf rbtree has provided powerful and flexible building blocks to realize sophisticated packet scheduling algorithms. As struct_ops now supports core operators in Qdisc_ops, start allowing qdisc to be implemented using bpf struct_ops with this patch. Users can implement Qdisc_ops.{enqueue, dequeue, init, reset, destroy} in bpf and register the qdisc dynamically into the kernel. Co-developed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Amery Hung <amery.hung@bytedance.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-3-ameryhung@gmail.com
2025-04-17Merge tag 'nf-25-04-17' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fix for net The following batch contains one Netfilter fix for net: 1) conntrack offload bit is erroneously unset in a race scenario, from Florian Westphal. netfilter pull request 25-04-17 * tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: conntrack: fix erronous removal of offload bit ==================== Link: https://patch.msgid.link/20250417102847.16640-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17Merge tag 'for-net-2025-04-16' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - l2cap: Process valid commands in too long frame - vhci: Avoid needless snprintf() calls * tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: vhci: Avoid needless snprintf() calls Bluetooth: l2cap: Process valid commands in too long frame ==================== Link: https://patch.msgid.link/20250416210126.2034212-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy)Peter Seiderer
Fix checkpatch code style warnings: WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1423: FILE: net/core/pktgen.c:1423: + strcpy(pkt_dev->dst_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1444: FILE: net/core/pktgen.c:1444: + strcpy(pkt_dev->dst_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1554: FILE: net/core/pktgen.c:1554: + strcpy(pkt_dev->src_min, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #1575: FILE: net/core/pktgen.c:1575: + strcpy(pkt_dev->src_max, buf); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3231: FILE: net/core/pktgen.c:3231: + strcpy(pkt_dev->result, "Starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3235: FILE: net/core/pktgen.c:3235: + strcpy(pkt_dev->result, "Error starting"); WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88 #3849: FILE: net/core/pktgen.c:3849: + strcpy(pkt_dev->odevname, ifname); While at it squash memset/strcpy pattern into single strscpy_pad call. Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-4-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: pktgen: fix code style (WARNING: please, no space before tabs)Peter Seiderer
Fix checkpatch code style warnings: WARNING: please, no space before tabs #230: FILE: net/core/pktgen.c:230: +#define M_NETIF_RECEIVE ^I1^I/* Inject packets into stack */$ Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-3-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: pktgen: fix code style (ERROR: else should follow close brace '}')Peter Seiderer
Fix checkpatch code style errors: ERROR: else should follow close brace '}' #1317: FILE: net/core/pktgen.c:1317: + } + else And checkpatch follow up code style check: CHECK: Unbalanced braces around else statement #1316: FILE: net/core/pktgen.c:1316: + } else Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415112916.113455-2-ps.report@gmx.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: ipv6: ioam6: fix double reallocationJustin Iurman
If the dst_entry is the same post transformation (which is a valid use case for IOAM), we don't add it to the cache to avoid a reference loop. Instead, we use a "fake" dst_entry and add it to the cache as a signal. When we read the cache, we compare it with our "fake" dst_entry and therefore detect if we're in the special case. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250415112554.23823-3-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: ipv6: ioam6: use consistent dst namesJustin Iurman
Be consistent and use the same terminology as other lwt users: orig_dst is the dst_entry before the transformation, while dst is either the dst_entry in the cache or the dst_entry after the transformation Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250415112554.23823-2-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17skb: implement skb_send_sock_locked_with_flags()Antonio Quartulli
When sending an skb over a socket using skb_send_sock_locked(), it is currently not possible to specify any flag to be set in msghdr->msg_flags. However, we may want to pass flags the user may have specified, like MSG_NOSIGNAL. Extend __skb_send_sock() with a new argument 'flags' and add a new interface named skb_send_sock_locked_with_flags(). Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17ovpn: implement TCP transportAntonio Quartulli
With this change ovpn is allowed to communicate to peers also via TCP. Parsing of incoming messages is implemented through the strparser API. Note that ovpn redefines sk_prot and sk_socket->ops for the TCP socket used to communicate with the peer. For this reason it needs to access inet6_stream_ops, which is declared as extern in the IPv6 module, but it is not fully exported. Therefore this patch is also adding EXPORT_SYMBOL_GPL(inet6_stream_ops) to net/ipv6/af_inet6.c. Cc: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-11-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17netfilter: conntrack: fix erronous removal of offload bitFlorian Westphal
The blamed commit exposes a possible issue with flow_offload_teardown(): We might remove the offload bit of a conntrack entry that has been offloaded again. 1. conntrack entry c1 is offloaded via flow f1 (f1->ct == c1). 2. f1 times out and is pushed back to slowpath, c1 offload bit is removed. Due to bug, f1 is not unlinked from rhashtable right away. 3. a new packet arrives for the flow and re-offload is triggered, i.e. f2->ct == c1. This is because lookup in flowtable skip entries with teardown bit set. 4. Next flowtable gc cycle finds f1 again 5. flow_offload_teardown() is called again for f1 and c1 offload bit is removed again, even though we have f2 referencing the same entry. This is harmless, but clearly not correct. Fix the bug that exposes this: set 'teardown = true' to have the gc callback unlink the flowtable entry from the table right away instead of the unintentional defer to the next round. Also prevent flow_offload_teardown() from fixing up the ct state more than once: We could also be called from the data path or a notifier, not only from the flowtable gc callback. NF_FLOW_TEARDOWN can never be unset, so we can use it as synchronization point: if we observe did not see a 0 -> 1 transition, then another CPU is already doing the ct state fixups for us. Fixes: 03428ca5cee9 ("netfilter: conntrack: rework offload nf_conn timeout extension logic") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-17xfrm: Fix UDP GRO handling for some corner casesTobias Brunner
This fixes an issue that's caused if there is a mismatch between the data offset in the GRO header and the length fields in the regular sk_buff due to the pskb_pull()/skb_push() calls. That's because the UDP GRO layer stripped off the UDP header via skb_gro_pull() already while the UDP header was explicitly not pulled/pushed in this function. For example, an IKE packet that triggered this had len=data_len=1268 and the data_offset in the GRO header was 28 (IPv4 + UDP). So pskb_pull() was called with an offset of 28-8=20, which reduced len to 1248 and via pskb_may_pull() and __pskb_pull_tail() it also set data_len to 1248. As the ESP offload module was not loaded, the function bailed out and called skb_push(), which restored len to 1268, however, data_len remained at 1248. So while skb_headlen() was 0 before, it was now 20. The latter caused a difference of 8 instead of 28 (or 0 if pskb_pull()/skb_push() was called with the complete GRO data_offset) in gro_try_pull_from_frag0() that triggered a call to gro_pull_from_frag0() that corrupted the packet. This change uses a more GRO-like approach seen in other GRO receivers via skb_gro_header() to just read the actual data we are interested in and does not try to "restore" the UDP header at this point to call the existing function. If the offload module is not loaded, it immediately bails out, otherwise, it only does a quick check to see if the packet is an IKE or keepalive packet instead of calling the existing function. Fixes: 172bf009c18d ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation") Fixes: 221ddb723d90 ("xfrm: Support GRO for IPv6 ESP in UDP encapsulation") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17xfrm: Refactor migration setup during the cloning processChiachang Wang
Previously, migration related setup, such as updating family, destination address, and source address, was performed after the clone was created in `xfrm_state_migrate`. This change moves this setup into the cloning function itself, improving code locality and reducing redundancy. The `xfrm_state_clone_and_setup` function now conditionally applies the migration parameters from struct xfrm_migrate if it is provided. This allows the function to be used both for simple cloning and for cloning with migration setup. Test: Tested with kernel test in the Android tree located in https://android.googlesource.com/kernel/tests/ The xfrm_tunnel_test.py under the tests folder in particular. Signed-off-by: Chiachang Wang <chiachangwang@google.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17xfrm: Migrate offload configurationChiachang Wang
Add hardware offload configuration to XFRM_MSG_MIGRATE using an option netlink attribute XFRMA_OFFLOAD_DEV. In the existing xfrm_state_migrate(), the xfrm_init_state() is called assuming no hardware offload by default. Even the original xfrm_state is configured with offload, the setting will be reset. If the device is configured with hardware offload, it's reasonable to allow the device to maintain its hardware offload mode. But the device will end up with offload disabled after receiving a migration event when the device migrates the connection from one netdev to another one. The devices that support migration may work with different underlying networks, such as mobile devices. The hardware setting should be forwarded to the different netdev based on the migration configuration. This change provides the capability for user space to migrate from one netdev to another. Test: Tested with kernel test in the Android tree located in https://android.googlesource.com/kernel/tests/ The xfrm_tunnel_test.py under the tests folder in particular. Signed-off-by: Chiachang Wang <chiachangwang@google.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17net, pidfd: report EINVAL for ESRCHChristian Brauner
dbus-broker relies -EINVAL being returned to indicate ESRCH in [1]. This causes issues for some workloads as reported in [2]. Paper over it until this is fixed in userspace. Link: https://lore.kernel.org/20250416-gegriffen-tiefbau-70cfecb80ac8@brauner Link: https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241 [1] Link: https://lore.kernel.org/20250415223454.GA1852104@ax162 [2] Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-16Merge branch '10GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ixgbe: Add basic devlink support Jedrzej Jagielski says: Create devlink specific directory for more convenient future feature development. Flashing and reloading are supported only by E610 devices. Introduce basic FW/NVM validation since devlink reload introduces possibility of runtime NVM update. Check FW API version, FW recovery mode and FW rollback mode. Introduce minimal recovery probe to let user to reload the faulty FW when recovery mode is detected. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ixgbe: add support for FW rollback mode ixgbe: add E610 implementation of FW recovery mode ixgbe: add FW API version check ixgbe: add support for devlink reload ixgbe: add device flash update via devlink ixgbe: extend .info_get() with stored versions ixgbe: add E610 functions getting PBA and FW ver info ixgbe: add .info_get extension specific for E610 devices ixgbe: read the netlist version information ixgbe: read the OROM version information ixgbe: add E610 functions for acquiring flash data ixgbe: add handler for devlink .info_get() ixgbe: add initial devlink support ixgbe: wrap netdev_priv() usage devlink: add value check to devlink_info_version_put() ==================== Link: https://patch.msgid.link/20250415221301.1633933-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv4: Use nlmsg_payload in ipmr fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-7-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv4: Use nlmsg_payload in route fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-6-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv4: Use nlmsg_payload in fib_frontend fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-5-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv4: Use nlmsg_payload in devinet fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-4-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv6: Use nlmsg_payload in route fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-3-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv6: Use nlmsg_payload in addrconf fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-2-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv6: Use nlmsg_payload in addrlabel fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. This changes function ip6addrlbl_valid_get_req() and ip6addrlbl_valid_dump_req(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-1-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: don't try to ops lock uninitialized devsJakub Kicinski
We need to be careful when operating on dev while in rtnl_create_link(). Some devices (vxlan) initialize netdev_ops in ->newlink, so later on. Avoid using netdev_lock_ops(), the device isn't registered so we cannot legally call its ops or generate any notifications for it. netdev_ops_assert_locked_or_invisible() is safe to use, it checks registration status first. Reported-by: syzbot+de1c7d68a10e3f123bdd@syzkaller.appspotmail.com Fixes: 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415151552.768373-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() failsVladimir Oltean
This is very similar to the problem and solution from commit 232deb3f9567 ("net: dsa: avoid refcount warnings when ->port_{fdb,mdb}_del returns error"), except for the dsa_port_do_tag_8021q_vlan_del() operation. Fixes: c64b9c05045a ("net: dsa: tag_8021q: add proper cross-chip notifier support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250414213020.2959021-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: dsa: free routing table on probe failureVladimir Oltean
If complete = true in dsa_tree_setup(), it means that we are the last switch of the tree which is successfully probing, and we should be setting up all switches from our probe path. After "complete" becomes true, dsa_tree_setup_cpu_ports() or any subsequent function may fail. If that happens, the entire tree setup is in limbo: the first N-1 switches have successfully finished probing (doing nothing but having allocated persistent memory in the tree's dst->ports, and maybe dst->rtable), and switch N failed to probe, ending the tree setup process before anything is tangible from the user's PoV. If switch N fails to probe, its memory (ports) will be freed and removed from dst->ports. However, the dst->rtable elements pointing to its ports, as created by dsa_link_touch(), will remain there, and will lead to use-after-free if dereferenced. If dsa_tree_setup_switches() returns -EPROBE_DEFER, which is entirely possible because that is where ds->ops->setup() is, we get a kasan report like this: ================================================================== BUG: KASAN: slab-use-after-free in mv88e6xxx_setup_upstream_port+0x240/0x568 Read of size 8 at addr ffff000004f56020 by task kworker/u8:3/42 Call trace: __asan_report_load8_noabort+0x20/0x30 mv88e6xxx_setup_upstream_port+0x240/0x568 mv88e6xxx_setup+0xebc/0x1eb0 dsa_register_switch+0x1af4/0x2ae0 mv88e6xxx_register_switch+0x1b8/0x2a8 mv88e6xxx_probe+0xc4c/0xf60 mdio_probe+0x78/0xb8 really_probe+0x2b8/0x5a8 __driver_probe_device+0x164/0x298 driver_probe_device+0x78/0x258 __device_attach_driver+0x274/0x350 Allocated by task 42: __kasan_kmalloc+0x84/0xa0 __kmalloc_cache_noprof+0x298/0x490 dsa_switch_touch_ports+0x174/0x3d8 dsa_register_switch+0x800/0x2ae0 mv88e6xxx_register_switch+0x1b8/0x2a8 mv88e6xxx_probe+0xc4c/0xf60 mdio_probe+0x78/0xb8 really_probe+0x2b8/0x5a8 __driver_probe_device+0x164/0x298 driver_probe_device+0x78/0x258 __device_attach_driver+0x274/0x350 Freed by task 42: __kasan_slab_free+0x48/0x68 kfree+0x138/0x418 dsa_register_switch+0x2694/0x2ae0 mv88e6xxx_register_switch+0x1b8/0x2a8 mv88e6xxx_probe+0xc4c/0xf60 mdio_probe+0x78/0xb8 really_probe+0x2b8/0x5a8 __driver_probe_device+0x164/0x298 driver_probe_device+0x78/0x258 __device_attach_driver+0x274/0x350 The simplest way to fix the bug is to delete the routing table in its entirety. dsa_tree_setup_routing_table() has no problem in regenerating it even if we deleted links between ports other than those of switch N, because dsa_link_touch() first checks whether the port pair already exists in dst->rtable, allocating if not. The deletion of the routing table in its entirety already exists in dsa_tree_teardown(), so refactor that into a function that can also be called from the tree setup error path. In my analysis of the commit to blame, it is the one which added dsa_link elements to dst->rtable. Prior to that, each switch had its own ds->rtable which is freed when the switch fails to probe. But the tree is potentially persistent memory. Fixes: c5f51765a1f6 ("net: dsa: list DSA links in the fabric") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250414213001.2957964-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: dsa: clean up FDB, MDB, VLAN entries on unbindVladimir Oltean
As explained in many places such as commit b117e1e8a86d ("net: dsa: delete dsa_legacy_fdb_add and dsa_legacy_fdb_del"), DSA is written given the assumption that higher layers have balanced additions/deletions. As such, it only makes sense to be extremely vocal when those assumptions are violated and the driver unbinds with entries still present. But Ido Schimmel points out a very simple situation where that is wrong: https://lore.kernel.org/netdev/ZDazSM5UsPPjQuKr@shredder/ (also briefly discussed by me in the aforementioned commit). Basically, while the bridge bypass operations are not something that DSA explicitly documents, and for the majority of DSA drivers this API simply causes them to go to promiscuous mode, that isn't the case for all drivers. Some have the necessary requirements for bridge bypass operations to do something useful - see dsa_switch_supports_uc_filtering(). Although in tools/testing/selftests/net/forwarding/local_termination.sh, we made an effort to popularize better mechanisms to manage address filters on DSA interfaces from user space - namely macvlan for unicast, and setsockopt(IP_ADD_MEMBERSHIP) - through mtools - for multicast, the fact is that 'bridge fdb add ... self static local' also exists as kernel UAPI, and might be useful to someone, even if only for a quick hack. It seems counter-productive to block that path by implementing shim .ndo_fdb_add and .ndo_fdb_del operations which just return -EOPNOTSUPP in order to prevent the ndo_dflt_fdb_add() and ndo_dflt_fdb_del() from running, although we could do that. Accepting that cleanup is necessary seems to be the only option. Especially since we appear to be coming back at this from a different angle as well. Russell King is noticing that the WARN_ON() triggers even for VLANs: https://lore.kernel.org/netdev/Z_li8Bj8bD4-BYKQ@shell.armlinux.org.uk/ What happens in the bug report above is that dsa_port_do_vlan_del() fails, then the VLAN entry lingers on, and then we warn on unbind and leak it. This is not a straight revert of the blamed commit, but we now add an informational print to the kernel log (to still have a way to see that bugs exist), and some extra comments gathered from past years' experience, to justify the logic. Fixes: 0832cd9f1f02 ("net: dsa: warn if port lists aren't empty in dsa_port_teardown") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250414212930.2956310-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: bridge: switchdev: do not notify new brentries as changedJonas Gorski
When adding a bridge vlan that is pvid or untagged after the vlan has already been added to any other switchdev backed port, the vlan change will be propagated as changed, since the flags change. This causes the vlan to not be added to the hardware for DSA switches, since the DSA handler ignores any vlans for the CPU or DSA ports that are changed. E.g. the following order of operations would work: $ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0 $ ip link set lan1 master swbridge $ bridge vlan add dev swbridge vid 1 pvid untagged self $ bridge vlan add dev lan1 vid 1 pvid untagged but this order would break: $ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0 $ ip link set lan1 master swbridge $ bridge vlan add dev lan1 vid 1 pvid untagged $ bridge vlan add dev swbridge vid 1 pvid untagged self Additionally, the vlan on the bridge itself would become undeletable: $ bridge vlan port vlan-id lan1 1 PVID Egress Untagged swbridge 1 PVID Egress Untagged $ bridge vlan del dev swbridge vid 1 self $ bridge vlan port vlan-id lan1 1 PVID Egress Untagged swbridge 1 Egress Untagged since the vlan was never added to DSA's vlan list, so deleting it will cause an error, causing the bridge code to not remove it. Fix this by checking if flags changed only for vlans that are already brentry and pass changed as false for those that become brentries, as these are a new vlan (member) from the switchdev point of view. Since *changed is set to true for becomes_brentry = true regardless of would_change's value, this will not change any rtnetlink notification delivery, just the value passed on to switchdev in vlan->changed. Fixes: 8d23a54f5bee ("net: bridge: switchdev: differentiate new VLANs from changed ones") Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250414200020.192715-1-jonas.gorski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>