summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-02-24net-sysfs: restore behavior for not running devicesEric Dumazet
modprobe dummy dumdummies=1 Old behavior : $ cat /sys/class/net/dummy0/carrier cat: /sys/class/net/dummy0/carrier: Invalid argument After blamed commit, an empty string is reported. $ cat /sys/class/net/dummy0/carrier $ In this commit, I restore the old behavior for carrier, speed and duplex attributes. Fixes: 79c61899b5ee ("net-sysfs: remove rtnl_trylock from device attributes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Leogrande <leogrande@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250221051223.576726-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net: ethtool: fix ioctl confusing drivers about desired HDS user configJakub Kicinski
The legacy ioctl path does not have support for extended attributes. So we issue a GET to fetch the current settings from the driver, in an attempt to keep them unchanged. HDS is a bit "special" as the GET only returns on/off while the SET takes a "ternary" argument (on/off/default). If the driver was in the "default" setting - executing the ioctl path binds it to on or off, even tho the user did not intend to change HDS config. Factor the relevant logic out of the netlink code and reuse it. Fixes: 87c8f8496a05 ("bnxt_en: add support for tcp-data-split ethtool command") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Tested-by: Daniel Xu <dxu@dxuuu.xyz> Tested-by: Taehee Yoo <ap420073@gmail.com> Link: https://patch.msgid.link/20250221025141.1132944-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-23batman-adv: add missing newlines for log macrosSven Eckelmann
Missing trailing newlines can lead to incomplete log lines that do not appear properly in dmesg or in console output. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Limit aggregation size to outgoing MTUSven Eckelmann
If a B.A.T.M.A.N. IV aggregated OGM was prepared, it was always assumed that 512 bytes (BATADV_MAX_AGGREGATION_BYTES) can be transmitted. But the outgoing MTU might be too small for these 512 bytes and the aggregation size must be adjusted in this case. Otherwise, the aggregates will cause unnecessary packet loss. For now, the non-aggregated packet length is not touched. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Use actual packet count for aggregated packetsSven Eckelmann
The batadv_forw_packet->num_packets didn't store the number of packets but the the number of packets - 1. This didn't had any effects on the actual handling of aggregates but can easily be a source of confusion when reading the code. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Switch to bitmap helper for aggregation handlingSven Eckelmann
The aggregation code duplicates code which already exists in the the bitops and bitmap helper. By switching to the bitmap helpers, operating on larger aggregations becomes possible without touching the different portions of the code which read/modify direct_link_flags. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Limit number of aggregated packets directlySven Eckelmann
The currently selected size in BATADV_MAX_AGGREGATION_BYTES (512) is chosen such that the number of possible aggregated packets is lower than 32. This number must be limited so that the type of batadv_forw_packet->direct_link_flags has enough bits to represent each packet (with the size of at least 24 bytes). This requirement is better implemented in code instead of having it inside a comment. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Use consistent name for mesh interfaceSven Eckelmann
The way how the virtual interface is called inside the batman-adv source code is not consistent. The genl headers call it meshif and the rest of the code calls is (mostly) softif. The genl definitions cannot be touched because they are part of the UAPI. But the rest of the batman-adv code can be touched to have a consistent name again. The bulk of the renaming was done using sed -i -e 's/soft\(-\|\_\| \|\)i\([nf]\)/mesh\1i\2/g' \ -e 's/SOFT\(-\|\_\| \|\)I\([NF]\)/MESH\1I\2/g' and then it was adjusted slightly when proofreading the changes. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Add support for jumbo framesSven Eckelmann
Since batman-adv is not actually depending on hardware capabilities, it has no limit on the MTU. Only the lower hard interfaces can limit it. In case these have an high enough MTU or fragmentation is enabled, a higher MTU than 1500 can be enabled. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: adopt netdev_hold() / netdev_put()Eric Dumazet
Add a device tracker to struct batadv_hard_iface to help debugging of network device refcount imbalances. Signed-off-by: Eric Dumazet <edumazet@google.com> [sven@narfation.org: fix kernel-doc, adopt for softif reference] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Drop batadv_priv_debug_log structSven Eckelmann
The support for the batman-adv ring buffer for debug logs was dropped with the removal of the debugfs filesystem. The structure storing this ring buffer is therefore no longer needed since commit aff6f5a68b92 ("batman-adv: Drop deprecated debugfs support") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-22batman-adv: Start new development cycleSimon Wunderlich
This version will contain all the (major or even only minor) changes for Linux 6.15. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-02-21net: set the minimum for net_hotdata.netdev_budget_usecsJiri Slaby (SUSE)
Commit 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") added a possibility to set net_hotdata.netdev_budget_usecs, but added no lower bound checking. Commit a4837980fd9f ("net: revert default NAPI poll timeout to 2 jiffies") made the *initial* value HZ-dependent, so the initial value is at least 2 jiffies even for lower HZ values (2 ms for 1000 Hz, 8ms for 250 Hz, 20 ms for 100 Hz). But a user still can set improper values by a sysctl. Set .extra1 (the lower bound) for net_hotdata.netdev_budget_usecs to the same value as in the latter commit. That is to 2 jiffies. Fixes: a4837980fd9f ("net: revert default NAPI poll timeout to 2 jiffies") Fixes: 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Link: https://patch.msgid.link/20250220110752.137639-1-jirislaby@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: fib_rules: Enable DSCP mask usageIdo Schimmel
Allow user space to configure FIB rules that match on DSCP with a mask, now that support has been added to the IPv4 and IPv6 address families. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ipv6: fib_rules: Add DSCP mask matchingIdo Schimmel
Extend IPv6 FIB rules to match on DSCP using a mask. Unlike IPv4, also initialize the DSCP mask when a non-zero 'tos' is specified as there is no difference in matching between 'tos' and 'dscp'. As a side effect, this makes it possible to match on 'dscp 0', like in IPv4. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ipv4: fib_rules: Add DSCP mask matchingIdo Schimmel
Extend IPv4 FIB rules to match on DSCP using a mask. The mask is only set in rules that match on DSCP (not TOS) and initialized to cover the entire DSCP field if the mask attribute is not specified. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: fib_rules: Add DSCP mask attributeIdo Schimmel
Add an attribute that allows matching on DSCP with a mask. Matching on DSCP with a mask is needed in deployments where users encode path information into certain bits of the DSCP field. Temporarily set the type of the attribute to 'NLA_REJECT' while support is being added. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: better track kernel sockets lifetimeEric Dumazet
While kernel sockets are dismantled during pernet_operations->exit(), their freeing can be delayed by any tx packets still held in qdisc or device queues, due to skb_set_owner_w() prior calls. This then trigger the following warning from ref_tracker_dir_exit() [1] To fix this, make sure that kernel sockets own a reference on net->passive. Add sk_net_refcnt_upgrade() helper, used whenever a kernel socket is converted to a refcounted one. [1] [ 136.263918][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at [ 136.263918][ T35] sk_alloc+0x2b3/0x370 [ 136.263918][ T35] inet6_create+0x6ce/0x10f0 [ 136.263918][ T35] __sock_create+0x4c0/0xa30 [ 136.263918][ T35] inet_ctl_sock_create+0xc2/0x250 [ 136.263918][ T35] igmp6_net_init+0x39/0x390 [ 136.263918][ T35] ops_init+0x31e/0x590 [ 136.263918][ T35] setup_net+0x287/0x9e0 [ 136.263918][ T35] copy_net_ns+0x33f/0x570 [ 136.263918][ T35] create_new_namespaces+0x425/0x7b0 [ 136.263918][ T35] unshare_nsproxy_namespaces+0x124/0x180 [ 136.263918][ T35] ksys_unshare+0x57d/0xa70 [ 136.263918][ T35] __x64_sys_unshare+0x38/0x40 [ 136.263918][ T35] do_syscall_64+0xf3/0x230 [ 136.263918][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 136.263918][ T35] [ 136.343488][ T35] ref_tracker: net notrefcnt@ffff8880638f01e0 has 1/2 users at [ 136.343488][ T35] sk_alloc+0x2b3/0x370 [ 136.343488][ T35] inet6_create+0x6ce/0x10f0 [ 136.343488][ T35] __sock_create+0x4c0/0xa30 [ 136.343488][ T35] inet_ctl_sock_create+0xc2/0x250 [ 136.343488][ T35] ndisc_net_init+0xa7/0x2b0 [ 136.343488][ T35] ops_init+0x31e/0x590 [ 136.343488][ T35] setup_net+0x287/0x9e0 [ 136.343488][ T35] copy_net_ns+0x33f/0x570 [ 136.343488][ T35] create_new_namespaces+0x425/0x7b0 [ 136.343488][ T35] unshare_nsproxy_namespaces+0x124/0x180 [ 136.343488][ T35] ksys_unshare+0x57d/0xa70 [ 136.343488][ T35] __x64_sys_unshare+0x38/0x40 [ 136.343488][ T35] do_syscall_64+0xf3/0x230 [ 136.343488][ T35] entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 0cafd77dcd03 ("net: add a refcount tracker for kernel sockets") Reported-by: syzbot+30a19e01a97420719891@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67b72aeb.050a0220.14d86d.0283.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250220131854.4048077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-02-20 We've added 19 non-merge commits during the last 8 day(s) which contain a total of 35 files changed, 1126 insertions(+), 53 deletions(-). The main changes are: 1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing 2) Add network TX timestamping support to BPF sock_ops, from Jason Xing 3) Add TX metadata Launch Time support, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: igc: Add launch time support to XDP ZC igc: Refactor empty frame insertion for launch time support net: stmmac: Add launch time support to XDP ZC selftests/bpf: Add launch time request to xdp_hw_metadata xsk: Add launch time hardware offload support to XDP Tx metadata selftests/bpf: Add simple bpf tests in the tx path for timestamping feature bpf: Support selective sampling for bpf timestamping bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING bpf: Disable unsafe helpers in TX timestamping callbacks bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping bpf: Add networking timestamping support to bpf_get/setsockopt() selftests/bpf: Add rto max for bpf_setsockopt test bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt ==================== Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21Merge tag 'for-net-2025-02-21' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btusb: Always allow SCO packets for user channel - L2CAP: Fix L2CAP_ECRED_CONN_RSP response * tag 'for-net-2025-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: Fix L2CAP_ECRED_CONN_RSP response Bluetooth: Always allow SCO packets for user channel ==================== Link: https://patch.msgid.link/20250221154941.2139043-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net/rds: Replace deprecated strncpy() with strscpy_pad()Thorsten Blum
strncpy() is deprecated for NUL-terminated destination buffers. Use strscpy_pad() instead and remove the manual NUL-termination. Compile-tested only. Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Tested-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250219224730.73093-2-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Create link directly in target net namespaceXiao Liang
Make rtnl_newlink_create() create device in target namespace directly. Avoid extra netns change when link netns is provided. Device drivers has been converted to be aware of link netns, that is not assuming device netns is and link netns is the same when ops->newlink() is called. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-12-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Remove "net" from newlink paramsXiao Liang
Now that devices have been converted to use the specific netns instead of ambiguous "net", let's remove it from newlink parameters. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-11-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: xfrm: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-10-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ipv6: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-9-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ipv6: Init tunnel link-netns before registering devXiao Liang
Currently some IPv6 tunnel drivers set tnl->net to dev_net(dev) in ndo_init(), which is called in register_netdevice(). However, it lacks the context of link-netns when we enable cross-net tunnels at device registration time. Let's move the init of tunnel link-netns before register_netdevice(). ip6_gre has already initialized netns, so just remove the redundant assignment. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-8-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ip_tunnel: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Convert common ip_tunnel_newlink() to accept an extra link netns argument. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-7-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()Xiao Liang
ip_tunnel_init() is called from register_netdevice(). In all code paths reaching here, tunnel->net should already have been set (either in ip_tunnel_newlink() or __ip_tunnel_create()). So don't set it again. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-6-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_opsXiao Liang
Device denoted by IFLA_LINK is in link_net (IFLA_LINK_NETNSID) or source netns by design, but 6lowpan uses dev_net. Note dev->netns_local is set to true and currently link_net is implemented via a netns change. These together effectively reject IFLA_LINK_NETNSID. This patch adds a validation to ensure link_net is either NULL or identical to dev_net. Thus it would be fine to continue using dev_net when rtnetlink core begins to create devices directly in target netns. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-5-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: Use link/peer netns in newlink() of rtnl_link_opsXiao Liang
Add two helper functions - rtnl_newlink_link_net() and rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back to link netns, and link netns falls back to source netns. Convert the use of params->net in netdevice drivers to one of the helper functions for clarity. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Pack newlink() params into structXiao Liang
There are 4 net namespaces involved when creating links: - source netns - where the netlink socket resides, - target netns - where to put the device being created, - link netns - netns associated with the device (backend), - peer netns - netns of peer device. Currently, two nets are passed to newlink() callback - "src_net" parameter and "dev_net" (implicitly in net_device). They are set as follows, depending on netlink attributes in the request. +------------+-------------------+---------+---------+ | peer netns | IFLA_LINK_NETNSID | src_net | dev_net | +------------+-------------------+---------+---------+ | | absent | source | target | | absent +-------------------+---------+---------+ | | present | link | link | +------------+-------------------+---------+---------+ | | absent | peer | target | | present +-------------------+---------+---------+ | | present | peer | link | +------------+-------------------+---------+---------+ When IFLA_LINK_NETNSID is present, the device is created in link netns first and then moved to target netns. This has some side effects, including extra ifindex allocation, ifname validation and link events. These could be avoided if we create it in target netns from the beginning. On the other hand, the meaning of src_net parameter is ambiguous. It varies depending on how parameters are passed. It is the effective link (or peer netns) by design, but some drivers ignore it and use dev_net instead. To provide more netns context for drivers, this patch packs existing newlink() parameters, along with the source netns, link netns and peer netns, into a struct. The old "src_net" is renamed to "net" to avoid confusion with real source netns, and will be deprecated later. The use of src_net are converted to params->net trivially. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Lookup device in target netns when creating linkXiao Liang
When creating link, lookup for existing device in target net namespace instead of current one. For example, two links created by: # ip link add dummy1 type dummy # ip link add netns ns1 dummy1 type dummy should have no conflict since they are in different namespaces. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-2-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rxrpc: Fix locking issues with the peer record hashDavid Howells
rxrpc_new_incoming_peer() can't use spin_lock_bh() whilst its caller has interrupts disabled. WARNING: CPU: 0 PID: 1550 at kernel/softirq.c:369 __local_bh_enable_ip+0x46/0xd0 ... Call Trace: rxrpc_alloc_incoming_call+0x1b0/0x400 rxrpc_new_incoming_call+0x1dd/0x5e0 rxrpc_input_packet+0x84a/0x920 rxrpc_io_thread+0x40d/0xb40 kthread+0x2ec/0x300 ret_from_fork+0x24/0x40 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1811 hardirqs last enabled at (1809): _raw_spin_unlock_irq+0x24/0x50 hardirqs last disabled at (1810): _raw_read_lock_irq+0x17/0x70 softirqs last enabled at (1182): handle_softirqs+0x3ee/0x430 softirqs last disabled at (1811): rxrpc_new_incoming_peer+0x56/0x120 Fix this by using a plain spin_lock() instead. IRQs are held, so softirqs can't happen. Fixes: a2ea9a907260 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250218192250.296870-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rxrpc: peer->mtu_lock is redundantDavid Howells
The peer->mtu_lock is only used to lock around writes to peer->max_data - and nothing else; further, all such writes take place in the I/O thread and the lock is only ever write-locked and never read-locked. In a couple of places, the write_seqcount_begin() is wrapped in preempt_disable/enable(), but not in all places. This can cause lockdep to complain: WARNING: CPU: 0 PID: 1549 at include/linux/seqlock.h:221 rxrpc_input_ack_trailer+0x305/0x430 ... RIP: 0010:rxrpc_input_ack_trailer+0x305/0x430 Fix this by just getting rid of the lock. Fixes: eeaedc5449d9 ("rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250218192250.296870-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rxrpc: rxperf: Fix missing decoding of terminal magic cookieDavid Howells
The rxperf RPCs seem to have a magic cookie at the end of the request that was failing to be taken account of by the unmarshalling of the request. Fix the rxperf code to expect this. Fixes: 75bfdbf2fca3 ("rxrpc: Implement an in-kernel rxperf server for testing purposes") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250218192250.296870-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21xfrm_output: Force software GSO only in tunnel modeCosmin Ratiu
The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel mode. Unfortunately, it is slightly broader than necessary, as it also severely affects performance for Geneve + IPSec transport mode over a device capable of both HW GSO and IPSec crypto offload. In this case, xfrm_output unnecessarily triggers software GSO instead of letting the HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over a back-2-back pair of NICs with MTU 1500, the performance was observed to be up to 6x worse when doing software GSO compared to leaving it to the hardware. This commit makes xfrm_output only trigger software GSO in crypto offload cases for already encapsulated packets in tunnel mode, as not doing so would then cause the inner tunnel skb->inner_networking_header to be overwritten and break software GSO for that packet later if the device turns out to not be capable of HW GSO. Taking a closer look at the conditions for the original bug, to better understand the reasons for this change: - vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and inner network header. - then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and network headers. - later in the xmit path, xfrm_output -> xfrm_outer_mode_output -> xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner network header with the one set in ip_tunnel_xmit before adding the second outer header. - __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation needs to happen based on dev features. In the original bug, the hw couldn't segment the packets, so skb_gso_segment was invoked. - deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment tries to use the wrong inner network header, expecting the one set in iptunnel_handle_offloads but getting the one set by xfrm instead. - a bit later, ipv6_gso_segment accesses the wrong memory based on that wrong inner network header. With the new change, the original bug (or similar ones) cannot happen again, as xfrm will now trigger software GSO before applying a tunnel. This concern doesn't exist in packet offload mode, when the HW adds encapsulation headers. For the non-offloaded packets (crypto in SW), software GSO is still done unconditionally in the else branch. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Fixes: a204aef9fd77 ("xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: fix tunnel mode TX datapath in packet offload modeAlexandre Cassen
Packets that match the output xfrm policy are delivered to the netstack. In IPsec packet mode for tunnel mode, the HW is responsible for building the hard header and outer IP header. In such a situation, the inner header may refer to a network that is not directly reachable by the host, resulting in a failed neighbor resolution. The packet is then dropped. xfrm policy defines the netdevice to use for xmit so we can send packets directly to it. Makes direct xmit exclusive to tunnel mode, since some rules may apply in transport mode. Fixes: f8a70afafc17 ("xfrm: add TX datapath support for IPsec packet offload mode") Signed-off-by: Alexandre Cassen <acassen@corp.free.fr> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: check for PMTU in tunnel mode for packet offloadLeon Romanovsky
In tunnel mode, for the packet offload, there were no PMTU signaling to the upper level about need to fragment the packet. As a solution, call to already existing xfrm[4|6]_tunnel_check_size() to perform that. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: provide common xdo_dev_offload_ok callback implementationLeon Romanovsky
Almost all drivers except bond and nsim had same check if device can perform XFRM offload on that specific packet. The check was that packet doesn't have IPv4 options and IPv6 extensions. In NIC drivers, the IPv4 HELEN comparison was slightly different, but the intent was to check for the same conditions. So let's chose more strict variant as a common base. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: rely on XFRM offloadLeon Romanovsky
After change of initialization of x->type_offload pointer to be valid only for offloaded SAs. There is no need to rely on both x->type_offload and x->xso.type to determine if SA is offloaded or not. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: simplify SA initialization routineLeon Romanovsky
SA replay mode is initialized differently for user-space and kernel-space users, but the call to xfrm_init_replay() existed in common path with boolean protection. That caused to situation where we have two different function orders. So let's rewrite the SA initialization flow to have same order for both in-kernel and user-space callers. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21xfrm: delay initialization of offload path till its actually requestedLeon Romanovsky
XFRM offload path is probed even if offload isn't needed at all. Let's make sure that x->type_offload pointer stays NULL for such path to reduce ambiguity. Fixes: 9d389d7f84bb ("xfrm: Add a xfrm type offload.") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-20neighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specifiedKohei Enju
kzalloc() uses page allocator when size is larger than KMALLOC_MAX_CACHE_SIZE, so the intention of commit ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()") can be achieved by using kzalloc(). When using GFP_ATOMIC, kvzalloc() only tries the kmalloc path, since the vmalloc path does not support the flag. In this case, kvzalloc() is equivalent to kzalloc() in that neither try the vmalloc path, so this replacement brings no functional change. This is primarily a cleanup change, as the original code functions correctly. This patch replaces kvzalloc() introduced by commit 41b3caa7c076 ("neighbour: Add hlist_node to struct neighbour"), which is called in the same context and with the same gfp flag as the aforementioned commit ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()"). Signed-off-by: Kohei Enju <enjuk@amazon.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://patch.msgid.link/20250219102227.72488-1-enjuk@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix access outside of user given buffer in pktgen_thread_write()Peter Seiderer
Honour the user given buffer size for the strn_len() calls (otherwise strn_len() will access memory outside of the user given buffer). Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-8-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix ctrl interface command parsingPeter Seiderer
Enable command writing without trailing '\n': - the good case $ echo "reset" > /proc/net/pktgen/pgctrl - the bad case (before the patch) $ echo -n "reset" > /proc/net/pktgen/pgctrl -bash: echo: write error: Invalid argument - with patch applied $ echo -n "reset" > /proc/net/pktgen/pgctrl Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-7-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix 'ratep 0' error handling (return -EINVAL)Peter Seiderer
Given an invalid 'ratep' command e.g. 'ratep 0' the return value is '1', leading to the following misleading output: - the good case $ echo "ratep 100" > /proc/net/pktgen/lo\@0 $ grep "Result:" /proc/net/pktgen/lo\@0 Result: OK: ratep=100 - the bad case (before the patch) $ echo "ratep 0" > /proc/net/pktgen/lo\@0" -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: No such parameter "atep" - with patch applied $ echo "ratep 0" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: Idle Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-6-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix 'rate 0' error handling (return -EINVAL)Peter Seiderer
Given an invalid 'rate' command e.g. 'rate 0' the return value is '1', leading to the following misleading output: - the good case $ echo "rate 100" > /proc/net/pktgen/lo\@0 $ grep "Result:" /proc/net/pktgen/lo\@0 Result: OK: rate=100 - the bad case (before the patch) $ echo "rate 0" > /proc/net/pktgen/lo\@0" -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: No such parameter "ate" - with patch applied $ echo "rate 0" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: Idle Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-5-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix hex32_arg parsing for short readsPeter Seiderer
Fix hex32_arg parsing for short reads (here 7 hex digits instead of the expected 8), shift result only on successful input parsing. - before the patch $ echo "mpls 0000123" > /proc/net/pktgen/lo\@0 $ grep mpls /proc/net/pktgen/lo\@0 mpls: 00001230 Result: OK: mpls=00001230 - with patch applied $ echo "mpls 0000123" > /proc/net/pktgen/lo\@0 $ grep mpls /proc/net/pktgen/lo\@0 mpls: 00000123 Result: OK: mpls=00000123 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-4-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: enable 'param=value' parsingPeter Seiderer
Enable more flexible parameters syntax, allowing 'param=value' in addition to the already supported 'param value' pattern (additional this gives the skipping '=' in count_trail_chars() a purpose). Tested with: $ echo "min_pkt_size 999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size=999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size =999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size= 999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size = 999" > /proc/net/pktgen/lo\@0 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-3-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: replace ENOTSUPP with EOPNOTSUPPPeter Seiderer
Replace ENOTSUPP with EOPNOTSUPP, fixes checkpatch hint WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP and e.g. $ echo "clone_skb 1" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Unknown error 524 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-2-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>