summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-12-18net: dsa: restore dsa_software_vlan_untag() ability to operate on ↵Vladimir Oltean
VLAN-untagged traffic Robert Hodaszi reports that locally terminated traffic towards VLAN-unaware bridge ports is broken with ocelot-8021q. He is describing the same symptoms as for commit 1f9fc48fd302 ("net: dsa: sja1105: fix reception from VLAN-unaware bridges"). For context, the set merged as "VLAN fixes for Ocelot driver": https://lore.kernel.org/netdev/20240815000707.2006121-1-vladimir.oltean@nxp.com/ was developed in a slightly different form earlier this year, in January. Initially, the switch was unconditionally configured to set OCELOT_ES0_TAG when using ocelot-8021q, regardless of port operating mode. This led to the situation where VLAN-unaware bridge ports would always push their PVID - see ocelot_vlan_unaware_pvid() - a negligible value anyway - into RX packets. To strip this in software, we would have needed DSA to know what private VID the switch chose for VLAN-unaware bridge ports, and pushed into the packets. This was implemented downstream, and a remnant of it remains in the form of a comment mentioning ds->ops->get_private_vid(), as something which would maybe need to be considered in the future. However, for upstream, it was deemed inappropriate, because it would mean introducing yet another behavior for stripping VLAN tags from VLAN-unaware bridge ports, when one already existed (ds->untag_bridge_pvid). The latter has been marked as obsolete along with an explanation why it is logically broken, but still, it would have been confusing. So, for upstream, felix_update_tag_8021q_rx_rule() was developed, which essentially changed the state of affairs from "Felix with ocelot-8021q delivers all packets as VLAN-tagged towards the CPU" into "Felix with ocelot-8021q delivers all packets from VLAN-aware bridge ports towards the CPU". This was done on the premise that in VLAN-unaware mode, there's nothing useful in the VLAN tags, and we can avoid introducing ds->ops->get_private_vid() in the DSA receive path if we configure the switch to not push those VLAN tags into packets in the first place. Unfortunately, and this is when the trainwreck started, the selftests developed initially and posted with the series were not re-ran. dsa_software_vlan_untag() was initially written given the assumption that users of this feature would send _all_ traffic as VLAN-tagged. It was only partially adapted to the new scheme, by removing ds->ops->get_private_vid(), which also used to be necessary in standalone ports mode. Where the trainwreck became even worse is that I had a second opportunity to think about this, when the dsa_software_vlan_untag() logic change initially broke sja1105, in commit 1f9fc48fd302 ("net: dsa: sja1105: fix reception from VLAN-unaware bridges"). I did not connect the dots that it also breaks ocelot-8021q, for pretty much the same reason that not all received packets will be VLAN-tagged. To be compatible with the optimized Felix control path which runs felix_update_tag_8021q_rx_rule() to only push VLAN tags when useful (in VLAN-aware mode), we need to restore the old dsa_software_vlan_untag() logic. The blamed commit introduced the assumption that dsa_software_vlan_untag() will see only VLAN-tagged packets, assumption which is false. What corrupts RX traffic is the fact that we call skb_vlan_untag() on packets which are not VLAN-tagged in the first place. Fixes: 93e4649efa96 ("net: dsa: provide a software untagging function on RX for VLAN-aware bridges") Reported-by: Robert Hodaszi <robert.hodaszi@digi.com> Closes: https://lore.kernel.org/netdev/20241215163334.615427-1-robert.hodaszi@digi.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241216135059.1258266-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-18Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: add support for devlink health events Przemek Kitszel says: Reports for two kinds of events are implemented, Malicious Driver Detection (MDD) and Tx hang. Patches 1, 2, 3: core improvements (checkpatch.pl, devlink extension) Patch 4: rename current ice devlink/ files Patches 5, 6, 7: ice devlink health infra + reporters Mateusz did good job caring for this series, and hardening the code. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: Add MDD logging via devlink health ice: add Tx hang devlink health reporter ice: rename devlink_port.[ch] to port.[ch] devlink: add devlink_fmsg_dump_skb() function devlink: add devlink_fmsg_put() macro checkpatch: don't complain on _Generic() use ==================== Link: https://patch.msgid.link/20241217210835.3702003-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-18ptr_ring: do not block hard interrupts in ptr_ring_resize_multiple()Eric Dumazet
Jakub added a lockdep_assert_no_hardirq() check in __page_pool_put_page() to increase test coverage. syzbot found a splat caused by hard irq blocking in ptr_ring_resize_multiple() [1] As current users of ptr_ring_resize_multiple() do not require hard irqs being masked, replace it to only block BH. Rename helpers to better reflect they are safe against BH only. - ptr_ring_resize_multiple() to ptr_ring_resize_multiple_bh() - skb_array_resize_multiple() to skb_array_resize_multiple_bh() [1] WARNING: CPU: 1 PID: 9150 at net/core/page_pool.c:709 __page_pool_put_page net/core/page_pool.c:709 [inline] WARNING: CPU: 1 PID: 9150 at net/core/page_pool.c:709 page_pool_put_unrefed_netmem+0x157/0xa40 net/core/page_pool.c:780 Modules linked in: CPU: 1 UID: 0 PID: 9150 Comm: syz.1.1052 Not tainted 6.11.0-rc3-syzkaller-00202-gf8669d7b5f5d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 RIP: 0010:__page_pool_put_page net/core/page_pool.c:709 [inline] RIP: 0010:page_pool_put_unrefed_netmem+0x157/0xa40 net/core/page_pool.c:780 Code: 74 0e e8 7c aa fb f7 eb 43 e8 75 aa fb f7 eb 3c 65 8b 1d 38 a8 6a 76 31 ff 89 de e8 a3 ae fb f7 85 db 74 0b e8 5a aa fb f7 90 <0f> 0b 90 eb 1d 65 8b 1d 15 a8 6a 76 31 ff 89 de e8 84 ae fb f7 85 RSP: 0018:ffffc9000bda6b58 EFLAGS: 00010083 RAX: ffffffff8997e523 RBX: 0000000000000000 RCX: 0000000000040000 RDX: ffffc9000fbd0000 RSI: 0000000000001842 RDI: 0000000000001843 RBP: 0000000000000000 R08: ffffffff8997df2c R09: 1ffffd40003a000d R10: dffffc0000000000 R11: fffff940003a000e R12: ffffea0001d00040 R13: ffff88802e8a4000 R14: dffffc0000000000 R15: 00000000ffffffff FS: 00007fb7aaf716c0(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa15a0d4b72 CR3: 00000000561b0000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tun_ptr_free drivers/net/tun.c:617 [inline] __ptr_ring_swap_queue include/linux/ptr_ring.h:571 [inline] ptr_ring_resize_multiple_noprof include/linux/ptr_ring.h:643 [inline] tun_queue_resize drivers/net/tun.c:3694 [inline] tun_device_event+0xaaf/0x1080 drivers/net/tun.c:3714 notifier_call_chain+0x19f/0x3e0 kernel/notifier.c:93 call_netdevice_notifiers_extack net/core/dev.c:2032 [inline] call_netdevice_notifiers net/core/dev.c:2046 [inline] dev_change_tx_queue_len+0x158/0x2a0 net/core/dev.c:9024 do_setlink+0xff6/0x41f0 net/core/rtnetlink.c:2923 rtnl_setlink+0x40d/0x5a0 net/core/rtnetlink.c:3201 rtnetlink_rcv_msg+0x73f/0xcf0 net/core/rtnetlink.c:6647 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2550 Fixes: ff4e538c8c3e ("page_pool: add a lockdep check for recycling in hardirq") Reported-by: syzbot+f56a5c5eac2b28439810@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/671e10df.050a0220.2b8c0f.01cf.GAE@google.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20241217135121.326370-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19netfilter: ipset: Fix for recursive locking warningPhil Sutter
With CONFIG_PROVE_LOCKING, when creating a set of type bitmap:ip, adding it to a set of type list:set and populating it from iptables SET target triggers a kernel warning: | WARNING: possible recursive locking detected | 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted | -------------------------------------------- | ping/4018 is trying to acquire lock: | ffff8881094a6848 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set] | | but task is already holding lock: | ffff88811034c048 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set] This is a false alarm: ipset does not allow nested list:set type, so the loop in list_set_kadd() can never encounter the outer set itself. No other set type supports embedded sets, so this is the only case to consider. To avoid the false report, create a distinct lock class for list:set type ipset locks. Fixes: f830837f0eed ("netfilter: ipset: list:set set type support") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-12-18ipvs: Fix clamp() of ip_vs_conn_tab on small memory systemsDavid Laight
The 'max_avail' value is calculated from the system memory size using order_base_2(). order_base_2(x) is defined as '(x) ? fn(x) : 0'. The compiler generates two copies of the code that follows and then expands clamp(max, min, PAGE_SHIFT - 12) (11 on 32bit). This triggers a compile-time assert since min is 5. In reality a system would have to have less than 512MB memory for the bounds passed to clamp to be reversed. Swap the order of the arguments to clamp() to avoid the warning. Replace the clamp_val() on the line below with clamp(). clamp_val() is just 'an accident waiting to happen' and not needed here. Detected by compile time checks added to clamp(), specifically: minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Closes: https://lore.kernel.org/all/CA+G9fYsT34UkGFKxus63H6UVpYi5GRZkezT9MRLfAbM3f6ke0g@mail.gmail.com/ Fixes: 4f325e26277b ("ipvs: dynamically limit the connection hash table") Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: David Laight <david.laight@aculab.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-12-17inetpeer: do not get a refcount in inet_getpeer()Eric Dumazet
All inet_getpeer() callers except ip4_frag_init() don't need to acquire a permanent refcount on the inetpeer. They can switch to full RCU protection. Move the refcount_inc_not_zero() into ip4_frag_init(), so that all the other callers no longer have to perform a pair of expensive atomic operations on a possibly contended cache line. inet_putpeer() no longer needs to be exported. After this patch, my DUT can receive 8,400,000 UDP packets per second targeting closed ports, using 50% less cpu cycles than before. Also change two calls to l3mdev_master_ifindex() by l3mdev_master_ifindex_rcu() (Ido ideas) Fixes: 8c2bd38b95f7 ("icmp: change the order of rate limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17inetpeer: update inetpeer timestamp in inet_getpeer()Eric Dumazet
inet_putpeer() will be removed in the following patch, because we will no longer use refcounts. Update inetpeer timestamp (p->dtime) at lookup time. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17inetpeer: remove create argument of inet_getpeer()Eric Dumazet
All callers of inet_getpeer() want to create an inetpeer. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17inetpeer: remove create argument of inet_getpeer_v[46]()Eric Dumazet
All callers of inet_getpeer_v4() and inet_getpeer_v6() want to create an inetpeer. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17net: bridge: constify 'struct bin_attribute'Thomas Weißschuh
The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241216-sysfs-const-bin_attr-net-v1-1-ec460b91f274@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17rtnetlink: Try the outer netns attribute in rtnl_get_peer_net().Kuniyuki Iwashima
Xiao Liang reported that the cited commit changed netns handling in newlink() of netkit, veth, and vxcan. Before the patch, if we don't find a netns attribute in the peer device attributes, we tried to find another netns attribute in the outer netlink attributes by passing it to rtnl_link_get_net(). Let's restore the original behaviour. Fixes: 48327566769a ("rtnetlink: fix double call of rtnl_link_get_net_ifla()") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCORBVVU8P6AHcEkENMj+gD2d3ce9t=A_o48E0yOQp8_wUQ@mail.gmail.com/#t Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Xiao Liang <shaw.leon@gmail.com> Link: https://patch.msgid.link/20241216110432.51488-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17net: page_pool: rename page_pool_is_last_ref()Jakub Kicinski
page_pool_is_last_ref() releases a reference while the name, to me at least, suggests it just checks if the refcount is 1. The semantics of the function are the same as those of atomic_dec_and_test() and refcount_dec_and_test(), so just use the _and_test() suffix. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20241215212938.99210-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17devlink: add devlink_fmsg_dump_skb() functionMateusz Polchlopek
Add devlink_fmsg_dump_skb() function that adds some diagnostic information about skb (like length, pkt type, MAC, etc) to devlink fmsg mechanism using bunch of devlink_fmsg_put() function calls. Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17net/sched: Add drop reasons for AQM-based qdiscsToke Høiland-Jørgensen
Now that we have generic QDISC_CONGESTED and QDISC_OVERLIMIT drop reasons, let's have all the qdiscs that contain an AQM apply them consistently when dropping packets. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241214-fq-codel-drop-reasons-v1-1-2a814e884c37@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Remove unix_our_peer().Kuniyuki Iwashima
unix_our_peer() is used only in unix_may_send(). Let's inline it in unix_may_send(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Clean up error paths in unix_dgram_sendmsg().Kuniyuki Iwashima
The error path is complicated in unix_dgram_sendmsg() because there are two timings when other could be non-NULL: when it's fetched from unix_peer_get() and when it's looked up by unix_find_other(). Let's move unix_peer_get() to the else branch for unix_find_other() and clean up the error paths. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Clean up SOCK_DEAD error paths in unix_dgram_sendmsg().Kuniyuki Iwashima
When other has SOCK_DEAD in unix_dgram_sendmsg(), we hold unix_state_lock() for the sender socket first. However, we do not need it for sk->sk_type. Let's move the lock down a bit. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Defer sock_put() to clean up path in unix_dgram_sendmsg().Kuniyuki Iwashima
When other has SOCK_DEAD in unix_dgram_sendmsg(), we call sock_put() for it first and then set NULL to other before jumping to the error path. This is to skip sock_put() in the error path. Let's not set NULL to other and defer the sock_put() to the error path to clean up the labels later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Split restart label in unix_dgram_sendmsg().Kuniyuki Iwashima
There are two paths jumping to the restart label in unix_dgram_sendmsg(). One requires another lookup and sk_filter(), but the other doesn't. Let's split the label to make each flow more straightforward. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Use msg->{msg_name,msg_namelen} in unix_dgram_sendmsg().Kuniyuki Iwashima
In unix_dgram_sendmsg(), we use a local variable sunaddr pointing NULL or msg->msg_name based on msg->msg_namelen. Let's remove sunaddr and simplify the usage. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Move !sunaddr case in unix_dgram_sendmsg().Kuniyuki Iwashima
When other is NULL in unix_dgram_sendmsg(), we check if sunaddr is NULL before looking up a receiver socket. There are three paths going through the check, but it's always false for 2 out of the 3 paths: the first socket lookup and the second 'goto restart'. The condition can be true for the first 'goto restart' only when SOCK_DEAD is flagged for the socket found with msg->msg_name. Let's move the check to the single appropriate path. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Set error only when needed in unix_dgram_sendmsg().Kuniyuki Iwashima
We will introduce skb drop reason for AF_UNIX, then we need to set an errno and a drop reason for each path. Let's set an error only when it's needed in unix_dgram_sendmsg(). Then, we need not (re)set 0 to err. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Clean up error paths in unix_stream_sendmsg().Kuniyuki Iwashima
If we move send_sig() to the SEND_SHUTDOWN check before the while loop, then we can reuse the same kfree_skb() after the pipe_err_free label. Let's gather the scattered kfree_skb()s in error paths. While at it, some style issues are fixed, and the pipe_err_free label is renamed to out_pipe to match other label names. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Set error only when needed in unix_stream_sendmsg().Kuniyuki Iwashima
We will introduce skb drop reason for AF_UNIX, then we need to set an errno and a drop reason for each path. Let's set an error only when it's needed in unix_stream_sendmsg(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Clean up error paths in unix_stream_connect().Kuniyuki Iwashima
The label order is weird in unix_stream_connect(), and all NULL checks are unnecessary if reordered. Let's clean up the error paths to make it easy to set a drop reason for each path. While at it, a comment with the old style is updated. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17af_unix: Set error only when needed in unix_stream_connect().Kuniyuki Iwashima
We will introduce skb drop reason for AF_UNIX, then we need to set an errno and a drop reason for each path. Let's set an error only when it's needed in unix_stream_connect(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17batman-adv: Map VID 0 to untagged TT VLANSven Eckelmann
VID 0 is not a valid VLAN according to "802.1Q-2011" "Table 9-2—Reserved VID values". It is only used to indicate "priority tag" frames which only contain priority information and no VID. The 8021q is also redirecting the priority tagged frames to the underlying interface since commit ad1afb003939 ("vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)"). But at the same time, it automatically adds the VID 0 to all devices to ensure that VID 0 is in the allowed list of the HW filter. This resulted in a VLAN 0 which was always announced in OGM messages. batman-adv should therefore not create a new batadv_softif_vlan for VID 0 and handle all VID 0 related frames using the "untagged" global/local translation tables. Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Antonio Quartulli <antonio@mandelbit.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2024-12-16sock: Introduce SO_RCVPRIORITY socket optionAnna Emese Nyiri
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the ancillary data returned by recvmsg(). This is analogous to the existing support for SO_RCVMARK, as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()"). Reviewed-by: Willem de Bruijn <willemb@google.com> Suggested-by: Ferenc Fejes <fejes@inf.elte.hu> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-5-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16sock: support SO_PRIORITY cmsgAnna Emese Nyiri
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to this is IP_TOS, when the priority value is calculated during the handling of ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data"). However, this is a computed value, and there is currently no mechanism to set a custom priority via control messages prior to this patch. According to this patch, if SO_PRIORITY is specified as ancillary data, the packet is sent with the priority value set through sockc->priority, overriding the socket-level values set via the traditional setsockopt() method. This is analogous to the existing support for SO_MARK, as implemented in commit c6af0c227a22 ("ip: support SO_MARK cmsg"). If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that takes precedence is the last one in the cmsg list. This patch has the side effect that raw_send_hdrinc now interprets cmsg IP_TOS. Reviewed-by: Willem de Bruijn <willemb@google.com> Suggested-by: Ferenc Fejes <fejes@inf.elte.hu> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16sock: Introduce sk_set_prio_allowed helper functionAnna Emese Nyiri
Simplify priority setting permissions with the 'sk_set_prio_allowed' function, centralizing the validation logic. This change is made in anticipation of a second caller in a following patch. No functional changes. Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-2-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16rxrpc: Fix ability to add more data to a call once MSG_MORE deassertedDavid Howells
When userspace is adding data to an RPC call for transmission, it must pass MSG_MORE to sendmsg() if it intends to add more data in future calls to sendmsg(). Calling sendmsg() without MSG_MORE being asserted closes the transmission phase of the call (assuming sendmsg() adds all the data presented) and further attempts to add more data should be rejected. However, this is no longer the case. The change of call state that was previously the guard got bumped over to the I/O thread, which leaves a window for a repeat sendmsg() to insert more data. This previously went unnoticed, but the more recent patch that changed the structures behind the Tx queue added a warning: WARNING: CPU: 3 PID: 6639 at net/rxrpc/sendmsg.c:296 rxrpc_send_data+0x3f2/0x860 and rejected the additional data, returning error EPROTO. Fix this by adding a guard flag to the call, setting the flag when we queue the final packet and then rejecting further attempts to add data with EPROTO. Fixes: 2d689424b618 ("rxrpc: Move call state changes from sendmsg to I/O thread") Reported-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6757fb68.050a0220.2477f.005f.GAE@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/2870480.1734037462@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16rxrpc: Disable IRQ, not BH, to take the lock for ->attend_linkDavid Howells
Use spin_lock_irq(), not spin_lock_bh() to take the lock when accessing the ->attend_link() to stop a delay in the I/O thread due to an interrupt being taken in the app thread whilst that holds the lock and vice versa. Fixes: a2ea9a907260 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/2870146.1734037095@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16netdev: fix repeated netlink messages in queue statsJakub Kicinski
The context is supposed to record the next queue to dump, not last dumped. If the dump doesn't fit we will restart from the already-dumped queue, duplicating the message. Before this fix and with the selftest improvements later in this series we see: # ./run_kselftest.sh -t drivers/net:stats.py timeout set to 45 selftests: drivers/net: stats.py KTAP version 1 1..5 ok 1 stats.check_pause ok 2 stats.check_fec ok 3 stats.pkt_byte_sum # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])), # Check failed 45 != 44 repeated queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1, # Check failed 45 != 44 missing queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])), # Check failed 45 != 44 repeated queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1, # Check failed 45 != 44 missing queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])), # Check failed 103 != 100 repeated queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1, # Check failed 103 != 100 missing queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])), # Check failed 102 != 100 repeated queue keys # Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex: # Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1, # Check failed 102 != 100 missing queue keys not ok 4 stats.qstat_by_ifindex ok 5 stats.check_down # Totals: pass:4 fail:1 xfail:0 xpass:0 skip:0 error:0 With the fix: # ./ksft-net-drv/run_kselftest.sh -t drivers/net:stats.py timeout set to 45 selftests: drivers/net: stats.py KTAP version 1 1..5 ok 1 stats.check_pause ok 2 stats.check_fec ok 3 stats.pkt_byte_sum ok 4 stats.qstat_by_ifindex ok 5 stats.check_down # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0 Fixes: ab63a2387cb9 ("netdev: add per-queue statistics") Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241213152244.3080955-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16netdev: fix repeated netlink messages in queue dumpJakub Kicinski
The context is supposed to record the next queue to dump, not last dumped. If the dump doesn't fit we will restart from the already-dumped queue, duplicating the message. Before this fix and with the selftest improvements later in this series we see: # ./run_kselftest.sh -t drivers/net:queues.py timeout set to 45 selftests: drivers/net: queues.py KTAP version 1 1..2 # Check| At /root/ksft-net-drv/drivers/net/./queues.py, line 32, in get_queues: # Check| ksft_eq(queues, expected) # Check failed 102 != 100 # Check| At /root/ksft-net-drv/drivers/net/./queues.py, line 32, in get_queues: # Check| ksft_eq(queues, expected) # Check failed 101 != 100 not ok 1 queues.get_queues ok 2 queues.addremove_queues # Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0 not ok 1 selftests: drivers/net: queues.py # exit=1 With the fix: # ./ksft-net-drv/run_kselftest.sh -t drivers/net:queues.py timeout set to 45 selftests: drivers/net: queues.py KTAP version 1 1..2 ok 1 queues.get_queues ok 2 queues.addremove_queues # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Fixes: 6b6171db7fc8 ("netdev-genl: Add netlink framework functions for queue") Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241213152244.3080955-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16ceph: allocate sparse_ext map only for sparse readsIlya Dryomov
If mounted with sparseread option, ceph_direct_read_write() ends up making an unnecessarily allocation for O_DIRECT writes. Fixes: 03bc06c7b0bd ("ceph: add new mount option to enable sparse reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: Auto-merging include/linux/bpf.h Auto-merging include/linux/bpf_verifier.h Auto-merging kernel/bpf/btf.c Auto-merging kernel/bpf/verifier.c Auto-merging kernel/trace/bpf_trace.c Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-16net: ethtool: Add support for tsconfig command to get/set hwtstamp configKory Maincent
Introduce support for ETHTOOL_MSG_TSCONFIG_GET/SET ethtool netlink socket to read and configure hwtstamp configuration of a PHC provider. Note that simultaneous hwtstamp isn't supported; configuring a new one disables the previous setting. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topologyKory Maincent
Either the MAC or the PHY can provide hwtstamp, so we should be able to read the tsinfo for any hwtstamp provider. Enhance 'get' command to retrieve tsinfo of hwtstamp providers within a network topology. Add support for a specific dump command to retrieve all hwtstamp providers within the network topology, with added functionality for filtered dump to target a single interface. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16net: Add the possibility to support a selected hwtstamp in netdeviceKory Maincent
Introduce the description of a hwtstamp provider, mainly defined with a the hwtstamp source and the phydev pointer. Add a hwtstamp provider description within the netdev structure to allow saving the hwtstamp we want to use. This prepares for future support of an ethtool netlink command to select the desired hwtstamp provider. By default, the old API that does not support hwtstamp selectability is used, meaning the hwtstamp provider pointer is unset. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16net: Make net_hwtstamp_validate accessibleKory Maincent
Make the net_hwtstamp_validate function accessible in prevision to use it from ethtool to validate the hwtstamp configuration before setting it. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16net: Make dev_get_hwtstamp_phylib accessibleKory Maincent
Make the dev_get_hwtstamp_phylib function accessible in prevision to use it from ethtool to read the hwtstamp current configuration. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16tls: add counters for rekeySabrina Dubroca
This introduces 5 counters to keep track of key updates: Tls{Rx,Tx}Rekey{Ok,Error} and TlsRxRekeyReceived. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16tls: implement rekey for TLS1.3Sabrina Dubroca
This adds the possibility to change the key and IV when using TLS1.3. Changing the cipher or TLS version is not supported. Once we have updated the RX key, we can unblock the receive side. If the rekey fails, the context is unmodified and userspace is free to retry the update or close the socket. This change only affects tls_sw, since 1.3 offload isn't supported. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-16tls: block decryption when a rekey is pendingSabrina Dubroca
When a TLS handshake record carrying a KeyUpdate message is received, all subsequent records will be encrypted with a new key. We need to stop decrypting incoming records with the old key, and wait until userspace provides a new key. Make a note of this in the RX context just after decrypting that record, and stop recvmsg/splice calls with EKEYEXPIRED until the new key is available. key_update_pending can't be combined with the existing bitfield, because we will read it locklessly in ->poll. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-15mptcp: drop useless "err = 0" in subflow_destroyGeliang Tang
Upon successful return, mptcp_pm_parse_addr() returns 0. There is no need to set "err = 0" after this. So after mptcp_nl_find_ssk() returns, just need to set "err = -ESRCH", then release and free msk socket if it returns NULL. Also, no need to define the variable "subflow" in subflow_destroy(), use mptcp_subflow_ctx(ssk) directly. This patch doesn't change the behaviour of the code, just refactoring. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-7-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15mptcp: change local addr type of subflow_destroyGeliang Tang
Generally, in the path manager interfaces, the local address is defined as an mptcp_pm_addr_entry type address, while the remote address is defined as an mptcp_addr_info type one: (struct mptcp_pm_addr_entry *local, struct mptcp_addr_info *remote) But subflow_destroy() interface uses two mptcp_addr_info type parameters. This patch changes the first one to mptcp_pm_addr_entry type and use helper mptcp_pm_parse_entry() to parse it instead of using mptcp_pm_parse_addr(). This patch doesn't change the behaviour of the code, just refactoring. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-6-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15mptcp: drop free_list for deleting entriesGeliang Tang
mptcp_pm_remove_addrs() actually only deletes one address, which does not match its name. This patch renames it to mptcp_pm_remove_addr_entry() and changes the parameter "rm_list" to "entry". With the help of mptcp_pm_remove_addr_entry(), it's no longer necessary to move the entry to be deleted to free_list and then traverse the list to delete the entry, which is not allowed in BPF. The entry can be directly deleted through list_del_rcu() and sock_kfree_s() now. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-5-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15mptcp: move mptcp_pm_remove_addrs into pm_userspaceGeliang Tang
Since mptcp_pm_remove_addrs() is only called from the userspace PM, this patch moves it into pm_userspace.c. For this, lookup_subflow_by_saddr() and remove_anno_list_by_saddr() helpers need to be exported in protocol.h. Also add "mptcp_" prefix for these helpers. Here, mptcp_pm_remove_addrs() is not changed to a static function because it will be used in BPF Path Manager. This patch doesn't change the behaviour of the code, just refactoring. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-4-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15mptcp: add mptcp_userspace_pm_get_sock helperGeliang Tang
Each userspace pm netlink function uses nla_get_u32() to get the msk token value, then pass it to mptcp_token_get_sock() to get the msk. Finally check whether userspace PM is selected on this msk. It makes sense to wrap them into a helper, named mptcp_userspace_pm_get_sock(), to do this. This patch doesn't change the behaviour of the code, just refactoring. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-3-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15mptcp: add mptcp_for_each_userspace_pm_addr macroGeliang Tang
Similar to mptcp_for_each_subflow() macro, this patch adds a new macro mptcp_for_each_userspace_pm_addr() for userspace PM to iterate over the address entries on the local address list userspace_pm_local_addr_list of the mptcp socket. This patch doesn't change the behaviour of the code, just refactoring. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-2-ddb6d00109a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>