summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-03-03sock: add sock_kmemdup helperGeliang Tang
This patch adds the sock version of kmemdup() helper, named sock_kmemdup(), to duplicate the input "src" memory block using the socket's option memory buffer. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/f828077394c7d1f3560123497348b438c875b510.1740735165.git.tanggeliang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: tcp_set_window_clamp() cleanupEric Dumazet
Remove one indentation level. Use max_t() and clamp() macros. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: remove READ_ONCE(req->ts_recent)Eric Dumazet
After commit 8d52da23b6c6 ("tcp: Defer ts_recent changes until req is owned"), req->ts_recent is not changed anymore. It is set once in tcp_openreq_init(), bpf_sk_assign_tcp_reqsk() or cookie_tcp_reqsk_alloc() before the req can be seen by other cpus/threads. This completes the revert of eba20811f326 ("tcp: annotate data-races around tcp_rsk(req)->ts_recent"). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wang Hai <wanghai38@huawei.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03net: gro: convert four dev_net() callsEric Dumazet
tcp4_check_fraglist_gro(), tcp6_check_fraglist_gro(), udp4_gro_lookup_skb() and udp6_gro_lookup_skb() assume RCU is held so that the net structure does not disappear. Use dev_net_rcu() instead of dev_net() to get LOCKDEP support. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: convert to dev_net_rcu()Eric Dumazet
TCP uses of dev_net() are under RCU protection, change them to dev_net_rcu() to get LOCKDEP support. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: add four drop reasons to tcp_check_req()Eric Dumazet
Use two existing drop reasons in tcp_check_req(): - TCP_RFC7323_PAWS - TCP_OVERWINDOW Add two new ones: - TCP_RFC7323_TSECR (corresponds to LINUX_MIB_TSECRREJECTED) - TCP_LISTEN_OVERFLOW (when a listener accept queue is full) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: add a drop_reason pointer to tcp_check_req()Eric Dumazet
We want to add new drop reasons for packets dropped in 3WHS in the following patches. tcp_rcv_state_process() has to set reason to TCP_FASTOPEN, because tcp_check_req() will conditionally overwrite the drop_reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.Kuniyuki Iwashima
We converted fib_info hash tables to per-netns one and now ready to convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL. Let's hold rtnl_net_lock() in inet_rtm_newroute() and inet_rtm_delroute(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-13-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Move fib_valid_key_len() to rtm_to_fib_config().Kuniyuki Iwashima
fib_valid_key_len() is called in the beginning of fib_table_insert() or fib_table_delete() to check if the prefix length is valid. fib_table_insert() and fib_table_delete() are called from 3 paths - ip_rt_ioctl() - inet_rtm_newroute() / inet_rtm_delroute() - fib_magic() In the first ioctl() path, rtentry_to_fib_config() checks the prefix length with bad_mask(). Also, fib_magic() always passes the correct prefix: 32 or ifa->ifa_prefixlen, which is already validated. Let's move fib_valid_key_len() to the rtnetlink path, rtm_to_fib_config(). While at it, 2 direct returns in rtm_to_fib_config() are changed to goto to match other places in the same function Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-12-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Hold rtnl_net_lock() in ip_rt_ioctl().Kuniyuki Iwashima
ioctl(SIOCADDRT/SIOCDELRT) calls ip_rt_ioctl() to add/remove a route in the netns of the specified socket. Let's hold rtnl_net_lock() there. Note that rtentry_to_fib_config() can be called without rtnl_net_lock() if we convert rtentry.dev handling to RCU later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Hold rtnl_net_lock() for ip_fib_net_exit().Kuniyuki Iwashima
ip_fib_net_exit() requires RTNL and is called from fib_net_init() and fib_net_exit_batch(). Let's hold rtnl_net_lock() before ip_fib_net_exit(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Namespacify fib_info hash tables.Kuniyuki Iwashima
We will convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL. Then, we need to have per-netns hash tables for struct fib_info. Let's allocate the hash tables per netns. fib_info_hash, fib_info_hash_bits, and fib_info_cnt are now moved to struct netns_ipv4 and accessed with net->ipv4.fib_XXX. Also, the netns checks are removed from fib_find_info_nh() and fib_find_info(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Add fib_info_hash_grow().Kuniyuki Iwashima
When the number of struct fib_info exceeds the hash table size in fib_create_info(), we try to allocate a new hash table with the doubled size. The allocation is done in fib_create_info(), and if successful, each struct fib_info is moved to the new hash table by fib_info_hash_move(). Let's integrate the allocation and fib_info_hash_move() as fib_info_hash_grow() to make the following change cleaner. While at it, fib_info_hash_grow() is placed near other hash-table-specific functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Remove fib_info_hash_size.Kuniyuki Iwashima
We will allocate the fib_info hash tables per netns. There are 5 global variables for fib_info hash tables: fib_info_hash, fib_info_laddrhash, fib_info_hash_size, fib_info_hash_bits, fib_info_cnt. However, fib_info_laddrhash and fib_info_hash_size can be easily calculated from fib_info_hash and fib_info_hash_bits. Let's remove fib_info_hash_size and use (1 << fib_info_hash_bits) instead. Now we need not pass the new hash table size to fib_info_hash_move(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Remove fib_info_laddrhash pointer.Kuniyuki Iwashima
We will allocate the fib_info hash tables per netns. There are 5 global variables for fib_info hash tables: fib_info_hash, fib_info_laddrhash, fib_info_hash_size, fib_info_hash_bits, fib_info_cnt. However, fib_info_laddrhash and fib_info_hash_size can be easily calculated from fib_info_hash and fib_info_hash_bits. Let's remove the fib_info_laddrhash pointer and instead use fib_info_hash + (1 << fib_info_hash_bits). While at it, fib_info_laddrhash_bucket() is moved near other hash-table-specific functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Make fib_info_hashfn() return struct hlist_head.Kuniyuki Iwashima
Every time fib_info_hashfn() returns a hash value, we fetch &fib_info_hash[hash]. Let's return the hlist_head pointer from fib_info_hashfn() and rename it to fib_info_hash_bucket() to match a similar function, fib_info_laddrhash_bucket(). Note that we need to move the fib_info_hash assignment earlier in fib_info_hash_move() to use fib_info_hash_bucket() in the for loop. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Allocate fib_info_hash[] during netns initialisation.Kuniyuki Iwashima
We will allocate fib_info_hash[] and fib_info_laddrhash[] for each netns. Currently, fib_info_hash[] is allocated when the first route is added. Let's move the first allocation to a new __net_init function. Note that we must call fib4_semantics_exit() in fib_net_exit_batch() because ->exit() is called earlier than ->exit_batch(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Allocate fib_info_hash[] and fib_info_laddrhash[] by kvcalloc().Kuniyuki Iwashima
Both fib_info_hash[] and fib_info_laddrhash[] are hash tables for struct fib_info and are allocated by kvzmalloc() separately. Let's replace the two kvzmalloc() calls with kvcalloc() to remove the fib_info_laddrhash pointer later. Note that fib_info_hash_alloc() allocates a new hash table based on fib_info_hash_bits because we will remove fib_info_hash_size later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Use cached net in fib_inetaddr_event().Kuniyuki Iwashima
net is available in fib_inetaddr_event(), let's use it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03llc: do not use skb_get() before dev_queue_xmit()Eric Dumazet
syzbot is able to crash hosts [1], using llc and devices not supporting IFF_TX_SKB_SHARING. In this case, e1000 driver calls eth_skb_pad(), while the skb is shared. Simply replace skb_get() by skb_clone() in net/llc/llc_s_ac.c Note that e1000 driver might have an issue with pktgen, because it does not clear IFF_TX_SKB_SHARING, this is an orthogonal change. We need to audit other skb_get() uses in net/llc. [1] kernel BUG at net/core/skbuff.c:2178 ! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 16371 Comm: syz.2.2764 Not tainted 6.14.0-rc4-syzkaller-00052-gac9c34d1e45a #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:pskb_expand_head+0x6ce/0x1240 net/core/skbuff.c:2178 Call Trace: <TASK> __skb_pad+0x18a/0x610 net/core/skbuff.c:2466 __skb_put_padto include/linux/skbuff.h:3843 [inline] skb_put_padto include/linux/skbuff.h:3862 [inline] eth_skb_pad include/linux/etherdevice.h:656 [inline] e1000_xmit_frame+0x2d99/0x5800 drivers/net/ethernet/intel/e1000/e1000_main.c:3128 __netdev_start_xmit include/linux/netdevice.h:5151 [inline] netdev_start_xmit include/linux/netdevice.h:5160 [inline] xmit_one net/core/dev.c:3806 [inline] dev_hard_start_xmit+0x9a/0x7b0 net/core/dev.c:3822 sch_direct_xmit+0x1ae/0xc30 net/sched/sch_generic.c:343 __dev_xmit_skb net/core/dev.c:4045 [inline] __dev_queue_xmit+0x13d4/0x43e0 net/core/dev.c:4621 dev_queue_xmit include/linux/netdevice.h:3313 [inline] llc_sap_action_send_test_c+0x268/0x320 net/llc/llc_s_ac.c:144 llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline] llc_sap_next_state net/llc/llc_sap.c:182 [inline] llc_sap_state_process+0x239/0x510 net/llc/llc_sap.c:209 llc_ui_sendmsg+0xd0d/0x14e0 net/llc/af_llc.c:993 sock_sendmsg_nosec net/socket.c:718 [inline] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+da65c993ae113742a25f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67c020c0.050a0220.222324.0011.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-03netfilter: nft_ct: Use __refcount_inc() for per-CPU nft_ct_pcpu_template.Sebastian Andrzej Siewior
nft_ct_pcpu_template is a per-CPU variable and relies on disabled BH for its locking. The refcounter is read and if its value is set to one then the refcounter is incremented and variable is used - otherwise it is already in use and left untouched. Without per-CPU locking in local_bh_disable() on PREEMPT_RT the read-then-increment operation is not atomic and therefore racy. This can be avoided by using unconditionally __refcount_inc() which will increment counter and return the old value as an atomic operation. In case the returned counter is not one, the variable is in use and we need to decrement counter. Otherwise we can use it. Use __refcount_inc() instead of read and a conditional increment. Fixes: edee4f1e9245 ("netfilter: nft_ct: add zone id set support") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-03-03wifi: cfg80211: regulatory: improve invalid hints checkingNikita Zhandarovich
Syzbot keeps reporting an issue [1] that occurs when erroneous symbols sent from userspace get through into user_alpha2[] via regulatory_hint_user() call. Such invalid regulatory hints should be rejected. While a sanity check from commit 47caf685a685 ("cfg80211: regulatory: reject invalid hints") looks to be enough to deter these very cases, there is a way to get around it due to 2 reasons. 1) The way isalpha() works, symbols other than latin lower and upper letters may be used to determine a country/domain. For instance, greek letters will also be considered upper/lower letters and for such characters isalpha() will return true as well. However, ISO-3166-1 alpha2 codes should only hold latin characters. 2) While processing a user regulatory request, between reg_process_hint_user() and regulatory_hint_user() there happens to be a call to queue_regulatory_request() which modifies letters in request->alpha2[] with toupper(). This works fine for latin symbols, less so for weird letter characters from the second part of _ctype[]. Syzbot triggers a warning in is_user_regdom_saved() by first sending over an unexpected non-latin letter that gets malformed by toupper() into a character that ends up failing isalpha() check. Prevent this by enhancing is_an_alpha2() to ensure that incoming symbols are latin letters and nothing else. [1] Syzbot report: ------------[ cut here ]------------ Unexpected user alpha2: A� WARNING: CPU: 1 PID: 964 at net/wireless/reg.c:442 is_user_regdom_saved net/wireless/reg.c:440 [inline] WARNING: CPU: 1 PID: 964 at net/wireless/reg.c:442 restore_alpha2 net/wireless/reg.c:3424 [inline] WARNING: CPU: 1 PID: 964 at net/wireless/reg.c:442 restore_regulatory_settings+0x3c0/0x1e50 net/wireless/reg.c:3516 Modules linked in: CPU: 1 UID: 0 PID: 964 Comm: kworker/1:2 Not tainted 6.12.0-rc5-syzkaller-00044-gc1e939a21eb1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: events_power_efficient crda_timeout_work RIP: 0010:is_user_regdom_saved net/wireless/reg.c:440 [inline] RIP: 0010:restore_alpha2 net/wireless/reg.c:3424 [inline] RIP: 0010:restore_regulatory_settings+0x3c0/0x1e50 net/wireless/reg.c:3516 ... Call Trace: <TASK> crda_timeout_work+0x27/0x50 net/wireless/reg.c:542 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa65/0x1850 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f2/0x390 kernel/kthread.c:389 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Reported-by: syzbot+e10709ac3c44f3d4e800@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=e10709ac3c44f3d4e800 Fixes: 09d989d179d0 ("cfg80211: add regulatory hint disconnect support") Cc: stable@kernel.org Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://patch.msgid.link/20250228134659.1577656-1-n.zhandarovich@fintech.ru Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-02net/tls: use the new scatterwalk functionsEric Biggers
Replace calls to the deprecated function scatterwalk_copychunks() with memcpy_from_scatterwalk(), memcpy_to_scatterwalk(), or scatterwalk_skip() as appropriate. The new functions generally behave more as expected and eliminate the need to call scatterwalk_done() or scatterwalk_pagedone(). However, the new functions intentionally do not advance to the next sg entry right away, which would have broken chain_to_walk() which is accessing the fields of struct scatter_walk directly. To avoid this, replace chain_to_walk() with scatterwalk_get_sglist() which supports the needed functionality. Cc: Boris Pismenny <borisp@nvidia.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-02-28net: gso: fix ownership in __udp_gso_segmentAntoine Tenart
In __udp_gso_segment the skb destructor is removed before segmenting the skb but the socket reference is kept as-is. This is an issue if the original skb is later orphaned as we can hit the following bug: kernel BUG at ./include/linux/skbuff.h:3312! (skb_orphan) RIP: 0010:ip_rcv_core+0x8b2/0xca0 Call Trace: ip_rcv+0xab/0x6e0 __netif_receive_skb_one_core+0x168/0x1b0 process_backlog+0x384/0x1100 __napi_poll.constprop.0+0xa1/0x370 net_rx_action+0x925/0xe50 The above can happen following a sequence of events when using OpenVSwitch, when an OVS_ACTION_ATTR_USERSPACE action precedes an OVS_ACTION_ATTR_OUTPUT action: 1. OVS_ACTION_ATTR_USERSPACE is handled (in do_execute_actions): the skb goes through queue_gso_packets and then __udp_gso_segment, where its destructor is removed. 2. The segments' data are copied and sent to userspace. 3. OVS_ACTION_ATTR_OUTPUT is handled (in do_execute_actions) and the same original skb is sent to its path. 4. If it later hits skb_orphan, we hit the bug. Fix this by also removing the reference to the socket in __udp_gso_segment. Fixes: ad405857b174 ("udp: better wmem accounting on gso") Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250226171352.258045-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-28inet: ping: avoid skb_clone() dance in ping_rcv()Eric Dumazet
ping_rcv() callers currently call skb_free() or consume_skb(), forcing ping_rcv() to clone the skb. After this patch ping_rcv() is now 'consuming' the original skb, either moving to a socket receive queue, or dropping it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250226183437.1457318-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-28ipv4: icmp: do not process ICMP_EXT_ECHOREPLY for broadcast/multicast addressesEric Dumazet
There is no point processing ICMP_EXT_ECHOREPLY for routes which would drop ICMP_ECHOREPLY (RFC 1122 3.2.2.6, 3.2.2.8) This seems an oversight of the initial implementation. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250226183437.1457318-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-28wifi: mac80211: refactor populating mesh related fields in sinfoSarika Sharma
Introduce the sta_set_mesh_sinfo() to populate mesh related fields in sinfo structure for station statistics. This will allow for the simplified population of other fields in the sinfo structure for link level in a subsequent patch to add support for MLO station statistics. No functionality changes added. Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250213171632.1646538-3-quic_sarishar@quicinc.com [reword since it's just an internal thing] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-28wifi: cfg80211: reorg sinfo structure elements for meshSarika Sharma
Currently, as multi-link operation(MLO) is not supported for mesh, reorganize the sinfo structure for mesh-specific fields and embed mesh related NL attributes together in organized view. This will allow for the simplified reorganization of sinfo structure for link level in a subsequent patch to add support for MLO station statistics. No functionality changes added. Pahole summary before the reorg of sinfo structure: - size: 256, cachelines: 4, members: 50 - sum members: 239, holes: 4, sum holes: 17 - paddings: 2, sum paddings: 2 - forced alignments: 1, forced holes: 1, sum forced holes: 1 Pahole summary after the reorg of sinfo structure: - size: 248, cachelines: 4, members: 50 - sum members: 239, holes: 4, sum holes: 9 - paddings: 2, sum paddings: 2 - forced alignments: 1, last cacheline: 56 bytes Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250213171632.1646538-2-quic_sarishar@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-02-27net: sched: Remove newline at the end of a netlink error messageGal Pressman
Netlink error messages should not have a newline at the end of the string. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250226093904.6632-5-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27net-sysfs: remove unused initial ret valuesAntoine Tenart
In some net-sysfs functions the ret value is initialized but never used as it is always overridden. Remove those. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/20250226174644.311136-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27Bluetooth: Add check for mgmt_alloc_skb() in mgmt_device_connected()Haoxiang Li
Add check for the return value of mgmt_alloc_skb() in mgmt_device_connected() to prevent null pointer dereference. Fixes: e96741437ef0 ("Bluetooth: mgmt: Make use of mgmt_send_event_skb in MGMT_EV_DEVICE_CONNECTED") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-02-27Bluetooth: Add check for mgmt_alloc_skb() in mgmt_remote_name()Haoxiang Li
Add check for the return value of mgmt_alloc_skb() in mgmt_remote_name() to prevent null pointer dereference. Fixes: ba17bb62ce41 ("Bluetooth: Fix skb allocation in mgmt_remote_name() & mgmt_device_connected()") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27Merge tag 'net-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth. We didn't get netfilter or wireless PRs this week, so next week's PR is probably going to be bigger. A healthy dose of fixes for bugs introduced in the current release nonetheless. Current release - regressions: - Bluetooth: always allow SCO packets for user channel - af_unix: fix memory leak in unix_dgram_sendmsg() - rxrpc: - remove redundant peer->mtu_lock causing lockdep splats - fix spinlock flavor issues with the peer record hash - eth: iavf: fix circular lock dependency with netdev_lock - net: use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net() RDMA driver register notifier after the device Current release - new code bugs: - ethtool: fix ioctl confusing drivers about desired HDS user config - eth: ixgbe: fix media cage present detection for E610 device Previous releases - regressions: - loopback: avoid sending IP packets without an Ethernet header - mptcp: reset connection when MPTCP opts are dropped after join Previous releases - always broken: - net: better track kernel sockets lifetime - ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels - phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT - eth: enetc: number of error handling fixes - dsa: rtl8366rb: reshuffle the code to fix config / build issue with LED support" * tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits) net: ti: icss-iep: Reject perout generation request idpf: fix checksums set in idpf_rx_rsc() selftests: drv-net: Check if combined-count exists net: ipv6: fix dst ref loop on input in rpl lwt net: ipv6: fix dst ref loop on input in seg6 lwt usbnet: gl620a: fix endpoint checking in genelink_bind() net/mlx5: IRQ, Fix null string in debug print net/mlx5: Restore missing trace event when enabling vport QoS net/mlx5: Fix vport QoS cleanup on error net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination. af_unix: Fix memory leak in unix_dgram_sendmsg() net: Handle napi_schedule() calls from non-interrupt net: Clear old fragment checksum value in napi_reuse_skb gve: unlink old napi when stopping a queue using queue API net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net(). tcp: Defer ts_recent changes until req is owned net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs() net: enetc: remove the mm_lock from the ENETC v4 driver net: enetc: add missing enetc4_link_deinit() net: enetc: update UDP checksum when updating originTimestamp field ...
2025-02-27net: ipv6: fix dst ref loop on input in rpl lwtJustin Iurman
Prevent a dst ref loop on input in rpl_iptunnel. Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel") Cc: Alexander Aring <alex.aring@gmail.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: ipv6: fix dst ref loop on input in seg6 lwtJustin Iurman
Prevent a dst ref loop on input in seg6_iptunnel. Fixes: af4a2209b134 ("ipv6: sr: use dst_cache in seg6_input") Cc: David Lebrun <dlebrun@google.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27xdp: remove xdp_alloc_skb_bulk()Alexander Lobakin
The only user was veth, which now uses napi_skb_cache_get_bulk(). It's now preferred over a direct allocation and is exported as well, so remove this one. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: skbuff: introduce napi_skb_cache_get_bulk()Alexander Lobakin
Add a function to get an array of skbs from the NAPI percpu cache. It's supposed to be a drop-in replacement for kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the requirement to call it only from the BH) is that it tries to use as many NAPI cache entries for skbs as possible, and allocate new ones only if needed. The logic is as follows: * there is enough skbs in the cache: decache them and return to the caller; * not enough: try refilling the cache first. If there is now enough skbs, return; * still not enough: try allocating skbs directly to the output array with %GFP_ZERO, maybe we'll be able to get some. If there's now enough, return; * still not enough: return as many as we were able to obtain. Most of times, if called from the NAPI polling loop, the first one will be true, sometimes (rarely) the second one. The third and the fourth -- only under heavy memory pressure. It can save significant amounts of CPU cycles if there are GRO cycles and/or Tx completion cycles (anything that descends to napi_skb_cache_put()) happening on this CPU. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: gro: expose GRO init/cleanup to use outside of NAPIAlexander Lobakin
Make GRO init and cleanup functions global to be able to use GRO without a NAPI instance. Taking into account already global gro_flush(), it's now fully usable standalone. New functions are not exported, since they're not supposed to be used outside of the kernel core code. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: gro: decouple GRO from the NAPI layerAlexander Lobakin
In fact, these two are not tied closely to each other. The only requirements to GRO are to use it in the BH context and have some sane limits on the packet batches, e.g. NAPI has a limit of its budget (64/8/etc.). Move purely GRO fields into a new structure, &gro_node. Embed it into &napi_struct and adjust all the references. gro_node::cached_napi_id is effectively the same as napi_struct::napi_id, but to be used on GRO hotpath to mark skbs. napi_struct::napi_id is now a fully control path field. Three Ethernet drivers use napi_gro_flush() not really meant to be exported, so move it to <net/gro.h> and add that include there. napi_gro_receive() is used in more than 100 drivers, keep it in <linux/netdevice.h>. This does not make GRO ready to use outside of the NAPI context yet. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27pktgen: avoid unused-const-variable warningArnd Bergmann
When extra warnings are enable, there are configurations that build pktgen without CONFIG_XFRM, which leaves a static const variable unused: net/core/pktgen.c:213:1: error: unused variable 'F_IPSEC' [-Werror,-Wunused-const-variable] 213 | PKT_FLAGS | ^~~~~~~~~ net/core/pktgen.c:197:2: note: expanded from macro 'PKT_FLAGS' 197 | pf(IPSEC) /* ipsec on for flows */ \ | ^~~~~~~~~ This could be marked as __maybe_unused, or by making the one use visible to the compiler by slightly rearranging the #ifdef blocks. The second variant looks slightly nicer here, so use that. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Peter Seiderer <ps.report@gmx.net> Link: https://patch.msgid.link/20250225085722.469868-1-arnd@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-26net: move aRFS rmap management and CPU affinity to coreAhmed Zaki
A common task for most drivers is to remember the user-set CPU affinity to its IRQs. On each netdev reset, the driver should re-assign the user's settings to the IRQs. Unify this task across all drivers by moving the CPU affinity to napi->config. However, to move the CPU affinity to core, we also need to move aRFS rmap management since aRFS uses its own IRQ notifiers. For the aRFS, add a new netdev flag "rx_cpu_rmap_auto". Drivers supporting aRFS should set the flag via netif_enable_cpu_rmap() and core will allocate and manage the aRFS rmaps. Freeing the rmap is also done by core when the netdev is freed. For better IRQ affinity management, move the IRQ rmap notifier inside the napi_struct and add new notify.notify and notify.release functions: netif_irq_cpu_rmap_notify() and netif_napi_affinity_release(). Now we have the aRFS rmap management in core, add CPU affinity mask to napi_config. To delegate the CPU affinity management to the core, drivers must: 1 - set the new netdev flag "irq_affinity_auto": netif_enable_irq_affinity(netdev) 2 - create the napi with persistent config: netif_napi_add_config() 3 - bind an IRQ to the napi instance: netif_napi_set_irq() the core will then make sure to use re-assign affinity to the napi's IRQ. The default IRQ mask is set to one cpu starting from the closest NUMA. Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26af_unix: Fix memory leak in unix_dgram_sendmsg()Adrian Huang
After running the 'sendmsg02' program of Linux Test Project (LTP), kmemleak reports the following memory leak: # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888243866800 (size 2048): comm "sendmsg02", pid 67, jiffies 4294903166 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 5e 00 00 00 00 00 00 00 ........^....... 01 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace (crc 7e96a3f2): kmemleak_alloc+0x56/0x90 kmem_cache_alloc_noprof+0x209/0x450 sk_prot_alloc.constprop.0+0x60/0x160 sk_alloc+0x32/0xc0 unix_create1+0x67/0x2b0 unix_create+0x47/0xa0 __sock_create+0x12e/0x200 __sys_socket+0x6d/0x100 __x64_sys_socket+0x1b/0x30 x64_sys_call+0x7e1/0x2140 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Commit 689c398885cc ("af_unix: Defer sock_put() to clean up path in unix_dgram_sendmsg().") defers sock_put() in the error handling path. However, it fails to account for the condition 'msg->msg_namelen != 0', resulting in a memory leak when the code jumps to the 'lookup' label. Fix issue by calling sock_put() if 'msg->msg_namelen != 0' is met. Fixes: 689c398885cc ("af_unix: Defer sock_put() to clean up path in unix_dgram_sendmsg().") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Acked-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250225021457.1824-1-ahuang12@lenovo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: skb: free up one bit in tx_flagsWillem de Bruijn
The linked series wants to add skb tx completion timestamps. That needs a bit in skb_shared_info.tx_flags, but all are in use. A per-skb bit is only needed for features that are configured on a per packet basis. Per socket features can be read from sk->sk_tsflags. Per packet tsflags can be set in sendmsg using cmsg, but only those in SOF_TIMESTAMPING_TX_RECORD_MASK. Per packet tsflags can also be set without cmsg by sandwiching a send inbetween two setsockopts: val |= SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); write(fd, buf, sz); val &= ~SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); Changing a datapath test from skb_shinfo(skb)->tx_flags to skb->sk->sk_tsflags can change behavior in that case, as the tx_flags is written before the second setsockopt updates sk_tsflags. Therefore, only bits can be reclaimed that cannot be set by cmsg and are also highly unlikely to be used to target individual packets otherwise. Free up the bit currently used for SKBTX_HW_TSTAMP_USE_CYCLES. This selects between clock and free running counter source for HW TX timestamps. It is probable that all packets of the same socket will always use the same source. Link: https://lore.kernel.org/netdev/cover.1739988644.git.pav@iki.fi/ Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20250225023416.2088705-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: Handle napi_schedule() calls from non-interruptFrederic Weisbecker
napi_schedule() is expected to be called either: * From an interrupt, where raised softirqs are handled on IRQ exit * From a softirq disabled section, where raised softirqs are handled on the next call to local_bh_enable(). * From a softirq handler, where raised softirqs are handled on the next round in do_softirq(), or further deferred to a dedicated kthread. Other bare tasks context may end up ignoring the raised NET_RX vector until the next random softirq handling opportunity, which may not happen before a while if the CPU goes idle afterwards with the tick stopped. Such "misuses" have been detected on several places thanks to messages of the kind: "NOHZ tick-stop error: local softirq work is pending, handler #08!!!" For example: __raise_softirq_irqoff __napi_schedule rtl8152_runtime_resume.isra.0 rtl8152_resume usb_resume_interface.isra.0 usb_resume_both __rpm_callback rpm_callback rpm_resume __pm_runtime_resume usb_autoresume_device usb_remote_wakeup hub_event process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm And also: * drivers/net/usb/r8152.c::rtl_work_func_t * drivers/net/netdevsim/netdev.c::nsim_start_xmit There is a long history of issues of this kind: 019edd01d174 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 330068589389 ("idpf: disable local BH when scheduling napi for marker packets") e3d5d70cb483 ("net: lan78xx: fix "softirq work is pending" error") e55c27ed9ccf ("mt76: mt7615: add missing bh-disable around rx napi schedule") c0182aa98570 ("mt76: mt7915: add missing bh-disable around tx napi enable/schedule") 970be1dff26d ("mt76: disable BH around napi_schedule() calls") 019edd01d174 ("ath10k: sdio: Add missing BH locking around napi_schdule()") 30bfec4fec59 ("can: rx-offload: can_rx_offload_threaded_irq_finish(): add new function to be called from threaded interrupt") e63052a5dd3c ("mlx5e: add add missing BH locking around napi_schdule()") 83a0c6e58901 ("i40e: Invoke softirqs after napi_reschedule") bd4ce941c8d5 ("mlx4: Invoke softirqs after napi_reschedule") 8cf699ec849f ("mlx4: do not call napi_schedule() without care") ec13ee80145c ("virtio_net: invoke softirqs after __napi_schedule") This shows that relying on the caller to arrange a proper context for the softirqs to be handled while calling napi_schedule() is very fragile and error prone. Also fixing them can also prove challenging if the caller may be called from different kinds of contexts. Therefore fix this from napi_schedule() itself with waking up ksoftirqd when softirqs are raised from task contexts. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Francois Romieu <romieu@fr.zoreil.com> Closes: https://lore.kernel.org/lkml/354a2690-9bbf-4ccb-8769-fa94707a9340@molgen.mpg.de/ Cc: Breno Leitao <leitao@debian.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250223221708.27130-1-frederic@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: Clear old fragment checksum value in napi_reuse_skbMohammad Heib
In certain cases, napi_get_frags() returns an skb that points to an old received fragment, This skb may have its skb->ip_summed, csum, and other fields set from previous fragment handling. Some network drivers set skb->ip_summed to either CHECKSUM_COMPLETE or CHECKSUM_UNNECESSARY when getting skb from napi_get_frags(), while others only set skb->ip_summed when RX checksum offload is enabled on the device, and do not set any value for skb->ip_summed when hardware checksum offload is disabled, assuming that the skb->ip_summed initiated to zero by napi_reuse_skb, ionic driver for example will ignore/unset any value for the ip_summed filed if HW checksum offload is disabled, and if we have a situation where the user disables the checksum offload during a traffic that could lead to the following errors shown in the kernel logs: <IRQ> dump_stack_lvl+0x34/0x48 __skb_gro_checksum_complete+0x7e/0x90 tcp6_gro_receive+0xc6/0x190 ipv6_gro_receive+0x1ec/0x430 dev_gro_receive+0x188/0x360 ? ionic_rx_clean+0x25a/0x460 [ionic] napi_gro_frags+0x13c/0x300 ? __pfx_ionic_rx_service+0x10/0x10 [ionic] ionic_rx_service+0x67/0x80 [ionic] ionic_cq_service+0x58/0x90 [ionic] ionic_txrx_napi+0x64/0x1b0 [ionic] __napi_poll+0x27/0x170 net_rx_action+0x29c/0x370 handle_softirqs+0xce/0x270 __irq_exit_rcu+0xa3/0xc0 common_interrupt+0x80/0xa0 </IRQ> This inconsistency sometimes leads to checksum validation issues in the upper layers of the network stack. To resolve this, this patch clears the skb->ip_summed value for each reused skb in by napi_reuse_skb(), ensuring that the caller is responsible for setting the correct checksum status. This eliminates potential checksum validation issues caused by improper handling of skb->ip_summed. Fixes: 76620aafd66f ("gro: New frags interface to avoid copying shinfo") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250225112852.2507709-1-mheib@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26tcp: be less liberal in TSEcr received while in SYN_RECV stateEric Dumazet
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323) As hinted by an old comment in tcp_check_req(), we can check the TSEcr value in the incoming packet corresponds to one of the SYNACK TSval values we have sent. In this patch, I record the oldest and most recent values that SYNACK packets have used. Send a challenge ACK if we receive a TSEcr outside of this range, and increase a new SNMP counter. nstat -az | grep TSEcrRejected TcpExtTSEcrRejected 0 0.0 Due to TCP fastopen implementation, do not apply yet these checks for fastopen flows. v2: No longer use req->num_timeout, but treq->snt_tsval_first to detect when first SYNACK is prepared. This means we make sure to not send an initial zero TSval. Make sure MPTCP and TCP selftests are passing. Change MIB name to TcpExtTSEcrRejected v1: https://lore.kernel.org/netdev/CADVnQykD8i4ArpSZaPKaoNxLJ2if2ts9m4As+=Jvdkrgx1qMHw@mail.gmail.com/T/ Reported-by: Yong-Hao Zou <yonghaoz1994@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250225171048.3105061-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().Kuniyuki Iwashima
Breno Leitao reported the splat below. [0] Commit 65161fb544aa ("net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().") added the DEBUG_NET_WARN_ON_ONCE(), assuming that the netdev is not registered before register_netdevice_notifier_dev_net(). But the assumption was simply wrong. Let's use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net(). [0]: WARNING: CPU: 25 PID: 849 at net/core/dev.c:2150 register_netdevice_notifier_dev_net (net/core/dev.c:2150) <TASK> ? __warn (kernel/panic.c:242 kernel/panic.c:748) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? report_bug (lib/bug.c:? lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? register_netdevice_notifier_dev_net (net/core/dev.c:2150) ? register_netdevice_notifier_dev_net (./include/net/net_namespace.h:406 ./include/linux/netdevice.h:2663 net/core/dev.c:2144) mlx5e_mdev_notifier_event+0x9f/0xf0 mlx5_ib notifier_call_chain.llvm.12241336988804114627 (kernel/notifier.c:85) blocking_notifier_call_chain (kernel/notifier.c:380) mlx5_core_uplink_netdev_event_replay (drivers/net/ethernet/mellanox/mlx5/core/main.c:352) mlx5_ib_roce_init.llvm.12447516292400117075+0x1c6/0x550 mlx5_ib mlx5r_probe+0x375/0x6a0 mlx5_ib ? kernfs_put (./include/linux/instrumented.h:96 ./include/linux/atomic/atomic-arch-fallback.h:2278 ./include/linux/atomic/atomic-instrumented.h:1384 fs/kernfs/dir.c:557) ? auxiliary_match_id (drivers/base/auxiliary.c:174) ? mlx5r_mp_remove+0x160/0x160 mlx5_ib really_probe (drivers/base/dd.c:? drivers/base/dd.c:658) driver_probe_device (drivers/base/dd.c:830) __driver_attach (drivers/base/dd.c:1217) bus_for_each_dev (drivers/base/bus.c:369) ? driver_attach (drivers/base/dd.c:1157) bus_add_driver (drivers/base/bus.c:679) driver_register (drivers/base/driver.c:249) Fixes: 7fb1073300a2 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().") Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/netdev/20250224-noisy-cordial-roadrunner-fad40c@leitao/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250225211023.96448-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26Merge tag 'nfs-for-6.14-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Stable Fixes: - O_DIRECT writes should adjust file length Other Bugfixes: - Adjust delegated timestamps for O_DIRECT reads and writes - Prevent looping due to rpc_signal_task() races - Fix a deadlock when recovering state on a sillyrenamed file - Properly handle -ETIMEDOUT errors from tlshd - Suppress build warnings for unused procfs functions - Fix memory leak of lsm_contexts" * tag 'nfs-for-6.14-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: lsm,nfs: fix memory leak of lsm_context sunrpc: suppress warnings for unused procfs functions SUNRPC: Handle -ETIMEDOUT return from tlshd NFSv4: Fix a deadlock when recovering state on a sillyrenamed file SUNRPC: Prevent looping due to rpc_signal_task() races NFS: Adjust delegated timestamps for O_DIRECT reads and writes NFS: O_DIRECT writes must check and adjust the file length
2025-02-26bpf: add get_netns_cookie helper to cgroup_skb programsMahe Tardy
This is needed in the context of Cilium and Tetragon to retrieve netns cookie from hostns when traffic leaves Pod, so that we can correlate skb->sk's netns cookie. Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com> Link: https://lore.kernel.org/r/20250225125031.258740-1-mahe.tardy@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>