summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2025-03-18inet: frags: save a pair of atomic operations in reassemblyEric Dumazet
As mentioned in commit 648700f76b03 ("inet: frags: use rhashtables for reassembly units"): A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. This patch implements this idea, seven years later. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250312082250.1803501-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18inet: frags: change inet_frag_kill() to defer refcount updatesEric Dumazet
In the following patch, we no longer assume inet_frag_kill() callers own a reference. Consuming two refcounts from inet_frag_kill() would lead in UAF. Propagate the pointer to the refs that will be consumed later by the final inet_frag_putn() call. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250312082250.1803501-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18ipv4: frags: remove ipq_put()Eric Dumazet
Replace ipq_put() with inet_frag_putn() Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250312082250.1803501-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18inet: frags: add inet_frag_putn() helperEric Dumazet
inet_frag_putn() can release multiple references in one step. Use it in inet_frags_free_cb(). Replace inet_frag_put(X) with inet_frag_putn(X, 1) Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250312082250.1803501-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18udp_tunnel: use static call for GRO hooks when possiblePaolo Abeni
It's quite common to have a single UDP tunnel type active in the whole system. In such a case we can replace the indirect call for the UDP tunnel GRO callback with a static call. Add the related accounting in the control path and switch to static call when possible. To keep the code simple use a static array for the registered tunnel types, and size such array based on the kernel config. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/6fd1f9c7651151493ecab174e7b8386a1534170d.1741718157.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18udp_tunnel: create a fastpath GRO lookup.Paolo Abeni
Most UDP tunnels bind a socket to a local port, with ANY address, no peer and no interface index specified. Additionally it's quite common to have a single tunnel device per namespace. Track in each namespace the UDP tunnel socket respecting the above. When only a single one is present, store a reference in the netns. When such reference is not NULL, UDP tunnel GRO lookup just need to match the incoming packet destination port vs the socket local port. The tunnel socket never sets the reuse[port] flag[s]. When bound to no address and interface, no other socket can exist in the same netns matching the specified local port. Matching packets with non-local destination addresses will be aggregated, and eventually segmented as needed - no behavior changes intended. Note that the UDP tunnel socket reference is stored into struct netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep all the fastpath-related netns fields in the same struct and allow cacheline-based optimization. Currently both the IPv4 and IPv6 socket pointer share the same cacheline as the `udp_table` field. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/4d5c319c4471161829f50cb8436841de81a5edae.1741718157.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-17tcp: Pass flags to __tcp_send_ackIlpo Järvinen
Accurate ECN needs to send custom flags to handle IP-ECN field reflection during handshake. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: add new TCP_TW_ACK_OOW state and allow ECN bits in TOSIlpo Järvinen
ECN bits in TOS are always cleared when sending in ACKs in TW. Clearing them is problematic for TCP flows that used Accurate ECN because ECN bits decide which service queue the packet is placed into (L4S vs Classic). Effectively, TW ACKs are always downgraded from L4S to Classic queue which might impact, e.g., delay the ACK will experience on the path compared with the other packets of the flow. Change the TW ACK sending code to differentiate: - In tcp_v4_send_reset(), commit ba9e04a7ddf4f ("ip: fix tos reflection in ack and reset packets") cleans ECN bits for TW reset and this is not affected. - In tcp_v4_timewait_ack(), ECN bits for all TW ACKs are cleaned. But now only ECN bits of ACKs for oow data or paws_reject are cleaned, and ECN bits of other ACKs will not be cleaned. - In tcp_v4_reqsk_send_ack(), commit 66b13d99d96a1 ("ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT") did not clean ECN bits of ACKs for oow data or paws_reject. But now the ECN bits rae cleaned for these ACKs. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: AccECN support to tcp_add_backlogIlpo Järvinen
AE flag needs to be preserved for AccECN. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17gro: prevent ACE field corruption & better AccECN handlingIlpo Järvinen
There are important differences in how the CWR field behaves in RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter and its changes are important so adjust the flags changed mask accordingly. Also, if CWR is there, set the Accurate ECN GSO flag to avoid corrupting CWR flag somewhere. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17gso: AccECN supportIlpo Järvinen
Handling the CWR flag differs between RFC 3168 ECN and AccECN. With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared starting from 2nd segment which is incompatible how AccECN handles the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN. With AccECN, CWR flag (or more accurately, the ACE field that also includes ECE & AE flags) changes only when new packet(s) with CE mark arrives so the flag should not be changed within a super-skb. The new skb/feature flags are necessary to prevent such TSO engines corrupting AccECN ACE counters by clearing the CWR flag (if the CWR handling feature cannot be turned off). If NIC is completely unaware of RFC3168 ECN (doesn't support NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag despite supporting also NETIF_F_TSO_ECN, TSO could be safely used with AccECN on such NIC. This should be evaluated per NIC basis (not done in this patch series for any NICs). For the cases, where TSO cannot keep its hands off the CWR flag, a GSO fallback is provided by this patch. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: helpers for ECN mode handlingIlpo Järvinen
Create helpers for TCP ECN modes. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()Ilpo Järvinen
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is called only for data segments, not for ACKs (with AccECN, also ACKs may get ECN bits). The extra "layer" in tcp_ecn_check_ce() function just checks for ECN being enabled, that can be moved into tcp_ecn_field_check rather than having the __ variant. No functional changes. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: extend TCP flags to allow AE bit/ACE fieldIlpo Järvinen
With AccECN, there's one additional TCP flag to be used (AE) and ACE field that overloads the definition of AE, CWR, and ECE flags. As tcp_flags was previously only 1 byte, the byte-order stuff needs to be added to it's handling. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: create FLAG_TS_PROGRESSIlpo Järvinen
Whenever timestamp advances, it declares progress which can be used by the other parts of the stack to decide that the ACK is the most recent one seen so far. AccECN will use this flag when deciding whether to use the ACK to update AccECN state or not. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-17tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()Ilpo Järvinen
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce() out of it - Move tcp_in_ack_event() later - While at it, remove the inline from tcp_in_ack_event() and let the compiler to decide Accurate ECN's heuristics does not know if there is going to be ACE field based CE counter increase or not until after rtx queue has been processed. Only then the number of ACKed bytes/pkts is available. As CE or not affects presence of FLAG_ECE, that information for tcp_in_ack_event is not yet available in the old location of the call to tcp_in_ack_event(). Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08udp: expand SKB_DROP_REASON_UDP_CSUM useEric Dumazet
SKB_DROP_REASON_UDP_CSUM can be used in four locations when dropping a packet because of a wrong UDP checksum. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250307102002.2095238-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07tcp: ulp: diag: more info without CAP_NET_ADMINMatthieu Baerts (NGI0)
When introduced in commit 61723b393292 ("tcp: ulp: add functions to dump ulp-specific information"), the whole ULP diag info has been exported only if the requester had CAP_NET_ADMIN. It looks like not everything is sensitive, and some info can be exported to all users in order to ease the debugging from the userspace side without requiring additional capabilities. Each layer should then decide what can be exposed to everybody. The 'net_admin' boolean is then passed to the different layers. On kTLS side, it looks like there is nothing sensitive there: version, cipher type, tx/rx user config type, plus some flags. So, only some metadata about the configuration, no cryptographic info like keys, etc. Then, everything can be exported to all users. On MPTCP side, that's different. The MPTCP-related sequence numbers per subflow should certainly not be exposed to everybody. For example, the DSS mapping and ssn_offset would give all users on the system access to narrow ranges of values for the subflow TCP sequence numbers and MPTCP-level DSNs, and then ease packet injection. The TCP diag interface doesn't expose the TCP sequence numbers for TCP sockets, so best to do the same here. The rest -- token, IDs, flags -- can be exported to everybody. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-2-06afdd860fc9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07tcp: ulp: diag: always print the name if anyMatthieu Baerts (NGI0)
Since its introduction in commit 61723b393292 ("tcp: ulp: add functions to dump ulp-specific information"), the ULP diag info have been exported only if the requester had CAP_NET_ADMIN. At least the ULP name can be exported without CAP_NET_ADMIN. This will already help identifying which layer is being used, e.g. which TCP connections are in fact MPTCP subflow. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-1-06afdd860fc9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06tcp: clamp window like before the cleanupMatthieu Baerts (NGI0)
A recent cleanup changed the behaviour of tcp_set_window_clamp(). This looks unintentional, and affects MPTCP selftests, e.g. some tests re-establishing a connection after a disconnect are now unstable. Before the cleanup, this operation was done: new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); The cleanup used the 'clamp' macro which takes 3 arguments -- value, lowest, and highest -- and returns a value between the lowest and the highest allowable values. This then assumes ... lowest (rcv_ssthresh) <= highest (rcv_wnd) ... which doesn't seem to be always the case here according to the MPTCP selftests, even when running them without MPTCP, but only TCP. For example, when we have ... rcv_wnd < rcv_ssthresh < new_rcv_ssthresh ... before the cleanup, the rcv_ssthresh was not changed, while after the cleanup, it is lowered down to rcv_wnd (highest). During a simple test with TCP, here are the values I observed: new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) 117760 (out) 65495 < 65536 128512 (out) 109595 > 80256 => lo > hi 1184975 (out) 328987 < 329088 113664 (out) 65483 < 65536 117760 (out) 110968 < 110976 129024 (out) 116527 > 109696 => lo > hi Here, we can see that it is not that rare to have rcv_ssthresh (lo) higher than rcv_wnd (hi), so having a different behaviour when the clamp() macro is used, even without MPTCP. Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) here, which seems to be generally the case in my tests with small connections. I then suggests reverting this part, not to change the behaviour. Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06inet: call inet6_ehashfn() once from inet6_hash_connect()Eric Dumazet
inet6_ehashfn() being called from __inet6_check_established() has a big impact on performance, as shown in the Tested section. After prior patch, we can compute the hash for port 0 from inet6_hash_connect(), and derive each hash in __inet_hash_connect() from this initial hash: hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport Apply the same principle for __inet_check_established(), although inet_ehashfn() has a smaller cost. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before this patch: utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56 perf top: 50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath After this patch: utime_start=0.315047 utime_end=9.257617 stime_start=7.041489 stime_end=1923.688387 num_transactions=3057968 latency_min=0.003041375 latency_max=7.056589232 latency_mean=0.141075048 # Better latency metrics latency_stddev=0.526900516 num_samples=312996 throughput=320677.21 # 111 % increase, and 229 % for the series perf top: inet6_ehashfn is no longer seen. 39.67% [kernel] [k] __inet_hash_connect 37.06% [kernel] [k] __inet6_check_established 4.79% [kernel] [k] rcu_all_qs 3.82% [kernel] [k] __cond_resched 1.76% [kernel] [k] sched_balance_domains 0.82% [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] sched_balance_rq 0.81% [kernel] [k] sched_balance_trigger 0.76% [kernel] [k] queued_spin_lock_slowpath Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250305034550.879255-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()Eric Dumazet
In order to speedup __inet_hash_connect(), we want to ensure hash values for <source address, port X, destination address, destination port> are not randomly spread, but monotonically increasing. Goal is to allow __inet_hash_connect() to derive the hash value of a candidate 4-tuple with a single addition in the following patch in the series. Given : hash_0 = inet_ehashfn(saddr, 0, daddr, dport) hash_sport = inet_ehashfn(saddr, sport, daddr, dport) Then (hash_sport == hash_0 + sport) for all sport values. As far as I know, there is no security implication with this change. After this patch, when __inet_hash_connect() has to try XXXX candidates, the hash table buckets are contiguous and packed, allowing a better use of cpu caches and hardware prefetchers. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before this patch: utime_start=0.271607 utime_end=3.847111 stime_start=18.407684 stime_end=1997.485557 num_transactions=1350742 latency_min=0.014131929 latency_max=17.895073144 latency_mean=0.505675853 latency_stddev=2.125164772 num_samples=307884 throughput=139866.80 perf top on client: 56.86% [kernel] [k] __inet6_check_established 17.96% [kernel] [k] __inet_hash_connect 13.88% [kernel] [k] inet6_ehashfn 2.52% [kernel] [k] rcu_all_qs 2.01% [kernel] [k] __cond_resched 0.41% [kernel] [k] _raw_spin_lock After this patch: utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56 perf top: 50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath There is indeed an increase of throughput and reduction of latency. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250305034550.879255-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()Eric Dumazet
We have platforms with 6 NUMA nodes and 480 cpus. inet_ehash_locks_alloc() currently allocates a single 64KB page to hold all ehash spinlocks. This adds more pressure on a single node. Change inet_ehash_locks_alloc() to use vmalloc() to spread the spinlocks on all online nodes, driven by NUMA policies. At boot time, NUMA policy is interleave=all, meaning that tcp_hashinfo.ehash_locks gets hash dispersion on all nodes. Tested: lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x000000004e99d30c-0x00000000763f3279 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x00000000fd73a33e-0x0000000004b9a177 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x00000000db07d7a2-0x00000000ad697d29 8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250305130550.1865988-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: net/ethtool/cabletest.c 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") 637399bf7e77 ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device") No Adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-05inet: fix lwtunnel_valid_encap_type() lock imbalanceEric Dumazet
After blamed commit rtm_to_fib_config() now calls lwtunnel_valid_encap_type{_attr}() without RTNL held, triggering an unlock balance in __rtnl_unlock, as reported by syzbot [1] IPv6 and rtm_to_nh_config() are not yet converted. Add a temporary @rtnl_is_held parameter to lwtunnel_valid_encap_type() and lwtunnel_valid_encap_type_attr(). While we are at it replace the two rcu_dereference() in lwtunnel_valid_encap_type() with more appropriate rcu_access_pointer(). [1] syz-executor245/5836 is trying to release lock (rtnl_mutex) at: [<ffffffff89d0e38c>] __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142 but there are no more locks to release! other info that might help us debug this: no locks held by syz-executor245/5836. stack backtrace: CPU: 0 UID: 0 PID: 5836 Comm: syz-executor245 Not tainted 6.14.0-rc4-syzkaller-00873-g3424291dd242 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_unlock_imbalance_bug+0x25b/0x2d0 kernel/locking/lockdep.c:5289 __lock_release kernel/locking/lockdep.c:5518 [inline] lock_release+0x47e/0xa30 kernel/locking/lockdep.c:5872 __mutex_unlock_slowpath+0xec/0x800 kernel/locking/mutex.c:891 __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142 lwtunnel_valid_encap_type+0x38a/0x5f0 net/core/lwtunnel.c:169 lwtunnel_valid_encap_type_attr+0x113/0x270 net/core/lwtunnel.c:209 rtm_to_fib_config+0x949/0x14e0 net/ipv4/fib_frontend.c:808 inet_rtm_newroute+0xf6/0x2a0 net/ipv4/fib_frontend.c:917 rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6919 netlink_rcv_skb+0x206/0x480 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:709 [inline] Fixes: 1dd2af7963e9 ("ipv4: fib: Convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.") Reported-by: syzbot+3f18ef0f7df107a3f6a0@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67c6f87a.050a0220.38b91b.0147.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250304125918.2763514-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-05net-timestamp: support TCP GSO case for a few missing flagsJason Xing
When I read through the TSO codes, I found out that we probably miss initializing the tx_flags of last seg when TSO is turned off, which means at the following points no more timestamp (for this last one) will be generated. There are three flags to be handled in this patch: 1. SKBTX_HW_TSTAMP 2. SKBTX_BPF 3. SKBTX_SCHED_TSTAMP Note that SKBTX_BPF[1] was added in 6.14.0-rc2 by commit 6b98ec7e882af ("bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback") and only belongs to net-next branch material for now. The common issue of the above three flags can be fixed by this single patch. This patch initializes the tx_flags to SKBTX_ANY_TSTAMP like what the UDP GSO does to make the newly segmented last skb inherit the tx_flags so that requested timestamp will be generated in each certain layer, or else that last one has zero value of tx_flags which leads to no timestamp at all. Fixes: 4ed2d765dfacc ("net-timestamp: TCP timestamping") Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-03-04tcp: use RCU lookup in __inet_hash_connect()Eric Dumazet
When __inet_hash_connect() has to try many 4-tuples before finding an available one, we see a high spinlock cost from the many spin_lock_bh(&head->lock) performed in its loop. This patch adds an RCU lookup to avoid the spinlock cost. check_established() gets a new @rcu_lookup argument. First reason is to not make any changes while head->lock is not held. Second reason is to not make this RCU lookup a second time after the spinlock has been acquired. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before series: utime_start=0.288582 utime_end=1.548707 stime_start=20.637138 stime_end=2002.489845 num_transactions=484453 latency_min=0.156279245 latency_max=20.922042756 latency_mean=1.546521274 latency_stddev=3.936005194 num_samples=312537 throughput=47426.00 perf top on the client: 49.54% [kernel] [k] _raw_spin_lock 25.87% [kernel] [k] _raw_spin_lock_bh 5.97% [kernel] [k] queued_spin_lock_slowpath 5.67% [kernel] [k] __inet_hash_connect 3.53% [kernel] [k] __inet6_check_established 3.48% [kernel] [k] inet6_ehashfn 0.64% [kernel] [k] rcu_all_qs After this series: utime_start=0.271607 utime_end=3.847111 stime_start=18.407684 stime_end=1997.485557 num_transactions=1350742 latency_min=0.014131929 latency_max=17.895073144 latency_mean=0.505675853 # Nice reduction of latency metrics latency_stddev=2.125164772 num_samples=307884 throughput=139866.80 # 190 % increase perf top on client: 56.86% [kernel] [k] __inet6_check_established 17.96% [kernel] [k] __inet_hash_connect 13.88% [kernel] [k] inet6_ehashfn 2.52% [kernel] [k] rcu_all_qs 2.01% [kernel] [k] __cond_resched 0.41% [kernel] [k] _raw_spin_lock Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250302124237.3913746-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04tcp: add RCU management to inet_bind_bucketEric Dumazet
Add RCU protection to inet_bind_bucket structure. - Add rcu_head field to the structure definition. - Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy() first argument. - Use hlist_del_rcu() and hlist_add_head_rcu() methods. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250302124237.3913746-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04tcp: optimize inet_use_bhash2_on_bind()Eric Dumazet
There is no reason to call ipv6_addr_type(). Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250302124237.3913746-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04tcp: use RCU in __inet{6}_check_established()Eric Dumazet
When __inet_hash_connect() has to try many 4-tuples before finding an available one, we see a high spinlock cost from __inet_check_established() and/or __inet6_check_established(). This patch adds an RCU lookup to avoid the spinlock acquisition when the 4-tuple is found in the hash table. Note that there are still spin_lock_bh() calls in __inet_hash_connect() to protect inet_bind_hashbucket, this will be fixed later in this series. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250302124237.3913746-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04net: rename netns_local to netns_immutableNicolas Dichtel
The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-03tcp: tcp_set_window_clamp() cleanupEric Dumazet
Remove one indentation level. Use max_t() and clamp() macros. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: remove READ_ONCE(req->ts_recent)Eric Dumazet
After commit 8d52da23b6c6 ("tcp: Defer ts_recent changes until req is owned"), req->ts_recent is not changed anymore. It is set once in tcp_openreq_init(), bpf_sk_assign_tcp_reqsk() or cookie_tcp_reqsk_alloc() before the req can be seen by other cpus/threads. This completes the revert of eba20811f326 ("tcp: annotate data-races around tcp_rsk(req)->ts_recent"). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wang Hai <wanghai38@huawei.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03net: gro: convert four dev_net() callsEric Dumazet
tcp4_check_fraglist_gro(), tcp6_check_fraglist_gro(), udp4_gro_lookup_skb() and udp6_gro_lookup_skb() assume RCU is held so that the net structure does not disappear. Use dev_net_rcu() instead of dev_net() to get LOCKDEP support. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: convert to dev_net_rcu()Eric Dumazet
TCP uses of dev_net() are under RCU protection, change them to dev_net_rcu() to get LOCKDEP support. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: add four drop reasons to tcp_check_req()Eric Dumazet
Use two existing drop reasons in tcp_check_req(): - TCP_RFC7323_PAWS - TCP_OVERWINDOW Add two new ones: - TCP_RFC7323_TSECR (corresponds to LINUX_MIB_TSECRREJECTED) - TCP_LISTEN_OVERFLOW (when a listener accept queue is full) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03tcp: add a drop_reason pointer to tcp_check_req()Eric Dumazet
We want to add new drop reasons for packets dropped in 3WHS in the following patches. tcp_rcv_state_process() has to set reason to TCP_FASTOPEN, because tcp_check_req() will conditionally overwrite the drop_reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250301201424.2046477-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.Kuniyuki Iwashima
We converted fib_info hash tables to per-netns one and now ready to convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL. Let's hold rtnl_net_lock() in inet_rtm_newroute() and inet_rtm_delroute(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-13-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Move fib_valid_key_len() to rtm_to_fib_config().Kuniyuki Iwashima
fib_valid_key_len() is called in the beginning of fib_table_insert() or fib_table_delete() to check if the prefix length is valid. fib_table_insert() and fib_table_delete() are called from 3 paths - ip_rt_ioctl() - inet_rtm_newroute() / inet_rtm_delroute() - fib_magic() In the first ioctl() path, rtentry_to_fib_config() checks the prefix length with bad_mask(). Also, fib_magic() always passes the correct prefix: 32 or ifa->ifa_prefixlen, which is already validated. Let's move fib_valid_key_len() to the rtnetlink path, rtm_to_fib_config(). While at it, 2 direct returns in rtm_to_fib_config() are changed to goto to match other places in the same function Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-12-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Hold rtnl_net_lock() in ip_rt_ioctl().Kuniyuki Iwashima
ioctl(SIOCADDRT/SIOCDELRT) calls ip_rt_ioctl() to add/remove a route in the netns of the specified socket. Let's hold rtnl_net_lock() there. Note that rtentry_to_fib_config() can be called without rtnl_net_lock() if we convert rtentry.dev handling to RCU later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Hold rtnl_net_lock() for ip_fib_net_exit().Kuniyuki Iwashima
ip_fib_net_exit() requires RTNL and is called from fib_net_init() and fib_net_exit_batch(). Let's hold rtnl_net_lock() before ip_fib_net_exit(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Namespacify fib_info hash tables.Kuniyuki Iwashima
We will convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL. Then, we need to have per-netns hash tables for struct fib_info. Let's allocate the hash tables per netns. fib_info_hash, fib_info_hash_bits, and fib_info_cnt are now moved to struct netns_ipv4 and accessed with net->ipv4.fib_XXX. Also, the netns checks are removed from fib_find_info_nh() and fib_find_info(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Add fib_info_hash_grow().Kuniyuki Iwashima
When the number of struct fib_info exceeds the hash table size in fib_create_info(), we try to allocate a new hash table with the doubled size. The allocation is done in fib_create_info(), and if successful, each struct fib_info is moved to the new hash table by fib_info_hash_move(). Let's integrate the allocation and fib_info_hash_move() as fib_info_hash_grow() to make the following change cleaner. While at it, fib_info_hash_grow() is placed near other hash-table-specific functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Remove fib_info_hash_size.Kuniyuki Iwashima
We will allocate the fib_info hash tables per netns. There are 5 global variables for fib_info hash tables: fib_info_hash, fib_info_laddrhash, fib_info_hash_size, fib_info_hash_bits, fib_info_cnt. However, fib_info_laddrhash and fib_info_hash_size can be easily calculated from fib_info_hash and fib_info_hash_bits. Let's remove fib_info_hash_size and use (1 << fib_info_hash_bits) instead. Now we need not pass the new hash table size to fib_info_hash_move(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Remove fib_info_laddrhash pointer.Kuniyuki Iwashima
We will allocate the fib_info hash tables per netns. There are 5 global variables for fib_info hash tables: fib_info_hash, fib_info_laddrhash, fib_info_hash_size, fib_info_hash_bits, fib_info_cnt. However, fib_info_laddrhash and fib_info_hash_size can be easily calculated from fib_info_hash and fib_info_hash_bits. Let's remove the fib_info_laddrhash pointer and instead use fib_info_hash + (1 << fib_info_hash_bits). While at it, fib_info_laddrhash_bucket() is moved near other hash-table-specific functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Make fib_info_hashfn() return struct hlist_head.Kuniyuki Iwashima
Every time fib_info_hashfn() returns a hash value, we fetch &fib_info_hash[hash]. Let's return the hlist_head pointer from fib_info_hashfn() and rename it to fib_info_hash_bucket() to match a similar function, fib_info_laddrhash_bucket(). Note that we need to move the fib_info_hash assignment earlier in fib_info_hash_move() to use fib_info_hash_bucket() in the for loop. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Allocate fib_info_hash[] during netns initialisation.Kuniyuki Iwashima
We will allocate fib_info_hash[] and fib_info_laddrhash[] for each netns. Currently, fib_info_hash[] is allocated when the first route is added. Let's move the first allocation to a new __net_init function. Note that we must call fib4_semantics_exit() in fib_net_exit_batch() because ->exit() is called earlier than ->exit_batch(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Allocate fib_info_hash[] and fib_info_laddrhash[] by kvcalloc().Kuniyuki Iwashima
Both fib_info_hash[] and fib_info_laddrhash[] are hash tables for struct fib_info and are allocated by kvzmalloc() separately. Let's replace the two kvzmalloc() calls with kvcalloc() to remove the fib_info_laddrhash pointer later. Note that fib_info_hash_alloc() allocates a new hash table based on fib_info_hash_bits because we will remove fib_info_hash_size later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-03ipv4: fib: Use cached net in fib_inetaddr_event().Kuniyuki Iwashima
net is available in fib_inetaddr_event(), let's use it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250228042328.96624-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>