summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2023-10-18ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddrEric Dumazet
syzbot reported a data-race while accessing nh->nh_saddr_genid [1] Add annotations, but leave the code lazy as intended. [1] BUG: KCSAN: data-race in fib_select_path / fib_select_path write to 0xffff8881387166f0 of 4 bytes by task 6778 on cpu 1: fib_info_update_nhc_saddr net/ipv4/fib_semantics.c:1334 [inline] fib_result_prefsrc net/ipv4/fib_semantics.c:1354 [inline] fib_select_path+0x292/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read to 0xffff8881387166f0 of 4 bytes by task 6759 on cpu 0: fib_result_prefsrc net/ipv4/fib_semantics.c:1350 [inline] fib_select_path+0x1cb/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x959d3217 -> 0x959d3218 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6759 Comm: kworker/u4:15 Not tainted 6.6.0-rc4-syzkaller-00029-gcbf3a2cb156a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023 Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker Fixes: 436c3b66ec98 ("ipv4: Invalidate nexthop cache nh_saddr more correctly.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231017192304.82626-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18tcp_bpf: properly release resources on error pathsPaolo Abeni
In the blamed commit below, I completely forgot to release the acquired resources before erroring out in the TCP BPF code, as reported by Dan. Address the issues by replacing the bogus return with a jump to the relevant cleanup code. Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/8f99194c698bcef12666f0a9a999c58f8b1cb52c.1697557782.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skbEric Dumazet
In commit 75eefc6c59fd ("tcp: tsq: add a shortcut in tcp_small_queue_check()") we allowed to send an skb regardless of TSQ limits being hit if rtx queue was empty or had a single skb, in order to better fill the pipe when/if TX completions were slow. Then later, commit 75c119afe14f ("tcp: implement rb-tree based retransmit queue") accidentally removed the special case for one skb in rtx queue. Stefan Wahren reported a regression in single TCP flow throughput using a 100Mbit fec link, starting from commit 65466904b015 ("tcp: adjust TSO packet sizes based on min_rtt"). This last commit only made the regression more visible, because it locked the TCP flow on a particular behavior where TSQ prevented two skbs being pushed downstream, adding silences on the wire between each TSO packet. Many thanks to Stefan for his invaluable help ! Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Link: https://lore.kernel.org/netdev/7f31ddc8-9971-495e-a1f6-819df542e0af@gmx.net/ Reported-by: Stefan Wahren <wahrenst@gmx.net> Tested-by: Stefan Wahren <wahrenst@gmx.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20231017124526.4060202-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18netfilter: xt_mangle: only check verdict part of return valueFlorian Westphal
These checks assume that the caller only returns NF_DROP without any errno embedded in the upper bits. This is fine right now, but followup patches will start to propagate such errors to allow kfree_skb_drop_reason() in the called functions, those would then indicate 'errno << 8 | NF_STOLEN'. To not break things we have to mask those parts out. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-17Merge tag 'ipsec-2023-10-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2023-10-17 1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert. From Dong Chenchen. 2) Fix data-races in the xfrm interfaces dev->stats fields. From Eric Dumazet. 3) Fix a data-race in xfrm_gen_index. From Eric Dumazet. 4) Fix an inet6_dev refcount underflow. From Zhang Changzhong. 5) Check the return value of pskb_trim in esp_remove_trailer for esp4 and esp6. From Ma Ke. 6) Fix a data-race in xfrm_lookup_with_ifid. From Eric Dumazet. * tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: fix a data-race in xfrm_lookup_with_ifid() net: ipv4: fix return value check in esp_remove_trailer net: ipv6: fix return value check in esp_remove_trailer xfrm6: fix inet6_dev refcount underflow problem xfrm: fix a data-race in xfrm_gen_index() xfrm: interface: use DEV_STATS_INC() net: xfrm: skip policies marked as dead while reinserting policies ==================== Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17tcp: fix excessive TLP and RACK timeouts from HZ roundingNeal Cardwell
We discovered from packet traces of slow loss recovery on kernels with the default HZ=250 setting (and min_rtt < 1ms) that after reordering, when receiving a SACKed sequence range, the RACK reordering timer was firing after about 16ms rather than the desired value of roughly min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the exact same issue. This commit fixes the TLP transmit timer and RACK reordering timer floor calculation to more closely match the intended 2ms floor even on kernels with HZ=250. It does this by adding in a new TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies, instead of the current approach of converting to jiffies and then adding th TCP_TIMEOUT_MIN value of 2 jiffies. Our testing has verified that on kernels with HZ=1000, as expected, this does not produce significant changes in behavior, but on kernels with the default HZ=250 the latency improvement can be large. For example, our tests show that for HZ=250 kernels at low RTTs this fix roughly halves the latency for the RACK reorder timer: instead of mostly firing at 16ms it mostly fires at 8ms. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16tcp: Set pingpong threshold via sysctlHaiyang Zhang
TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16ipv4: use tunnel flow flags for tunnel route lookupsBeniamino Galvani
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a new field to struct ip_tunnel_key to control route lookups. Currently the flag is used by vxlan and geneve tunnels; use it also in udp_tunnel_dst_lookup() so that it affects all tunnel types relying on this function. Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: add new arguments to udp_tunnel_dst_lookup()Beniamino Galvani
We want to make the function more generic so that it can be used by other UDP tunnel implementations such as geneve and vxlan. To do that, add the following arguments: - source and destination UDP port; - ifindex of the output interface, needed by vxlan; - the tos, because in some cases it is not taken from struct ip_tunnel_info (for example, when it's inherited from the inner packet); - the dst cache, because not all tunnel types (e.g. vxlan) want to use the one from struct ip_tunnel_info. With these parameters, the function no longer needs the full struct ip_tunnel_info as argument and we can pass only the relevant part of it (struct ip_tunnel_key). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: remove "proto" argument from udp_tunnel_dst_lookup()Beniamino Galvani
The function is now UDP-specific, the protocol is always IPPROTO_UDP. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-16ipv4: rename and move ip_route_output_tunnel()Beniamino Galvani
At the moment ip_route_output_tunnel() is used only by bareudp. Ideally, other UDP tunnel implementations should use it, but to do so the function needs to accept new parameters that are specific for UDP tunnels, such as the ports. Prepare for these changes by renaming the function to udp_tunnel_dst_lookup() and move it to file net/ipv4/udp_tunnel_core.c. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-13tcp: allow again tcp_disconnect() when threads are waitingPaolo Abeni
As reported by Tom, .NET and applications build on top of it rely on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP socket. The blamed commit below caused a regression, as such cancellation can now fail. As suggested by Eric, this change addresses the problem explicitly causing blocking I/O operation to terminate immediately (with an error) when a concurrent disconnect() is executed. Instead of tracking the number of threads blocked on a given socket, track the number of disconnect() issued on such socket. If such counter changes after a blocking operation releasing and re-acquiring the socket lock, error out the current operation. Fixes: 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting") Reported-by: Tom Deseyn <tdeseyn@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not setMartin KaFai Lau
It was reported that there is a compiler warning on the unused variable "sin_addr_len" in af_inet.c when CONFIG_CGROUP_BPF is not set. This patch is to address it similar to the ipv6 counterpart in inet6_getname(). It is to "return sin_addr_len;" instead of "return sizeof(*sin);". Fixes: fefba7d1ae19 ("bpf: Propagate modified uaddrlen from cgroup sockaddr programs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/bpf/20231013185702.3993710-1-martin.lau@linux.dev Closes: https://lore.kernel.org/bpf/20231013114007.2fb09691@canb.auug.org.au/
2023-10-13tcp: Fix listen() warning with v4-mapped-v6 address.Kuniyuki Iwashima
syzbot reported a warning [0] introduced by commit c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address."). After the cited commit, a v4 socket's address matches the corresponding v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa. During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses bhash and conflicts properly without checking bhash2 so that we need not check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in inet_bind2_bucket_match_addr(). However, the repro shows that we need to check that in a no-conflict case. The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls listen() for the first socket: from socket import * s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s1.bind(('127.0.0.1', 0)) s2 = socket(AF_INET6) s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1])) s1.listen() The second socket should belong to the first socket's tb2, but the second bind() creates another tb2 bucket because inet_bind2_bucket_find() returns NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the corresponding v4 address tb2. bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X) Then, listen() for the first socket calls inet_csk_get_port(), where the v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered. To avoid that, we need to check if v4-mapped-v6 sk address matches with the corresponding v4 address tb2 in inet_bind2_bucket_match(). The same checks are needed in inet_bind2_bucket_addr_match() too, so we can move all checks there and call it from inet_bind2_bucket_match(). Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr and not of sockets in the bucket. This could be refactored later by defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending ::ffff: when creating v4 tb2. [0]: WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Modules linked in: CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023 RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31 RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000 RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38 RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200 R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100 FS: 00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256 __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217 inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239 __sys_listen+0x194/0x270 net/socket.c:1866 __do_sys_listen net/socket.c:1875 [inline] __se_sys_listen net/socket.c:1873 [inline] __x64_sys_listen+0x53/0x80 net/socket.c:1873 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3a5bce3af9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9 RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006 R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001 R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001 </TASK> Fixes: c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.") Reported-by: syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13net: fix IPSTATS_MIB_OUTFORWDATAGRAMS increment after fragment checkHeng Guo
Reproduce environment: network with 3 VM linuxs is connected as below: VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3 VM1: eth0 ip: 192.168.122.207 MTU 1800 VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500 VM3: eth0 ip: 192.168.123.240 MTU 1800 Reproduce: VM1 send 1600 bytes UDP data to VM3 using tools scapy with flags='DF'. scapy command: send(IP(dst="192.168.123.240",flags='DF')/UDP()/str('0'*1600),count=1, inter=1.000000) Result: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 2 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is always keeping 2 without increment. Issue description and patch: ip_exceeds_mtu() in ip_forward() drops this IP datagram because skb len (1600 sending by scapy) is over MTU(1500 in VM2) if "DF" is set. According to RFC 4293 "3.2.3. IP Statistics Tables", +-------+------>------+----->-----+----->-----+ | InForwDatagrams (6) | OutForwDatagrams (6) | | V +->-+ OutFragReqds | InNoRoutes | | (packets) / (local packet (3) | | | IF is that of the address | +--> OutFragFails | and may not be the receiving IF) | | (packets) the IPSTATS_MIB_OUTFORWDATAGRAMS should be counted before fragment check. The existing implementation, instead, would incease the counter after fragment check: ip_exceeds_mtu() in ipv4 and ip6_pkt_too_big() in ipv6. So do patch to move IPSTATS_MIB_OUTFORWDATAGRAMS counter to ip_forward() for ipv4 and ip6_forward() for ipv6. Test result with patch: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 6 0 2 2 0 0 2 4 0 0 0 0 0 0 0 0 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqdss Ip: 1 64 7 0 2 3 0 0 2 5 0 0 0 0 0 0 0 1 0 ...... root@qemux86-64:~# ---------------------------------------------------------------------- ForwDatagrams is updated from 2 to 3. Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Signed-off-by: Heng Guo <heng.guo@windriver.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231011015137.27262-1-heng.guo@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: kernel/bpf/verifier.c 829955981c55 ("bpf: Fix verifier log for async callback return values") a923819fb2c5 ("bpf: Treat first argument as return value for bpf_throw") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11net: tcp: fix crashes trying to free half-baked MTU probesJakub Kicinski
tcp_stream_alloc_skb() initializes the skb to use tcp_tsorted_anchor which is a union with the destructor. We need to clean that TCP-iness up before freeing. Fixes: 736013292e3c ("tcp: let tcp_mtu_probe() build headless packets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231010173651.3990234-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11bpf: Propagate modified uaddrlen from cgroup sockaddr programsDaan De Meyer
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents. __cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-10tcp: change data receiver flowlabel after one dupDavid Morley
This commit changes the data receiver repath behavior to occur after receiving a single duplicate. This can help recover ACK connectivity quicker if a TLP was sent along a nonworking path. For instance, consider the case where we have an initially nonworking forward path and reverse path and subsequently switch to only working forward paths. Before this patch we would have the following behavior. +---------+--------+--------+----------+----------+----------+ | Event | For FL | Rev FL | FP Works | RP Works | Data Del | +---------+--------+--------+----------+----------+----------+ | Initial | A | 1 | N | N | 0 | +---------+--------+--------+----------+----------+----------+ | TLP | A | 1 | N | N | 0 | +---------+--------+--------+----------+----------+----------+ | RTO 1 | B | 1 | Y | N | 1 | +---------+--------+--------+----------+----------+----------+ | RTO 2 | C | 1 | Y | N | 2 | +---------+--------+--------+----------+----------+----------+ | RTO 3 | D | 2 | Y | Y | 3 | +---------+--------+--------+----------+----------+----------+ This patch gets rid of at least RTO 3, avoiding additional unnecessary repaths of a working forward path to a (potentially) nonworking one. In addition, this commit changes the behavior to avoid repathing upon rx of duplicate data if the local endpoint is in CA_Loss (in which case the RTOs will already be changing the outgoing flowlabel). Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10tcp: record last received ipv6 flowlabelDavid Morley
In order to better estimate whether a data packet has been retransmitted or is the result of a TLP, we save the last received ipv6 flowlabel. To make space for this field we resize the "ato" field in inet_connection_sock as the current value of TCP_DELACK_MAX can be fully contained in 8 bits and add a compile_time_assert ensuring this field is the required size. v2: addressed kernel bot feedback about dccp_delack_timer() v3: addressed build error introduced by commit bbf80d713fe7 ("tcp: derive delack_max from rto_min") Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-10net: ipv4: fix return value check in esp_remove_trailerMa Ke
In esp_remove_trailer(), to avoid an unexpected result returned by pskb_trim, we should check the return value of pskb_trim(). Signed-off-by: Ma Ke <make_ruc2021@163.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-06xfrm: pass struct net to xfrm_decode_session wrappersFlorian Westphal
Preparation patch, extra arg is not used. No functional changes intended. This is needed to replace the xfrm session decode functions with the flow dissector. skb_flow_dissect() cannot be used as-is, because it attempts to deduce the 'struct net' to use for bpf program fetch from skb->sk or skb->dev, but xfrm code path can see skbs that have neither sk or dev filled in. So either flow dissector needs to try harder, e.g. by also trying skb->dst->dev, or we have to pass the struct net explicitly. Passing the struct net doesn't look too bad to me, most places already have it available or can derive it from the output device. Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/netdev/202309271628.27fd2187-oliver.sang@intel.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-10-06xfrm: Support GRO for IPv6 ESP in UDP encapsulationSteffen Klassert
This patch enables the GRO codepath for IPv6 ESP in UDP encapsulated packets. Decapsulation happens at L2 and saves a full round through the stack for each packet. This is also needed to support HW offload for ESP in UDP encapsulation. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-06xfrm: Support GRO for IPv4 ESP in UDP encapsulationSteffen Klassert
This patch enables the GRO codepath for IPv4 ESP in UDP encapsulated packets. Decapsulation happens at L2 and saves a full round through the stack for each packet. This is also needed to support HW offload for ESP in UDP encapsulation. Enabling this would imporove performance for ESP in UDP datapath, i.e IPsec with NAT in between. By default GRP for ESP-in-UDP is disabled for UDP sockets. To enable this feature for an ESP socket, the following two options need to be set: 1. enable ESP-in-UDP: (this is already set by an IKE daemon). int type = UDP_ENCAP_ESPINUDP; setsockopt(fd, SOL_UDP, UDP_ENCAP, &type, sizeof(type)); 2. To enable GRO for ESP in UDP socket: type = true; setsockopt(fd, SOL_UDP, UDP_GRO, &type, sizeof(type)); Enabling ESP-in-UDP has the side effect of preventing the Linux stack from seeing ESP packets at the L3 (when ESP OFFLOAD is disabled), as packets are immediately decapsulated from UDP and decrypted. This change may affect nftable rules that match on ESP packets at L3. Also tcpdump won't see the ESP packet. Developers/admins are advised to review and adapt any nftable rules accordingly before enabling this feature to prevent potential rule breakage. Also tcpdump will not see from ESP packets from a ESP in UDP flow, when this is enabled. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-06xfrm: Use the XFRM_GRO to indicate a GRO call on inputSteffen Klassert
This is needed to support GRO for ESP in UDP encapsulation. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04tcp: fix delayed ACKs for MSS boundary conditionNeal Cardwell
This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads data when > 1*MSS is unacknowledged, then tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || (2) If an app reads all received data, and the packets were < 1*MSS, and either (a) the app is not ping-pong or (b) we received two packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) *However*: if an app reads exactly 1*MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1*MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1*MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1*MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1*MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Xin Guo <guoxin0309@gmail.com> Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04tcp: fix quick-ack counting to count actual ACKs of new dataNeal Cardwell
This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04ipv4: Set offload_failed flag in fibmatch resultsBenjamin Poirier
Due to a small omission, the offload_failed flag is missing from ipv4 fibmatch results. Make sure it is set correctly. The issue can be witnessed using the following commands: echo "1 1" > /sys/bus/netdevsim/new_device ip link add dummy1 up type dummy ip route add 192.0.2.0/24 dev dummy1 echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload ip route add 198.51.100.0/24 dev dummy1 ip route # 192.168.15.0/24 has rt_trap # 198.51.100.0/24 has rt_offload_failed ip route get 192.168.15.1 fibmatch # Result has rt_trap ip route get 198.51.100.1 fibmatch # Result differs from the route shown by `ip route`, it is missing # rt_offload_failed ip link del dev dummy1 echo 1 > /sys/bus/netdevsim/del_device Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-04Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-10-02 We've added 11 non-merge commits during the last 12 day(s) which contain a total of 12 files changed, 176 insertions(+), 41 deletions(-). The main changes are: 1) Fix BPF verifier to reset backtrack_state masks on global function exit as otherwise subsequent precision tracking would reuse them, from Andrii Nakryiko. 2) Several sockmap fixes for available bytes accounting, from John Fastabend. 3) Reject sk_msg egress redirects to non-TCP sockets given this is only supported for TCP sockets today, from Jakub Sitnicki. 4) Fix a syzkaller splat in bpf_mprog when hitting maximum program limits with BPF_F_BEFORE directive, from Daniel Borkmann and Nikolay Aleksandrov. 5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust size_index for selecting a bpf_mem_cache, from Hou Tao. 6) Fix arch_prepare_bpf_trampoline return code for s390 JIT, from Song Liu. 7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off, from Leon Hwang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Use kmalloc_size_roundup() to adjust size_index selftest/bpf: Add various selftests for program limits bpf, mprog: Fix maximum program check on mprog attachment bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets bpf, sockmap: Add tests for MSG_F_PEEK bpf, sockmap: Do not inc copied_seq when PEEK flag set bpf: tcp_read_skb needs to pop skb regardless of seq bpf: unconditionally reset backtrack_state masks on global func exit bpf: Fix tr dereferencing selftests/bpf: Check bpf_cubic_acked() is called via struct_ops s390/bpf: Let arch_prepare_bpf_trampoline return program size ==================== Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-03udp_tunnel: Use flex array to simplify codeChristophe JAILLET
'n_tables' is small, UDP_TUNNEL_NIC_MAX_TABLES = 4 as a maximum. So there is no real point to allocate the 'entries' pointers array with a dedicate memory allocation. Using a flexible array for struct udp_tunnel_nic->entries avoids the overhead of an additional memory allocation. This also saves an indirection when the array is accessed. Finally, __counted_by() can be used for run-time bounds checking if configured and supported by the compiler. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/4a096ba9cf981a588aa87235bb91e933ee162b3d.1695542544.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: optimize tcp_metrics_flush_all()Eric Dumazet
This is inspired by several syzbot reports where tcp_metrics_flush_all() was seen in the traces. We can avoid acquiring tcp_metrics_lock for empty buckets, and we should add one cond_resched() to break potential long loops. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: do not create an entry from tcp_init_metrics()Eric Dumazet
tcp_init_metrics() only wants to get metrics if they were previously stored in the cache. Creating an entry is adding useless costs, especially when tcp_no_metrics_save is set. Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: properly set tp->snd_ssthresh in tcp_init_metrics()Eric Dumazet
We need to set tp->snd_ssthresh to TCP_INFINITE_SSTHRESH in the case tcp_get_metrics() fails for some reason. Fixes: 9ad7c049f0f7 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03tcp_metrics: add missing barriers on deleteEric Dumazet
When removing an item from RCU protected list, we must prevent store-tearing, using rcu_assign_pointer() or WRITE_ONCE(). Fixes: 04f721c671656 ("tcp_metrics: Rewrite tcp_metrics_flush_all") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-03ipv4/fib: send notify when delete source address routesHangbin Liu
After deleting an interface address in fib_del_ifaddr(), the function scans the fib_info list for stray entries and calls fib_flush() and fib_table_flush(). Then the stray entries will be deleted silently and no RTM_DELROUTE notification will be sent. This lack of notification can make routing daemons, or monitor like `ip monitor route` miss the routing changes. e.g. + ip link add dummy1 type dummy + ip link add dummy2 type dummy + ip link set dummy1 up + ip link set dummy2 up + ip addr add 192.168.5.5/24 dev dummy1 + ip route add 7.7.7.0/24 dev dummy2 src 192.168.5.5 + ip -4 route 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 As Ido reminded, fib_table_flush() isn't only called when an address is deleted, but also when an interface is deleted or put down. The lack of notification in these cases is deliberate. And commit 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message on device down") introduced a sysctl to make IPv6 behave like IPv4 in this regard. So we can't send the route delete notify blindly in fib_table_flush(). To fix this issue, let's add a new flag in "struct fib_info" to track the deleted prefer source address routes, and only send notify for them. After update: + ip monitor route + ip addr del 192.168.5.5/24 dev dummy1 Deleted 192.168.5.0/24 dev dummy1 proto kernel scope link src 192.168.5.5 Deleted broadcast 192.168.5.255 dev dummy1 table local proto kernel scope link src 192.168.5.5 Deleted local 192.168.5.5 dev dummy1 table local proto kernel scope host src 192.168.5.5 Deleted 7.7.7.0/24 dev dummy2 scope link src 192.168.5.5 Suggested-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230922075508.848925-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-01inet: implement lockless getsockopt(IP_MULTICAST_IF)Eric Dumazet
Add missing annotations to inet->mc_index and inet->mc_addr to fix data-races. getsockopt(IP_MULTICAST_IF) can be lockless. setsockopt() side is left for later. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless IP_PKTOPTIONS implementationEric Dumazet
Current implementation is already lockless, because the socket lock is released before reading socket fields. Add missing READ_ONCE() annotations. Note that corresponding WRITE_ONCE() are needed, the order of the patches do not really matter. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless getsockopt(IP_UNICAST_IF)Eric Dumazet
Add missing READ_ONCE() annotations when reading inet->uc_index Implementing getsockopt(IP_UNICAST_IF) locklessly seems possible, the setsockopt() part might not be possible at the moment. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless getsockopt(IP_MTU)Eric Dumazet
sk_dst_get() does not require socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: lockless getsockopt(IP_OPTIONS)Eric Dumazet
inet->inet_opt being RCU protected, we can use RCU instead of locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_TOSEric Dumazet
Some reads of inet->tos are racy. Add needed READ_ONCE() annotations and convert IP_TOS option lockless. v2: missing changes in include/net/route.h (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MTU_DISCOVEREric Dumazet
inet->pmtudisc can be read locklessly. Implement proper lockless reads and writes to inet->pmtudisc ip_sock_set_mtu_discover() can now be called from arbitrary contexts. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MULTICAST_TTLEric Dumazet
inet->mc_ttl can be read locklessly. Implement proper lockless reads and writes to inet->mc_ttl Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: annotate data-races around sk->sk_dst_pending_confirmEric Dumazet
This field can be read or written without socket lock being held. Add annotations to avoid load-store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_MAX_PACING_RATEEric Dumazet
SO_MAX_PACING_RATE setsockopt() does not need to hold the socket lock, because sk->sk_pacing_rate readers can run fine if the value is changed by other threads, after adding READ_ONCE() accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_PRIORITYEric Dumazet
This is a followup of 8bf43be799d4 ("net: annotate data-races around sk->sk_priority"). sk->sk_priority can be read and written without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01tcp: derive delack_max from rto_minEric Dumazet
While BPF allows to set icsk->->icsk_delack_max and/or icsk->icsk_rto_min, we have an ip route attribute (RTAX_RTO_MIN) to be able to tune rto_min, but nothing to consequently adjust max delayed ack, which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). This makes RTAX_RTO_MIN of almost no practical use, unless customers are in big trouble. Modern days datacenter communications want to set rto_min to ~5 ms, and the max delayed ack one jiffie smaller to avoid spurious retransmits. After this patch, an "rto_min 5" route attribute will effectively lower max delayed ack timers to 4 ms. Note in the following ss output, "rto:6 ... ato:4" $ ss -temoi dst XXXXXX State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 ino:255134 sk:1001 <-> skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 bytes_received:54823120 segs_out:1370582 segs_in:1370580 data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 While we could argue this patch fixes a bug with RTAX_RTO_MIN, I do not add a Fixes: tag, so that we can soak it a bit before asking backports to stable branches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-29bpf, sockmap: Do not inc copied_seq when PEEK flag setJohn Fastabend
When data is peek'd off the receive queue we shouldn't considered it copied from tcp_sock side. When we increment copied_seq this will confuse tcp_data_ready() because copied_seq can be arbitrarily increased. From application side it results in poll() operations not waking up when expected. Notice tcp stack without BPF recvmsg programs also does not increment copied_seq. We broke this when we moved copied_seq into recvmsg to only update when actual copy was happening. But, it wasn't working correctly either before because the tcp_data_ready() tried to use the copied_seq value to see if data was read by user yet. See fixes tags. Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230926035300.135096-3-john.fastabend@gmail.com