summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-11-18treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-17sctp: sm_statefuns: Remove pointer casts of the same typeLi zeming
The subh.addip_hdr pointer is also of type (struct sctp_addiphdr *), so it does not require a cast. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20221115020705.3220-1-zeming@nfschina.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-16tcp: annotate data-race around queue->synflood_warnedEric Dumazet
Annotate the lockless read of queue->synflood_warned. Following xchg() has the needed data-race resolution. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16ax25: af_ax25: Remove unnecessary (void*) conversionsLi zeming
The valptr pointer is of (void *) type, so other pointers need not be forced to assign values to it. Signed-off-by: Li zeming <zeming@nfschina.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16tcp: configurable source port perturb table sizeGleb Mazovetskiy
On embedded systems with little memory and no relevant security concerns, it is beneficial to reduce the size of the table. Reducing the size from 2^16 to 2^8 saves 255 KiB of kernel RAM. Makes the table size configurable as an expert option. The size was previously increased from 2^8 to 2^16 in commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16"). Signed-off-by: Gleb Mazovetskiy <glex.spb@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16l2tp: Serialize access to sk_user_data with sk_callback_lockJakub Sitnicki
sk->sk_user_data has multiple users, which are not compatible with each other. Writers must synchronize by grabbing the sk->sk_callback_lock. l2tp currently fails to grab the lock when modifying the underlying tunnel socket fields. Fix it by adding appropriate locking. We err on the side of safety and grab the sk_callback_lock also inside the sk_destruct callback overridden by l2tp, even though there should be no refs allowing access to the sock at the time when sk_destruct gets called. v4: - serialize write to sk_user_data in l2tp sk_destruct v3: - switch from sock lock to sk_callback_lock - document write-protection for sk_user_data v2: - update Fixes to point to origin of the bug - use real names in Reported/Tested-by tags Cc: Tom Parkin <tparkin@katalix.com> Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core") Reported-by: Haowei Yan <g1042620637@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16ipv4: tunnels: use DEV_STATS_INC()Eric Dumazet
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx). Adopt SMP safe DEV_STATS_INC() to update dev->stats fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16ipv6: tunnels: use DEV_STATS_INC()Eric Dumazet
Most of code paths in tunnels are lockless (eg NETIF_F_LLTX in tx). Adopt SMP safe DEV_STATS_{INC|ADD}() to update dev->stats fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16ipv6/sit: use DEV_STATS_INC() to avoid data-racesEric Dumazet
syzbot/KCSAN reported that multiple cpus are updating dev->stats.tx_error concurrently. This is because sit tunnels are NETIF_F_LLTX, meaning their ndo_start_xmit() is not protected by a spinlock. While original KCSAN report was about tx path, rx path has the same issue. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: add atomic_long_t to net_device_stats fieldsEric Dumazet
Long standing KCSAN issues are caused by data-race around some dev->stats changes. Most performance critical paths already use per-cpu variables, or per-queue ones. It is reasonable (and more correct) to use atomic operations for the slow paths. This patch adds an union for each field of net_device_stats, so that we can convert paths that are not yet protected by a spinlock or a mutex. netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64 Note that the memcpy() we were using on 64bit arches had no provision to avoid load-tearing, while atomic_long_read() is providing the needed protection at no cost. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: __sock_gen_cookie() cleanupEric Dumazet
Adopt atomic64_try_cmpxchg() and remove the loop, to make the intent more obvious. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: adopt try_cmpxchg() in napi_{enable|disable}()Eric Dumazet
This makes code a bit cleaner. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: adopt try_cmpxchg() in napi_schedule_prep() and napi_complete_done()Eric Dumazet
This makes the code slightly more efficient. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: net_{enable|disable}_timestamp() optimizationsEric Dumazet
Adopting atomic_try_cmpxchg() makes the code cleaner. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16ipv6: fib6_new_sernum() optimizationEric Dumazet
Adopt atomic_try_cmpxchg() which is slightly more efficient. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: mm_account_pinned_pages() optimizationEric Dumazet
Adopt atomic_long_try_cmpxchg() in mm_account_pinned_pages() as it is slightly more efficient. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually downVladimir Oltean
RFC 2863 says: The lowerLayerDown state is also a refinement on the down state. This new state indicates that this interface runs "on top of" one or more other interfaces (see ifStackTable) and that this interface is down specifically because one or more of these lower-layer interfaces are down. DSA interfaces are virtual network devices, stacked on top of the DSA master, but they have a physical MAC, with a PHY that reports a real link status. But since DSA (perhaps improperly) uses an iflink to describe the relationship to its master since commit c084080151e1 ("dsa: set ->iflink on slave interfaces to the ifindex of the parent"), default_operstate() will misinterpret this to mean that every time the carrier of a DSA interface is not ok, it is because of the master being not ok. In fact, since commit c0a8a9c27493 ("net: dsa: automatically bring user ports down when master goes down"), DSA cannot even in theory be in the lowerLayerDown state, because it just calls dev_close_many(), thereby going down, when the master goes down. We could revert the commit that creates an iflink between a DSA user port and its master, especially since now we have an alternative IFLA_DSA_MASTER which has less side effects. But there may be tooling in use which relies on the iflink, which has existed since 2009. We could also probably do something local within DSA to overwrite what rfc2863_policy() did, in a way similar to hsr_set_operstate(), but this seems like a hack. What seems appropriate is to follow the iflink, and check the carrier status of that interface as well. If that's down too, yes, keep reporting lowerLayerDown, otherwise just down. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16udp: Introduce optional per-netns hash table.Kuniyuki Iwashima
The maximum hash table size is 64K due to the nature of the protocol. [0] It's smaller than TCP, and fewer sockets can cause a performance drop. On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in different netns, creating 32Mi sockets without data transfer in the root netns causes regression for the iperf3's connection. uhash_entries sockets length Gbps 64K 1 1 5.69 1Mi 16 5.27 2Mi 32 4.90 4Mi 64 4.09 8Mi 128 2.96 16Mi 256 2.06 32Mi 512 1.12 The per-netns hash table breaks the lengthy lists into shorter ones. It is useful on a multi-tenant system with thousands of netns. With smaller hash tables, we can look up sockets faster, isolate noisy neighbours, and reduce lock contention. The max size of the per-netns table is 64K as well. This is because the possible hash range by udp_hashfn() always fits in 64K within the same netns and we cannot make full use of the whole buckets larger than 64K. /* 0 < num < 64K -> X < hash < X + 64K */ (num + net_hash_mix(net)) & mask; Also, the min size is 128. We use a bitmap to search for an available port in udp_lib_get_port(). To keep the bitmap on the stack and not fire the CONFIG_FRAME_WARN error at build time, we round up the table size to 128. The sysctl usage is the same with TCP: $ dmesg | cut -d ' ' -f 6- | grep "UDP hash" UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc) # sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries # sysctl net.ipv4.udp_child_hash_entries net.ipv4.udp_child_hash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = -65536 # share the global table # sysctl -w net.ipv4.udp_child_hash_entries=100 net.ipv4.udp_child_hash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets We could optimise the hash table lookup/iteration further by removing the netns comparison for the per-netns one in the future. Also, we could optimise the sparse udp_hslot layout by putting it in udp_table. [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16udp: Access &udp_table via net.Kuniyuki Iwashima
We will soon introduce an optional per-netns hash table for UDP. This means we cannot use udp_table directly in most places. Instead, access it via net->ipv4.udp_table. The access will be valid only while initialising udp_table itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16udp: Set NULL to udp_seq_afinfo.udp_table.Kuniyuki Iwashima
We will soon introduce an optional per-netns hash table for UDP. This means we cannot use the global udp_seq_afinfo.udp_table to fetch a UDP hash table. Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get a proper table from net->ipv4.udp_table. Note that we still need udp_seq_afinfo.udp_table for UDP LITE. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16udp: Set NULL to sk->sk_prot->h.udp_table.Kuniyuki Iwashima
We will soon introduce an optional per-netns hash table for UDP. This means we cannot use the global sk->sk_prot->h.udp_table to fetch a UDP hash table. Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get a proper table from net->ipv4.udp_table. Note that we still need sk->sk_prot->h.udp_table for UDP LITE. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16udp: Clean up some functions.Kuniyuki Iwashima
This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The change is mainly to keep reverse christmas tree order. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16wifi: cfg80211: Avoid clashing function prototypesGustavo A. R. Silva
When built with Control Flow Integrity, function prototypes between caller and function declaration must match. These mismatches are visible at compile time with the new -Wcast-function-type-strict in Clang[1]. Fix a total of 73 warnings like these: drivers/net/wireless/intersil/orinoco/wext.c:1379:27: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_param *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict] IW_HANDLER(SIOCGIWPOWER, (iw_handler)orinoco_ioctl_getpower), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../net/wireless/wext-compat.c:1607:33: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_point *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict] [IW_IOCTL_IDX(SIOCSIWGENIE)] = (iw_handler) cfg80211_wext_siwgenie, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/net/wireless/intersil/orinoco/wext.c:1390:27: error: incompatible function pointer types initializing 'const iw_handler' (aka 'int (*const)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') with an expression of type 'int (struct net_device *, struct iw_request_info *, struct iw_param *, char *)' [-Wincompatible-function-pointer-types] IW_HANDLER(SIOCGIWRETRY, cfg80211_wext_giwretry), ^~~~~~~~~~~~~~~~~~~~~~ The cfg80211 Wireless Extension handler callbacks (iw_handler) use a union for the data argument. Actually use the union and perform explicit member selection in the function body instead of having a function prototype mismatch. There are no resulting binary differences before/after changes. These changes were made partly manually and partly with the help of Coccinelle. Link: https://github.com/KSPP/linux/issues/234 Link: https://reviews.llvm.org/D134831 [1] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/a68822bf8dd587988131bb6a295280cb4293f05d.1667934775.git.gustavoars@kernel.org
2022-11-16rxrpc: Fix network address validationDavid Howells
Fix network address validation on entry to uapi functions such as connect() for AF_RXRPC. The check for address compatibility with the transport socket isn't correct and allows an AF_INET6 address to be given to an AF_INET socket, resulting in an oops now that rxrpc is calling udp_sendmsg() directly. Sample program: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/socket.h> #include <arpa/inet.h> #include <linux/rxrpc.h> static unsigned char ctrl[256] = "\x18\x00\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x01"; int main(void) { struct sockaddr_rxrpc srx = { .srx_family = AF_RXRPC, .transport_type = SOCK_DGRAM, .transport_len = 28, .transport.sin6.sin6_family = AF_INET6, }; struct mmsghdr vec = { .msg_hdr.msg_control = ctrl, .msg_hdr.msg_controllen = 0x18, }; int s; s = socket(AF_RXRPC, SOCK_DGRAM, AF_INET); if (s < 0) { perror("socket"); exit(1); } if (connect(s, (struct sockaddr *)&srx, sizeof(srx)) < 0) { perror("connect"); exit(1); } if (sendmmsg(s, &vec, 1, MSG_NOSIGNAL | MSG_MORE) < 0) { perror("sendmmsg"); exit(1); } return 0; } If working properly, connect() should fail with EAFNOSUPPORT. Fixes: ed472b0c8783 ("rxrpc: Call udp_sendmsg() directly") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-16rxrpc: Fix oops from calling udpv6_sendmsg() on AF_INET socketDavid Howells
If rxrpc sees an IPv6 address, it assumes it can call udpv6_sendmsg() on it - even if it got it on an IPv4 socket. Fix do_udp_sendmsg() to give an error in such a case. general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] ... RIP: 0010:ipv6_addr_v4mapped include/net/ipv6.h:749 [inline] RIP: 0010:udpv6_sendmsg+0xd0a/0x2c70 net/ipv6/udp.c:1361 ... Call Trace: do_udp_sendmsg net/rxrpc/output.c:27 [inline] do_udp_sendmsg net/rxrpc/output.c:21 [inline] rxrpc_send_abort_packet+0x73b/0x860 net/rxrpc/output.c:367 rxrpc_release_calls_on_socket+0x211/0x300 net/rxrpc/call_object.c:595 rxrpc_release_sock net/rxrpc/af_rxrpc.c:886 [inline] rxrpc_release+0x263/0x5a0 net/rxrpc/af_rxrpc.c:917 __sock_release+0xcd/0x280 net/socket.c:650 sock_close+0x18/0x20 net/socket.c:1365 __fput+0x27c/0xa90 fs/file_table.c:320 task_work_run+0x16b/0x270 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0xb35/0x2a20 kernel/exit.c:820 do_group_exit+0xd0/0x2a0 kernel/exit.c:950 __do_sys_exit_group kernel/exit.c:961 [inline] __se_sys_exit_group kernel/exit.c:959 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:959 Fixes: ed472b0c8783 ("rxrpc: Call udp_sendmsg() directly") Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-15net: dsa: don't leak tagger-owned storage on switch driver unbindVladimir Oltean
In the initial commit dc452a471dba ("net: dsa: introduce tagger-owned storage for private and shared data"), we had a call to tag_ops->disconnect(dst) issued from dsa_tree_free(), which is called at tree teardown time. There were problems with connecting to a switch tree as a whole, so this got reworked to connecting to individual switches within the tree. In this process, tag_ops->disconnect(ds) was made to be called only from switch.c (cross-chip notifiers emitted as a result of dynamic tag proto changes), but the normal driver teardown code path wasn't replaced with anything. Solve this problem by adding a function that does the opposite of dsa_switch_setup_tag_protocol(), which is called from the equivalent spot in dsa_switch_teardown(). The positioning here also ensures that we won't have any use-after-free in tagging protocol (*rcv) ops, since the teardown sequence is as follows: dsa_tree_teardown -> dsa_tree_teardown_master -> dsa_master_teardown -> unsets master->dsa_ptr, making no further packets match the ETH_P_XDSA packet type handler -> dsa_tree_teardown_ports -> dsa_port_teardown -> dsa_slave_destroy -> unregisters DSA net devices, there is even a synchronize_net() in unregister_netdevice_many() -> dsa_tree_teardown_switches -> dsa_switch_teardown -> dsa_switch_teardown_tag_protocol -> finally frees the tagger-owned storage Fixes: 7f2973149c22 ("net: dsa: make tagging protocols connect to individual switches from a tree") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Saeed Mahameed <saeed@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20221114143551.1906361-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15net: dsa: remove phylink_validate() methodVladimir Oltean
As of now, no DSA driver uses a custom link mode validation procedure anymore. So remove this DSA operation and let phylink determine what is supported based on config->mac_capabilities (if provided by the driver). Leave a comment why we left the code that we did, and that there is more work to do. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Fix sparse warning in the new nft_inner expression, reported by Jakub Kicinski. 2) Incorrect vlan header check in nft_inner, from Peng Wu. 3) Two patches to pass reset boolean to expression dump operation, in preparation for allowing to reset stateful expressions in rules. This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter. 4) Inconsistent indentation in nft_fib, from Jiapeng Chong. 5) Speed up siphash calculation in conntrack, from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: conntrack: use siphash_4u64 netfilter: rpfilter/fib: clean up some inconsistent indenting netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3() netfilter: nft_payload: use __be16 to store gre version ==================== Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15net/x25: Fix skb leak in x25_lapb_receive_frame()Wei Yongjun
x25_lapb_receive_frame() using skb_copy() to get a private copy of skb, the new skb should be freed in the undersized/fragmented skb error handling path. Otherwise there is a memory leak. Fixes: cb101ed2c3c7 ("x25: Handle undersized/fragmented skbs") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Link: https://lore.kernel.org/r/20221114110519.514538-1-weiyongjun@huaweicloud.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15net: dsa: add support for DSA rx offloading via metadata dstFelix Fietkau
If a metadata dst is present with the type METADATA_HW_PORT_MUX on a dsa cpu port netdev, assume that it carries the port number and that there is no DSA tag present in the skb data. Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15bpf: Expand map key argument of bpf_redirect_map to u64Toke Høiland-Jørgensen
For queueing packets in XDP we want to add a new redirect map type with support for 64-bit indexes. To prepare fore this, expand the width of the 'key' argument to the bpf_redirect_map() helper. Since BPF registers are always 64-bit, this should be safe to do after the fact. Acked-by: Song Liu <song@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15net: dcb: move getapptrust to separate functionDaniel Machon
This patch fixes a frame size warning, reported by kernel test robot. >> net/dcb/dcbnl.c:1230:1: warning: the frame size of 1244 bytes is >> larger than 1024 bytes [-Wframe-larger-than=] The getapptrust part of dcbnl_ieee_fill is moved to a separate function, and the selector array is now dynamically allocated, instead of stack allocated. Tested on microchip sparx5 driver. Fixes: 6182d5875c33 ("net: dcb: add new apptrust attribute") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/20221114092950.2490451-1-daniel.machon@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-15bridge: switchdev: Fix memory leaks when changing VLAN protocolIdo Schimmel
The bridge driver can offload VLANs to the underlying hardware either via switchdev or the 8021q driver. When the former is used, the VLAN is marked in the bridge driver with the 'BR_VLFLAG_ADDED_BY_SWITCHDEV' private flag. To avoid the memory leaks mentioned in the cited commit, the bridge driver will try to delete a VLAN via the 8021q driver if the VLAN is not marked with the previously mentioned flag. When the VLAN protocol of the bridge changes, switchdev drivers are notified via the 'SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL' attribute, but the 8021q driver is also called to add the existing VLANs with the new protocol and delete them with the old protocol. In case the VLANs were offloaded via switchdev, the above behavior is both redundant and buggy. Redundant because the VLANs are already programmed in hardware and drivers that support VLAN protocol change (currently only mlx5) change the protocol upon the switchdev attribute notification. Buggy because the 8021q driver is called despite these VLANs being marked with 'BR_VLFLAG_ADDED_BY_SWITCHDEV'. This leads to memory leaks [1] when the VLANs are deleted. Fix by not calling the 8021q driver for VLANs that were already programmed via switchdev. [1] unreferenced object 0xffff8881f6771200 (size 256): comm "ip", pid 446855, jiffies 4298238841 (age 55.240s) hex dump (first 32 bytes): 00 00 7f 0e 83 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000012819ac>] vlan_vid_add+0x437/0x750 [<00000000f2281fad>] __br_vlan_set_proto+0x289/0x920 [<000000000632b56f>] br_changelink+0x3d6/0x13f0 [<0000000089d25f04>] __rtnl_newlink+0x8ae/0x14c0 [<00000000f6276baf>] rtnl_newlink+0x5f/0x90 [<00000000746dc902>] rtnetlink_rcv_msg+0x336/0xa00 [<000000001c2241c0>] netlink_rcv_skb+0x11d/0x340 [<0000000010588814>] netlink_unicast+0x438/0x710 [<00000000e1a4cd5c>] netlink_sendmsg+0x788/0xc40 [<00000000e8992d4e>] sock_sendmsg+0xb0/0xe0 [<00000000621b8f91>] ____sys_sendmsg+0x4ff/0x6d0 [<000000000ea26996>] ___sys_sendmsg+0x12e/0x1b0 [<00000000684f7e25>] __sys_sendmsg+0xab/0x130 [<000000004538b104>] do_syscall_64+0x3d/0x90 [<0000000091ed9678>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 279737939a81 ("net: bridge: Fix VLANs memory leak") Reported-by: Vlad Buslov <vladbu@nvidia.com> Tested-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20221114084509.860831-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-15kcm: close race conditions on sk_receive_queueCong Wang
sk->sk_receive_queue is protected by skb queue lock, but for KCM sockets its RX path takes mux->rx_lock to protect more than just skb queue. However, kcm_recvmsg() still only grabs the skb queue lock, so race conditions still exist. We can teach kcm_recvmsg() to grab mux->rx_lock too but this would introduce a potential performance regression as struct kcm_mux can be shared by multiple KCM sockets. So we have to enforce skb queue lock in requeue_rx_msgs() and handle skb peek case carefully in kcm_wait_data(). Fortunately, skb_recv_datagram() already handles it nicely and is widely used by other sockets, we can just switch to skb_recv_datagram() after getting rid of the unnecessary sock lock in kcm_recvmsg() and kcm_splice_read(). Side note: SOCK_DONE is not used by KCM sockets, so it is safe to get rid of this check too. I ran the original syzbot reproducer for 30 min without seeing any issue. Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Reported-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com Reported-by: shaozhengchao <shaozhengchao@huawei.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20221114005119.597905-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-15netfilter: conntrack: use siphash_4u64Florian Westphal
This function is used for every packet, siphash_4u64 is noticeably faster than using local buffer + siphash: Before: 1.23% kpktgend_0 [kernel.vmlinux] [k] __siphash_unaligned 0.14% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw After: 0.79% kpktgend_0 [kernel.vmlinux] [k] siphash_4u64 0.15% kpktgend_0 [nf_conntrack] [k] hash_conntrack_raw In the pktgen test this gives about ~2.4% performance improvement. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15netfilter: rpfilter/fib: clean up some inconsistent indentingJiapeng Chong
No functional modification involved. net/ipv4/netfilter/nft_fib_ipv4.c:141 nft_fib4_eval() warn: inconsistent indenting. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2733 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESETPhil Sutter
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful expressions like counters or quotas. The latter two are the only consumers, adjust their 'dump' callbacks to respect the parameter introduced earlier. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15netfilter: nf_tables: Extend nft_expr_ops::dump callback parametersPhil Sutter
Add a 'reset' flag just like with nft_object_ops::dump. This will be useful to reset "anonymous stateful objects", e.g. simple rule counters. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-14bpf: Refactor btf_struct_accessKumar Kartikeya Dwivedi
Instead of having to pass multiple arguments that describe the register, pass the bpf_reg_state into the btf_struct_access callback. Currently, all call sites simply reuse the btf and btf_id of the reg they want to check the access of. The only exception to this pattern is the callsite in check_ptr_to_map_access, hence for that case create a dummy reg to simulate PTR_TO_BTF_ID access. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14tcp: Add listening address to SYN flood messageJamie Bainbridge
The SYN flood message prints the listening port number, but with many processes bound to the same port on different IPs, it's impossible to tell which socket is the problem. Add the listen IP address to the SYN flood message. For IPv6 use "[IP]:port" as per RFC-5952 and to provide ease of copy-paste to "ss" filters. For IPv4 use "IP:port" to match. Each protcol's "any" address and a host address now look like: Possible SYN flooding on port 0.0.0.0:9001. Possible SYN flooding on port 127.0.0.1:9001. Possible SYN flooding on port [::]:9001. Possible SYN flooding on port [fc00::1]:9001. Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/4fedab7ce54a389aeadbdc639f6b4f4988e9d2d7.1668386107.git.jamie.bainbridge@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-14net: dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shimsVladimir Oltean
There are multi-generational drivers like mv88e6xxx which have code like this: int mv88e6xxx_port_hwtstamp_get(struct dsa_switch *ds, int port, struct ifreq *ifr) { if (!chip->info->ptp_support) return -EOPNOTSUPP; ... } DSA wants to deny PTP timestamping on the master if the switch supports timestamping too. However it currently relies on the presence of the port_hwtstamp_get() callback to determine PTP capability, and this clearly does not work in that case (method is present but returns -EOPNOTSUPP). We should not deny PTP on the DSA master for those switches which truly do not support hardware timestamping. Create a dsa_port_supports_hwtstamp() method which actually probes for support by calling port_hwtstamp_get() and seeing whether that returned -EOPNOTSUPP or not. Fixes: f685e609a301 ("net: dsa: Deny PTP on master if switch supports it") Link: https://patchwork.kernel.org/project/netdevbpf/patch/20221110124345.3901389-1-festevam@gmail.com/ Reported-by: Fabio Estevam <festevam@gmail.com> Reported-by: Steffen Bätz <steffen@innosonix.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Fabio Estevam <festevam@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14net: flow_offload: add support for ARP frame matchingSteen Hegelund
This adds a new flow_rule_match_arp function that allows drivers to be able to dissect ARP frames. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14ipasdv4/tcp_ipv4: remove redundant assignmentxu xin
The value of 'st->state' has been verified as "TCP_SEQ_STATE_LISTENING", it's unnecessary to assign TCP_SEQ_STATE_LISTENING to it, so we can remove it. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14net: caif: fix double disconnect client in chnl_net_open()Zhengchao Shao
When connecting to client timeout, disconnect client for twice in chnl_net_open(). Remove one. Compile tested only. Fixes: 2aa40aef9deb ("caif: Use link layer MTU instead of fixed MTU") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14rxrpc: Fix missing IPV6 #ifdefDavid Howells
Fix rxrpc_encap_err_rcv() to make the call to ipv6_icmp_error conditional on IPV6 support being enabled. Fixes: b6c66c4324e7 ("rxrpc: Use the core ICMP/ICMP6 parsers") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org
2022-11-11tcp: tcp_wfree() refactoringEric Dumazet
Use try_cmpxchg() (instead of cmpxchg()) in a more readable way. oval = smp_load_acquire(&sk->sk_tsq_flags); do { ... } while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval)); Reduce indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221110190239.3531280-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11tcp: adopt try_cmpxchg() in tcp_release_cb()Eric Dumazet
try_cmpxchg() is slighly more efficient (at least on x86), and smp_load_acquire(&sk->sk_tsq_flags) could avoid a KCSAN report. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221110174829.3403442-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11bridge: Add missing parenthesesIdo Schimmel
No changes in generated code. Reported-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20221110085422.521059-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11mptcp: Fix grammar in a commentMat Martineau
We kept getting initial patches from new contributors to remove a duplicate 'the' (since grammar checking scripts flag it), but submitters never followed up after code review. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11mptcp: get sk from msk directlyGeliang Tang
Use '(struct sock *)msk' to get 'sk' from 'msk' in a more direct way instead of using '&msk->sk.icsk_inet.sk'. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>