summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2024-09-17Merge tag 'irq-core-2024-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Core: - Remove a global lock in the affinity setting code The lock protects a cpumask for intermediate results and the lock causes a bottleneck on simultaneous start of multiple virtual machines. Replace the lock and the static cpumask with a per CPU cpumask which is nicely serialized by raw spinlock held when executing this code. - Provide support for giving a suffix to interrupt domain names. That's required to support devices with subfunctions so that the domain names are distinct even if they originate from the same device node. - The usual set of cleanups and enhancements all over the place Drivers: - Support for longarch AVEC interrupt chip - Refurbishment of the Armada driver so it can be extended for new variants. - The usual set of cleanups and enhancements all over the place" * tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) genirq: Use cpumask_intersects() genirq/cpuhotplug: Use cpumask_intersects() irqchip/apple-aic: Only access system registers on SoCs which provide them irqchip/apple-aic: Add a new "Global fast IPIs only" feature level irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi dt-bindings: apple,aic: Document A7-A11 compatibles irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy() genirq/msi: Use kmemdup_array() instead of kmemdup() genirq/proc: Change the return value for set affinity permission error genirq/proc: Use irq_move_pending() in show_irq_affinity() genirq/proc: Correctly set file permissions for affinity control files genirq: Get rid of global lock in irq_do_set_affinity() genirq: Fix typo in struct comment irqchip/loongarch-avec: Add AVEC irqchip support irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING LoongArch: Architectural preparation for AVEC irqchip LoongArch: Move irqchip function prototypes to irq-loongson.h irqchip/loongson-pch-msi: Switch to MSI parent domains softirq: Remove unused 'action' parameter from action callback ...
2024-09-16Merge tag 'vfs-6.12.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...
2024-09-13net: fib_rules: Enable DSCP selector usageIdo Schimmel
Now that both IPv4 and IPv6 support the new DSCP selector, enable user space to configure FIB rules that make use of it by changing the policy of the new DSCP attribute so that it accepts values in the range of [0, 63]. Use NLA_U8 rather than NLA_UINT as the field is of fixed size. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240911093748.3662015-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13net: fib_rules: Add DSCP selector attributeIdo Schimmel
The FIB rule TOS selector is implemented differently between IPv4 and IPv6. In IPv4 it is used to match on the three "Type of Services" bits specified in RFC 791, while in IPv6 is it is used to match on the six DSCP bits specified in RFC 2474. Add a new FIB rule attribute to allow matching on DSCP. The attribute will be used to implement a 'dscp' selector in ip-rule with a consistent behavior between IPv4 and IPv6. For now, set the type of the attribute to 'NLA_REJECT' so that user space will not be able to configure it. This restriction will be lifted once both IPv4 and IPv6 support the new attribute. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240911093748.3662015-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of errorDaniel Borkmann
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input arguments, zero the value for the case of an error as otherwise it could leak memory. For tracing, it is not needed given CAP_PERFMON can already read all kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped in here. Also, the MTU helpers mtu_len pointer value is being written but also read. Technically, the MEM_UNINIT should not be there in order to always force init. Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now implies two things actually: i) write into memory, ii) memory does not have to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory, ii) memory must be initialized. This means that for bpf_*_check_mtu() we're readding the issue we're trying to fix, that is, it would then be able to write back into things like .rodata BPF maps. Follow-up work will rework the MEM_UNINIT semantics such that the intent can be better expressed. For now just clear the *mtu_len on error path which can be lifted later again. Fixes: 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix helper writes to read-only mapsDaniel Borkmann
Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via *<ptr> = val. The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write *<ptr> = val. Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-09-11 We've added 12 non-merge commits during the last 16 day(s) which contain a total of 20 files changed, 228 insertions(+), 30 deletions(-). There's a minor merge conflict in drivers/net/netkit.c: 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx") d96608794889 ("netkit: Disable netpoll support") The main changes are: 1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used to easily parse skbs in BPF programs attached to tracepoints, from Philo Lu. 2) Add a cond_resched() point in BPF's sock_hash_free() as there have been several syzbot soft lockup reports recently, from Eric Dumazet. 3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which got noticed when zero copy ice driver started to use it, from Maciej Fijalkowski. 4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs up via netif_receive_skb_list() to better measure latencies, from Daniel Xu. 5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann. 6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but instead gather the actual value via /proc/sys/net/core/max_skb_frags, also from Maciej Fijalkowski. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: sock_map: Add a cond_resched() in sock_hash_free() selftests/bpf: Expand skb dynptr selftests for tp_btf bpf: Allow bpf_dynptr_from_skb() for tp_btf tcp: Use skb__nullable in trace_tcp_send_reset selftests/bpf: Add test for __nullable suffix in tp_btf bpf: Support __nullable argument suffix for tp_btf bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc() tcp_bpf: Remove an unused parameter for bpf_tcp_ingress() bpf, sockmap: Correct spelling skmsg.c netkit: Disable netpoll support Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: add dmabuf introspectionMina Almasry
Add dmabuf information to page_pool stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get ... {'dmabuf': 10, 'id': 456, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 455, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 454, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 453, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 452, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 451, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 450, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 449, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, And queue stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get ... {'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'}, Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-14-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: add SO_DEVMEM_DONTNEED setsockopt to release RX fragsMina Almasry
Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240910171458.219195-11-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11tcp: RX path for devmem TCPMina Almasry
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: add support for skbs with unreadable fragsMina Almasry
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: support non paged skb fragsMina Almasry
Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11memory-provider: dmabuf devmem memory providerMina Almasry
Implement a memory provider that allocates dmabuf devmem in the form of net_iov. The provider receives a reference to the struct netdev_dmabuf_binding via the pool->mp_priv pointer. The driver needs to set this pointer for the provider in the net_iov. The provider obtains a reference on the netdev_dmabuf_binding which guarantees the binding and the underlying mapping remains alive until the provider is destroyed. Usage of PP_FLAG_DMA_MAP is required for this memory provide such that the page_pool can provide the driver with the dma-addrs of the devmem. Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order != 0. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-7-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11page_pool: devmem supportMina Almasry
Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page. Currently these entries in struct page are rented by the page_pool and used exclusively by the net stack: struct { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; }; Mirror these (and only these) entries into struct net_iov and implement netmem helpers that can access these common fields regardless of whether the underlying type is page or net_iov. Implement checks for net_iov in netmem helpers which delegate to mm APIs, to ensure net_iov are never passed to the mm stack. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-6-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: netdevice devmem allocatorMina Almasry
Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates net_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a net_iov struct. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-5-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: support binding dma-buf to netdeviceMina Almasry
Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to rx queues on the netdevice. On the binding, the dma_buf_attach & dma_buf_map_attachment will occur. The entries in the sg_table from mapping will be inserted into a genpool to make it ready for allocation. The chunks in the genpool are owned by a dmabuf_chunk_owner struct which holds the dma-buf offset of the base of the chunk and the dma_addr of the chunk. Both are needed to use allocations that come from this chunk. We create a new type that represents an allocation from the genpool: net_iov. We setup the net_iov allocation size in the genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally allocated by the page pool and given to the drivers. The user can unbind the dmabuf from the netdevice by closing the netlink socket that established the binding. We do this so that the binding is automatically unbound even if the userspace process crashes. The binding and unbinding leaves an indicator in struct netdev_rx_queue that the given queue is bound, and the binding is actuated by resetting the rx queue using the queue API. The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11net: netdev netlink api to bind dma-buf to a net deviceMina Almasry
API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-3-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: add netdev_rx_queue_restart()Mina Almasry
Add netdev_rx_queue_restart(), which resets an rx queue using the queue API recently merged[1]. The queue API was merged to enable the core net stack to reset individual rx queues to actuate changes in the rx queue's configuration. In later patches in this series, we will use netdev_rx_queue_restart() to reset rx queues after binding or unbinding dmabuf configuration, which will cause reallocation of the page_pool to repopulate its memory using the new configuration. [1] https://lore.kernel.org/netdev/20240430231420.699177-1-shailend@google.com/T/ Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11sock_map: Add a cond_resched() in sock_hash_free()Eric Dumazet
Several syzbot soft lockup reports all have in common sock_hash_free() If a map with a large number of buckets is destroyed, we need to yield the cpu when needed. Fixes: 75e68e5bf2c7 ("bpf, sockhash: Synchronize delete from bucket list on map free") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240906154449.3742932-1-edumazet@google.com
2024-09-11bpf: Allow bpf_dynptr_from_skb() for tp_btfPhilo Lu
Making tp_btf able to use bpf_dynptr_from_skb(), which is useful for skb parsing, especially for non-linear paged skb data. This is achieved by adding KF_TRUSTED_ARGS flag to bpf_dynptr_from_skb and registering it for TRACING progs. With KF_TRUSTED_ARGS, args from fentry/fexit are excluded, so that unsafe progs like fexit/__kfree_skb are not allowed. We also need the skb dynptr to be read-only in tp_btf. Because may_access_direct_pkt_data() returns false by default when checking bpf_dynptr_from_skb, there is no need to add BPF_PROG_TYPE_TRACING to it explicitly. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240911033719.91468-5-lulie@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-10Merge tag 'ipsec-next-2024-09-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2024-09-10 1) Remove an unneeded WARN_ON on packet offload. From Patrisious Haddad. 2) Add a copy from skb_seq_state to buffer function. This is needed for the upcomming IPTFS patchset. From Christian Hopps. 3) Spelling fix in xfrm.h. From Simon Horman. 4) Speed up xfrm policy insertions. From Florian Westphal. 5) Add and revert a patch to support xfrm interfaces for packet offload. This patch was just half cooked. 6) Extend usage of the new xfrm_policy_is_dead_or_sk helper. From Florian Westphal. 7) Update comments on sdb and xfrm_policy. From Florian Westphal. 8) Fix a null pointer dereference in the new policy insertion code From Florian Westphal. 9) Fix an uninitialized variable in the new policy insertion code. From Nathan Chancellor. * tag 'ipsec-next-2024-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: policy: Restore dir assignments in xfrm_hash_rebuild() xfrm: policy: fix null dereference Revert "xfrm: add SA information to the offloaded packet" xfrm: minor update to sdb and xfrm_policy comments xfrm: policy: use recently added helper in more places xfrm: add SA information to the offloaded packet xfrm: policy: remove remaining use of inexact list xfrm: switch migrate to xfrm_policy_lookup_bytype xfrm: policy: don't iterate inexact policies twice at insert time selftests: add xfrm policy insertion speed test script xfrm: Correct spelling in xfrm.h net: add copy from skb_seq_state to buffer function xfrm: Remove documentation WARN_ON to limit return values for offloaded SA ==================== Link: https://patch.msgid.link/20240910065507.2436394-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: remove dev_pick_tx_cpu_id()Jakub Kicinski
dev_pick_tx_cpu_id() has been introduced with two users by commit a4ea8a3dacc3 ("net: Add generic ndo_select_queue functions"). The use in AF_PACKET has been removed in 2019 by commit b71b5837f871 ("packet: rework packet_pick_tx_queue() to use common code selection") The other user was a Netlogic XLP driver, removed in 2021 by commit 47ac6f567c28 ("staging: Remove Netlogic XLP network driver"). It's relatively unlikely that any modern driver will need an .ndo_select_queue implementation which picks purely based on CPU ID and skips XPS, delete dev_pick_tx_cpu_id() Found by code inspection. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09bpf: lwtunnel: Unmask upper DSCP bits in bpf_lwt_xmit_reroute()Ido Schimmel
Unmask the upper DSCP bits when calling ip_route_output_key() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-09net: sysfs: Fix weird usage of class's namespace relevant fieldsZijun Hu
Device class has two namespace relevant fields which are associated by the following usage: struct class { ... const struct kobj_ns_type_operations *ns_type; const void *(*namespace)(const struct device *dev); ... } if (dev->class && dev->class->ns_type) dev->class->namespace(dev); The usage looks weird since it checks @ns_type but calls namespace() it is found for all existing class definitions that the other filed is also assigned once one is assigned in current kernel tree, so fix this weird usage by checking @namespace to call namespace(). Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-06netpoll: remove netpoll_srcuEric Dumazet
netpoll_srcu is currently used from netpoll_poll_disable() and __netpoll_cleanup() Both functions run under RTNL, using netpoll_srcu adds confusion and no additional protection. Moreover the synchronize_srcu() call in __netpoll_cleanup() is performed before clearing np->dev->npinfo, which violates RCU rules. After this patch, netpoll_poll_disable() and netpoll_poll_enable() simply use rtnl_dereference(). This saves a big chunk of memory (more than 192KB on platforms with 512 cpus) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20240905084909.2082486-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-06net/core: make use of the helper macro LIST_HEAD()Hongbo Li
list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20240904093243.3345012-6-lihongbo22@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/phy_device.c 2560db6ede1a ("net: phy: Fix missing of_node_put() for leds") 1dce520abd46 ("net: phy: Use for_each_available_child_of_node_scoped()") https://lore.kernel.org/20240904115823.74333648@canb.auug.org.au Adjacent changes: drivers/net/ethernet/xilinx/xilinx_axienet.h drivers/net/ethernet/xilinx/xilinx_axienet_main.c 858430db28a5 ("net: xilinx: axienet: Fix race in axienet_stop") 76abb5d675c4 ("net: xilinx: axienet: Add statistics support") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05net: napi: Prevent overflow of napi_defer_hard_irqsJoe Damato
In commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature") napi_defer_irqs was added to net_device and napi_defer_irqs_count was added to napi_struct, both as type int. This value never goes below zero, so there is not reason for it to be a signed int. Change the type for both from int to u32, and add an overflow check to sysfs to limit the value to S32_MAX. The limit of S32_MAX was chosen because the practical limit before this patch was S32_MAX (anything larger was an overflow) and thus there are no behavioral changes introduced. If the extra bit is needed in the future, the limit can be raised. Before this patch: $ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs' $ cat /sys/class/net/eth4/napi_defer_hard_irqs -2147483647 After this patch: $ sudo bash -c 'echo 2147483649 > /sys/class/net/eth4/napi_defer_hard_irqs' bash: line 0: echo: write error: Numerical result out of range Similarly, /sys/class/net/XXXXX/tx_queue_len is defined as unsigned: include/linux/netdevice.h: unsigned int tx_queue_len; And has an overflow check: dev_change_tx_queue_len(..., unsigned long new_len): if (new_len != (unsigned int)new_len) return -ERANGE; Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240904153431.307932-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03net: dqs: Do not use extern for unused dql_groupBreno Leitao
When CONFIG_DQL is not enabled, dql_group should be treated as a dead declaration. However, its current extern declaration assumes the linker will ignore it, which is generally true across most compiler and architecture combinations. But in certain cases, the linker still attempts to resolve the extern struct, even when the associated code is dead, resulting in a linking error. For instance the following error in loongarch64: >> loongarch64-linux-ld: net-sysfs.c:(.text+0x589c): undefined reference to `dql_group' Modify the declaration of the dead object to be an empty declaration instead of an extern. This change will prevent the linker from attempting to resolve an undefined reference. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409012047.eCaOdfQJ-lkp@intel.com/ Fixes: 74293ea1c4db ("net: sysfs: Do not create sysfs for non BQL device") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://patch.msgid.link/20240902101734.3260455-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_localAlexander Lobakin
"Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03netdev_features: convert NETIF_F_LLTX to dev->lltxAlexander Lobakin
NETIF_F_LLTX can't be changed via Ethtool and is not a feature, rather an attribute, very similar to IFF_NO_QUEUE (and hot). Free one netdev_features_t bit and make it a "hot" private flag. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03netdevice: convert private flags > BIT(31) to bitfieldsAlexander Lobakin
Make dev->priv_flags `u32` back and define bits higher than 31 as bitfield booleans as per Jakub's suggestion. This simplifies code which accesses these bits with no optimization loss (testb both before/after), allows to not extend &netdev_priv_flags each time, but also scales better as bits > 63 in the future would only add a new u64 to the structure with no complications, comparing to that extending ::priv_flags would require converting it to a bitmap. Note that I picked `unsigned long :1` to not lose any potential optimizations comparing to `bool :1` etc. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-02netdev-genl: Set extack and fix error on napi-getJoe Damato
In commit 27f91aaf49b3 ("netdev-genl: Add netlink framework functions for napi"), when an invalid NAPI ID is specified the return value -EINVAL is used and no extack is set. Change the return value to -ENOENT and set the extack. Before this commit: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --do napi-get --json='{"id": 451}' Netlink error: Invalid argument nl_len = 36 (20) nl_flags = 0x100 nl_type = 2 error: -22 After this commit: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --do napi-get --json='{"id": 451}' Netlink error: No such file or directory nl_len = 44 (28) nl_flags = 0x300 nl_type = 2 error: -2 extack: {'bad-attr': '.id'} Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20240831121707.17562-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-02bpf, sockmap: Correct spelling skmsg.cSimon Horman
Correct spelling in skmsg.c. As reported by codespell. Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/bpf/20240829-sockmap-spell-v1-1-a614d76564cc@kernel.org
2024-08-31bpf: Unmask upper DSCP bits in __bpf_redirect_neigh_v4()Ido Schimmel
Unmask the upper DSCP bits when calling ip_route_output_flow() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-29net: prefer strscpy over strcpyHongbo Li
The deprecated helper strcpy() performs no bounds checking on the destination buffer. This could result in linear overflows beyond the end of the buffer, leading to all kinds of misbehaviors. The safe replacement is strscpy() [1]. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/faraday/ftgmac100.c 4186c8d9e6af ("net: ftgmac100: Ensure tx descriptor updates are visible") e24a6c874601 ("net: ftgmac100: Get link speed and duplex for NC-SI") https://lore.kernel.org/0b851ec5-f91d-4dd3-99da-e81b98c9ed28@kernel.org net/ipv4/tcp.c bac76cf89816 ("tcp: fix forever orphan socket caused by tcp_abort") edefba66d929 ("tcp: rstreason: introduce SK_RST_REASON_TCP_STATE for active reset") https://lore.kernel.org/20240828112207.5c199d41@canb.auug.org.au No adjacent changes. Link: https://patch.msgid.link/20240829130829.39148-1-pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-28file: reclaim 24 bytes from f_ownerChristian Brauner
We do embedd struct fown_struct into struct file letting it take up 32 bytes in total. We could tweak struct fown_struct to be more compact but really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct should allocate the struct on demand. This frees up 24 bytes in struct file. That will have some potentially user-visible changes for the ownership fcntl()s. Some of them can now fail due to allocation failures. Practically, that probably will almost never happen as the allocations are small and they only happen once per file. The fown_struct is used during kill_fasync() which is used by e.g., pipes to generate a SIGIO signal. Sending of such signals is conditional on userspace having set an owner for the file using one of the F_OWNER fcntl()s. Such users will be unaffected if struct fown_struct is allocated during the fcntl() call. There are a few subsystems that call __f_setown() expecting file->f_owner to be allocated: (1) tun devices file->f_op->fasync::tun_chr_fasync() -> __f_setown() There are no callers of tun_chr_fasync(). (2) tty devices file->f_op->fasync::tty_fasync() -> __tty_fasync() -> __f_setown() tty_fasync() has no additional callers but __tty_fasync() has. Note that __tty_fasync() only calls __f_setown() if the @on argument is true. It's called from: file->f_op->release::tty_release() -> tty_release() -> __tty_fasync() -> __f_setown() tty_release() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of tty_release() are safe as well. file->f_op->release::tty_open() -> tty_release() -> __tty_fasync() -> __f_setown() __tty_hangup() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of __tty_hangup() are safe as well. From the callchains it's obvious that (1) and (2) end up getting called via file->f_op->fasync(). That can happen either through the F_SETFL fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If FASYNC is requested and the file isn't already FASYNC then file->f_op->fasync() is called with @on true which ends up causing both (1) and (2) to call __f_setown(). (1) and (2) are the only subsystems that call __f_setown() from the file->f_op->fasync() handler. So both (1) and (2) have been updated to allocate a struct fown_struct prior to calling fasync_helper() to register with the fasync infrastructure. That's safe as they both call fasync_helper() which also does allocations if @on is true. The other interesting case are file leases: (3) file leases lease_manager_ops->lm_setup::lease_setup() -> __f_setown() Which in turn is called from: generic_add_lease() -> lease_manager_ops->lm_setup::lease_setup() -> __f_setown() So here again we can simply make generic_add_lease() allocate struct fown_struct prior to the lease_manager_ops->lm_setup::lease_setup() which happens under a spinlock. With that the two remaining subsystems that call __f_setown() are: (4) dnotify (5) sockets Both have their own custom ioctls to set struct fown_struct and both have been converted to allocate a struct fown_struct on demand from their respective ioctls. Interactions with O_PATH are fine as well e.g., when opening a /dev/tty as O_PATH then no file->f_op->open() happens thus no file->f_owner is allocated. That's fine as no file operation will be set for those and the device has never been opened. fcntl()s called on such things will just allocate a ->f_owner on demand. Although I have zero idea why'd you care about f_owner on an O_PATH fd. Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-27tc: adjust network header after 2nd vlan pushBoris Sukholitko
<tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan. The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr> Consider the following shell script snippet configuring TC rules on the veth interface: ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up tc qdisc add dev veth0 clsact tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success" Sending double-tagged vlan packet with the IP payload inside: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg. OTOH, sending single-tagged vlan packet: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*.......... 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 20, will push the second vlan tag but will *not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel. Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience: if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30. proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version): nhoff = skb_network_offset(skb); Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC. Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow. Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped: if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; } At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header. Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets): if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len); .... case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break; .... out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len); And skb_vlan_push (net/core/skbuff.c:6204) function does: err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err; skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN; in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points: 1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet. 2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet. 3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again. Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header. The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-26ethtool: check device is present when getting link settingsJamie Bainbridge
A sysfs reader can race with a device reset or removal, attempting to read device state when the device is not actually present. eg: [exception RIP: qed_get_current_link+17] #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede] #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b #14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f #17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb crash> struct net_device.state ffff9a9d21336000 state = 5, state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100). The device is not present, note lack of __LINK_STATE_PRESENT (0b10). This is the same sort of panic as observed in commit 4224cfd7fb65 ("net-sysfs: add check for netdevice being present to speed_show"). There are many other callers of __ethtool_get_link_ksettings() which don't have a device presence check. Move this check into ethtool to protect all callers. Fixes: d519e17e2d01 ("net: export device speed and duplex via sysfs") Fixes: 4224cfd7fb65 ("net-sysfs: add check for netdevice being present to speed_show") Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26net: Correct spelling in net/coreSimon Horman
Correct spelling in net/core. As reported by codespell. Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240822-net-spell-v1-13-3a98971ce2d2@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26netpoll: Ensure clean state on setup failuresBreno Leitao
Modify netpoll_setup() and __netpoll_setup() to ensure that the netpoll structure (np) is left in a clean state if setup fails for any reason. This prevents carrying over misconfigured fields in case of partial setup success. Key changes: - np->dev is now set only after successful setup, ensuring it's always NULL if netpoll is not configured or if netpoll_setup() fails. - np->local_ip is zeroed if netpoll setup doesn't complete successfully. - Added DEBUG_NET_WARN_ON_ONCE() checks to catch unexpected states. - Reordered some operations in __netpoll_setup() for better logical flow. These changes improve the reliability of netpoll configuration, since it assures that the structure is fully initialized or totally unset. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20240822111051.179850-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-08-23 We've added 10 non-merge commits during the last 15 day(s) which contain a total of 10 files changed, 222 insertions(+), 190 deletions(-). The main changes are: 1) Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case when long-lived sockets miss a chance to set additional callbacks if a sockops program was not attached early in their lifetime, from Alan Maguire. 2) Add a batch of BPF selftest improvements which fix a few bugs and add missing features to improve the test coverage of sockmap/sockhash, from Michal Luczaj. 3) Fix a false-positive Smatch-reported off-by-one in tcp_validate_cookie() which is part of the test_tcp_custom_syncookie BPF selftest, from Kuniyuki Iwashima. 4) Fix the flow_dissector BPF selftest which had a bug in IP header's tot_len calculation doing subtraction after htons() instead of inside htons(), from Asbjørn Sloth Tønnesen. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftest: bpf: Remove mssind boundary check in test_tcp_custom_syncookie.c. selftests/bpf: Introduce __attribute__((cleanup)) in create_pair() selftests/bpf: Exercise SOCK_STREAM unix_inet_redir_to_connected() selftests/bpf: Honour the sotype of af_unix redir tests selftests/bpf: Simplify inet_socketpair() and vsock_socketpair_connectible() selftests/bpf: Socket pair creation, cleanups selftests/bpf: Support more socket types in create_pair() selftests/bpf: Avoid subtraction after htons() in ipip tests selftests/bpf: add sockopt tests for TCP_BPF_SOCK_OPS_CB_FLAGS bpf/bpf_get,set_sockopt: add option to set TCP-BPF sock ops flags ==================== Link: https://patch.msgid.link/20240823134959.1091-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26Merge tag 'nf-next-24-08-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: Patch #1 fix checksum calculation in nfnetlink_queue with SCTP, segment GSO packet since skb_zerocopy() does not support GSO_BY_FRAGS, from Antonio Ojea. Patch #2 extend nfnetlink_queue coverage to handle SCTP packets, from Antonio Ojea. Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink, from Donald Hunter. Patch #4 adds a dedicate commit list for sets to speed up intra-transaction lookups, from Florian Westphal. Patch #5 skips removal of element from abort path for the pipapo backend, ditching the shadow copy of this datastructure is sufficient. Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to let users of conncoiunt decide when to enable conntrack, this is needed by openvswitch, from Xin Long. Patch #7 pass context to all nft_parse_register_load() in preparation for the next patch. Patches #8 and #9 reject loads from uninitialized registers from control plane to remove register initialization from datapath. From Florian Westphal. * tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: don't initialize registers in nft_do_chain() netfilter: nf_tables: allow loads only when register is initialized netfilter: nf_tables: pass context structure to nft_parse_register_load netfilter: move nf_ct_netns_get out of nf_conncount_init netfilter: nf_tables: do not remove elements if set backend implements .abort netfilter: nf_tables: store new sets in dedicated list netfilter: nfnetlink: convert kfree_skb to consume_skb selftests: netfilter: nft_queue.sh: sctp coverage netfilter: nfnetlink_queue: unbreak SCTP traffic ==================== Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-24net: refactor ->ndo_bpf calls into dev_xdp_propagateMina Almasry
When net devices propagate xdp configurations to slave devices, we will need to perform a memory provider check to ensure we're not binding xdp to a device using unreadable netmem. Currently the ->ndo_bpf calls in a few places. Adding checks to all these places would not be ideal. Refactor all the ->ndo_bpf calls into one place where we can add this check in the future. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23rtnetlink: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23neighbour: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23fib: rules: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOMEM, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23net: phy: Introduce ethernet link topology representationMaxime Chevallier
Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. The PHY index can be re-used for PHYs that are persistent. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-22pktgen: use cpus_read_lock() in pg_net_init()Eric Dumazet
I have seen the WARN_ON(smp_processor_id() != cpu) firing in pktgen_thread_worker() during tests. We must use cpus_read_lock()/cpus_read_unlock() around the for_each_online_cpu(cpu) loop. While we are at it use WARN_ON_ONCE() to avoid a possible syslog flood. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240821175339.1191779-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>