summaryrefslogtreecommitdiff
path: root/include/linux/skbuff.h
AgeCommit message (Collapse)Author
2022-07-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2022-07-22 We've added 73 non-merge commits during the last 12 day(s) which contain a total of 88 files changed, 3458 insertions(+), 860 deletions(-). The main changes are: 1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai. 2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel syscalls through kprobe mechanism, from Andrii Nakryiko. 3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function, from Song Liu & Jiri Olsa. 4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi. 5) Add a ksym BPF iterator to allow for more flexible and efficient interactions with kernel symbols, from Alan Maguire. 6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter. 7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov. 8) libbpf support for writing custom perf event readers, from Jon Doron. 9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar. 10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski. 11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin. 12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev. 13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui. 14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong. 15) Make non-prealloced BPF map allocations low priority to play better with memcg limits, from Yafang Shao. 16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao. 17) Various smaller cleanups and improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits) bpf: Simplify bpf_prog_pack_[size|mask] bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch) bpf, x64: Allow to use caller address from stack ftrace: Allow IPMODIFY and DIRECT ops on the same function ftrace: Add modify_ftrace_direct_multi_nolock bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF selftests/bpf: Fix test_verifier failed test in unprivileged mode selftests/bpf: Add negative tests for new nf_conntrack kfuncs selftests/bpf: Add tests for new nf_conntrack kfuncs selftests/bpf: Add verifier tests for trusted kfunc args net: netfilter: Add kfuncs to set and change CT status net: netfilter: Add kfuncs to set and change CT timeout net: netfilter: Add kfuncs to allocate and insert CT net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup bpf: Add documentation for kfuncs bpf: Add support for forcing kfunc args to be trusted bpf: Switch to new kfunc flags infrastructure tools/resolve_btfids: Add support for 8-byte BTF sets bpf: Introduce 8-byte BTF set ... ==================== Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19Merge branch 'io_uring-zerocopy-send' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux Pavel Begunkov says: ==================== io_uring zerocopy send The patchset implements io_uring zerocopy send. It works with both registered and normal buffers, mixing is allowed but not recommended. Apart from usual request completions, just as with MSG_ZEROCOPY, io_uring separately notifies the userspace when buffers are freed and can be reused (see API design below), which is delivered into io_uring's Completion Queue. Those "buffer-free" notifications are not necessarily per request, but the userspace has control over it and should explicitly attaching a number of requests to a single notification. The series also adds some internal optimisations when used with registered buffers like removing page referencing. From the kernel networking perspective there are two main changes. The first one is passing ubuf_info into the network layer from io_uring (inside of an in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info caching on the io_uring side, but also helps to avoid cross-referencing and synchronisation problems. The second part is an optional optimisation removing page referencing for requests with registered buffers. Benchmarking UDP with an optimised version of the selftest (see [1]), which sends a bunch of requests, waits for completions and repeats. "+ flush" column posts one additional "buffer-free" notification per request, and just "zc" doesn't post buffer notifications at all. NIC (requests / second): IO size | non-zc | zc | zc + flush 4000 | 495134 | 606420 (+22%) | 558971 (+12%) 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) dummy (requests / second): IO size | non-zc | zc | zc + flush 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) Previously it also brought a massive performance speedup compared to the msg_zerocopy tool (see [3]), which is probably not super interesting. There is also an additional bunch of refcounting optimisations that was omitted from the series for simplicity and as they don't change the picture drastically, they will be sent as follow up, as well as flushing optimisations closing the performance gap b/w two last columns. For TCP on localhost (with hacks enabling localhost zerocopy) and including additional overhead for receive: IO size | non-zc | zc 1200 | 4174 | 4148 4096 | 7597 | 11228 Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the omitted optimisations will somewhat help, should look better for 4000, but couldn't test properly because of setup problems. Links: liburing (benchmark + tests): [1] https://github.com/isilence/liburing/tree/zc_v4 kernel repo: [2] https://github.com/isilence/linux/tree/zc_v4 RFC v1: [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/ RFC v2: https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/ Net patches based: git@github.com:isilence/linux.git zc_v4-net-base or https://github.com/isilence/linux/tree/zc_v4-net-base API design overview: The series introduces an io_uring concept of notifactors. From the userspace perspective it's an entity to which it can bind one or more requests and then requesting to flush it. Flushing a notifier makes it impossible to attach new requests to it, and instructs the notifier to post a completion once all requests attached to it are completed and the kernel doesn't need the buffers anymore. Notifications are stored in notification slots, which should be registered as an array in io_uring. Each slot stores only one notifier at any particular moment. Flushing removes it from the slot and the slot automatically replaces it with a new notifier. All operations with notifiers are done by specifying an index of a slot it's currently in. When registering a notification the userspace specifies a u64 tag for each slot, which will be copied in notification completion entries as cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32 sequence number counting notifiers of a slot. ==================== Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: introduce __skb_fill_page_desc_noaccPavel Begunkov
Managed pages contain pinned userspace pages and controlled by upper layers, there is no need in tracking skb->pfmemalloc for them. Introduce a helper for filling frags but ignoring page tracking, it'll be needed later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: introduce managed frags infrastructurePavel Begunkov
Some users like io_uring can do page pinning more efficiently, so we want a way to delegate referencing to other subsystems. For that add a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold page references and upper layers are responsivle to managing page lifetime. It's allowed to convert skbs from managed to normal by calling skb_zcopy_downgrade_managed(). The function will take all needed page references and clear the flag. It's needed, for instance, to avoid mixing managed modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: Allow custom iter handler in msghdrDavid Ahern
Add support for custom iov_iter handling to msghdr. The idea is that in-kernel subsystems want control over how an SG is split. Signed-off-by: David Ahern <dsahern@kernel.org> [pavel: move callback into msghdr] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19bpf: Don't redirect packets with invalid pkt_lenZhengchao Shao
Syzbot found an issue [1]: fq_codel_drop() try to drop a flow whitout any skbs, that is, the flow->head is null. The root cause, as the [2] says, is because that bpf_prog_test_run_skb() run a bpf prog which redirects empty skbs. So we should determine whether the length of the packet modified by bpf prog or others like bpf_prog_test is valid before forwarding it directly. LINK: [1] https://syzkaller.appspot.com/bug?id=0b84da80c2917757915afa89f7738a9d16ec96c5 LINK: [2] https://www.spinics.net/lists/netdev/msg777503.html Reported-by: syzbot+7a12909485b94426aceb@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220715115559.139691-1-shaozhengchao@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-18skbuff: add SKBFL_DONT_ORPHAN flagPavel Begunkov
We don't want to list every single ubuf_info callback in skb_orphan_frags(), add a flag controlling the behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-24net: helper function skb_len_addRichard Gobert
Move the len fields manipulation in the skbs to a helper function. There is a comment specifically requesting this and there are several other areas in the code displaying the same pattern which can be refactored. This improves code readability. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Link: https://lore.kernel.org/r/20220622160853.GA6478@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-21net: warn if mac header was not setEric Dumazet
Make sure skb_mac_header(), skb_mac_offset() and skb_mac_header_len() uses are not fooled if the mac header has not been set. These checks are enabled if CONFIG_DEBUG_NET=y This commit will likely expose existing bugs in linux networking stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220620093017.3366713-1-eric.dumazet@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-07net: skb: move enum skb_drop_reason to standalone header fileMenglong Dong
As the skb drop reasons are getting more and more, move the enum 'skb_drop_reason' and related function to the standalone header 'dropreason.h', as Jakub Kicinski suggested. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-06-02net: add debug info to __skb_pull()Eric Dumazet
While analyzing yet another syzbot report, I found the following patch very useful. It allows to better understand what went wrong. This debug info is only enabled if CONFIG_DEBUG_NET=y, which is the case for syzbot builds. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16net: skb: Remove skb_data_area_size()Ricardo Martinez
skb_data_area_size() is not needed. As Jakub pointed out [1]: For Rx, drivers can use the size passed during skb allocation or use skb_tailroom(). For Tx, drivers should use skb_headlen(). [1] https://lore.kernel.org/netdev/CAHNKnsTmH-rGgWi3jtyC=ktM1DW2W1VJkYoTMJV2Z_Bt498bsg@mail.gmail.com/ Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16net: skb: change the definition SKB_DR_SET()Menglong Dong
The SKB_DR_OR() is used to set the drop reason to a value when it is not set or specified yet. SKB_NOT_DROPPED_YET should also be considered as not set. Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-12skbuff: replace a BUG_ON() with the new DEBUG_NET_WARN_ON_ONCE()Jakub Kicinski
Very few drivers actually have Kconfig knobs for adding -DDEBUG. 8 according to a quick grep, while there are 93 users of skb_checksum_none_assert(). Switch to the new DEBUG_NET_WARN_ON_ONCE() to catch bad skbs. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220511172305.1382810-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-11net: warn if transport header was not setEric Dumazet
Make sure skb_transport_header() and skb_transport_offset() uses are not fooled if the transport header has not been set. This change will likely expose existing bugs in linux networking stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-10skbuff: render the checksum comment to documentationJakub Kicinski
Long time ago Tom added a giant comment to skbuff.h explaining checksums. Now that we have a place in Documentation for skbuff docs we should render it. Sprinkle some markup while at it. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-10skbuff: rewrite the doc for data-only skbsJakub Kicinski
The comment about shinfo->dataref split is really unhelpful, at least to me. Rewrite it and render it to skb documentation. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-10skbuff: add a basic intro docJakub Kicinski
Add basic skb documentation. It's mostly an intro to the subsequent patches - it would looks strange if we documented advanced topics without covering the basics in any way. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-10ptp: Support late timestamp determinationGerhard Engleder
If a physical clock supports a free running cycle counter, then timestamps shall be based on this time too. For TX it is known in advance before the transmission if a timestamp based on the free running cycle counter is needed. For RX it is impossible to know which timestamp is needed before the packet is received and assigned to a socket. Support late timestamp determination by a network device. Therefore, an address/cookie is stored within the new netdev_data field of struct skb_shared_hwtstamps. This address/cookie is provided to a new network device function called ndo_get_tstamp(), which returns a timestamp based on the normal/adjustable time or based on the free running cycle counter. If function is not supported, then timestamp handling is not changed. This mechanism is intended for RX, but TX use is also possible. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-10ptp: Request cycles for TX timestampGerhard Engleder
The free running cycle counter of physical clocks called cycles shall be used for hardware timestamps to enable synchronisation. Introduce new flag SKBTX_HW_TSTAMP_USE_CYCLES, which signals driver to provide a TX timestamp based on cycles if cycles are supported. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-09net: skb: introduce skb_data_area_size()Ricardo Martinez
Helper to calculate the linear data space in the skb. Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-05net: Make msg_zerocopy_alloc staticDavid Ahern
msg_zerocopy_alloc is only used by msg_zerocopy_realloc; remove the export and make static in skbuff.c Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/r/20220504170947.18773-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-30net: inline skb_zerocopy_iter_dgramPavel Begunkov
skb_zerocopy_iter_dgram() is a small proxy function, inline it. For that, move __zerocopy_sg_from_iter into linux/skbuff.h Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-27Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-04-27 We've added 85 non-merge commits during the last 18 day(s) which contain a total of 163 files changed, 4499 insertions(+), 1521 deletions(-). The main changes are: 1) Teach libbpf to enhance BPF verifier log with human-readable and relevant information about failed CO-RE relocations, from Andrii Nakryiko. 2) Add typed pointer support in BPF maps and enable it for unreferenced pointers (via probe read) and referenced ones that can be passed to in-kernel helpers, from Kumar Kartikeya Dwivedi. 3) Improve xsk to break NAPI loop when rx queue gets full to allow for forward progress to consume descriptors, from Maciej Fijalkowski & Björn Töpel. 4) Fix a small RCU read-side race in BPF_PROG_RUN routines which dereferenced the effective prog array before the rcu_read_lock, from Stanislav Fomichev. 5) Implement BPF atomic operations for RV64 JIT, and add libbpf parsing logic for USDT arguments under riscv{32,64}, from Pu Lehui. 6) Implement libbpf parsing of USDT arguments under aarch64, from Alan Maguire. 7) Enable bpftool build for musl and remove nftw with FTW_ACTIONRETVAL usage so it can be shipped under Alpine which is musl-based, from Dominique Martinet. 8) Clean up {sk,task,inode} local storage trace RCU handling as they do not need to use call_rcu_tasks_trace() barrier, from KP Singh. 9) Improve libbpf API documentation and fix error return handling of various API functions, from Grant Seltzer. 10) Enlarge offset check for bpf_skb_{load,store}_bytes() helpers given data length of frags + frag_list may surpass old offset limit, from Liu Jian. 11) Various improvements to prog_tests in area of logging, test execution and by-name subtest selection, from Mykola Lysenko. 12) Simplify map_btf_id generation for all map types by moving this process to build time with help of resolve_btfids infra, from Menglong Dong. 13) Fix a libbpf bug in probing when falling back to legacy bpf_probe_read*() helpers; the probing caused always to use old helpers, from Runqing Yang. 14) Add support for ARCompact and ARCv2 platforms for libbpf's PT_REGS tracing macros, from Vladimir Isaev. 15) Cleanup BPF selftests to remove old & unneeded rlimit code given kernel switched to memcg-based memory accouting a while ago, from Yafang Shao. 16) Refactor of BPF sysctl handlers to move them to BPF core, from Yan Zhu. 17) Fix BPF selftests in two occasions to work around regressions caused by latest LLVM to unblock CI until their fixes are worked out, from Yonghong Song. 18) Misc cleanups all over the place, from various others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits) selftests/bpf: Add libbpf's log fixup logic selftests libbpf: Fix up verifier log for unguarded failed CO-RE relos libbpf: Simplify bpf_core_parse_spec() signature libbpf: Refactor CO-RE relo human description formatting routine libbpf: Record subprog-resolved CO-RE relocations unconditionally selftests/bpf: Add CO-RE relos and SEC("?...") to linked_funcs selftests libbpf: Avoid joining .BTF.ext data with BPF programs by section name libbpf: Fix logic for finding matching program for CO-RE relocation libbpf: Drop unhelpful "program too large" guess libbpf: Fix anonymous type check in CO-RE logic bpf: Compute map_btf_id during build time selftests/bpf: Add test for strict BTF type check selftests/bpf: Add verifier tests for kptr selftests/bpf: Add C tests for kptr libbpf: Add kptr type tag macros to bpf_helpers.h bpf: Make BTF type match stricter for release arguments bpf: Teach verifier about kptr_get kfunc helpers bpf: Wire up freeing of referenced kptr bpf: Populate pairs of btf_id and destructor kfunc in btf bpf: Adapt copy_map_value for multiple offset case ... ==================== Link: https://lore.kernel.org/r/20220427224758.20976-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26net: generalize skb freeing deferral to per-cpu listsEric Dumazet
Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-20net: Change skb_ensure_writable()'s write_len param to unsigned int typeLiu Jian
Both pskb_may_pull() and skb_clone_writable()'s length parameters are of type unsigned int already. Therefore, change this function's write_len param to unsigned int type. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220416105801.88708-3-liujian56@huawei.com
2022-04-17tcp: add drop reason support to tcp_ofo_queue()Eric Dumazet
packets in OFO queue might be redundant, and dropped. tcp_drop() is no longer needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_prune_ofo_queue()Eric Dumazet
Add one reason for packets dropped from OFO queue because of memory pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add two drop reasons for tcp_ack()Eric Dumazet
Add TCP_TOO_OLD_ACK and TCP_ACK_UNSENT_DATA drop reasons so that tcp_rcv_established() can report them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reasons to tcp_rcv_state_process()Eric Dumazet
Add basic support for drop reasons in tcp_rcv_state_process() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_validate_incoming()Eric Dumazet
Creates four new drop reasons for the following cases: 1) packet being rejected by RFC 7323 PAWS check 2) packet being rejected by SEQUENCE check 3) Invalid RST packet 4) Invalid SYN packet Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: ip: add skb drop reasons to ip forwardingMenglong Dong
Replace kfree_skb() which is used in ip6_forward() and ip_forward() with kfree_skb_reason(). The new drop reason 'SKB_DROP_REASON_PKT_TOO_BIG' is introduced for the case that the length of the packet exceeds MTU and can't fragment. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13net: ipv4: add skb drop reasons to ip_error()Menglong Dong
Eventually, I find out the handler function for inputting route lookup fail: ip_error(). The drop reasons we used in ip_error() are almost corresponding to IPSTATS_MIB_*, and following new reasons are introduced: SKB_DROP_REASON_IP_INADDRERRORS SKB_DROP_REASON_IP_INNOROUTES Isn't the name SKB_DROP_REASON_IP_HOSTUNREACH and SKB_DROP_REASON_IP_NETUNREACH more accurate? To make them corresponding to IPSTATS_MIB_*, we keep their name still. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-13skb: add some helpers for skb drop reasonsMenglong Dong
In order to simply the definition and assignment for 'enum skb_drop_reason', introduce some helpers. SKB_DR() is used to define a variable of type 'enum skb_drop_reason' with the 'SKB_DROP_REASON_NOT_SPECIFIED' initial value. SKB_DR_SET() is used to set the value of the variable. Seems it is a little useless? But it makes the code shorter. SKB_DR_OR() is used to set the value of the variable if it is not set yet, which means its value is SKB_DROP_REASON_NOT_SPECIFIED. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11net: icmp: add skb drop reasons to icmp protocolMenglong Dong
Replace kfree_skb() used in icmp_rcv() and icmpv6_rcv() with kfree_skb_reason(). In order to get the reasons of the skb drops after icmp message handle, we change the return type of 'handler()' in 'struct icmp_control' from 'bool' to 'enum skb_drop_reason'. This may change its original intention, as 'false' means failure, but 'SKB_NOT_DROPPED_YET' means success now. Therefore, all 'handler' and the call of them need to be handled. Following 'handler' functions are involved: icmp_unreach() icmp_redirect() icmp_echo() icmp_timestamp() icmp_discard() And following new drop reasons are added: SKB_DROP_REASON_ICMP_CSUM SKB_DROP_REASON_INVALID_PROTO The reason 'INVALID_PROTO' is introduced for the case that the packet doesn't follow rfc 1122 and is dropped. This is not a common case, and I believe we can locate the problem from the data in the packet. For now, this 'INVALID_PROTO' is used for the icmp broadcasts with wrong types. Maybe there should be a document file for these reasons. For example, list all the case that causes the 'UNHANDLED_PROTO' and 'INVALID_PROTO' drop reason. Therefore, users can locate their problems according to the document. Reviewed-by: Hao Peng <flyingpeng@tencent.com> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11net: skb: rename SKB_DROP_REASON_PTYPE_ABSENTMenglong Dong
As David Ahern suggested, the reasons for skb drops should be more general and not be code based. Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no L3 protocol handler, no L4 protocol handler, version extensions, etc. From previous discussion, now we have the aim to make these reasons more abstract and users based, avoiding code based. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-06net: remove noblock parameter from skb_recv_datagram()Oliver Hartkopp
skb_recv_datagram() has two parameters 'flags' and 'noblock' that are merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)' As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags' into 'flags' and 'noblock' with finally obsolete bit operations like this: skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc); And this is not even done consistently with the 'flags' parameter. This patch removes the obsolete and costly splitting into two parameters and only performs bit operations when really needed on the caller side. One missing conversion thankfully reported by kernel test robot. I missed to enable kunit tests to build the mctp code. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-03-21 v2 We've added 137 non-merge commits during the last 17 day(s) which contain a total of 143 files changed, 7123 insertions(+), 1092 deletions(-). The main changes are: 1) Custom SEC() handling in libbpf, from Andrii. 2) subskeleton support, from Delyan. 3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao. 4) Fix net.core.bpf_jit_harden race, from Hou. 5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub. 6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami. The arch specific bits will come later. 7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri. 8) Enable non-atomic allocations in local storage, from Joanne. 9) Various var_off ptr_to_btf_id fixed, from Kumar. 10) bpf_ima_file_hash helper, from Roberto. 11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits) selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" bpftool: Fix a bug in subskeleton code generation bpf: Fix bpf_prog_pack when PMU_SIZE is not defined bpf: Fix bpf_prog_pack for multi-node setup bpf: Fix warning for cast from restricted gfp_t in verifier bpf, arm: Fix various typos in comments libbpf: Close fd in bpf_object__reuse_map bpftool: Fix print error when show bpf map bpf: Fix kprobe_multi return probe backtrace Revert "bpf: Add support to inline bpf_get_func_ip helper on x86" bpf: Simplify check in btf_parse_hdr() selftests/bpf/test_lirc_mode2.sh: Exit with proper code bpf: Check for NULL return from bpf_get_btf_vmlinux selftests/bpf: Test skipping stacktrace bpf: Adjust BPF stack helper functions to accommodate skip > 0 bpf: Select proper size for bpf_prog_pack ... ==================== Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macroMartin KaFai Lau
This patch removes the TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macros. Instead, PKT_VLAN_PRESENT_OFFSET is used because all of them are at the same offset. Comment is added to make it clear that changing the position of tc_at_ingress or mono_delivery_time will require to adjust the defined macros. The earlier discussion can be found here: https://lore.kernel.org/bpf/419d994e-ff61-7c11-0ec7-11fefcb0186e@iogearbox.net/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220309090450.3710955-1-kafai@fb.com
2022-03-09skb: make drop reason booleanableJakub Kicinski
We have a number of cases where function returns drop/no drop decision as a boolean. Now that we want to report the reason code as well we have to pass extra output arguments. We can make the reason code evaluate correctly as bool. I believe we're good to reorder the reasons as they are reported to user space as strings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06net: tun: track dropped skb via kfree_skb_reason()Dongli Zhang
The TUN can be used as vhost-net backend. E.g, the tun_net_xmit() is the interface to forward the skb from TUN to vhost-net/virtio-net. However, there are many "goto drop" in the TUN driver. Therefore, the kfree_skb_reason() is involved at each "goto drop" to help userspace ftrace/ebpf to track the reason for the loss of packets. The below reasons are introduced: - SKB_DROP_REASON_DEV_READY - SKB_DROP_REASON_NOMEM - SKB_DROP_REASON_HDR_TRUNC - SKB_DROP_REASON_TAP_FILTER - SKB_DROP_REASON_TAP_TXFILTER Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06net: tap: track dropped skb via kfree_skb_reason()Dongli Zhang
The TAP can be used as vhost-net backend. E.g., the tap_handle_frame() is the interface to forward the skb from TAP to vhost-net/virtio-net. However, there are many "goto drop" in the TAP driver. Therefore, the kfree_skb_reason() is involved at each "goto drop" to help userspace ftrace/ebpf to track the reason for the loss of packets. The below reasons are introduced: - SKB_DROP_REASON_SKB_CSUM - SKB_DROP_REASON_SKB_GSO_SEG - SKB_DROP_REASON_SKB_UCOPY_FAULT - SKB_DROP_REASON_DEV_HDR - SKB_DROP_REASON_FULL_RING Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: use kfree_skb_reason() for __netif_receive_skb_core()Menglong Dong
Add reason for skb drops to __netif_receive_skb_core() when packet_type not found to handle the skb. For this purpose, the drop reason SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for example, this case mainly happens when L3 protocol is not supported. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: use kfree_skb_reason() for sch_handle_ingress()Menglong Dong
Replace kfree_skb() used in sch_handle_ingress() with kfree_skb_reason(). Following drop reasons are introduced: SKB_DROP_REASON_TC_INGRESS Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: use kfree_skb_reason() for do_xdp_generic()Menglong Dong
Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason(). The drop reason SKB_DROP_REASON_XDP is introduced for this case. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: use kfree_skb_reason() for enqueue_to_backlog()Menglong Dong
Replace kfree_skb() used in enqueue_to_backlog() with kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is introduced for the case of failing to enqueue the skb to the per CPU backlog queue. The further reason can be backlog queue full or RPS flow limition, and I think we needn't to make further distinctions. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: add skb drop reasons to __dev_xmit_skb()Menglong Dong
Add reasons for skb drops to __dev_xmit_skb() by replacing kfree_skb_list() with kfree_skb_list_reason(). The drop reason of SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: skb: introduce the function kfree_skb_list_reason()Menglong Dong
To report reasons of skb drops, introduce the function kfree_skb_list_reason() and make kfree_skb_list() an inline call to it. This function will be used in the next commit in __dev_xmit_skb(). Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04net: dev: use kfree_skb_reason() for sch_handle_egress()Menglong Dong
Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason(). The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering the code path of tc egerss, we make it distinct with the drop reason of SKB_DROP_REASON_QDISC_DROP in the next commit. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-03bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingressMartin KaFai Lau
The current tc-bpf@ingress reads and writes the __sk_buff->tstamp as a (rcv) timestamp which currently could either be 0 (not available) or ktime_get_real(). This patch is to backward compatible with the (rcv) timestamp expectation at ingress. If the skb->tstamp has the delivery_time, the bpf insn rewrite will read 0 for tc-bpf running at ingress as it is not available. When writing at ingress, it will also clear the skb->mono_delivery_time bit. /* BPF_READ: a = __sk_buff->tstamp */ if (!skb->tc_at_ingress || !skb->mono_delivery_time) a = skb->tstamp; else a = 0 /* BPF_WRITE: __sk_buff->tstamp = a */ if (skb->tc_at_ingress) skb->mono_delivery_time = 0; skb->tstamp = a; [ A note on the BPF_CGROUP_INET_INGRESS which can also access skb->tstamp. At that point, the skb is delivered locally and skb_clear_delivery_time() has already been done, so the skb->tstamp will only have the (rcv) timestamp. ] If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time has to be cleared also. It could be done together during convert_ctx_access(). However, the latter patch will also expose the skb->mono_delivery_time bit as __sk_buff->delivery_time_type. Changing the delivery_time_type in the background may surprise the user, e.g. the 2nd read on __sk_buff->delivery_time_type may need a READ_ONCE() to avoid compiler optimization. Thus, in expecting the needs in the latter patch, this patch does a check on !skb->tstamp after running the tc-bpf and clears the skb->mono_delivery_time bit if needed. The earlier discussion on v4 [0]. The bpf insn rewrite requires the skb's mono_delivery_time bit and tc_at_ingress bit. They are moved up in sk_buff so that bpf rewrite can be done at a fixed offset. tc_skip_classify is moved together with tc_at_ingress. To get one bit for mono_delivery_time, csum_not_inet is moved down and this bit is currently used by sctp. [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>