summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-28libbpf, xsk: Init all ring members in xsk_umem__create and xsk_socket__createFletcher Dunn
Fix a sharp edge in xsk_umem__create and xsk_socket__create. Almost all of the members of the ring buffer structs are initialized, but the "cached_xxx" variables are not all initialized. The caller is required to zero them. This is needlessly dangerous. The results if you don't do it can be very bad. For example, they can cause xsk_prod_nb_free and xsk_cons_nb_avail to return values greater than the size of the queue. xsk_ring_cons__peek can return an index that does not refer to an item that has been queued. I have confirmed that without this change, my program misbehaves unless I memset the ring buffers to zero before calling the function. Afterwards, my program works without (or with) the memset. Signed-off-by: Fletcher Dunn <fletcherd@valvesoftware.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/85f12913cde94b19bfcb598344701c38@valvesoftware.com
2020-03-27Merge branch 'cgroup-helpers'Alexei Starovoitov
Daniel Borkmann says: ==================== This adds various straight-forward helper improvements and additions to BPF cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which would allow to implement more fine-grained policies and improve current load balancer limitations we're seeing. For details please see individual patches. I've tested them on Kubernetes & Cilium and also added selftests for the small verifier extension. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-27bpf: Add selftest cases for ctx_or_null argument typeDaniel Borkmann
Add various tests to make sure the verifier keeps catching them: # ./test_verifier [...] #230/p pass ctx or null check, 1: ctx OK #231/p pass ctx or null check, 2: null OK #232/p pass ctx or null check, 3: 1 OK #233/p pass ctx or null check, 4: ctx - const OK #234/p pass ctx or null check, 5: null (connect) OK #235/p pass ctx or null check, 6: null (bind) OK #236/p pass ctx or null check, 7: ctx (bind) OK #237/p pass ctx or null check, 8: null (bind) OK [...] Summary: 1595 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c74758d07b1b678036465ef7f068a49e9efd3548.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooksDaniel Borkmann
We already have the bpf_get_current_uid_gid() helper enabled, and given we now have perf event RB output available for connect(), sendmsg(), recvmsg() and bind-related hooks, add a trivial change to enable bpf_get_current_pid_tgid() and bpf_get_current_comm() as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor idDaniel Borkmann
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Allow to retrieve cgroup v1 classid from v2 hooksDaniel Borkmann
Today, Kubernetes is still operating on cgroups v1, however, it is possible to retrieve the task's classid based on 'current' out of connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators which attach to the root cgroup v2 hook in a mixed env like in case of Cilium, for example, in order to then correlate certain pod traffic and use it as part of the key for BPF map lookups. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Add netns cookie and enable it for bpf cgroup hooksDaniel Borkmann
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable perf event rb output for bpf cgroup progsDaniel Borkmann
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks are all lacking perf event rb output in order to push notifications or monitoring events up to user space. Back in commit a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops"), I've worked with Sowmini to enable them for sock_ops where the context part is not used (as opposed to skbs for example where the packet data can be appended). Make the bpf_sockopt_event_output() helper generic and enable it for mentioned hooks. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrieval of socket cookie for bind/post-bind hookDaniel Borkmann
We currently make heavy use of the socket cookie in BPF's connect(), sendmsg() and recvmsg() hooks for load-balancing decisions. However, it is currently not enabled/implemented in BPF {post-}bind hooks where it can later be used in combination for correlation in the tc egress path, for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
2020-03-26bpf: Remove unused vairable 'bpf_xdp_link_lops'YueHaibing
kernel/bpf/syscall.c:2263:34: warning: 'bpf_xdp_link_lops' defined but not used [-Wunused-const-variable=] static const struct bpf_link_ops bpf_xdp_link_lops; ^~~~~~~~~~~~~~~~~ commit 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction") involded this unused variable, remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200326031613.19372-1-yuehaibing@huawei.com
2020-03-26bpf: Factor out attach_type to prog_type mapping for attach/detachAndrii Nakryiko
Factor out logic mapping expected program attach type to program type and subsequent handling of program attach/detach. Also list out all supported cgroup BPF program types explicitly to prevent accidental bugs once more program types are added to a mapping. Do the same for prog_query API. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200325065746.640559-3-andriin@fb.com
2020-03-26bpf: Factor out cgroup storages operationsAndrii Nakryiko
Refactor cgroup attach/detach code to abstract away common operations performed on all types of cgroup storages. This makes high-level logic more apparent, plus allows to reuse more code across multiple functions. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200325065746.640559-2-andriin@fb.com
2020-03-25bpf: Test_verifier, #70 error message updates for 32-bit right shiftJohn Fastabend
After changes to add update_reg_bounds after ALU ops and adding ALU32 bounds tracking the error message is changed in the 32-bit right shift tests. Test "#70/u bounds check after 32-bit right shift with 64-bit input FAIL" now fails with, Unexpected error message! EXP: R0 invalid mem access RES: func#0 @0 7: (b7) r1 = 2 8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP2 R10=fp0 fp-8_w=mmmmmmmm 8: (67) r1 <<= 31 9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967296 R10=fp0 fp-8_w=mmmmmmmm 9: (74) w1 >>= 31 10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP0 R10=fp0 fp-8_w=mmmmmmmm 10: (14) w1 -= 2 11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967294 R10=fp0 fp-8_w=mmmmmmmm 11: (0f) r0 += r1 math between map_value pointer and 4294967294 is not allowed And test "#70/p bounds check after 32-bit right shift with 64-bit input FAIL" now fails with, Unexpected error message! EXP: R0 invalid mem access RES: func#0 @0 7: (b7) r1 = 2 8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv2 R10=fp0 fp-8_w=mmmmmmmm 8: (67) r1 <<= 31 9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967296 R10=fp0 fp-8_w=mmmmmmmm 9: (74) w1 >>= 31 10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv0 R10=fp0 fp-8_w=mmmmmmmm 10: (14) w1 -= 2 11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967294 R10=fp0 fp-8_w=mmmmmmmm 11: (0f) r0 += r1 last_idx 11 first_idx 0 regs=2 stack=0 before 10: (14) w1 -= 2 regs=2 stack=0 before 9: (74) w1 >>= 31 regs=2 stack=0 before 8: (67) r1 <<= 31 regs=2 stack=0 before 7: (b7) r1 = 2 math between map_value pointer and 4294967294 is not allowed Before this series we did not trip the "math between map_value pointer..." error because check_reg_sane_offset is never called in adjust_ptr_min_max_vals(). Instead we have a register state that looks like this at line 11*, 11: R0_w=map_value(id=0,off=0,ks=8,vs=8, smin_value=0,smax_value=0, umin_value=0,umax_value=0, var_off=(0x0; 0x0)) R1_w=invP(id=0, smin_value=0,smax_value=4294967295, umin_value=0,umax_value=4294967295, var_off=(0xfffffffe; 0x0)) R10=fp(id=0,off=0, smin_value=0,smax_value=0, umin_value=0,umax_value=0, var_off=(0x0; 0x0)) fp-8_w=mmmmmmmm 11: (0f) r0 += r1 In R1 'smin_val != smax_val' yet we have a tnum_const as seen by 'var_off(0xfffffffe; 0x0))' with a 0x0 mask. So we hit this check in adjust_ptr_min_max_vals() if ((known && (smin_val != smax_val || umin_val != umax_val)) || smin_val > smax_val || umin_val > umax_val) { /* Taint dst register if offset had invalid bounds derived from * e.g. dead branches. */ __mark_reg_unknown(env, dst_reg); return 0; } So we don't throw an error here and instead only throw an error later in the verification when the memory access is made. The root cause in verifier without alu32 bounds tracking is having 'umin_value = 0' and 'umax_value = U64_MAX' from BPF_SUB which we set when 'umin_value < umax_val' here, if (dst_reg->umin_value < umax_val) { /* Overflow possible, we know nothing */ dst_reg->umin_value = 0; dst_reg->umax_value = U64_MAX; } else { ...} Later in adjust_calar_min_max_vals we previously did a coerce_reg_to_size() which will clamp the U64_MAX to U32_MAX by truncating to 32bits. But either way without a call to update_reg_bounds the less precise bounds tracking will fall out of the alu op verification. After latest changes we now exit adjust_scalar_min_max_vals with the more precise umin value, due to zero extension propogating bounds from alu32 bounds into alu64 bounds and then calling update_reg_bounds. This then causes the verifier to trigger an earlier error and we get the error in the output above. This patch updates tests to reflect new error message. * I have a local patch to print entire verifier state regardless if we believe it is a constant so we can get a full picture of the state. Usually if tnum_is_const() then bounds are also smin=smax, etc. but this is not always true and is a bit subtle. Being able to see these states helps understand dataflow imo. Let me know if we want something similar upstream. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158507161475.15666.3061518385241144063.stgit@john-Precision-5820-Tower
2020-03-25bpf: Verifer, adjust_scalar_min_max_vals to always call update_reg_bounds()John Fastabend
Currently, for all op verification we call __red_deduce_bounds() and __red_bound_offset() but we only call __update_reg_bounds() in bitwise ops. However, we could benefit from calling __update_reg_bounds() in BPF_ADD, BPF_SUB, and BPF_MUL cases as well. For example, a register with state 'R1_w=invP0' when we subtract from it, w1 -= 2 Before coerce we will now have an smin_value=S64_MIN, smax_value=U64_MAX and unsigned bounds umin_value=0, umax_value=U64_MAX. These will then be clamped to S32_MIN, U32_MAX values by coerce in the case of alu32 op as done in above example. However tnum will be a constant because the ALU op is done on a constant. Without update_reg_bounds() we have a scenario where tnum is a const but our unsigned bounds do not reflect this. By calling update_reg_bounds after coerce to 32bit we further refine the umin_value to U64_MAX in the alu64 case or U32_MAX in the alu32 case above. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158507151689.15666.566796274289413203.stgit@john-Precision-5820-Tower
2020-03-25bpf: Verifer, refactor adjust_scalar_min_max_valsJohn Fastabend
Pull per op ALU logic into individual functions. We are about to add u32 versions of each of these by pull them out the code gets a bit more readable here and nicer in the next patch. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158507149518.15666.15672349629329072411.stgit@john-Precision-5820-Tower
2020-03-26libbpf: Don't allocate 16M for log buffer by defaultStanislav Fomichev
For each prog/btf load we allocate and free 16 megs of verifier buffer. On production systems it doesn't really make sense because the programs/btf have gone through extensive testing and (mostly) guaranteed to successfully load. Let's assume successful case by default and skip buffer allocation on the first try. If there is an error, start with BPF_LOG_BUF_SIZE and double it on each ENOSPC iteration. v3: * Return -ENOMEM when can't allocate log buffer (Andrii Nakryiko) v2: * Don't allocate the buffer at all on the first try (Andrii Nakryiko) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200325195521.112210-1-sdf@google.com
2020-03-26libbpf: Remove unused parameter `def` to get_map_field_intTobias Klauser
Has been unused since commit ef99b02b23ef ("libbpf: capture value in BTF type info for BTF-defined map defs"). Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200325113655.19341-1-tklauser@distanz.ch
2020-03-26bpf: Document bpf_inspect drgn toolAndrey Ignatov
It's a follow-up for discussion in [1]. drgn tool bpf_inspect.py was merged to drgn repo in [2]. Document it in kernel tree to make BPF developers aware that the tool exists and can help with getting BPF state unavailable via UAPI. For now it's just one tool but the doc is written in a way that allows to cover more tools in the future if needed. Please refer to the doc itself for more details. The patch was tested by `make htmldocs` and sanity-checking that resulting html looks good. v2 -> v3: - two sections: "Description" and "Getting started" (Daniel); - add examples in "Getting started" section (Daniel); - add "Customization" section to show how tool can be customized. v1 -> v2: - better "BPF drgn tools" section (Alexei) [1] https://lore.kernel.org/bpf/20200228201514.GB51456@rdna-mbp/T/#mefed65e8a98116bd5d07d09a570a3eac46724951 [2] https://github.com/osandov/drgn/pull/49 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200324185135.1431038-1-rdna@fb.com
2020-03-23samples, bpf: Refactor perf_event user program with libbpf bpf_linkDaniel T. Lee
The bpf_program__attach of libbpf(using bpf_link) is much more intuitive than the previous method using ioctl. bpf_program__attach_perf_event manages the enable of perf_event and attach of BPF programs to it, so there's no neeed to do this directly with ioctl. In addition, bpf_link provides consistency in the use of API because it allows disable (detach, destroy) for multiple events to be treated as one bpf_link__destroy. Also, bpf_link__destroy manages the close() of perf_event fd. This commit refactors samples that attach the bpf program to perf_event by using libbbpf instead of ioctl. Also the bpf_load in the samples were removed and migrated to use libbbpf API. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200321100424.1593964-3-danieltimlee@gmail.com
2020-03-23samples, bpf: Move read_trace_pipe to trace_helpersDaniel T. Lee
To reduce the reliance of trace samples (trace*_user) on bpf_load, move read_trace_pipe to trace_helpers. By moving this bpf_loader helper elsewhere, trace functions can be easily migrated to libbbpf. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200321100424.1593964-2-danieltimlee@gmail.com
2020-03-23bpf: Add tests for bpf_sk_storage to bpf_tcp_caMartin KaFai Lau
This patch adds test to exercise the bpf_sk_storage_get() and bpf_sk_storage_delete() helper from the bpf_dctcp.c. The setup and check on the sk_storage is done immediately before and after the connect(). This patch also takes this chance to move the pthread_create() after the connect() has been done. That will remove the need of the "wait_thread" label. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320152107.2169904-1-kafai@fb.com
2020-03-23bpf: Add bpf_sk_storage support to bpf_tcp_caMartin KaFai Lau
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete() helper to the bpf_tcp_ca's struct_ops. That would allow bpf-tcp-cc to: 1) share sk private data with other bpf progs. 2) use bpf_sk_storage as a private storage for a bpf-tcp-cc if the existing icsk_ca_priv is not big enough. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320152101.2169498-1-kafai@fb.com
2020-03-20selftests/bpf: Fix mix of tabs and spacesBill Wendling
Clang's -Wmisleading-indentation warns about misleading indentations if there's a mixture of spaces and tabs. Remove extraneous spaces. Signed-off-by: Bill Wendling <morbo@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200320201510.217169-1-morbo@google.com
2020-03-20bpf, tcp: Make tcp_bpf_recvmsg staticYueHaibing
After commit f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of tcp_bpf.c, so make it static and remove it from tcp.h. Also move it to BPF_STREAM_PARSER #ifdef to fix unused function warnings. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20bpf, tcp: Fix unused function warningsYueHaibing
If BPF_STREAM_PARSER is not set, gcc warns: net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function] Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
2020-03-20bpftool: Add struct_ops supportMartin KaFai Lau
This patch adds struct_ops support to the bpftool. To recap a bit on the recent bpf_struct_ops feature on the kernel side: It currently supports "struct tcp_congestion_ops" to be implemented in bpf. At a high level, bpf_struct_ops is struct_ops map populated with a number of bpf progs. bpf_struct_ops currently supports the "struct tcp_congestion_ops". However, the bpf_struct_ops design is generic enough that other kernel struct ops can be supported in the future. Although struct_ops is map+progs at a high lever, there are differences in details. For example, 1) After registering a struct_ops, the struct_ops is held by the kernel subsystem (e.g. tcp-cc). Thus, there is no need to pin a struct_ops map or its progs in order to keep them around. 2) To iterate all struct_ops in a system, it iterates all maps in type BPF_MAP_TYPE_STRUCT_OPS. BPF_MAP_TYPE_STRUCT_OPS is the current usual filter. In the future, it may need to filter by other struct_ops specific properties. e.g. filter by tcp_congestion_ops or other kernel subsystem ops in the future. 3) struct_ops requires the running kernel having BTF info. That allows more flexibility in handling other kernel structs. e.g. it can always dump the latest bpf_map_info. 4) Also, "struct_ops" command is not intended to repeat all features already provided by "map" or "prog". For example, if there really is a need to pin the struct_ops map, the user can use the "map" cmd to do that. While the first attempt was to reuse parts from map/prog.c, it ended up not a lot to share. The only obvious item is the map_parse_fds() but that still requires modifications to accommodate struct_ops map specific filtering (for the immediate and the future needs). Together with the earlier mentioned differences, it is better to part away from map/prog.c. The initial set of subcmds are, register, unregister, show, and dump. For register, it registers all struct_ops maps that can be found in an obj file. Option can be added in the future to specify a particular struct_ops map. Also, the common bpf_tcp_cc is stateless (e.g. bpf_cubic.c and bpf_dctcp.c). The "reuse map" feature is not implemented in this patch and it can be considered later also. For other subcmds, please see the man doc for details. A sample output of dump: [root@arch-fb-vm1 bpf]# bpftool struct_ops dump name cubic [{ "bpf_map_info": { "type": 26, "id": 64, "key_size": 4, "value_size": 256, "max_entries": 1, "map_flags": 0, "name": "cubic", "ifindex": 0, "btf_vmlinux_value_type_id": 18452, "netns_dev": 0, "netns_ino": 0, "btf_id": 52, "btf_key_type_id": 0, "btf_value_type_id": 0 } },{ "bpf_struct_ops_tcp_congestion_ops": { "refcnt": { "refs": { "counter": 1 } }, "state": "BPF_STRUCT_OPS_STATE_INUSE", "data": { "list": { "next": 0, "prev": 0 }, "key": 0, "flags": 0, "init": "void (struct sock *) bictcp_init/prog_id:138", "release": "void (struct sock *) 0", "ssthresh": "u32 (struct sock *) bictcp_recalc_ssthresh/prog_id:141", "cong_avoid": "void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140", "set_state": "void (struct sock *, u8) bictcp_state/prog_id:142", "cwnd_event": "void (struct sock *, enum tcp_ca_event) bictcp_cwnd_event/prog_id:139", "in_ack_event": "void (struct sock *, u32) 0", "undo_cwnd": "u32 (struct sock *) tcp_reno_undo_cwnd/prog_id:144", "pkts_acked": "void (struct sock *, const struct ack_sample *) bictcp_acked/prog_id:143", "min_tso_segs": "u32 (struct sock *) 0", "sndbuf_expand": "u32 (struct sock *) 0", "cong_control": "void (struct sock *, const struct rate_sample *) 0", "get_info": "size_t (struct sock *, u32, int *, union tcp_cc_info *) 0", "name": "bpf_cubic", "owner": 0 } } } ] Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200318171656.129650-1-kafai@fb.com
2020-03-20bpftool: Translate prog_id to its bpf prog_nameMartin KaFai Lau
The kernel struct_ops obj has kernel's func ptrs implemented by bpf_progs. The bpf prog_id is stored as the value of the func ptr for introspection purpose. In the latter patch, a struct_ops dump subcmd will be added to introspect these func ptrs. It is desired to print the actual bpf prog_name instead of only printing the prog_id. Since struct_ops is the only usecase storing prog_id in the func ptr, this patch adds a prog_id_as_func_ptr bool (default is false) to "struct btf_dumper" in order not to mis-interpret the ptr value for the other existing use-cases. While printing a func_ptr as a bpf prog_name, this patch also prefix the bpf prog_name with the ptr's func_proto. [ Note that it is the ptr's func_proto instead of the bpf prog's func_proto ] It reuses the current btf_dump_func() to obtain the ptr's func_proto string. Here is an example from the bpf_cubic.c: "void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140" Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200318171650.129252-1-kafai@fb.com
2020-03-20bpftool: Print as a string for char arrayMartin KaFai Lau
A char[] is currently printed as an integer array. This patch will print it as a string when 1) The array element type is an one byte int 2) The array element type has a BTF_INT_CHAR encoding or the array element type's name is "char" 3) All characters is between (0x1f, 0x7f) and it is terminated by a null character. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200318171643.129021-1-kafai@fb.com
2020-03-20bpftool: Print the enum's name instead of valueMartin KaFai Lau
This patch prints the enum's name if there is one found in the array of btf_enum. The commit 9eea98497951 ("bpf: fix BTF verification of enums") has details about an enum could have any power-of-2 size (up to 8 bytes). This patch also takes this chance to accommodate these non 4 byte enums. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200318171637.128862-1-kafai@fb.com
2020-03-19bpf: Support llvm-objcopy for vmlinux BTFFangrui Song
Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Stanislav Fomichev <sdf@google.com> Tested-by: Andrii Nakryiko <andriin@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
2020-03-17bpf, libbpf: Fix ___bpf_kretprobe_args1(x) macro definitionWenbo Zhang
Use PT_REGS_RC instead of PT_REGS_RET to get ret correctly. Fixes: df8ff35311c8 ("libbpf: Merge selftests' bpf_trace_helpers.h into libbpf's bpf_tracing.h") Signed-off-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200315083252.22274-1-ethercflow@gmail.com
2020-03-17selftests/bpf: Reset process and thread affinity after each test/sub-testAndrii Nakryiko
Some tests and sub-tests are setting "custom" thread/process affinity and don't reset it back. Instead of requiring each test to undo all this, ensure that thread affinity is restored by test_progs test runner itself. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200314013932.4035712-3-andriin@fb.com
2020-03-17selftests/bpf: Fix test_progs's parsing of test numbersAndrii Nakryiko
When specifying disjoint set of tests, test_progs doesn't set skipped test's array elements to false. This leads to spurious execution of tests that should have been skipped. Fix it by explicitly initializing them to false. Fixes: 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200314013932.4035712-2-andriin@fb.com
2020-03-17selftests/bpf: Fix race in tcp_rtt testAndrii Nakryiko
Previous attempt to make tcp_rtt more robust introduced a new race, in which server_done might be set to true before server can actually accept any connection. Fix this by unconditionally waiting for accept(). Given socket is non-blocking, if there are any problems with client side, it should eventually close listening FD and let server thread exit with failure. Fixes: 4cd729fa022c ("selftests/bpf: Make tcp_rtt test more robust to failures") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200314013932.4035712-1-andriin@fb.com
2020-03-17selftests/bpf: Fix nanosleep for real this timeAndrii Nakryiko
Amazingly, some libc implementations don't call __NR_nanosleep syscall from their nanosleep() APIs. Hammer it down with explicit syscall() call and never get back to it again. Also simplify code for timespec initialization. I verified that nanosleep is called w/ printk and in exactly same Linux image that is used in Travis CI. So it should both sleep and call correct syscall. v1->v2: - math is too hard, fix usec -> nsec convertion (Martin); - test_vmlinux has explicit nanosleep() call, convert that one as well. Fixes: 4e1fd25d19e8 ("selftests/bpf: Fix usleep() implementation") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200314002743.3782677-1-andriin@fb.com
2020-03-17selftest/bpf: Fix compilation warning in sockmap_parse_prog.cAndrii Nakryiko
Remove unused len variable, which causes compilation warnings. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200314001834.3727680-1-andriin@fb.com
2020-03-16sfc: fix XDP-redirect in this driverJesper Dangaard Brouer
XDP-redirect is broken in this driver sfc. XDP_REDIRECT requires tailroom for skb_shared_info when creating an SKB based on the redirected xdp_frame (both in cpumap and veth). The fix requires some initial explaining. The driver uses RX page-split when possible. It reserves the top 64 bytes in the RX-page for storing dma_addr (struct efx_rx_page_state). It also have the XDP recommended headroom of XDP_PACKET_HEADROOM (256 bytes). As it doesn't reserve any tailroom, it can still fit two standard MTU (1500) frames into one page. The sizeof struct skb_shared_info in 320 bytes. Thus drivers like ixgbe and i40e, reduce their XDP headroom to 192 bytes, which allows them to fit two frames with max 1536 bytes into a 4K page (192+1536+320=2048). The fix is to reduce this drivers headroom to 128 bytes and add the 320 bytes tailroom. This account for reserved top 64 bytes in the page, and still fit two frame in a page for normal MTUs. We must never go below 128 bytes of headroom for XDP, as one cacheline is for xdp_frame area and next cacheline is reserved for metadata area. Fixes: eb9a36be7f3e ("sfc: perform XDP processing on received packets") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16remoteproc: clean up notification configAlex Elder
Rearrange the config files for remoteproc and IPA to fix their interdependencies. First, have CONFIG_QCOM_Q6V5_MSS select QCOM_Q6V5_IPA_NOTIFY so the notification code is built regardless of whether IPA needs it. Next, represent QCOM_IPA as being dependent on QCOM_Q6V5_MSS rather than setting its value to match QCOM_Q6V5_COMMON (which is selected by QCOM_Q6V5_MSS). Drop all dependencies from QCOM_Q6V5_IPA_NOTIFY. The notification code will be built whenever QCOM_Q6V5_MSS is set, and it has no other dependencies. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: kcm: kcmproc.c: Fix RCU list suspicious usage warningMadhuparna Bhowmik
This path fixes the suspicious RCU usage warning reported by kernel test robot. net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section There is no need to use list_for_each_entry_rcu() in kcm_stats_seq_show() as the list is always traversed under knet->mutex held. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16qede: remove some unused code in function qede_selftest_receive_trafficZheng Zengkai
Remove set but not used variables 'sw_comp_cons' and 'hw_comp_cons' to fix gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic: drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:20: warning: variable sw_comp_cons set but not used [-Wunused-but-set-variable] drivers/net/ethernet/qlogic/qede/qede_ethtool.c: In function qede_selftest_receive_traffic: drivers/net/ethernet/qlogic/qede/qede_ethtool.c:1569:6: warning: variable hw_comp_cons set but not used [-Wunused-but-set-variable] After removing 'hw_comp_cons',the memory barrier 'rmb()' and its comments become useless, so remove them as well. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: sched: set the hw_stats_type in pedit loopJiri Pirko
For a single pedit action, multiple offload entries may be used. Set the hw_stats_type to all of them. Fixes: 44f865801741 ("sched: act: allow user to specify type of HW stats for a filter") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16Merge branch 'net-stmmac-Use-readl_poll_timeout-to-simplify-the-code'David S. Miller
Dejin Zheng says: ==================== net: stmmac: Use readl_poll_timeout() to simplify the code This patch sets just for replace the open-coded loop to the readl_poll_timeout() helper macro for simplify the code in stmmac driver. v2 -> v3: - return whatever error code by readl_poll_timeout() returned. v1 -> v2: - no changed. I am a newbie and sent this patch a month ago (February 6th). So far, I have not received any comments or suggestion. I think it may be lost somewhere in the world, so resend it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: stmmac: use readl_poll_timeout() function in dwmac4_dma_reset()Dejin Zheng
The dwmac4_dma_reset() function use an open coded of readl_poll_timeout(). Replace the open coded handling with the proper function. Signed-off-by: Dejin Zheng <zhengdejin5@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: stmmac: use readl_poll_timeout() function in init_systime()Dejin Zheng
The init_systime() function use an open coded of readl_poll_timeout(). Replace the open coded handling with the proper function. Signed-off-by: Dejin Zheng <zhengdejin5@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16chcr: remove set but not used variable 'status'YueHaibing
drivers/crypto/chelsio/chcr_ktls.c: In function chcr_ktls_cpl_set_tcb_rpl: drivers/crypto/chelsio/chcr_ktls.c:662:11: warning: variable status set but not used [-Wunused-but-set-variable] commit 8a30923e1598 ("cxgb4/chcr: Save tx keys and handle HW response") involved this unused variable, remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)Era Mayflower
Netlink support of extended packet number cipher suites, allows adding and updating XPN macsec interfaces. Added support in: * Creating interfaces with GCM-AES-XPN-128 and GCM-AES-XPN-256 suites. * Setting and getting 64bit packet numbers with of SAs. * Setting (only on SA creation) and getting ssci of SAs. * Setting salt when installing a SAK. Added 2 cipher suite identifiers according to 802.1AE-2018 table 14-1: * MACSEC_CIPHER_ID_GCM_AES_XPN_128 * MACSEC_CIPHER_ID_GCM_AES_XPN_256 In addition, added 2 new netlink attribute types: * MACSEC_SA_ATTR_SSCI * MACSEC_SA_ATTR_SALT Depends on: macsec: Support XPN frame handling - IEEE 802.1AEbw. Signed-off-by: Era Mayflower <mayflowerera@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16macsec: Support XPN frame handling - IEEE 802.1AEbwEra Mayflower
Support extended packet number cipher suites (802.1AEbw) frames handling. This does not include the needed netlink patches. * Added xpn boolean field to `struct macsec_secy`. * Added ssci field to `struct_macsec_tx_sa` (802.1AE figure 10-5). * Added ssci field to `struct_macsec_rx_sa` (802.1AE figure 10-5). * Added salt field to `struct macsec_key` (802.1AE 10.7 NOTE 1). * Created pn_t type for easy access to lower and upper halves. * Created salt_t type for easy access to the "ssci" and "pn" parts. * Created `macsec_fill_iv_xpn` function to create IV in XPN mode. * Support in PN recovery and preliminary replay check in XPN mode. In addition, according to IEEE 802.1AEbw figure 10-5, the PN of incoming frame can be 0 when XPN cipher suite is used, so fixed the function `macsec_validate_skb` to fail on PN=0 only if XPN is off. Signed-off-by: Era Mayflower <mayflowerera@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15Merge branch 'net-dsa-improve-serdes-integration'David S. Miller
Russell King says: ==================== net: dsa: improve serdes integration Depends on "net: mii clause 37 helpers". Andrew Lunn mentioned that the Serdes PCS found in Marvell DSA switches does not automatically update the switch MACs with the link parameters. Currently, the DSA code implements a work-around for this. This series improves the Serdes integration, making use of the recent phylink changes to support split MAC/PCS setups. One noticable improvement for userspace is that ethtool can now report the link partner's advertisement. This repost has no changes compared to the previous posting; however, the regression Andrew had found which exists even without this patch set has now been fixed by Andrew and merged into the net-next tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: dsa: mv88e6xxx: use PHY_DETECT in mac_link_up/mac_link_downRussell King
Use the status of the PHY_DETECT bit to determine whether we need to force the MAC settings in mac_link_up() and mac_link_down(). Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: dsa: mv88e6xxx: remove port_link_state functionsRussell King
The port_link_state method is only used by mv88e6xxx_port_setup_mac(), which is now only called during port setup, rather than also being called via phylink's mac_config method. Remove this now unnecessary optimisation, which allows us to remove the port_link_state methods as well. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>