summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-05-04tools: bpftool: add simple perf event output readerJakub Kicinski
Users of BPF sooner or later discover perf_event_output() helpers and BPF_MAP_TYPE_PERF_EVENT_ARRAY. Dumping this array type is not possible, however, we can add simple reading of perf events. Create a new event_pipe subcommand for maps, this sub command will only work with BPF_MAP_TYPE_PERF_EVENT_ARRAY maps. Parts of the code from samples/bpf/trace_output_user.c. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04tools: bpftool: move get_possible_cpus() to common codeJakub Kicinski
Move the get_possible_cpus() function to shared code. No functional changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04tools: bpftool: fold hex keyword in command helpJakub Kicinski
Instead of spelling [hex] BYTES everywhere use DATA as keyword for generalized value. This will help us keep the messages concise when longer command are added in the future. It will also be useful once BTF support comes. We will only have to change the definition of DATA. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04nfp: bpf: rewrite map pointers with NFP TIDsJakub Kicinski
Kernel will now replace map fds with actual pointer before calling the offload prepare. We can identify those pointers and replace them with NFP table IDs instead of loading the table ID in code generated for CALL instruction. This allows us to support having the same CALL being used with different maps. Since we don't want to change the FW ABI we still need to move the TID from R1 to portion of R0 before the jump. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04nfp: bpf: perf event output helpers supportJakub Kicinski
Add support for the perf_event_output family of helpers. The implementation on the NFP will not match the host code exactly. The state of the host map and rings is unknown to the device, hence device can't return errors when rings are not installed. The device simply packs the data into a firmware notification message and sends it over to the host, returning success to the program. There is no notion of a host CPU on the device when packets are being processed. Device will only offload programs which set BPF_F_CURRENT_CPU. Still, if map index doesn't match CPU no error will be returned (see above). Dropped/lost firmware notification messages will not cause "lost events" event on the perf ring, they are only visible via device error counters. Firmware notification messages may also get reordered in respect to the packets which caused their generation. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: replace map pointer loads before calling into offloadsJakub Kicinski
Offloads may find host map pointers more useful than map fds. Map pointers can be used to identify the map, while fds are only valid within the context of loading process. Jump to skip_full_check on error in case verifier log overflow has to be handled (replace_map_fd_with_map_ptr() prints to the log, driver prep may do that too in the future). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: export bpf_event_output()Jakub Kicinski
bpf_event_output() is useful for offloads to add events to BPF event rings, export it. Note that export is placed near the stub since tracing is optional and kernel/bpf/core.c is always going to be built. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04nfp: bpf: record offload neutral maps in the driverJakub Kicinski
For asynchronous events originating from the device, like perf event output, we need to be able to make sure that objects being referred to by the FW message are valid on the host. FW events can get queued and reordered. Even if we had a FW message "barrier" we should still protect ourselves from bogus FW output. Add a reverse-mapping hash table and record in it all raw map pointers FW may refer to. Only record neutral maps, i.e. perf event arrays. These are currently the only objects FW can refer to. Use RCU protection on the read side, update side is under RTNL. Since program vs map destruction order is slightly painful for offload simply take an extra reference on all the recorded maps to make sure they don't disappear. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: offload: allow offloaded programs to use perf event arraysJakub Kicinski
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes. The map only holds glue to perf ring, not actual data. Allow non-offloaded perf event arrays to be used in offloaded programs. Offload driver can extract the events from HW and put them in the map for user space to retrieve. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04Merge branch 'bpf-subprog-mgmt-cleanup'Daniel Borkmann
Jiong Wang says: ==================== This patch set clean up some code logic related with managing subprog information. Part of the set are inspried by Edwin's code in his RFC: "bpf/verifier: subprog/func_call simplifications" but with clearer separation so it could be easier to review. - Path 1 unifies main prog and subprogs. All of them are registered in env->subprog_starts. - After patch 1, it is clear that subprog_starts and subprog_stack_depth could be merged as both of them now have main and subprog unified. Patch 2 therefore does the merge, all subprog information are centred at bpf_subprog_info. - Patch 3 goes further to introduce a new fake "exit" subprog which serves as an ending marker to the subprog list. We could then turn the following code snippets across verifier: if (env->subprog_cnt == cur_subprog + 1) subprog_end = insn_cnt; else subprog_end = env->subprog_info[cur_subprog + 1].start; into: subprog_end = env->subprog_info[cur_subprog + 1].start; There is no functional change by this patch set. No bpf selftest (both non-jit and jit) regression found after this set. v2: - fixed adjust_subprog_starts to also update fake "exit" subprog start. - for John's suggestion on renaming subprog to prog, I could work on a follow-up patch if it is recognized as worth the change. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04net/mlx5: fix spelling mistake: "modfiy" -> "modify"Colin Ian King
Trivial fix to spelling mistake in netdev_warn warning message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-04net/mlx5: Cleanup unused field in Work Queue parametersTariq Toukan
Remove the 'linear' field from struct mlx5_wq_param. It is redundant, set but never read. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-04net/mlx5: Fix dump_command mailbox length printedMoshe Shemesh
Dump command mailbox length printed was correct only if data_only flag was set. For the case that data_only flag was clear the offset to stop printing at was wrong and so the buffer printed was too short. Changed the print loop to stop according to number of buffers in mailbox. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-04net/mlx5: Refactor num of blocks in mailbox calculationMoshe Shemesh
Get the logic that calculates the number of blocks in a command mailbox into a dedicated function. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-04net/mlx5: Decrease level of prints about non-existent MKEYLeon Romanovsky
User-controlled application can cause multiple prints as below to flood dmesg. Since knowledge of failed MKey release is important for debug, let's decrease its level to debug. mlx5_core 0000:00:04.0: mlx5_core_destroy_mkey:127:(pid 2352): failed radix tree delete of mkey 0x1ed700 Reported-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-04net/mlx4_en: optimizes get_fixed_ipv6_csum()Eric Dumazet
While trying to support CHECKSUM_COMPLETE for IPV6 fragments, I had to experiments various hacks in get_fixed_ipv6_csum(). I must admit I could not find how to implement this :/ However, get_fixed_ipv6_csum() does a lot of redundant operations, calling csum_partial() twice. First csum_partial() computes the checksum of saddr and daddr, put in @csum_pseudo_hdr. Undone later in the second csum_partial() computed on whole ipv6 header. Then nexthdr is added once, added a second time, then substracted. payload_len is added once, then substracted. Really all this can be reduced to two add_csum(), to add back 6 bytes that were removed by mlx4 when providing hw_checksum in RX descriptor. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04Merge branch 'smc-splice-implementation'David S. Miller
Ursula Braun says: ==================== net/smc: splice implementation Stefan comes up with an smc implementation for splice(). The first three patches are preparational patches, the 4th patch implements splice(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04smc: add support for splice()Stefan Raspl
Provide an implementation for splice() when we are using SMC. See smc_splice_read() for further details. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>< Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04smc: allocate RMBs as compound pagesStefan Raspl
Preparatory work for splice() support. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>< Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04smc: make smc_rx_wait_data() genericStefan Raspl
Turn smc_rx_wait_data into a generic function that can be used at various instances to wait on traffic to complete with varying criteria. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>< Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04smc: simplify abort logicStefan Raspl
Some of the conditions to exit recv() are common in two pathes - cleaning up code by moving the check up so we have it only once. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>< Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Overlapping changes in selftests Makefile. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04Merge branch 'sh_eth-complain-on-access-to-unimplemented-TSU-registers'David S. Miller
Sergei Shtylyov says: ==================== sh_eth: complain on access to unimplemented TSU registers Here's a set of 2 patches against DaveM's 'net-next.git' repo. The 1st patch routes TSU_POST<n> register accesses thru sh_eth_tsu_{read|write}() and the 2nd added WARN_ON() unimplemented register to those functions. I'm going to deal with TSU_ADR{H|L}<n> registers in a later series... ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04sh_eth: WARN_ON() access to unimplemented TSU registerSergei Shtylyov
Commit 3365711df024 ("sh_eth: WARN on access to a register not implemented in a particular chip") added WARN_ON() to sh_eth_{read|write}() but not to sh_eth_tsu_{read|write}(). Now that we've routed almost all TSU register accesses (except TSU_ADR{H|L}<n> -- which are special) thru the latter pair of accessors, it makes sense to check for the unimplemented TSU registers as well... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04sh_eth: use TSU register accessors for TSU_POST<n>Sergei Shtylyov
There's no particularly good reason TSU_POST<n> registers get accessed circumventing sh_eth_tsu_{read|write}() -- start using those, removing (badly named) sh_eth_tsu_get_post_reg_offset(), while at it... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04bpf: add faked "ending" subprogJiong Wang
There are quite a few code snippet like the following in verifier: subprog_start = 0; if (env->subprog_cnt == cur_subprog + 1) subprog_end = insn_cnt; else subprog_end = env->subprog_info[cur_subprog + 1].start; The reason is there is no marker in subprog_info array to tell the end of it. We could resolve this issue by introducing a faked "ending" subprog. The special "ending" subprog is with "insn_cnt" as start offset, so it is serving as the end mark whenever we iterate over all subprogs. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: centre subprog information fieldsJiong Wang
It is better to centre all subprog information fields into one structure. This structure could later serve as function node in call graph. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: unify main prog and subprogJiong Wang
Currently, verifier treat main prog and subprog differently. All subprogs detected are kept in env->subprog_starts while main prog is not kept there. Instead, main prog is implicitly defined as the prog start at 0. There is actually no difference between main prog and subprog, it is better to unify them, and register all progs detected into env->subprog_starts. This could also help simplifying some code logic. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04xfrm: use a dedicated slab cache for struct xfrm_stateMathias Krause
struct xfrm_state is rather large (768 bytes here) and therefore wastes quite a lot of memory as it falls into the kmalloc-1024 slab cache, leaving 256 bytes of unused memory per XFRM state object -- a net waste of 25%. Using a dedicated slab cache for struct xfrm_state reduces the level of internal fragmentation to a minimum. On my configuration SLUB chooses to create a slab cache covering 4 pages holding 21 objects, resulting in an average memory waste of ~13 bytes per object -- a net waste of only 1.6%. In my tests this led to memory savings of roughly 2.3MB for 10k XFRM states. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-05-03Merge tag 'linux-kselftest-4.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: "This Kselftest update for 4.17-rc4 consists of a fix for a syntax error in the script that runs selftests. Mathieu Desnoyers found this bug in the script on systems running GNU Make 3.8 or older" * tag 'linux-kselftest-4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: Fix lib.mk run_tests target shell script
2018-05-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Various sockmap fixes from John Fastabend (pinned map handling, blocking in recvmsg, double page put, error handling during redirect failures, etc.) 2) Fix dead code handling in x86-64 JIT, from Gianluca Borello. 3) Missing device put in RDS IB code, from Dag Moxnes. 4) Don't process fast open during repair mode in TCP< from Yuchung Cheng. 5) Move address/port comparison fixes in SCTP, from Xin Long. 6) Handle add a bond slave's master into a bridge properly, from Hangbin Liu. 7) IPv6 multipath code can operate on unitialized memory due to an assumption that the icmp header is in the linear SKB area. Fix from Eric Dumazet. 8) Don't invoke do_tcp_sendpages() recursively via TLS, from Dave Watson. 9) Fix memory leaks in x86-64 JIT, from Daniel Borkmann. 10) RDS leaks kernel memory to userspace, from Eric Dumazet. 11) DCCP can invoke a tasklet on a freed socket, take a refcount. Also from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits) dccp: fix tasklet usage smc: fix sendpage() call net/smc: handle unregistered buffers net/smc: call consolidation qed: fix spelling mistake: "offloded" -> "offloaded" net/mlx5e: fix spelling mistake: "loobpack" -> "loopback" tcp: restore autocorking rds: do not leak kernel memory to user land qmi_wwan: do not steal interfaces from class drivers ipv4: fix fnhe usage by non-cached routes bpf: sockmap, fix error handling in redirect failures bpf: sockmap, zero sg_size on error when buffer is released bpf: sockmap, fix scatterlist update on error path in send with apply net_sched: fq: take care of throttled flows before reuse ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6" bpf, x64: fix memleak when not converging on calls bpf, x64: fix memleak when not converging after image net/smc: restrict non-blocking connect finish 8139too: Use disable_irq_nosync() in rtl8139_poll_controller() sctp: fix the issue that the cookie-ack with auth can't get processed ...
2018-05-03Merge branch 'parisc-4.17-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "Fix two section mismatches, convert to read_persistent_clock64(), add further documentation regarding the HPMC crash handler and make bzImage the default build target" * 'parisc-4.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Fix section mismatches parisc: drivers.c: Fix section mismatches parisc: time: Convert read_persistent_clock() to read_persistent_clock64() parisc: Document rules regarding checksum of HPMC handler parisc: Make bzImage default build target
2018-05-03Merge branch 'move-ld_abs-to-native-BPF'Alexei Starovoitov
Daniel Borkmann says: ==================== This set simplifies BPF JITs significantly by moving ld_abs/ld_ind to native BPF, for details see individual patches. Main rationale is in patch 'implement ld_abs/ld_ind in native bpf'. Thanks! v1 -> v2: - Added missing seen_lds_abs in LDX_MSH and use X = A initially due to being preserved on func call. - Added a large batch of cBPF tests into test_bpf. - Added x32 removal of LD_ABS/LD_IND, so all JITs are covered. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: sync tools bpf.h uapi headerDaniel Borkmann
Only sync the header from include/uapi/linux/bpf.h. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, x32: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from x32 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, s390x: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from s390x JIT. Tested on s390x instance on LinuxONE. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, ppc64: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from ppc64 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, mips64: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from mips64 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, arm32: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from arm32 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, sparc64: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from sparc64 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, arm64: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from arm64 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf, x64: remove ld_abs/ld_indDaniel Borkmann
Since LD_ABS/LD_IND instructions are now removed from the core and reimplemented through a combination of inlined BPF instructions and a slow-path helper, we can get rid of the complexity from x64 JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: add skb_load_bytes_relative helperDaniel Borkmann
This adds a small BPF helper similar to bpf_skb_load_bytes() that is able to load relative to mac/net header offset from the skb's linear data. Compared to bpf_skb_load_bytes(), it takes a fifth argument namely start_header, which is either BPF_HDR_START_MAC or BPF_HDR_START_NET. This allows for a more flexible alternative compared to LD_ABS/LD_IND with negative offset. It's enabled for tc BPF programs as well as sock filter program types where it's mainly useful in reuseport programs to ease access to lower header data. Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.html Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: implement ld_abs/ld_ind in native bpfDaniel Borkmann
The main part of this work is to finally allow removal of LD_ABS and LD_IND from the BPF core by reimplementing them through native eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and keeping them around in native eBPF caused way more trouble than actually worth it. To just list some of the security issues in the past: * fdfaf64e7539 ("x86: bpf_jit: support negative offsets") * 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets") * e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") * 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call") * 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context") * 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context") For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy these days due to their limitations and more efficient/flexible alternatives that have been developed over time such as direct packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a register, the load happens in host endianness and its exception handling can yield unexpected behavior. The latter is explained in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by div/mod by 0 exception") with similar cases of exceptions we had. In native eBPF more recent program types will disable LD_ABS/LD_IND altogether through may_access_skb() in verifier, and given the limitations in terms of exception handling, it's also disabled in programs that use BPF to BPF calls. In terms of cBPF, the LD_ABS/LD_IND is used in networking programs to access packet data. It is not used in seccomp-BPF but programs that use it for socket filtering or reuseport for demuxing with cBPF. This is mostly relevant for applications that have not yet migrated to native eBPF. The main complexity and source of bugs in LD_ABS/LD_IND is coming from their implementation in the various JITs. Most of them keep the model around from cBPF times by implementing a fastpath written in asm. They use typically two from the BPF program hidden CPU registers for caching the skb's headlen (skb->len - skb->data_len) and skb->data. Throughout the JIT phase this requires to keep track whether LD_ABS/LD_IND are used and if so, the two registers need to be recached each time a BPF helper would change the underlying packet data in native eBPF case. At least in eBPF case, available CPU registers are rare and the additional exit path out of the asm written JIT helper makes it also inflexible since not all parts of the JITer are in control from plain C. A LD_ABS/LD_IND implementation in eBPF therefore allows to significantly reduce the complexity in JITs with comparable performance results for them, e.g.: test_bpf tcpdump port 22 tcpdump complex x64 - before 15 21 10 14 19 18 - after 7 10 10 7 10 15 arm64 - before 40 91 92 40 91 151 - after 51 64 73 51 62 113 For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter() and cache the skb's headlen and data in the cBPF prologue. The BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just used as a local temporary variable. This allows to shrink the image on x86_64 also for seccomp programs slightly since mapping to %rsi is not an ereg. In callee-saved R8 and R9 we now track skb data and headlen, respectively. For normal prologue emission in the JITs this does not add any extra instructions since R8, R9 are pushed to stack in any case from eBPF side. cBPF uses the convert_bpf_ld_abs() emitter which probes the fast path inline already and falls back to bpf_skb_load_helper_{8,16,32}() helper relying on the cached skb data and headlen as well. R8 and R9 never need to be reloaded due to bpf_helper_changes_pkt_data() since all skb access in cBPF is read-only. Then, for the case of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally, does neither cache skb data and headlen nor has an inlined fast path. The reason for the latter is that native eBPF does not have any extra registers available anyway, but even if there were, it avoids any reload of skb data and headlen in the first place. Additionally, for the negative offsets, we provide an alternative bpf_skb_load_bytes_relative() helper in eBPF which operates similarly as bpf_skb_load_bytes() and allows for more flexibility. Tested myself on x64, arm64, s390x, from Sandipan on ppc64. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: migrate ebpf ld_abs/ld_ind tests to test_verifierDaniel Borkmann
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason is that the eBPF tests from test_bpf module do not go via BPF verifier and therefore any instruction rewrites from verifier cannot take place. Therefore, move them into test_verifier which runs out of user space, so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches. It will have the same effect since runtime tests are also performed from there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto and keep it internal to core kernel. Additionally, also add further cBPF LD_ABS/LD_IND test coverage into test_bpf.ko suite. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: prefix cbpf internal helpers with bpf_Daniel Borkmann
No change in functionality, just remove the '__' prefix and replace it with a 'bpf_' prefix instead. We later on add a couple of more helpers for cBPF and keeping the scheme with '__' is suboptimal there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03Merge branch 'AF_XDP-initial-support'Alexei Starovoitov
Björn Töpel says: ==================== This patch set introduces a new address family called AF_XDP that is optimized for high performance packet processing and, in upcoming patch sets, zero-copy semantics. In this patch set, we have removed all zero-copy related code in order to make it smaller, simpler and hopefully more review friendly. This patch set only supports copy-mode for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX using the XDP_DRV path. Zero-copy support requires XDP and driver changes that Jesper Dangaard Brouer is working on. Some of his work has already been accepted. We will publish our zero-copy support for RX and TX on top of his patch sets at a later point in time. An AF_XDP socket (XSK) is created with the normal socket() syscall. Associated with each XSK are two queues: the RX queue and the TX queue. A socket can receive packets on the RX queue and it can send packets on the TX queue. These queues are registered and sized with the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory to have at least one of these queues for each socket. In contrast to AF_PACKET V2/V3 these descriptor queues are separated from packet buffers. An RX or TX descriptor points to a data buffer in a memory area called a UMEM. RX and TX can share the same UMEM so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. This new dedicated packet buffer area is call a UMEM. It consists of a number of equally size frames and each frame has a unique frame id. A descriptor in one of the queues references a frame by referencing its frame id. The user space allocates memory for this UMEM using whatever means it feels is most appropriate (malloc, mmap, huge pages, etc). This memory area is then registered with the kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue and the COMPLETION queue. The fill queue is used by the application to send down frame ids for the kernel to fill in with RX packet data. References to these frames will then appear in the RX queue of the XSK once they have been received. The completion queue, on the other hand, contains frame ids that the kernel has transmitted completely and can now be used again by user space, for either TX or RX. Thus, the frame ids appearing in the completion queue are ids that were previously transmitted using the TX queue. In summary, the RX and FILL queues are used for the RX path and the TX and COMPLETION queues are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is completed that traffic starts to flow. Note that in this patch set, all packet data is copied out to user-space. A new feature in this patch set is that the UMEM can be shared between processes, if desired. If a process wants to do this, it simply skips the registration of the UMEM and its corresponding two queues, sets a flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will then receive frame id references in its own RX queue that point to this shared UMEM. Note that since the queue structures are single-consumer / single-producer (for performance reasons), the new process has to create its own socket with associated RX and TX queues, since it cannot share this with the other process. This is also the reason that there is only one set of FILL and COMPLETION queues per UMEM. It is the responsibility of a single process to handle the UMEM. If multiple-producer / multiple-consumer queues are implemented in the future, this requirement could be relaxed. How is then packets distributed between these two XSK? We have introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The user-space application can place an XSK at an arbitrary place in this map. The XDP program can then redirect a packet to a specific index in this map and at this point XDP validates that the XSK in that map was indeed bound to that device and queue number. If not, the packet is dropped. If the map is empty at that index, the packet is also dropped. This also means that it is currently mandatory to have an XDP program loaded (and one XSK in the XSKMAP) to be able to get any traffic to user space through the XSK. AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the driver does not have support for XDP, or XDP_SKB is explicitly chosen when loading the XDP program, XDP_SKB mode is employed that uses SKBs together with the generic XDP support and copies out the data to user space. A fallback mode that works for any network device. On the other hand, if the driver has support for XDP, it will be used by the AF_XDP code to provide better performance, but there is still a copy of the data into user space. There is a xdpsock benchmarking/test application included that demonstrates how to use AF_XDP sockets with both private and shared UMEMs. Say that you would like your UDP traffic from port 4242 to end up in queue 16, that we will enable AF_XDP on. Here, we use ethtool for this: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ action 16 Running the rxdrop benchmark in XDP_DRV mode can then be done using: samples/bpf/xdpsock -i p3p2 -q 16 -r -N For XDP_SKB mode, use the switch "-S" instead of "-N" and all options can be displayed with "-h", as usual. We have run some benchmarks on a dual socket system with two Broadwell E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 cores which gives a total of 28, but only two cores are used in these experiments. One for TR/RX and one for the user space application. The memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is 8192MB and with 8 of those DIMMs in the system we have 64 GB of total memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The NIC is Intel I40E 40Gbit/s using the i40e driver. Below are the results in Mpps of the I40E NIC benchmark runs for 64 and 1500 byte packets, generated by a commercial packet generator HW outputing packets at full 40 Gbit/s line rate. The results are without retpoline so that we can compare against previous numbers. With retpoline, the AF_XDP numbers drop with between 10 - 15 percent. AF_XDP performance 64 byte packets. Results from V2 in parenthesis. Benchmark XDP_SKB XDP_DRV rxdrop 2.9(3.0) 9.6(9.5) txpush 2.6(2.5) NA* l2fwd 1.9(1.9) 2.5(2.5) (TX using XDP_SKB in both cases) AF_XDP performance 1500 byte packets: Benchmark XDP_SKB XDP_DRV rxdrop 2.1(2.2) 3.3(3.3) l2fwd 1.4(1.4) 1.8(1.8) (TX using XDP_SKB in both cases) * NA since we have no support for TX using the XDP_DRV infrastructure in this patch set. This is for a future patch set since it involves changes to the XDP NDOs. Some of this has been upstreamed by Jesper Dangaard Brouer. XDP performance on our system as a base line: 64 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 32.3(32.9)M 0 1500 byte packets: XDP stats CPU pps issue-pps XDP-RX CPU 16 3.3(3.3)M 0 Changes from V2: * Fixed a race in XSKMAP map found by Will. The code has been completely rearchitected and is now simpler, faster, and hopefully also not racy. Please review and check if it holds. If you would like to diff V2 against V3, you can find them here: https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next The structure of the patch set is as follows: Patches 1-3: Basic socket and umem plumbing Patches 4-9: RX support together with the new XSKMAP Patches 10-13: TX support Patch 14: Statistics support with getsockopt() Patch 15: Sample application We based this patch set on bpf-next commit a3fe1f6f2ada ("tools: bpftool: change time format for program 'loaded at:' information") To do for this patch set: * Syzkaller torture session being worked on Post-series plan: * Optimize performance * Kernel selftest * Kernel load module support of AF_XDP would be nice. Unclear how to achieve this though since our XDP code depends on net/core. * Support for AF_XDP sockets without an XPD program loaded. In this case all the traffic on a queue should go up to the user space socket. * Daniel Borkmann's suggestion for a "copy to XDP socket, and return XDP_PASS" for a tcpdump-like functionality. * And of course getting to zero-copy support in small increments, starting with TX then adding RX. Thanks: Björn and Magnus ==================== Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03samples/bpf: sample application and documentation for AF_XDP socketsMagnus Karlsson
This is a sample application for AF_XDP sockets. The application supports three different modes of operation: rxdrop, txonly and l2fwd. To show-case a simple round-robin load-balancing between a set of sockets in an xskmap, set the RR_LB compile time define option to 1 in "xdpsock.h". v2: The entries variable was calculated twice in {umem,xq}_nb_avail. Co-authored-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03xsk: statistics supportMagnus Karlsson
In this commit, a new getsockopt is added: XDP_STATISTICS. This is used to obtain stats from the sockets. v2: getsockopt now returns size of stats structure. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03xsk: support for TxMagnus Karlsson
Here, Tx support is added. The user fills the Tx queue with frames to be sent by the kernel, and let's the kernel know using the sendmsg syscall. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>