summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-04-24Revert "net: init sk_cookie for inet socket"Yafang Shao
This reverts commit <c6849a3ac17e> ("net: init sk_cookie for inet socket") Per discussion with Eric, when update sock_net(sk)->cookie_gen, the whole cache cache line will be invalidated, as this cache line is shared with all cpus, that may cause great performace hit. Bellow is the data form Eric. "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on my host" when running synflood test. Have to revert it to prevent from cache line false sharing. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Merge branch 'net-DIM-tx'David S. Miller
Tal Gilboa says: ==================== Introduce adaptive TX interrupt moderation to net DIM Net DIM is a library designed for dynamic interrupt moderation. It was implemented and optimized with receive side interrupts in mind, since these are usually the CPU expensive ones. This patch-set introduces adaptive transmit interrupt moderation to net DIM, complete with a usage in the mlx5e driver. Using adaptive TX behavior would reduce interrupt rate for multiple scenarios. Furthermore, it is essential for increasing bandwidth on cases where payload aggregation is required. v3: Remove "inline" from functions in .c files (requested by DaveM). Revert adding "enabled" field from struct net_dim and applied mlx5e structural suggestions (suggested by SaeedM). v2: Rebase over proper tree. v1: Fix compilation issues due to missed function renaming. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/mlx5e: Enable adaptive-TX moderationTal Gilboa
Add support for adaptive TX moderation. This greatly reduces TX interrupt rate and increases bandwidth, mostly for TCP bandwidth over ARM architecture (below). There is a slight single stream TCP with very large message sizes degradation (x86). In this case if there's any moderation on transmitted packets the bandwidth would reduce due to hitting TCP output limit. Since this is a synthetic case, this is still worth doing. Performance improvement (ConnectX-4Lx 40GbE, ARM) TCP 64B bandwidth with 1-50 streams increased 6-35%. TCP 64B bandwidth with 100-500 streams increased 20-70%. Performance improvement (ConnectX-5 100GbE, x86) Bandwidth: increased up to 40% (1024B with 10s of streams). Interrupt rate: reduced up to 50% (1024B with 1000s of streams). Performance degradation (ConnectX-5 100GbE, x86) Bandwidth: up to 10% decrease single stream TCP (1MB message size from 51Gb/s to 47Gb/s). Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/dim: Support adaptive TX moderationTal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX interrupt moderation. The main goal here is to reduce interrupt rate and allow better payload aggregation by keeping SKBs in the TX queue a bit longer. Ping-pong behavior would get a profile with a short timer, so latency wouldn't increase for these scenarios. There might be a slight degradation in bandwidth for single stream with large message sizes, since net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but with many streams performance is always improved. Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/dim: Rename *_get_profile() functions to *_get_rx_moderation()Tal Gilboa
Preparation for introducing adaptive TX to net DIM. Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24vhost_net: use packet weight for rx handler, tooPaolo Abeni
Similar to commit a2ac99905f1e ("vhost-net: set packet weight of tx polling to 2 * vq size"), we need a packet-based limit for handler_rx, too - elsewhere, under rx flood with small packets, tx can be delayed for a very long time, even without busypolling. The pkt limit applied to handle_rx must be the same applied by handle_tx, or we will get unfair scheduling between rx and tx. Tying such limit to the queue length makes it less effective for large queue length values and can introduce large process scheduler latencies, so a constant valued is used - likewise the existing bytes limit. The selected limit has been validated with PVP[1] performance test with different queue sizes: queue size 256 512 1024 baseline 366 354 362 weight 128 715 723 670 weight 256 740 745 733 weight 512 600 460 583 weight 1024 423 427 418 A packet weight of 256 gives peek performances in under all the tested scenarios. No measurable regression in unidirectional performance tests has been detected. [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24team: fix netconsole setup over teamXin Long
The same fix in Commit dbe173079ab5 ("bridge: fix netconsole setup over bridge") is also needed for team driver. While at it, remove the unnecessary parameter *team from team_port_enable_netpoll(). v1->v2: - fix it in a better way, as does bridge. Fixes: 0fb52a27a04a ("team: cleanup netpoll clode") Reported-by: João Avelino Bellomo Filho <jbellomo@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: fib_rules: fix l3mdev netlink attr processingRoopa Prabhu
Fixes: b16fb418b1bf ("net: fib_rules: add extack support") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23Merge branch 'amd-xgbe-fixes'David S. Miller
aTom Lendacky says: ==================== amd-xgbe: AMD XGBE driver fixes 2018-04-23 This patch series addresses some issues in the AMD XGBE driver. The following fixes are included in this driver update series: - Improve KR auto-negotiation and training (2 patches) - Add pre and post auto-negotiation hooks - Use the pre and post auto-negotiation hooks to disable CDR tracking during auto-negotiation page exchange in KR mode - Check for SFP tranceiver signal support and only use the signal if the SFP indicates that it is supported This patch series is based on net. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23amd-xgbe: Only use the SFP supported transceiver signalsTom Lendacky
The SFP eeprom indicates the transceiver signals (Rx LOS, Tx Fault, etc.) that it supports. Update the driver to include checking the eeprom data when deciding whether to use a transceiver signal. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23amd-xgbe: Improve KR auto-negotiation and trainingTom Lendacky
Update xgbe-phy-v2.c to make use of the auto-negotiation (AN) phy hooks to improve the ability to successfully complete Clause 73 AN when running at 10gbps. Hardware can sometimes have issues with CDR lock when the AN DME page exchange is being performed. The AN and KR training hooks are used as follows: - The pre AN hook is used to disable CDR tracking in the PHY so that the DME page exchange can be successfully and consistently completed. - The post KR training hook is used to re-enable the CDR tracking so that KR training can successfully complete. - The post AN hook is used to check for an unsuccessful AN which will increase a CDR tracking enablement delay (up to a maximum value). Add two debugfs entries to allow control over use of the CDR tracking workaround. The debugfs entries allow the CDR tracking workaround to be disabled and determine whether to re-enable CDR tracking before or after link training has been initiated. Also, with these changes the receiver reset cycle that is performed during the link status check can be performed less often. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23amd-xgbe: Add pre/post auto-negotiation phy hooksTom Lendacky
Add hooks to the driver auto-negotiation (AN) flow to allow the different phy implementations to perform any steps necessary to improve AN. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23pppoe: check sockaddr length in pppoe_connect()Guillaume Nault
We must validate sockaddr_len, otherwise userspace can pass fewer data than we expect and we end up accessing invalid data. Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers") Reported-by: syzbot+4f03bdf92fdf9ef5ddab@syzkaller.appspotmail.com Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23l2tp: check sockaddr length in pppol2tp_connect()Guillaume Nault
Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that it actually points to valid data. Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23selftests: net: update .gitignore with missing testAnders Roxell
Fixes: 192dc405f308 ("selftests: net: add tcp_mmap program") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: phy: marvell: clear wol event before setting itJingju Hou
If WOL event happened once, the LED[2] interrupt pin will not be cleared unless we read the CSISR register. If interrupts are in use, the normal interrupt handling will clear the WOL event. Let's clear the WOL event before enabling it if !phy_interrupt_is_valid(). Signed-off-by: Jingju Hou <Jingju.Hou@synaptics.com> Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23dca: make function dca_common_get_tag staticColin Ian King
Function dca_common_get_tag is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Merge branch 'bpf-sockmap-fixes'Daniel Borkmann
John Fastabend says: ==================== While testing sockmap with more programs (besides our test programs) I found a couple issues. The attached series fixes an issue where pinned maps were not working correctly, blocking sockets returned zero, and an error path that when the sock hit an out of memory case resulted in a double page_put() while doing ingress redirects. See individual patches for more details. v2: Incorporated Daniel's feedback to use map ops for uref put op which also fixed the build error discovered in v1. v3: rename map_put_uref to map_release_uref ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24bpf: sockmap, fix double page_put on ENOMEM error in redirect pathJohn Fastabend
In the case where the socket memory boundary is hit the redirect path returns an ENOMEM error. However, before checking for this condition the redirect scatterlist buffer is setup with a valid page and length. This is never unwound so when the buffers are released latter in the error path we do a put_page() and clear the scatterlist fields. But, because the initial error happens before completing the scatterlist buffer we end up with both the original buffer and the redirect buffer pointing to the same page resulting in duplicate put_page() calls. To fix this simply move the initial configuration of the redirect scatterlist buffer below the sock memory check. Found this while running TCP_STREAM test with netperf using Cilium. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24bpf: sockmap, sk_wait_event needed to handle blocking casesJohn Fastabend
In the recvmsg handler we need to add a wait event to support the blocking use cases. Without this we return zero and may confuse user applications. In the wait event any data received on the sk either via sk_receive_queue or the psock ingress list will wake up the sock. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24bpf: sockmap, map_release does not hold refcnt for pinned mapsJohn Fastabend
Relying on map_release hook to decrement the reference counts when a map is removed only works if the map is not being pinned. In the pinned case the ref is decremented immediately and the BPF programs released. After this BPF programs may not be in-use which is not what the user would expect. This patch moves the release logic into bpf_map_put_uref() and brings sockmap in-line with how a similar case is handled in prog array maps. Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23bpf: sockmap sample use clang flag, -target bpfJohn Fastabend
Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the sockmap Makefile. Relevant text quoted here, Otherwise, you can use bpf target. Additionally, you _must_ use bpf target when: - Your program uses data structures with pointer or long / unsigned long types that interface with BPF helpers or context data structures. Access into these structures is verified by the BPF verifier and may result in verification failures if the native architecture is not aligned with the BPF architecture, e.g. 64-bit. An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf' Fixes: 69e8cc134bcb ("bpf: sockmap sample program") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSGJohn Fastabend
BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the data_end pointers. Additionally, the verifier ensures that every accesses into the values is a __u64 read. This correctly maps on to the BPF 64-bit architecture. However, to ensure that when building on 32bit architectures that clang uses correct types the '-target bpf' option _must_ be specified. To make this clear add a note to the Documentation. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAYRoman Gushchin
Running bpf programs requires disabled preemption, however at least some* of the BPF_PROG_RUN_ARRAY users do not follow this rule. To fix this bug, and also to make it not happen in the future, let's add explicit preemption disabling/re-enabling to the __BPF_PROG_RUN_ARRAY code. * for example: [ 17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0 ... [ 17.640890] inet6_create+0x3eb/0x520 [ 17.641405] __sock_create+0x242/0x340 [ 17.641939] __sys_socket+0x57/0xe0 [ 17.642370] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 17.642944] SyS_socket+0xa/0x10 [ 17.643357] do_syscall_64+0x79/0x220 [ 17.643879] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix SIP conntrack with phones sending session descriptions for different media types but same port numbers, from Florian Westphal. 2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian Anastasov. 3) Skip compat array allocation in ebtables if there is no entries, also from Florian. 4) Do not lose left/right bits when shifting marks from xt_connmark, from Jack Ma. 5) Silence false positive memleak in conntrack extensions, from Cong Wang. 6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann. 7) Cannot kfree rule that is already in list in nf_tables, switch order so this error handling is not required, from Florian Westphal. 8) Release set name in error path, from Florian. 9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell. 10) NAT chain and extensions depend on NF_TABLES. 11) Out of bound access when renaming chains, from Taehee Yoo. 12) Incorrect casting in xt_connmark leads to wrong bitshifting. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23Merge branch 'ipv6-couple-of-fixes-for-rcu-change-to-from'David S. Miller
David Ahern says: ==================== net/ipv6: couple of fixes for rcu change to from So many details... I am thankful for all the robots running the permutations and tools. Two bug fixes from the rcu change to rt->from: 1. missing rcu lock in ip6_negative_advice 2. rcu dereferences in 2 sites ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net/ipv6: Fix missing rcu dereferences on fromDavid Ahern
kbuild test robot reported 2 uses of rt->from not properly accessed using rcu_dereference: 1. add rcu_dereference_protected to rt6_remove_exception_rt and make sure it is always called with rcu lock held. 2. change rt6_do_redirect to take a reference on 'from' when accessed the first time so it can be used the sceond time outside of the lock Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net/ipv6: add rcu locking to ip6_negative_adviceDavid Ahern
syzbot reported a suspicious rcu_dereference_check: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592 rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410 ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204 dst_negative_advice include/net/sock.h:1786 [inline] sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051 __sys_setsockopt+0x2df/0x390 net/socket.c:1899 SYSC_setsockopt net/socket.c:1914 [inline] SyS_setsockopt+0x34/0x50 net/socket.c:1911 Add rcu locking around call to rt6_check_expired in ip6_negative_advice. Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected") Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23Merge branch 'qed-debug-data'David S. Miller
Denis Bolotin says: ==================== Add configuration information to register dump and debug data The purpose of this patchset is to add configuration information to the debug data collection, which already contains register dump. The first patch (removing the ptt) is essential because it prevents the unnecessary ptt acquirement when calling mcp APIs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23qed: Add configuration information to register dump and debug dataDenis Bolotin
Configuration information is added to the debug data collection, in addition to register dump. Added qed_dbg_nvm_image() that receives an image type, allocates a buffer and reads the image. The images are saved in the buffers and the dump size is updated. Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23qed: Delete unused parameter p_ptt from mcp APIsDenis Bolotin
Since nvm images attributes are cached during driver load, acquiring ptt is not needed when calling qed_mcp_get_nvm_image(). Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: stmmac: Implement logic to automatically select HW InterfaceJose Abreu
Move all the core version detection to a common place ("hwif.c") and implement a table which can be used to lookup the correct callbacks for each IP version. This simplifies the initialization flow of each IP version and eases future implementation of new IP versions. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Cc: David S. Miller <davem@davemloft.net> Cc: Joao Pinto <jpinto@synopsys.com> Cc: Vitor Soares <soares@synopsys.com> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policyEric Dumazet
KMSAN reported use of uninit-value that I tracked to lack of proper size check on RTA_TABLE attribute. I also believe RTA_PREFSRC lacks a similar check. Fixes: 86872cb57925 ("[IPv6] route: FIB6 configuration using struct fib6_config") Fixes: c3968a857a6b ("ipv6: RTA_PREFSRC support for ipv6 route source address selection") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23r8169: don't use netif_info et al before net_device has been registeredHeiner Kallweit
There's no benefit in using netif_info et al before the net_device has been registered. We get messages like r8169 0000:03:00.0 (unnamed net_device) (uninitialized): [message] Therefore use dev_info/dev_err instead. As a side effect we don't need parameter dev for function rtl8169_get_mac_version() any longer. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: init sk_cookie for inet socketYafang Shao
With sk_cookie we can identify a socket, that is very helpful for traceing and statistic, i.e. tcp tracepiont and ebpf. So we'd better init it by default for inet socket. When using it, we just need call atomic64_read(&sk->sk_cookie). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslaveXin Long
After Commit 8a8efa22f51b ("bonding: sync netpoll code with bridge"), it would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev if bond->dev->npinfo was set. However now slave_dev npinfo is set with bond->dev->npinfo before calling slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup(). It causes that the lower dev of this slave dev can't set its npinfo. One way to reproduce it: # modprobe bonding # brctl addbr br0 # brctl addif br0 eth1 # ifconfig bond0 192.168.122.1/24 up # ifenslave bond0 eth2 # systemctl restart netconsole # ifenslave bond0 br0 # ifconfig eth2 down # systemctl restart netconsole The netpoll won't really work. This patch is to remove that slave_dev npinfo setting in bond_enslave(). Fixes: 8a8efa22f51b ("bonding: sync netpoll code with bridge") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23Merge branch 'fib-rules-extack-support'David S. Miller
Roopa Prabhu says: ==================== fib rules extack support First patch refactors code to move fib rule netlink handling into a common function. This became obvious when adding duplicate extack msgs in add and del paths. Second patch adds extack msgs. v2 - Dropped the ip route get support and selftests from the series to look at the input path some more (as pointed out by ido). Will come back to that next week when i have some time. resending just the extack part for now. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: fib_rules: add extack supportRoopa Prabhu
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23fib_rules: move common handling of newrule delrule msgs into fib_nl2ruleRoopa Prabhu
This reduces code duplication in the fib rule add and del paths. Get rid of validate_rulemsg. This became obvious when adding duplicate extack support in fib newrule/delrule error paths. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23tc-testing: updated ife test casesRoman Mashak
Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: introduce a new tracepoint for tcp_rcv_space_adjustYafang Shao
tcp_rcv_space_adjust is called every time data is copied to user space, introducing a tcp tracepoint for which could show us when the packet is copied to user. When a tcp packet arrives, tcp_rcv_established() will be called and with the existed tracepoint tcp_probe we could get the time when this packet arrives. Then this packet will be copied to user, and tcp_rcv_space_adjust will be called and with this new introduced tracepoint we could get the time when this packet is copied to user. With these two tracepoints, we could figure out whether the user program processes this packet immediately or there's latency. Hence in the printk message, sk_cookie is printed as a key to relate tcp_rcv_space_adjust with tcp_probe. Maybe we could export sockfd in this new tracepoint as well, then we could relate this new tracepoint with epoll/read/recv* tracepoints, and finally that could show us the whole lifespan of this packet. But we could also implement that with pid as these functions are executed in process context. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23tcp: don't read out-of-bounds opsizeJann Horn
The old code reads the "opsize" variable from out-of-bounds memory (first byte behind the segment) if a broken TCP segment ends directly after an opcode that is neither EOL nor NOP. The result of the read isn't used for anything, so the worst thing that could theoretically happen is a pagefault; and since the physmap is usually mostly contiguous, even that seems pretty unlikely. The following C reproducer triggers the uninitialized read - however, you can't actually see anything happen unless you put something like a pr_warn() in tcp_parse_md5sig_option() to print the opsize. ==================================== #define _GNU_SOURCE #include <arpa/inet.h> #include <stdlib.h> #include <errno.h> #include <stdarg.h> #include <net/if.h> #include <linux/if.h> #include <linux/ip.h> #include <linux/tcp.h> #include <linux/in.h> #include <linux/if_tun.h> #include <err.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/ioctl.h> #include <assert.h> void systemf(const char *command, ...) { char *full_command; va_list ap; va_start(ap, command); if (vasprintf(&full_command, command, ap) == -1) err(1, "vasprintf"); va_end(ap); printf("systemf: <<<%s>>>\n", full_command); system(full_command); } char *devname; int tun_alloc(char *name) { int fd = open("/dev/net/tun", O_RDWR); if (fd == -1) err(1, "open tun dev"); static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI }; strcpy(req.ifr_name, name); if (ioctl(fd, TUNSETIFF, &req)) err(1, "TUNSETIFF"); devname = req.ifr_name; printf("device name: %s\n", devname); return fd; } #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24)) void sum_accumulate(unsigned int *sum, void *data, int len) { assert((len&2)==0); for (int i=0; i<len/2; i++) { *sum += ntohs(((unsigned short *)data)[i]); } } unsigned short sum_final(unsigned int sum) { sum = (sum >> 16) + (sum & 0xffff); sum = (sum >> 16) + (sum & 0xffff); return htons(~sum); } void fix_ip_sum(struct iphdr *ip) { unsigned int sum = 0; sum_accumulate(&sum, ip, sizeof(*ip)); ip->check = sum_final(sum); } void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) { unsigned int sum = 0; struct { unsigned int saddr; unsigned int daddr; unsigned char pad; unsigned char proto_num; unsigned short tcp_len; } fakehdr = { .saddr = ip->saddr, .daddr = ip->daddr, .proto_num = ip->protocol, .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4) }; sum_accumulate(&sum, &fakehdr, sizeof(fakehdr)); sum_accumulate(&sum, tcp, tcp->doff*4); tcp->check = sum_final(sum); } int main(void) { int tun_fd = tun_alloc("inject_dev%d"); systemf("ip link set %s up", devname); systemf("ip addr add 192.168.42.1/24 dev %s", devname); struct { struct iphdr ip; struct tcphdr tcp; unsigned char tcp_opts[20]; } __attribute__((packed)) syn_packet = { .ip = { .ihl = sizeof(struct iphdr)/4, .version = 4, .tot_len = htons(sizeof(syn_packet)), .ttl = 30, .protocol = IPPROTO_TCP, /* FIXUP check */ .saddr = IPADDR(192,168,42,2), .daddr = IPADDR(192,168,42,1) }, .tcp = { .source = htons(1), .dest = htons(1337), .seq = 0x12345678, .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4, .syn = 1, .window = htons(64), .check = 0 /*FIXUP*/ }, .tcp_opts = { /* INVALID: trailing MD5SIG opcode after NOPs */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 19 } }; fix_ip_sum(&syn_packet.ip); fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp); while (1) { int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet)); if (write_res != sizeof(syn_packet)) err(1, "packet write failed"); } } ==================================== Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23bpf: btf: Clean up btf.h in uapiMartin KaFai Lau
This patch cleans up btf.h in uapi: 1) Rename "name" to "name_off" to better reflect it is an offset to the string section instead of a char array. 2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23bpf: fix virtio-net's length calc for XDP_PASSNikita V. Shirokov
In commit 6870de435b90 ("bpf: make virtio compatible w/ bpf_xdp_adjust_tail") i didn't account for vi->hdr_len during new packet's length calculation after bpf_prog_run in receive_mergeable. because of this all packets, if they were passed to the kernel, were truncated by 12 bytes. Fixes:6870de435b90 ("bpf: make virtio compatible w/ bpf_xdp_adjust_tail") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-22Linux 4.17-rc2v4.17-rc2Linus Torvalds
2018-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-04-21 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when one task is detaching a BPF prog via perf_event_detach_bpf_prog() and another one dumping through bpf_prog_array_copy_info(). For the latter we move the copy_to_user() out of the bpf_event_mutex lock to fix it, from Yonghong. 2) Fix test_sock and test_sock_addr.sh failures. The former was hitting rlimit issues and the latter required ping to specify the address family, from Yonghong. 3) Remove a dead check in sockmap's sock_map_alloc(), from Jann. 4) Add generated files to BPF kselftests gitignore that were previously missed, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22ibmvnic: Clean actual number of RX or TX poolsThomas Falcon
Avoid using value stored in the login response buffer when cleaning TX and RX buffer pools since these could be inconsistent depending on the device state. Instead use the field in the driver's private data that tracks the number of active pools. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22Merge branch 'net-sched-ife-malformed-ife-packet-fixes'David S. Miller
Alexander Aring says: ==================== net: sched: ife: malformed ife packet fixes As promised at netdev 2.2 tc workshop I am working on adding scapy support for tdc testing. It is still work in progress. I will submit the patches to tdc later (they are not in good shape yet). The good news is I have been able to find bugs which normal packet testing would not be able to find. With fuzzy testing I was able to craft certain malformed packets that IFE action was not able to deal with. This patch set fixes those bugs. changes since v4: - use pskb_may_pull before pointer assign changes since v3: - use pskb_may_pull changes since v2: - remove inline from __ife_tlv_meta_valid - add const to cast to meta_tlvhdr - add acked and reviewed tags ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22net: sched: ife: check on metadata lengthAlexander Aring
This patch checks if sk buffer is available to dererence ife header. If not then NULL will returned to signal an malformed ife packet. This avoids to crashing the kernel from outside. Signed-off-by: Alexander Aring <aring@mojatatu.com> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22net: sched: ife: handle malformed tlv lengthAlexander Aring
There is currently no handling to check on a invalid tlv length. This patch adds such handling to avoid killing the kernel with a malformed ife packet. Signed-off-by: Alexander Aring <aring@mojatatu.com> Reviewed-by: Yotam Gigi <yotam.gi@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>