summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-08-28bpf: additional sockmap self testsJohn Fastabend
Add some more sockmap tests to cover, - forwarding to NULL entries - more than two maps to test list ops - forwarding to different map Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28bpf: sockmap add missing rcu_read_(un)lock in smap_data_readyJohn Fastabend
References to psock must be done inside RCU critical section. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28bpf: sockmap, remove STRPARSER map_flags and add multi-map supportJohn Fastabend
The addition of map_flags BPF_SOCKMAP_STRPARSER flags was to handle a specific use case where we want to have BPF parse program disabled on an entry in a sockmap. However, Alexei found the API a bit cumbersome and I agreed. Lets remove the STRPARSER flag and support the use case by allowing socks to be in multiple maps. This allows users to create two maps one with programs attached and one without. When socks are added to maps they now inherit any programs attached to the map. This is a nice generalization and IMO improves the API. The API rules are less ambiguous and do not need a flag: - When a sock is added to a sockmap we have two cases, i. The sock map does not have any attached programs so we can add sock to map without inheriting bpf programs. The sock may exist in 0 or more other maps. ii. The sock map has an attached BPF program. To avoid duplicate bpf programs we only add the sock entry if it does not have an existing strparser/verdict attached, returning -EBUSY if a program is already attached. Otherwise attach the program and inherit strparser/verdict programs from the sock map. This allows for socks to be in a multiple maps for redirects and inherit a BPF program from a single map. Also this patch simplifies the logic around BPF_{EXIST|NOEXIST|ANY} flags. In the original patch I tried to be extra clever and only update map entries when necessary. Now I've decided the complexity is not worth it. If users constantly update an entry with the same sock for no reason (i.e. update an entry without actually changing any parameters on map or sock) we still do an alloc/release. Using this and allowing multiple entries of a sock to exist in a map the logic becomes much simpler. Note: Now that multiple maps are supported the "maps" pointer called when a socket is closed becomes a list of maps to remove the sock from. To keep the map up to date when a sock is added to the sockmap we must add the map/elem in the list. Likewise when it is removed we must remove it from the list. This results in searching the per psock list on delete operation. On TCP_CLOSE events we walk the list and remove the psock from all map/entry locations. I don't see any perf implications in this because at most I have a psock in two maps. If a psock were to be in many maps its possibly this might be noticeable on delete but I can't think of a reason to dup a psock in many maps. The sk_callback_lock is used to protect read/writes to the list. This was convenient because in all locations we were taking the lock anyways just after working on the list. Also the lock is per sock so in normal cases we shouldn't see any contention. Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28bpf: convert sockmap field attach_bpf_fd2 to typeJohn Fastabend
In the initial sockmap API we provided strparser and verdict programs using a single attach command by extending the attach API with a the attach_bpf_fd2 field. However, if we add other programs in the future we will be adding a field for every new possible type, attach_bpf_fd(3,4,..). This seems a bit clumsy for an API. So lets push the programs using two new type fields. BPF_SK_SKB_STREAM_PARSER BPF_SK_SKB_STREAM_VERDICT This has the advantage of having a readable name and can easily be extended in the future. Updates to samples and sockmap included here also generalize tests slightly to support upcoming patch for multiple map support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-27ARM: dts: rk3228-evb: Fix the compiling errorDavid Wu
This patch solves the following error: arch/arm/boot/dts/rk3228-evb.dtb: ERROR (phandle_references): Reference to non-existent node or label "phy0" Fixess db40f15b53e4 ("ARM: dts: rk3228-evb: Enable the integrated PHY for gmac") Signed-off-by: David Wu <david.wu@rock-chips.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net: mvpp2: fix the packet size configuration for 10GAntoine Ténart
The MVPP22_XLG_CTRL1_FRAMESIZELIMIT define is used as an offset, but is defined as BIT(0). Updated its name to contains "OFFS" as in offset and fix its value using the offset value, 0. Reported-by: Stefan Chulski <stefanc@marvell.com> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Fixes: 76eb1b1de5b6 ("net: mvpp2: set maximum packet size for 10G ports") Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-08-25 This series contains updates to i40e and i40evf only. Mitch adjusts the max packet size to account for two VLAN tags. Sudheer provides a fix to ensure that the watchdog timer is scheduled immediately after admin queue operations are scheduled in i40evf_down(). Fixes an issue by adding locking around the admin queue command and update of state variables so that adminq_subtask will have the accurate information whenever it gets scheduled. Anjali fixes a bug where the PF flag setup should happen before the VMDq RSS queue count is initialized for VMDq VSI to get the right number of queues for RSS in the case of x722 devices. Fixed a problem with the hardware ATR eviction feature where the NVM setting was incorrect. Jake separates the flags into two types, hw_features and flags. The hw_features flags contain a set of features which are enabled at init time and will not contain feature flags that can be toggled. Everything else will remain in the flags variable, and can be modified anytime during run time. We should not be directly copying a cpumask_t, since it is bitmap and might not be copied correctly, so use cpumask_copy() instead. Stefan Assmann makes vf _offload_flags more "generic" by renaming it to vf_cap_flags, which allows other capabilities besides offloading to be added. Alan makes it such that if adaptive-rx/tx is enabled, the user cannot make any manual adjustments to interrupt moderation. Also makes it so that if ITR is disabled by adaptive-rx/tx is then enabled, ITR will be re-enabled. v2: Dropped patches #1 & #8 from the original patch series submission, while Jesse and Jake re-work their patches based on feedback from David Miller. Also removed the duplicate patch 3 that was accidentally sent out twice in the previous submission. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'nfp-SR-IOV-ndos-support'David S. Miller
Jakub Kicinski says: ==================== nfp: SR-IOV ndos support This set adds basic SR-IOV including setting/getting VF MAC addresses, VLANs, link state and spoofcheck settings. It is wired up for both vNICs and representors (note: ip link will not report VF settings on VF/PF representors because they are not linked to the PF PCI device). Pablo and team add the basic implementation, Simon and Dirk follow up with the representor plumbing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25nfp: add basic SR-IOV ndo functions to representorsSimon Horman
Add basic ndo_set/get_vf to support SR-IOV on all types of port representors. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25nfp: add basic SR-IOV ndo functionsPablo Cascón
Add basic ndo_set/get_vf to support SR-IOV. VF to egress phy static mapping by now. Use vfcfg ABI version 2 to write the info to the FW and collect the return value from the mailbox. Signed-off-by: Pablo Cascón <pablo.cascon@netronome.com> Signed-off-by: Jimmy Kizito <jimmy.kizito@netronome.com> Signed-off-by: Rami Tomer <rami.tomer@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25tcp: fix hang in tcp_sendpage_locked()Eric Dumazet
syszkaller got a hang in tcp stack, related to a bug in tcp_sendpage_locked() root@syzkaller:~# cat /proc/3059/stack [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0 [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110 [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50 [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0 [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110 [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60 [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280 [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160 [<ffffffff84089203>] tcp_sendpage+0x43/0x60 [<ffffffff841641da>] inet_sendpage+0x1aa/0x660 [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0 [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0 [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0 [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750 [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330 [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50 [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610 [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'net_sched-clean-up-tc-classes-and-u32-filter'David S. Miller
Cong Wang says: ==================== net_sched: clean up tc classes and u32 filter Patch 1 and patch 2 prepare for patch 3. Major changes are in patch 3 and patch 4, details are there too. v2: Add patch 1 and 2, group all into a patchset Fix a coding style issue in patch 4 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: kill u32_node pointer in QdiscWANG Cong
It is ugly to hide a u32-filter-specific pointer inside Qdisc, this breaks the TC layers: 1. Qdisc is a generic representation, should not have any specific data of any type 2. Qdisc layer is above filter layer, should only save filters in the list of struct tcf_proto. This pointer is used as the head of the chain of u32 hash tables, that is struct tc_u_hnode, because u32 filter is very special, it allows to create multiple hash tables within one qdisc and across multiple u32 filters. Instead of using this ugly pointer, we can just save it in a global hash table key'ed by (dev ifindex, qdisc handle), therefore we can still treat it as a per qdisc basis data structure conceptually. Of course, because of network namespaces, this key is not unique at all, but it is fine as we already have a pointer to Qdisc in struct tc_u_common, we can just compare the pointers when collision. And this only affects slow paths, has no impact to fast path, thanks to the pointer ->tp_c. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: remove tc class reference countingWANG Cong
For TC classes, their ->get() and ->put() are always paired, and the reference counting is completely useless, because: 1) For class modification and dumping paths, we already hold RTNL lock, so all of these ->get(),->change(),->put() are atomic. 2) For filter bindiing/unbinding, we use other reference counter than this one, and they should have RTNL lock too. 3) For ->qlen_notify(), it is special because it is called on ->enqueue() path, but we already hold qdisc tree lock there, and we hold this tree lock when graft or delete the class too, so it should not be gone or changed until we release the tree lock. Therefore, this patch removes ->get() and ->put(), but: 1) Adds a new ->find() to find the pointer to a class by classid, no refcnt. 2) Move the original class destroy upon the last refcnt into ->delete(), right after releasing tree lock. This is fine because the class is already removed from hash when holding the lock. For those who also use ->put() as ->unbind(), just rename them to reflect this change. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: introduce tclass_del_notify()WANG Cong
Like for TC actions, ->delete() is a special case, we have to prepare and fill the notification before delete otherwise would get use-after-free after we remove the reference count. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25net_sched: get rid of more forward declarationsWANG Cong
This is not needed if we move them up properly. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25hinic: skb_pad() frees on errorDan Carpenter
The skb_pad() function frees the skb on error, so this code has a double free. Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'ipv6-sr-updates'David S. Miller
David Lebrun says: ==================== net: updates for IPv6 Segment Routing v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect() use cases, and L2ENCAP only uses this redirection. Fix incoherence between arbitrary MAC header size support and fixed headroom computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP mode. This patch series provides several updates for the SRv6 implementation. The first patch leverages the existing infrastructure to support encapsulation of IPv4 packets. The second patch implements the T.Encaps.L2 SR function, enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet. The last three patches update the seg6local lightweight tunnel, and mainly implement four new actions: End.T, End.DX2, End.DX4 and End.DT6. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: implement additional seg6local actionsDavid Lebrun
This patch implements the following seg6local actions. - SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the next-hop looked up in the specified routing table. - SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to the specified network interface. - SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it, possibly to the specified next-hop. - SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it to the next-hop looked up in the specified routing table. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add helper functions for seg6localDavid Lebrun
This patch adds three helper functions to be used with the seg6local packet processing actions. The decap_and_validate() function will be used by the End.D* actions, that decapsulate an SR-enabled packet. The advance_nextseg() function applies the fundamental operations to update an SRH for the next segment. The lookup_nexthop() function helps select the next-hop for the processed SR packets. It supports an optional next-hop address to route the packet specifically through it, and an optional routing table to use. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: enforce IPv6 packets for seg6local lwtDavid Lebrun
This patch ensures that the seg6local lightweight tunnel is used solely with IPv6 routes and processes only IPv6 packets. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for encapsulation of L2 framesDavid Lebrun
This patch implements the L2 frame encapsulation mechanism, referred to as T.Encaps.L2 in the SRv6 specifications [1]. A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only accepts packets with an existing MAC header (i.e., it will not work for locally generated packets). The resulting packet looks like IPv6 -> SRH -> Ethernet -> original L3 payload. The next header field of the SRH is set to NEXTHDR_NONE. [1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for ip4ip6 encapsulationDavid Lebrun
This patch enables the SRv6 encapsulation mode to carry an IPv4 payload. All the infrastructure was already present, I just had to add a parameter to seg6_do_srh_encap() to specify the inner packet protocol, and perform some additional checks. Usage example: ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25i40e: synchronize nvmupdate command and adminq subtaskSudheer Mogilappagari
During NVM update, state machine gets into unrecoverable state because i40e_clean_adminq_subtask can get scheduled after the admin queue command but before other state variables are updated. This causes incorrect input to i40e_nvmupd_check_wait_event and state transitions don't happen. This issue existed before but surfaced after commit 373149fc99a0 ("i40e: Decrease the scope of rtnl lock") This fix adds locking around admin queue command and update of state variables so that adminq_subtask will have accurate information whenever it gets scheduled. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: prevent changing ITR if adaptive-rx/tx enabledAlan Brady
Currently the driver allows the user to change (or even disable) interrupt moderation if adaptive-rx/tx is enabled when this should not be the case. Adaptive RX/TX will not respect the user's ITR settings so allowing the user to change it is weird. This bug would also allow the user to disable interrupt moderation with adaptive-rx/tx enabled which doesn't make much sense either. This patch makes it such that if adaptive-rx/tx is enabled, the user cannot make any manual adjustments to interrupt moderation. It also makes it so that if ITR is disabled but adaptive-rx/tx is then enabled, ITR will be re-enabled. Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: use cpumask_copy instead of direct assignmentJacob Keller
According to the header file cpumask.h, we shouldn't be directly copying a cpumask_t, since its a bitmap and might not be copied correctly. Lets use the provided cpumask_copy() function instead. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40evf: use netdev variable in reset taskAlan Brady
If we're going to bother initializing a variable to reference it we might as well use it. Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e/i40evf: rename vf_offload_flags to vf_cap_flags in struct ↵Stefan Assmann
virtchnl_vf_resource The current name of vf_offload_flags indicates that the bitmap is limited to offload related features. Make this more generic by renaming it to vf_cap_flags, which allows for other capabilities besides offloading to be added. Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: move check for avoiding VID=0 filters into i40e_vsi_add_vlanJacob Keller
In i40e_vsi_add_vlan we treat attempting to add VID=0 as an error, because it does not do what the caller might expect. We already special case VID=0 in i40e_vlan_rx_add_vid so that we avoid this error when adding the VLAN. This special casing is necessary so that we do not add the VLAN=0 filter since we don't want to stop receiving untagged traffic. Unfortunately, not all callers of i40e_vsi_add_vlan are aware of this, including when we add VLANs from a VF device. Rather than special casing every single caller of i40e_vsi_add_vlan, lets just move this check internally. This makes the code simpler because the caller does not need to be aware of how VLAN=0 is special, and we don't forget to add this check in new places. This fixes a harmless error message displaying when adding a VLAN from within a VF. The message was meaningless but there is no reason to confuse end users and system administrators, and this is now avoided. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e/i40evf: use cmpxchg64 when updating private flags in ethtoolJacob Keller
When a user gives an invalid command to change a private flag which is not supported, either because it is read-only, or the device is not capable of the feature, we simply ignore the request. A naive solution would simply be to report error codes when one of the flags was not supported. However, this causes problems because it makes the operation not atomic. If a user requests multiple private flags together at once we could end up changing one before failing at the second flag. We can do a bit better if we instead update a temporary copy of the flags variable in the loop, and then copy it into place after. If we aren't careful this has the pitfall of potentially silently overwriting any changes caused by other threads. Avoid this by using cmpxchg64 which will compare and swap the flags variable only if it currently matched the old value. We'll report -EAGAIN in the (hopefully rare!) case where the cmpxchg64 fails. This ensures that we can properly report when flags are not supported in an atomic fashion without the risk of overwriting other threads changes. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: Detect ATR HW Evict NVM issue and disable the featureAnjali Singhai Jain
This patch fixes a problem with the HW ATR eviction feature where the NVM setting was incorrect. This patch detects the issue on X720 adapters and disables the feature if the NVM setting is incorrect. Without this patch, HW ATR Evict feature does not work on broken NVMs and is not detected either. If the HW ATR Evict feature is disabled the SW Eviction feature will take effect. Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Alice Michael <alice.michael@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: remove workaround for Open Firmware MAC addressJacob Keller
Since commit b499ffb0a22c ("i40e: Look up MAC address in Open Firmware or IDPROM"), we've had support for obtaining the MAC address form Open Firmware or IDPROM. This code relied on sending the Open Firmware address directly to the device firmware instead of relying on our MAC/VLAN filter list. Thus, a work around was introduced in commit b1b15df59232 ("i40e: Explicitly write platform-specific mac address after PF reset") We refactored the Open Firmware address enablement code in the ill-named commit 41c4c2b50d52 ("i40e: allow look-up of MAC address from Open Firmware or IDPROM") Since this refactor, we no longer even set I40E_FLAG_PF_MAC. Further, we don't need this work around, because we actually store the MAC address as part of the MAC/VLAN filter hash. Thus, we will restore the address correctly upon reset. The refactor above failed to revert the workaround, so do that now. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: separate hw_features from runtime changing flagsJacob Keller
The number of flags found in pf->flags has grown quite large, and there are a lot of different types of flags. Most of the flags are simply hardware features which are enabled on some firmware or some MAC types. Other flags are dynamic run-time flags which enable or disable certain features of the driver. Separate these two types of flags into pf->hw_features and pf->flags. The hw_features list will contain a set of features which are enabled at init time. This will not contain toggles or otherwise dynamically changing features. These flags should not need atomic protections, as they will be set once during init and then be essentially read only. Everything else will remain in the flags variable. These flags may be modified at any time during run time. A future patch may wish to convert these flags into set_bit/clear_bit/test_bit or similar approach to ensure atomic correctness. The I40E_FLAG_MFP_ENABLED flag may be a good fit for hw_features but currently is used by ethtool in the private flags settings, and thus has been left as part of flags. Additionally, I40E_FLAG_DCB_CAPABLE may be a good fit for the hw_features but this patch has not tried to untangle it yet. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e: Fix a bug with VMDq RSS queue allocationAnjali Singhai Jain
The X722 pf flag setup should happen before the VMDq RSS queue count is initialized for VMDq VSI to get the right number of queues for RSS in case of X722 devices. Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Alice Michael <alice.michael@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40evf: prevent VF close returning before state transitions to DOWNSudheer Mogilappagari
Currently i40evf_close() can return before state transitions to __I40EVF_DOWN because of the latency involved in processing and receiving response from PF driver and scheduling of VF watchdog_task. Due to this inconsistency an immediate call to i40evf_open() fails because state is still DOWN_PENDING. When a VF interface is in up state and we try to add it as slave, The bonding driver calls dev_close() and dev_open() in short duration resulting in dev_open returning error. The ifenslave command needs to be run again for dev_open to succeed. This fix ensures that watchdog timer is scheduled immediately after admin queue operations are scheduled in i40evf_down(). In addition a wait condition is added at the end of i40evf_close so that function wont return when state is still DOWN_PENDING. The timeout value is chosen after some profiling and includes some buffer. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-25i40e/i40evf: adjust packet size to account for double VLANsMitch Williams
Now that the kernel supports double VLAN tags, we should at least play nice. Adjust the max packet size to account for two VLAN tags, not just one. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-24strparser: initialize all callbacksEric Biggers
commit bbb03029a899 ("strparser: Generalize strparser") added more function pointers to 'struct strp_callbacks'; however, kcm_attach() was not updated to initialize them. This could cause the ->lock() and/or ->unlock() function pointers to be set to garbage values, causing a crash in strp_work(). Fix the bug by moving the callback structs into static memory, so unspecified members are zeroed. Also constify them while we're at it. This bug was found by syzkaller, which encountered the following splat: IP: 0x55 PGD 3b1ca067 P4D 3b1ca067 PUD 3b12f067 PMD 0 Oops: 0010 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: kstrp strp_work task: ffff88006bb0e480 task.stack: ffff88006bb10000 RIP: 0010:0x55 RSP: 0018:ffff88006bb17540 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000 RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48 RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000 R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48 R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000 FS: 0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098 worker_thread+0x223/0x1860 kernel/workqueue.c:2233 kthread+0x35e/0x430 kernel/kthread.c:231 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Code: Bad RIP value. RIP: 0x55 RSP: ffff88006bb17540 CR2: 0000000000000055 ---[ end trace f0e4920047069cee ]--- Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and CONFIG_AF_KCM=y): #include <linux/bpf.h> #include <linux/kcm.h> #include <linux/types.h> #include <stdint.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <sys/syscall.h> #include <unistd.h> static const struct bpf_insn bpf_insns[3] = { { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */ { .code = 0x95 }, /* BPF_EXIT_INSN() */ }; static const union bpf_attr bpf_attr = { .prog_type = 1, .insn_cnt = 2, .insns = (uintptr_t)&bpf_insns, .license = (uintptr_t)"", }; int main(void) { int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &bpf_attr, sizeof(bpf_attr)); int inet_fd = socket(AF_INET, SOCK_STREAM, 0); int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0); ioctl(kcm_fd, SIOCKCMATTACH, &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd }); } Fixes: bbb03029a899 ("strparser: Generalize strparser") Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24hv_netvsc: Fix rndis_filter_close error during netvsc_removeHaiyang Zhang
We now remove rndis filter before unregister_netdev(), which calls device close. It involves closing rndis filter already removed. This patch fixes this error. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24Merge tag 'mlx5-updates-2017-08-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2017-08-24 This series includes updates to mlx5 core driver. From Gal and Saeed, three cleanup patches. From Matan, Low level flow steering improvements and optimizations, - Use more efficient data structures for flow steering objects handling. - Add tracepoints to flow steering operations. - Overall these patches improve flow steering rule insertion rate by a factor of seven in large scales (~50K rules or more). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24hinic: uninitialized variable in hinic_api_cmd_init()Dan Carpenter
We never set the error code in this function. Fixes: eabf0fad81d5 ("net-next/hinic: Initialize api cmd resources") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24net: mv643xx_eth: Be drop monitor friendlyFlorian Fainelli
txq_reclaim() does the normal transmit queue reclamation and rxq_deinit() does the RX ring cleanup, none of these are packet drops, so use dev_consume_skb() for both locations. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24tg3: Be drop monitor friendlyFlorian Fainelli
tg3_tx() does the normal packet TX completion, tigon3_dma_hwbug_workaround() and tg3_tso_bug() both need to allocate a new SKB that is suitable to workaround HW bugs, and finally tg3_free_rings() is doing ring cleanup. Use dev_consume_skb_any() for these 3 locations to be SKB drop monitor friendly. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24Merge branch 'ipv6-Route-ICMPv6-errors-with-the-flow-when-ECMP-in-use'David S. Miller
Jakub Sitnicki says: ==================== ipv6: Route ICMPv6 errors with the flow when ECMP in use This patch set is another take at making Path MTU Discovery work when server nodes are behind a router employing multipath routing in a load-balance or anycast setup (that is, when not every end-node can be reached by every path). The problem has been well described in RFC 7690 [1], but in short - in such setups ICMPv6 PTB errors are not guaranteed to be routed back to the server node that sent a reply that exceeds path MTU. The proposed solution is two-fold: (1) on the server side - reflect the Flow Label [2]. This can be done without modifying the application using a new per-netns sysctl knob that has been proposed independently of this patchset in the patch entitled "ipv6: Add sysctl for per namespace flow label reflection" [3]. (2) on the ECMP router - make the ipv6 routing subsystem look into the ICMPv6 error packets and compute the flow-hash from its payload, i.e. the offending packet that triggered the error. This is the same behavior as ipv4 stack has already. With both parts in place Path MTU Discovery can work past the ECMP router when using IPv6. [1] https://tools.ietf.org/html/rfc7690 [2] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 [3] http://patchwork.ozlabs.org/patch/804870/ v1 -> v2: - don't use "extern" in external function declaration in header file - style change, put as many arguments as possible on the first line of a function call, and align consecutive lines to the first argument - expand the cover letter based on the feedback v2 -> v3: - switch to computing flow-hash using flow dissector to align with recent changes to multipath routing in ipv4 stack - add a sysctl knob for enabling flow label reflection per netns --- Testing has covered multipath routing of ICMPv6 PTB errors in forward and local output path in a simple use-case of an HTTP server sending a reply which is over the path MTU size [3]. I have also checked if the flows get evenly spread over multiple paths (i.e. if there are no regressions) [4]. [3] https://github.com/jsitnicki/tools/tree/master/net/tests/ecmp/pmtud [4] https://github.com/jsitnicki/tools/tree/master/net/tests/ecmp/load-balance ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Use multipath hash from flow info if availableJakub Sitnicki
Allow our callers to influence the choice of ECMP link by honoring the hash passed together with the flow info. This allows for special treatment of ICMP errors which we would like to route over the same path as the IPv6 datagram that triggered the error. Also go through rt6_multipath_hash(), in the usual case when we aren't dealing with an ICMP error, so that there is one central place where multipath hash is computed. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Fold rt6_info_hash_nhsfn() into its only callerJakub Sitnicki
Commit 644d0e656958 ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes sense to keep it around. Also remove the accompanying comment that has become outdated. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Compute multipath hash for ICMP errors from offending packetJakub Sitnicki
When forwarding or sending out an ICMPv6 error, look at the embedded packet that triggered the error and compute a flow hash over its headers. This let's us route the ICMP error together with the flow it belongs to when multipath (ECMP) routing is in use, which in turn makes Path MTU Discovery work in ECMP load-balanced or anycast setups (RFC 7690). Granted, end-hosts behind the ECMP router (aka servers) need to reflect the IPv6 Flow Label for PMTUD to work. The code is organized to be in parallel with ipv4 stack: ip_multipath_l3_keys -> ip6_multipath_l3_keys fib_multipath_hash -> rt6_multipath_hash Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24net: Extend struct flowi6 with multipath hashJakub Sitnicki
Allow for functions that fill out the IPv6 flow info to also pass a hash computed over the skb contents. The hash value will drive the multipath routing decisions. This is intended for special treatment of ICMPv6 errors, where we would like to make a routing decision based on the flow identifying the offending IPv6 datagram that triggered the error, rather than the flow of the ICMP error itself. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24devlink: Fix devlink_dpipe_table_register() stub signature.David S. Miller
One too many arguments compared to the non-stub version. Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: ffd3cdccf214 ("devlink: Add support for dynamic table size") Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Add sysctl for per namespace flow label reflectionJakub Sitnicki
Reflecting IPv6 Flow Label at server nodes is useful in environments that employ multipath routing to load balance the requests. As "IPv6 Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error messages generated in response to a downstream packets from the server can be routed by a load balancer back to the original server without looking at transport headers, if the server applies the flow label reflection. This enables the Path MTU Discovery past the ECMP router in load-balance or anycast environments where each server node is reachable by only one path. Introduce a sysctl to enable flow label reflection per net namespace for all newly created sockets. Same could be earlier achieved only per socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR socket option. [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24net/mlx5e: make mlx5e_profile constBhumika Goyal
Make this const as it is only passed as an argument to the function mlx5e_create_netdev and the corresponding argument is of type const. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>