summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-08-25RDS: Don't destroy the rdma id until after we're done using itSantosh Shilimkar
During connection resets, we are destroying the rdma id too soon. We can't destroy it when it is still in use. So lets move rdma_destroy_id() after we clear the rings. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: Fix assertion level from fatal to warningsantosh.shilimkar@oracle.com
Fix the asserion level since its not fatal and can be hit in normal execution paths. There is no need to take the system down. We keep the WARN_ON() to detect the condition if we get here with bad pages. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: Make sure we do a signaled send for large-sendsantosh.shilimkar@oracle.com
WR(Work Requests )always generate a WC(Work Completion) with signaled send. Default RDS ib code is setup for un-signaled completion. Since RDS connction is persistent, we can end up sending the data even after large-send when the remote end is not active(for any reason). By doing a signaled send at least once per large-send, we can at least detect the problem in work completion handler there by avoiding sending more data to inactive remote. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: Mark message mapped before transmitsantosh.shilimkar@oracle.com
rds_send_xmit() marks the rds message map flag after xmit_[rdma/atomic]() which is clearly wrong. We need to maintain the ownership between transport and rds. Also take care of error path. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: add a sock_destruct callback debug aidsantosh.shilimkar@oracle.com
This helps to detect the accidental processes/apps trying to destroy the RDS socket which they are sharing with other processes/apps. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: check for congestion updates during rds_send_xmitsantosh.shilimkar@oracle.com
Ensure we don't keep sending the data if the link is congested. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: make sure we post recv bufferssantosh.shilimkar@oracle.com
If we get an ENOMEM during rds_ib_recv_refill, we might never come back and refill again later. Patch makes sure to kick krdsd into helping out. To achieve this we add RDS_RECV_REFILL flag and update in the refill path based on that so that at least some therad will keep posting receive buffers. Since krdsd and softirq both might race for refill, we decide to schedule on work queue based on ring_low instead of ring_empty. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: don't update ip address tables if the address hasn't changedsantosh.shilimkar@oracle.com
If the ip address tables hasn't changed, there is no need to remove them only to be added back again. Lets fix it. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: destroy the ib state earlier during shutdownsantosh.shilimkar@oracle.com
Destroy ib state early during shutdown. Otherwise we can get callbacks after the QP isn't really able to handle them. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: always free recv frag as we free its ring entrysantosh.shilimkar@oracle.com
We were still seeing rare occurrences of the WARN_ON(recv->r_frag) which indicates that the recv refill path was finding allocated frags in ring entries that were marked free. These were usually followed by OOM crashes. They only seem to be occurring in the presence of completion errors and connection resets. This patch ensures that we free the frag as we mark the ring entry free. This should stop the refill path from finding allocated frags in ring entries that were marked free. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25RDS: restore return value in rds_cmsg_rdma_args()santosh.shilimkar@oracle.com
In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns number of pinned pages on success. And the same value is returned to the caller of rds_cmsg_rdma_args() on success which is not intended. Commit f4a3fc03c1d7 ("RDS: Clean up error handling in rds_cmsg_rdma_args") removed the 'ret = 0' line which broke RDS RDMA mode. Fix it by restoring the return value on rds_pin_pages() success keeping the clean-up in place. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: refine pacing rate determinationEric Dumazet
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25xfrm: Use VRF master index if output device is enslavedDavid Ahern
Directs route lookups to VRF table. Compiles out if NET_VRF is not enabled. With this patch able to successfully bring up ipsec tunnels in VRFs, even with duplicate network configuration. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: fix slow start after idle vs TSO/GSOEric Dumazet
slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25batman-adv: beautify supported routing algorithm listMarek Lindner
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Fix conditional statements indentationSven Eckelmann
commit 29b9256e6631 ("batman-adv: consider outgoing interface in OGM sending") incorrectly indented the interface check code. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Add lockdep_asserts for documented external locksSven Eckelmann
Some functions already have documentation about locks they require inside their kerneldoc header. These can be directly tested during runtime using the lockdep asserts. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Annotate deleting functions with external lock via lockdepSven Eckelmann
Functions which use (h)list_del* are requiring correct locking when they operate on global lists. Most of the time the search in the list and the delete are done in the same function. All other cases should have it visible that they require a special lock to avoid race conditions. Lockdep asserts can be used to check these problem during runtime when the lockdep functionality is enabled. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: convert bat_priv->tt.req_list to hlistMarek Lindner
Since the list's tail is never accessed using a double linked list head wastes memory. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Fix gw_bandwidth calculation on 32 bit systemsSven Eckelmann
The TVLV for the gw_bandwidth stores everything as u32. But the gw_bandwidth reads the signed long which limits the maximum value to (2 ** 31 - 1) on systems with 4 byte long. Also the input value is always converted from either Mibit/s or Kibit/s to 100Kibit/s. This reduces the values even further when the user sets it via the default unit Kibit/s. It may even cause an integer overflow and end up with a value the user never intended. Instead read the values as u64, check for possible overflows, do the unit adjustments and then reduce the size to u32. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Return EINVAL on invalid gw_bandwidth changeSven Eckelmann
Invalid speed settings by the user are currently acknowledged as correct but not stored. Instead the return of the store operation of the file "gw_bandwidth" should indicate that the given value is not acceptable. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: prevent potential hlist double deletionMarek Lindner
The hlist_del_rcu() call in batadv_tt_global_size_mod() does not check if the element still is part of the list prior to deletion. The atomic list counter should prevent the worst but converting to hlist_del_init_rcu() ensures the element can't be deleted more than once. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: convert orig_node->vlan_list to hlistMarek Lindner
Since the list's tail is never accessed using a double linked list head wastes memory. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Remove batadv_ types forward declarationsSven Eckelmann
main.h is included in every file and is the only way to access types.h. This makes forward declarations for all types defined in types.h unnecessary. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: rename batadv_new_tt_req_node to batadv_tt_req_node_newMarek Lindner
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: update kernel doc of batadv_tt_global_del_orig_entry()Marek Lindner
The updated kernel doc & additional comment shall prevent accidental copy & paste errors or calling the function without the required precautions. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Remove multiple assignment per lineSven Eckelmann
The Linux CodingStyle disallows multiple assignments in a single line. (see chapter 1) Reported-by: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Fix kerneldoc over 80 column linesSven Eckelmann
Kerneldoc required single line documentation in the past (before 2009). Therefore, the 80 columns limit per line check of checkpatch was disabled for kerneldoc. But kerneldoc is not excluded anymore from it and checkpatch now enabled the check again. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-25batman-adv: Replace C99 int types with kernel typeSven Eckelmann
(s|u)(8|16|32|64) are the preferred types in the kernel. The use of the standard C99 types u?int(8|16|32|64)_t are objected by some people and even checkpatch now warns about using them. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-24net: Fix RCU splat in af_keyDavid Ahern
Hit the following splat testing VRF change for ipsec: [ 113.475692] =============================== [ 113.476194] [ INFO: suspicious RCU usage. ] [ 113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED Not tainted [ 113.477545] ------------------------------- [ 113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 Illegal context switch in RCU read-side critical section! [ 113.479288] [ 113.479288] other info that might help us debug this: [ 113.479288] [ 113.480207] [ 113.480207] rcu_scheduler_active = 1, debug_locks = 1 [ 113.480931] 2 locks held by setkey/6829: [ 113.481371] #0: (&net->xfrm.xfrm_cfg_mutex){+.+.+.}, at: [<ffffffff814e9887>] pfkey_sendmsg+0xfb/0x213 [ 113.482509] #1: (rcu_read_lock){......}, at: [<ffffffff814e767f>] rcu_read_lock+0x0/0x6e [ 113.483509] [ 113.483509] stack backtrace: [ 113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED [ 113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 113.486845] 0000000000000001 ffff88001d4c7a98 ffffffff81518af2 ffffffff81086962 [ 113.487732] ffff88001d538480 ffff88001d4c7ac8 ffffffff8107ae75 ffffffff8180a154 [ 113.488628] 0000000000000b30 0000000000000000 00000000000000d0 ffff88001d4c7ad8 [ 113.489525] Call Trace: [ 113.489813] [<ffffffff81518af2>] dump_stack+0x4c/0x65 [ 113.490389] [<ffffffff81086962>] ? console_unlock+0x3d6/0x405 [ 113.491039] [<ffffffff8107ae75>] lockdep_rcu_suspicious+0xfa/0x103 [ 113.491735] [<ffffffff81064032>] rcu_preempt_sleep_check+0x45/0x47 [ 113.492442] [<ffffffff8106404d>] ___might_sleep+0x19/0x1c8 [ 113.493077] [<ffffffff81064268>] __might_sleep+0x6c/0x82 [ 113.493681] [<ffffffff81133190>] cache_alloc_debugcheck_before.isra.50+0x1d/0x24 [ 113.494508] [<ffffffff81134876>] kmem_cache_alloc+0x31/0x18f [ 113.495149] [<ffffffff814012b5>] skb_clone+0x64/0x80 [ 113.495712] [<ffffffff814e6f71>] pfkey_broadcast_one+0x3d/0xff [ 113.496380] [<ffffffff814e7b84>] pfkey_broadcast+0xb5/0x11e [ 113.497024] [<ffffffff814e82d1>] pfkey_register+0x191/0x1b1 [ 113.497653] [<ffffffff814e9770>] pfkey_process+0x162/0x17e [ 113.498274] [<ffffffff814e9895>] pfkey_sendmsg+0x109/0x213 In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes the RCU lock. Since pfkey_broadcast takes the RCU lock the allocation argument is pointless since GFP_ATOMIC must be used between the rcu_read_{,un}lock. The one call outside of rcu can be done with GFP_KERNEL. Fixes: 7f6b9dbd5afbd ("af_key: locking change") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24ila: Precompute checksum difference for translationsTom Herbert
In the ILA build state for LWT compute the checksum difference to apply to transport checksums that include the IPv6 pseudo header. The difference is between the route destination (from fib6_config) and the locator to write. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-24lwt: Add cfg argument to build_stateTom Herbert
Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23Merge tag 'nfc-next-4.3-1' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.3 pull request This is the NFC pull request for 4.3. With this one we have: - A new driver for Samsung's S3FWRN5 NFC chipset. In order to properly support this driver, a few NCI core routines needed to be exported. Future drivers like Intel's Fields Peak will benefit from this. - SPI support as a physical transport for STM st21nfcb. - An additional netlink API for sending replies back to userspace from vendor commands. - 2 small fixes for TI's trf7970a - A few st-nci fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23tipc: fix stale link problem during synchronizationJon Paul Maloy
Recent changes to the link synchronization means that we can now just drop packets arriving on the synchronizing link before the synch point is reached. This has lead to significant simplifications to the implementation, but also turns out to have a flip side that we need to consider. Under unlucky circumstances, the two endpoints may end up repeatedly dropping each other's packets, while immediately asking for retransmission of the same packets, just to drop them once more. This pattern will eventually be broken when the synch point is reached on the other link, but before that, the endpoints may have arrived at the retransmission limit (stale counter) that indicates that the link should be broken. We see this happen at rare occasions. The fix for this is to not ask for retransmissions when a link is in state LINK_SYNCHING. The fact that the link has reached this state means that it has already received the first SYNCH packet, and that it knows the synch point. Hence, it doesn't need any more packets until the other link has reached the synch point, whereafter it can go ahead and ask for the missing packets. However, because of the reduced traffic on the synching link that follows this change, it may now take longer to discover that the synch point has been reached. We compensate for this by letting all packets, on any of the links, trig a check for synchronization termination. This is possible because the packets themselves don't contain any information that is needed for discovering this condition. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23tipc: interrupt link synchronization when a link goes downJon Paul Maloy
When we introduced the new link failover/synch mechanism in commit 6e498158a827fd515b514842e9a06bdf0f75ab86 ("tipc: move link synch and failover to link aggregation level"), we missed the case when the non-tunnel link goes down during the link synchronization period. In this case the tunnel link will remain in state LINK_SYNCHING, something leading to unpredictable behavior when the failover procedure is initiated. In this commit, we ensure that the node and remaining link goes back to regular communication state (SELF_UP_PEER_UP/LINK_ESTABLISHED) when one of the parallel links goes down. We also ensure that we don't re-enter synch mode if subsequent SYNCH packets arrive on the remaining link. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23tipc: eliminate risk of premature link setup during failoverJon Paul Maloy
When a link goes down, and there is still a working link towards its destination node, a failover is initiated, and the failed link is not allowed to re-establish until that procedure is finished. To ensure this, the concerned link endpoints are set to state LINK_FAILINGOVER, and the node endpoints to NODE_FAILINGOVER during the failover period. However, if the link reset is due to a disabled bearer, the corres- ponding link endpoint is deleted, and only the node endpoint knows about the ongoing failover. Now, if the disabled bearer is re-enabled during the failover period, the discovery mechanism may create a new link endpoint that is ready to be established, despite that this is not permitted. This situation may cause both the ongoing failover and any subsequent link synchronization to fail. In this commit, we ensure that a newly created link goes directly to state LINK_FAILINGOVER if the corresponding node state is NODE_FAILINGOVER. This eliminates the problem described above. Furthermore, we tighten the criteria for which packets are allowed to end a failover state in the function tipc_node_check_state(). By checking that the receiving link is up and running, instead of just checking that it is not in failover mode, we eliminate the risk that protocol packets from the re-created link may cause the failover to be prematurely terminated. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23netlink: mmap: fix tx type checkKen-ichirou MATSUZAWA
I can't send netlink message via mmaped netlink socket since commit: a8866ff6a5bce7d0ec465a63bc482a85c09b0d39 netlink: make the check for "send from tx_ring" deterministic msg->msg_iter.type is set to WRITE (1) at SYSCALL_DEFINE6(sendto, ... import_single_range(WRITE, ... iov_iter_init(1, WRITE, ... call path, so that we need to check the type by iter_is_iovec() to accept the WRITE. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23fou: Do WARN_ON_ONCE in gue_gro_receive for bad proto callbacksTom Herbert
Do WARN_ON_ONCE instead of WARN_ON in gue_gro_receive when the offload callcaks are bad (either don't exist or gro_receive is not specified). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23gro: Fix remcsum offload to deal with frags in GROTom Herbert
The remote checksum offload GRO did not consider the case that frag0 might be in use. This patch fixes that by accessing headers using the skb_gro functions and not saving offsets relative to skb->head. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 9p regression fix from Al Viro: "Fix for breakage introduced when switching p9_client_{read,write}() to struct iov_iter * (went into 4.1)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: ensure err is initialized to 0 in p9_client_read/write
2015-08-229p: ensure err is initialized to 0 in p9_client_read/writeVincent Bernat
Some use of those functions were providing unitialized values to those functions. Notably, when reading 0 bytes from an empty file on a 9P filesystem, the return code of read() was not 0. Tested with this simple program: #include <assert.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> int main(int argc, const char **argv) { assert(argc == 2); char buffer[256]; int fd = open(argv[1], O_RDONLY|O_NOCTTY); assert(fd >= 0); assert(read(fd, buffer, 0) == 0); return 0; } Cc: stable@vger.kernel.org # v4.1 Signed-off-by: Vincent Bernat <vincent@bernat.im> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-08-21mm: make page pfmemalloc check more robustMichal Hocko
Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-21netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)Pablo Neira Ayuso
Instead of IS_ENABLED(CONFIG_IPV6), otherwise we hit: et/built-in.o: In function `tee_tg6': >> xt_TEE.c:(.text+0x6cd8c): undefined reference to `nf_dup_ipv6' when: CONFIG_IPV6=y CONFIG_NF_DUP_IPV4=y # CONFIG_NF_DUP_IPV6 is not set CONFIG_NETFILTER_XT_TARGET_TEE=y Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21netfilter: nf_dup: fix sparse warningsPablo Neira Ayuso
>> net/ipv4/netfilter/nft_dup_ipv4.c:29:37: sparse: incorrect type in initializer (different base types) net/ipv4/netfilter/nft_dup_ipv4.c:29:37: expected restricted __be32 [user type] s_addr net/ipv4/netfilter/nft_dup_ipv4.c:29:37: got unsigned int [unsigned] <noident> >> net/ipv6/netfilter/nf_dup_ipv6.c:48:23: sparse: incorrect type in assignment (different base types) net/ipv6/netfilter/nf_dup_ipv6.c:48:23: expected restricted __be32 [addressable] [assigned] [usertype] flowlabel net/ipv6/netfilter/nf_dup_ipv6.c:48:23: got int Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21ipvs: add more mcast parameters for the sync daemonJulian Anastasov
- mcast_group: configure the multicast address, now IPv6 is supported too - mcast_port: configure the multicast port - mcast_ttl: configure the multicast TTL/HOP_LIMIT Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21ipvs: add sync_maxlen parameter for the sync daemonJulian Anastasov
Allow setups with large MTU to send large sync packets by adding sync_maxlen parameter. The default value is now based on MTU but no more than 1500 for compatibility reasons. To avoid problems if MTU changes allow fragmentation by sending packets with DF=0. Problem reported by Dan Carpenter. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21ipvs: call rtnl_lock earlyJulian Anastasov
When the sync damon is started we need to hold rtnl lock while calling ip_mc_join_group. Currently, we have a wrong locking order because the correct one is rtnl_lock->__ip_vs_mutex. It is implied from the usage of __ip_vs_mutex in ip_vs_dst_event() which is called under rtnl lock during NETDEV_* notifications. Fix the problem by calling rtnl_lock early only for the start_sync_thread call. As a bonus this fixes the usage __dev_get_by_name which was not called under rtnl lock. This patch actually extends and depends on commit 54ff9ef36bdf ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}"). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-21ipvs: Add ovf schedulerRaducu Deaconu
The weighted overflow scheduling algorithm directs network connections to the server with the highest weight that is currently available and overflows to the next when active connections exceed the node's weight. Signed-off-by: Raducu Deaconu <rhadoo.io88@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is second pull request includes the conflict resolution patch that resulted from the updates that we got for the conntrack template through kmalloc. No changes with regards to the previously sent 15 patches. The following patchset contains Netfilter updates for your net-next tree, they are: 1) Rework the existing nf_tables counter expression to make it per-cpu. 2) Prepare and factor out common packet duplication code from the TEE target so it can be reused from the new dup expression. 3) Add the new dup expression for the nf_tables IPv4 and IPv6 families. 4) Convert the nf_tables limit expression to use a token-based approach with 64-bits precision. 5) Enhance the nf_tables limit expression to support limiting at packet byte. This comes after several preparation patches. 6) Add a burst parameter to indicate the amount of packets or bytes that can exceed the limiting. 7) Add netns support to nfacct, from Andreas Schultz. 8) Pass the nf_conn_zone structure instead of the zone ID in nf_tables to allow accessing more zone specific information, from Daniel Borkmann. 9) Allow to define zone per-direction to support netns containers with overlapping network addressing, also from Daniel. 10) Extend the CT target to allow setting the zone based on the skb->mark as a way to support simple mappings from iptables, also from Daniel. 11) Make the nf_tables payload expression aware of the fact that VLAN offload may have removed a vlan header, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>