summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-12-08slip: sl_alloc(): remove unused parameter "dev_t line"Marc Kleine-Budde
The first and only parameter of sl_alloc() is unused, so remove it. Fixes: 5342b77c4123 slip: ("Clean up create and destroy") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: mvpp2: fix the RSS table entry offsetAntoine Tenart
The macro used to access or set an RSS table entry was using an offset of 8, while it should use an offset of 0. This lead to wrongly configure the RSS table, not accessing the right entries. Fixes: 1d7d15d79fb4 ("net: mvpp2: initialize the RSS tables") Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'cxgb4-collect-hardware-logs-via-ethtool'David S. Miller
Rahul Lakkireddy says: ==================== cxgb4: collect hardware logs via ethtool Collect more hardware logs via ethtool --get-dump facility. Patch 1 collects on-chip memory layout information. Patch 2 collects on-chip MC memory dumps. Patch 3 collects HMA memory dump. Patch 4 evaluates and skips TX and RX payload regions in memory dumps. Patch 5 collects egress and ingress SGE queue contexts. Patch 6 collects PCIe configuration logs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: collect PCIe configuration logsRahul Lakkireddy
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: collect egress and ingress SGE queue contextsRahul Lakkireddy
Use meminfo to identify the egress and ingress context regions and fetch all valid contexts from those regions. Also flush all contexts before attempting collection to prevent stale information. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: skip TX and RX payload regions in memory dumpsRahul Lakkireddy
Use meminfo to identify TX and RX payload regions and skip them in collection of EDC, MC, and HMA. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: collect HMA memory dumpRahul Lakkireddy
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: collect MC memory dumpRahul Lakkireddy
Use meminfo to get base address and size of MC memory. Also use same meminfo for EDC memory dumps. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08cxgb4: collect on-chip memory informationRahul Lakkireddy
Collect memory layout of various on-chip memory regions. Move code for collecting on-chip memory information to cudbg_lib.c and update cxgb4_debugfs.c to use the common function. Also include cudbg_entity.h before cudbg_lib.h to avoid adding cudbg entity structure forward declarations in cudbg_lib.h. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'veth-and-GSO-maximums'David S. Miller
Stephen Hemminger says: ==================== veth and GSO maximums This is the more general way to solving the issue of GSO limits not being set correctly for containers on Azure. If a GSO packet is sent to host that exceeds the limit (reported by NDIS), then the host is forced to do segmentation in software which has noticeable performance impact. The core rtnetlink infrastructure already has the messages and infrastructure to allow changing gso limits. With an updated iproute2 the following already works: # ip li set dev dummy0 gso_max_size 30000 These patches are about making it easier with veth. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08veth: set peer GSO valuesStephen Hemminger
When new veth is created, and GSO values have been configured on one device, clone those values to the peer. For example: # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2 This should create vm1 <--> vm2 with both having GSO maximum size set to 65530. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08rtnetlink: allow GSO maximums to be set on device creationStephen Hemminger
Netlink device already allows changing GSO sizes with ip set command. The part that is missing is allowing overriding GSO settings on device creation. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: stmmac: fix broken dma_interrupt handling for multi-queuesNiklas Cassel
There is nothing that says that number of TX queues == number of RX queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue. This code is obviously wrong: for (chan = 0; chan < tx_channel_count; chan++) { struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan]; priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an uninitialized napi_struct to __napi_schedule(), causing us to crash in net_rx_action(), because napi_struct->poll is zero. [12846.759880] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [12846.768014] pgd = (ptrval) [12846.770742] [00000000] *pgd=39ec7831, *pte=00000000, *ppte=00000000 [12846.777023] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM [12846.782942] Modules linked in: [12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted 4.15.0-rc2-00285-gf5fb5f2f39a7 #36 [12846.794177] Hardware name: Axis ARTPEC-6 Platform [12846.798879] task: (ptrval) task.stack: (ptrval) [12846.803407] PC is at 0x0 [12846.805942] LR is at net_rx_action+0x274/0x43c [12846.810383] pc : [<00000000>] lr : [<80bff064>] psr: 200e0113 [12846.816648] sp : b90d9ae8 ip : b90d9ae8 fp : b90d9b44 [12846.821871] r10: 00000008 r9 : 0013250e r8 : 00000100 [12846.827094] r7 : 0000012c r6 : 00000000 r5 : 00000001 r4 : bac84900 [12846.833619] r3 : 00000000 r2 : b90d9b08 r1 : 00000000 r0 : bac84900 Since each DMA channel can be used for rx and tx simultaneously, the current code should probably be rewritten so that napi_struct is embedded in a new struct stmmac_channel. That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue where we got the IRQ, instead of looping through all tx queues. This is also how the xgbe driver does it (another driver for this IP). Fixes: c22a3f48ef99 ("net: stmmac: adding multiple napi mechanism") Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'tcp-RACK-loss-recovery-bug-fixes'David S. Miller
Yuchung Cheng says: ==================== tcp: RACK loss recovery bug fixes This patch set has four minor bug fixes in TCP RACK loss recovery. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp: evaluate packet losses upon RTT changeYuchung Cheng
RACK skips an ACK unless it advances the most recently delivered TX timestamp (rack.mstamp). Since RACK also uses the most recent RTT to decide if a packet is lost, RACK should still run the loss detection whenever the most recent RTT changes. For example, an ACK that does not advance the timestamp but triggers the cwnd undo due to reordering, would then use the most recent (higher) RTT measurement to detect further losses. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp: fix off-by-one bug in RACKYuchung Cheng
RACK should mark a packet lost when remaining wait time is zero. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp: always evaluate losses in RACK upon undoYuchung Cheng
When sender detects spurious retransmission, all packets marked lost are remarked to be in-flight. However some may be considered lost based on its timestamps in RACK. This patch forces RACK to re-evaluate, which may be skipped previously if the ACK does not advance RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp: correctly test congestion state in RACKYuchung Cheng
RACK does not test the loss recovery state correctly to compute the reordering window. It assumes if lost_out is zero then TCP is not in loss recovery. But it can be zero during recovery before calling tcp_rack_detect_loss(): when an ACK acknowledges all packets marked lost before receiving this ACK, but has not yet to discover new ones by tcp_rack_detect_loss(). The fix is to simply test the congestion state directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: dsa: lan9303: Protect ALR operations with mutexEgil Hjelmeland
ALR table operations are a sequence of related register operations which should be protected from concurrent access. The alr_cache should also be protected. Add alr_mutex doing that. Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: fix use-after-free in tcf_block_put_extJiri Pirko
Since the block is freed with last chain being put, once we reach the end of iteration of list_for_each_entry_safe, the block may be already freed. I'm hitting this only by creating and deleting clsact: [ 202.171952] ================================================================== [ 202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390 [ 202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796 [ 202.194508] [ 202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5 [ 202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 202.213613] Call Trace: [ 202.216369] dump_stack+0xda/0x169 [ 202.220192] ? dma_virt_map_sg+0x147/0x147 [ 202.224790] ? show_regs_print_info+0x54/0x54 [ 202.229691] ? tcf_chain_destroy+0x1dc/0x250 [ 202.234494] print_address_description+0x83/0x3d0 [ 202.239781] ? tcf_block_put_ext+0x240/0x390 [ 202.244575] kasan_report+0x1ba/0x460 [ 202.248707] ? tcf_block_put_ext+0x240/0x390 [ 202.253518] tcf_block_put_ext+0x240/0x390 [ 202.258117] ? tcf_chain_flush+0x290/0x290 [ 202.262708] ? qdisc_hash_del+0x82/0x1a0 [ 202.267111] ? qdisc_hash_add+0x50/0x50 [ 202.271411] ? __lock_is_held+0x5f/0x1a0 [ 202.275843] clsact_destroy+0x3d/0x80 [sch_ingress] [ 202.281323] qdisc_destroy+0xcb/0x240 [ 202.285445] qdisc_graft+0x216/0x7b0 [ 202.289497] tc_get_qdisc+0x260/0x560 Fix this by holding the block also by chain 0 and put chain 0 explicitly, out of the list_for_each_entry_safe loop at the very end of tcf_block_put_ext. Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08bnxt_en: Fix sources of spurious netpoll warningsCalvin Owens
After applying 2270bc5da3497945 ("bnxt_en: Fix netpoll handling") and 903649e718f80da2 ("bnxt_en: Improve -ENOMEM logic in NAPI poll loop."), we still see the following WARN fire: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1875170 at net/core/netpoll.c:165 netpoll_poll_dev+0x15a/0x160 bnxt_poll+0x0/0xd0 exceeded budget in poll <snip> Call Trace: [<ffffffff814be5cd>] dump_stack+0x4d/0x70 [<ffffffff8107e013>] __warn+0xd3/0xf0 [<ffffffff8107e07f>] warn_slowpath_fmt+0x4f/0x60 [<ffffffff8179519a>] netpoll_poll_dev+0x15a/0x160 [<ffffffff81795f38>] netpoll_send_skb_on_dev+0x168/0x250 [<ffffffff817962fc>] netpoll_send_udp+0x2dc/0x440 [<ffffffff815fa9be>] write_ext_msg+0x20e/0x250 [<ffffffff810c8125>] call_console_drivers.constprop.23+0xa5/0x110 [<ffffffff810c9549>] console_unlock+0x339/0x5b0 [<ffffffff810c9a88>] vprintk_emit+0x2c8/0x450 [<ffffffff810c9d5f>] vprintk_default+0x1f/0x30 [<ffffffff81173df5>] printk+0x48/0x50 [<ffffffffa0197713>] edac_raw_mc_handle_error+0x563/0x5c0 [edac_core] [<ffffffffa0197b9b>] edac_mc_handle_error+0x42b/0x6e0 [edac_core] [<ffffffffa01c3a60>] sbridge_mce_output_error+0x410/0x10d0 [sb_edac] [<ffffffffa01c47cc>] sbridge_check_error+0xac/0x130 [sb_edac] [<ffffffffa0197f3c>] edac_mc_workq_function+0x3c/0x90 [edac_core] [<ffffffff81095f8b>] process_one_work+0x19b/0x480 [<ffffffff810967ca>] worker_thread+0x6a/0x520 [<ffffffff8109c7c4>] kthread+0xe4/0x100 [<ffffffff81884c52>] ret_from_fork+0x22/0x40 This happens because we increment rx_pkts on -ENOMEM and -EIO, resulting in rx_pkts > 0. Fix this by only bumping rx_pkts if we were actually given a non-zero budget. Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'lockless-qdisc-series'David S. Miller
John Fastabend says: ==================== lockless qdisc series This series adds support for building lockless qdiscs. This is the result of noticing the qdisc lock is a common hot-spot in perf analysis of the Linux network stack, especially when testing with high packet per second rates. However, nothing is free and most qdiscs rely on the qdisc lock for their data structures so each qdisc must be converted on a case by case basis. In this series, to kick things off, we make pfifo_fast, mq, and mqprio lockless. Follow up series can address additional qdiscs as needed. For example sch_tbf might be useful. To allow this the lockless design is an opt-in flag. In some future utopia we convert all qdiscs and we get to drop this case analysis, but in order to make progress we live in the real-world. There are also a handful of optimizations I have behind this series and a few code cleanups that I couldn't figure out how to fit neatly into this series with out increasing the patch count. Once this is in additional patches can address this. The most notable is in skb_dequeue we can push the consumer lock out a bit and consume multiple skbs off the skb_array in pfifo fast per iteration. Ideally we could push arrays of packets at drivers as well but we would need the infrastructure for this. The other notable improvement is to do less locking in the overrun cases where bad tx queue list and gso_skb are being hit. Although, nice in theory in practice this is the error case and I haven't found a benchmark where this matters yet. For testing... My first test case uses multiple containers (via cilium) where multiple client containers use 'wrk' to benchmark connections with a server container running lighttpd. Where lighttpd is configured to use multiple threads, one per core. Additionally this test has a proxy agent running so all traffic takes an extra hop through a proxy container. In these cases each TCP packet traverses the egress qdisc layer at least four times and the ingress qdisc layer an additional four times. This makes for a good stress test IMO, perf details below. The other micro-benchmark I run is injecting packets directly into qdisc layer using pktgen. This uses the benchmark script, ./pktgen_bench_xmit_mode_queue_xmit.sh Benchmarks taken in two cases, "base" running latest net-next no changes to qdisc layer and "qdisc" tests run with qdisc lockless updates. Numbers reported in req/sec. All virtual 'veth' devices run with pfifo_fast in the qdisc test case. `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"` conns 16 32 64 1024 ----------------------------------------------- base: 18831 20201 21393 29151 qdisc: 19309 21063 23899 29265 notice in all cases we see performance improvement when running with qdisc case. Microbenchmarks using pktgen are as follows, `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000 base(mq): 2.1Mpps base(pfifo_fast): 2.1Mpps qdisc(mq): 2.6Mpps qdisc(pfifo_fast): 2.6Mpps notice numbers are the same for mq and pfifo_fast because only testing a single thread here. In both tests we see a nice bump in performance gain. The key with 'mq' is it is already per txq ring so contention is minimal in the above cases. Qdiscs such as tbf or htb which have more contention will likely show larger gains when/if lockless versions are implemented. Thanks to everyone who helped with this work especially Daniel Borkmann, Eric Dumazet and Willem de Bruijn for discussing the design and reviewing versions of the code. Changes from the RFC: dropped a couple patches off the end, fixed a bug with skb_queue_walk_safe not unlinking skb in all cases, fixed a lockdep splat with pfifo_fast_destroy not calling *_bh lock variant, addressed _most_ of Willem's comments, there was a bug in the bulk locking (final patch) of the RFC series. @Willem, I left out lockdep annotation for a follow on series to add lockdep more completely, rather than just in code I touched. Comments and feedback welcome. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: pfifo_fast use skb_arrayJohn Fastabend
This converts the pfifo_fast qdisc to use the skb_array data structure and set the lockless qdisc bit. pfifo_fast is the first qdisc to support the lockless bit that can be a child of a qdisc requiring locking. So we add logic to clear the lock bit on initialization in these cases when the qdisc graft operation occurs. This also removes the logic used to pick the next band to dequeue from and instead just checks a per priority array for packets from top priority to lowest. This might need to be a bit more clever but seems to work for now. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: skb_array: expose peek APIJohn Fastabend
This adds a peek routine to skb_array.h for use with qdisc. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprioJohn Fastabend
The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqJohn Fastabend
The sch_mq qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Also exports __gnet_stats_copy_queue() to use as a helper function. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: helpers to sum qlen and qlen for per cpu logicJohn Fastabend
Add qdisc qlen helper routines for lockless qdiscs to use. The qdisc qlen is no longer used in the hotpath but it is reported via stats query on the qdisc so it still needs to be tracked. This adds the per cpu operations needed along with a helper to return the summation of per cpu stats. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: check for frozen queue before skb_bad_txq checkJohn Fastabend
I can not think of any reason to pull the bad txq skb off the qdisc if the txq we plan to send this on is still frozen. So check for frozen queue first and abort before dequeuing either skb_bad_txq skb or normal qdisc dequeue() skb. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: use skb list for skb_bad_txJohn Fastabend
Similar to how gso is handled use skb list for skb_bad_tx this is required with lockless qdiscs because we may have multiple cores attempting to push skbs into skb_bad_tx concurrently Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: drop qdisc_reset from dev_graft_qdiscJohn Fastabend
In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy' operation is called on the old qdisc. The destroy operation will wait a rcu grace period and call qdisc_rcu_free(). At which point gso_cpu_skb is free'd along with all stats so no need to zero stats and gso_cpu_skb from the graft operation itself. Further after dropping the qdisc locks we can not continue to call qdisc_reset before waiting an rcu grace period so that the qdisc is detached from all cpus. By removing the qdisc_reset() here we get the correct property of waiting an rcu grace period and letting the qdisc_destroy operation clean up the qdisc correctly. Note, a refcnt greater than 1 would cause the destroy operation to be aborted however if this ever happened the reference to the qdisc would be lost and we would have a memory leak. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: explicit locking in gso_cpu fallbackJohn Fastabend
This work is preparing the qdisc layer to support egress lockless qdiscs. If we are running the egress qdisc lockless in the case we overrun the netdev, for whatever reason, the netdev returns a busy error code and the skb is parked on the gso_skb pointer. With many cores all hitting this case at once its possible to have multiple sk_buffs here so we turn gso_skb into a queue. This should be the edge case and if we see this frequently then the netdev/qdisc layer needs to back off. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: a dflt qdisc may be used with per cpu statsJohn Fastabend
Enable dflt qdisc support for per cpu stats before this patch a dflt qdisc was required to use the global statistics qstats and bstats. This adds a static flags field to qdisc_ops that is propagated into qdisc->flags in qdisc allocate call. This allows the allocation block to completely allocate the qdisc object so we don't have dangling allocations after qdisc init. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: provide per cpu qstat helpersJohn Fastabend
The per cpu qstats support was added with per cpu bstat support which is currently used by the ingress qdisc. This patch adds a set of helpers needed to make other qdiscs that use qstats per cpu as well. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: remove remaining uses for qdisc_qlen in xmit pathJohn Fastabend
sch_direct_xmit() uses qdisc_qlen as a return value but all call sites of the routine only check if it is zero or not. Simplify the logic so that we don't need to return an actual queue length value. This introduces a case now where sch_direct_xmit would have returned a qlen of zero but now it returns true. However in this case all call sites of sch_direct_xmit will implement a dequeue() and get a null skb and abort. This trades tracking qlen in the hotpath for an extra dequeue operation. Overall this seems to be good for performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: allow qdiscs to handle lockingJohn Fastabend
This patch adds a flag for queueing disciplines to indicate the stack does not need to use the qdisc lock to protect operations. This can be used to build lockless scheduling algorithms and improving performance. The flag is checked in the tx path and the qdisc lock is only taken if it is not set. For now use a conditional if statement. Later we could be more aggressive if it proves worthwhile and use a static key or wrap this in a likely(). Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason for this is synchronizing a qlen counter across threads proves to cost more than doing the enqueue/dequeue operations when tested with pktgen. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: cleanup qdisc_run and __qdisc_run semanticsJohn Fastabend
Currently __qdisc_run calls qdisc_run_end() but does not call qdisc_run_begin(). This makes it hard to track pairs of qdisc_run_{begin,end} across function calls. To simplify reading these code paths this patch moves begin/end calls into qdisc_run(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'tcp-bbr-sampling-fixes'David S. Miller
Neal Cardwell says: ==================== TCP BBR sampling fixes for loss recovery undo This patch series has a few minor bug fixes for cases where spurious loss recoveries can trick BBR estimators into estimating that the available bandwidth is much lower than the true available bandwidth. In both cases the fix here is to just reset the estimator upon loss recovery undo. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: reset long-term bandwidth sampling on loss recovery undoNeal Cardwell
Fix BBR so that upon notification of a loss recovery undo BBR resets long-term bandwidth sampling. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this can cause BBR to spuriously estimate that we are seeing loss rates high enough to trigger long-term bandwidth estimation. To avoid that problem, this commit resets long-term bandwidth sampling on loss recovery undo events. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: reset full pipe detection on loss recovery undoNeal Cardwell
Fix BBR so that upon notification of a loss recovery undo BBR resets the full pipe detection (STARTUP exit) state machine. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this could previously cause BBR to spuriously estimate that the pipe is full. Since spurious loss recovery means that our overall sending will have slowed down spuriously, this commit gives a flow more time to probe robustly for bandwidth and decide the pipe is really full. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: record "full bw reached" decision in new full_bw_reached bitNeal Cardwell
This commit records the "full bw reached" decision in a new full_bw_reached bit. This is a pure refactor that does not change the current behavior, but enables subsequent fixes and improvements. In particular, this enables simple and clean fixes because the full_bw and full_bw_cnt can be unconditionally zeroed without worrying about forgetting that we estimated we filled the pipe in Startup. And it enables future improvements because multiple code paths can be used for estimating that we filled the pipe in Startup; any new code paths only need to set this bit when they think the pipe is full. Note that this fix intentionally reduces the width of the full_bw_cnt counter, since we have never used the most significant bit. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08sfc: pass valid pointers from efx_enqueue_unwindBert Kenward
The bytes_compl and pkts_compl pointers passed to efx_dequeue_buffers cannot be NULL. Add a paranoid warning to check this condition and fix the one case where they were NULL. efx_enqueue_unwind() is called very rarely, during error handling. Without this fix it would fail with a NULL pointer dereference in efx_dequeue_buffer, with efx_enqueue_skb in the call stack. Fixes: e9117e5099ea ("sfc: Firmware-Assisted TSO version 2") Reported-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Bert Kenward <bkenward@solarflare.com> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08gianfar: Disable EEE autoneg by defaultClaudiu Manoil
This controller does not support EEE, but it may connect to a PHY which supports EEE and advertises EEE by default, while its link partner also advertises EEE. If this happens, the PHY enters low power mode when the traffic rate is low and causes packet loss. This patch disables EEE advertisement by default for any PHY that gianfar connects to, to prevent the above unwanted outcome. Signed-off-by: Shaohui Xie <Shaohui.Xie@nxp.com> Tested-by: Yangbo Lu <Yangbo.lu@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08virtio_net: Disable interrupts if napi_complete_done rescheduled napiToshiaki Makita
Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able to be rescheduled within napi_complete_done() even in non-busypoll case, but virtnet_poll() always enabled interrupts before complete, and when napi was rescheduled within napi_complete_done() it did not disable interrupts. This caused more interrupts when event idx is disabled. According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ processing") we cannot place virtqueue_enable_cb_prepare() after NAPI_STATE_SCHED is cleared, so disable interrupts again if napi_complete_done() returned false. Tested with vhost-user of OVS 2.7 on host, which does not have the event idx feature. * Before patch: $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 1472 60.00 32763206 0 6430.32 212992 60.00 23384299 4589.56 Interrupts on guest: 9872369 Packets/interrupt: 2.37 * After patch $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 1472 60.00 32794646 0 6436.49 212992 60.00 32793501 6436.27 Interrupts on guest: 4941299 Packets/interrupt: 6.64 Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - three more patches in regard to the SPDX license tags. The missing tags for the files in arch/s390/kvm will be merged via the KVM tree. With that all s390 related files should have their SPDX tags. - a patch to get rid of 'struct timespec' in the DASD driver. - bug fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: fix compat system call table s390/mm: fix off-by-one bug in 5-level page table handling s390: Remove redudant license text s390: add a few more SPDX identifiers s390/dasd: prevent prefix I/O error s390: always save and restore all registers on context switch s390/dasd: remove 'struct timespec' usage s390/qdio: restrict target-full handling to IQDIO s390/qdio: consider ERROR buffers for inbound-full condition s390/virtio: add BSD license to virtio-ccw
2017-12-08Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "Fix some more FP register fallout from the SVE patches and also some problems with the PGD tracking in our software PAN emulation code, after we received a crash report from a 3.18 kernel running a backport. Summary: - fix SW PAN pgd shadowing for kernel threads, EFI and exiting user tasks - fix FP register leak when a task_struct is re-allocated - fix potential use-after-free in FP state tracking used by KVM" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/sve: Avoid dereference of dead task_struct in KVM guest entry arm64: SW PAN: Update saved ttbr0 value on enter_lazy_tlb arm64: SW PAN: Point saved ttbr0 at the zero page when switching to init_mm arm64: fpsimd: Abstract out binding of task's fpsimd context to the cpu. arm64: fpsimd: Prevent registers leaking from dead tasks
2017-12-08Merge tag 'acpi-4.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "This fixes an out of bounds warning from KASAN in the ACPI CPPC driver" * tag 'acpi-4.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / CPPC: Fix KASAN global out of bounds warning
2017-12-08Merge tag 'pm-4.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "This fixes an issue in the device runtime PM framework that prevents customer devices from resuming if runtime PM is disabled for one or more of their supplier devices (as reflected by device links between those devices)" * tag 'pm-4.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / runtime: Fix handling of suppliers with disabled runtime PM
2017-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2017-12-07 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Detailed documentation of BPF development process from Daniel. 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops from Lawrence. 3) Minor follow up for bpf_skb_set_tunnel_key() from William. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tuntap: fix possible deadlock when fail to register netdevJason Wang
Private destructor could be called when register_netdev() fail with rtnl lock held. This will lead deadlock in tun_free_netdev() who tries to hold rtnl_lock. Fixing this by switching to use spinlock to synchronize. Fixes: 96f84061620c ("tun: add eBPF based queue selection method") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08of: overlay: Make node skipping in init_overlay_changeset() clearerGeert Uytterhoeven
Make it more clear that nodes without "__overlay__" subnodes are skipped, by reverting the logic and using continue. This also reduces indentation level. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Rob Herring <robh@kernel.org>