summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-09-14xen/balloon: add runtime control for scrubbing ballooned out pagesMarek Marczykowski-Górecki
Scrubbing pages on initial balloon down can take some time, especially in nested virtualization case (nested EPT is slow). When HVM/PVH guest is started with memory= significantly lower than maxmem=, all the extra pages will be scrubbed before returning to Xen. But since most of them weren't used at all at that point, Xen needs to populate them first (from populate-on-demand pool). In nested virt case (Xen inside KVM) this slows down the guest boot by 15-30s with just 1.5GB needed to be returned to Xen. Add runtime parameter to enable/disable it, to allow initially disabling scrubbing, then enable it back during boot (for example in initramfs). Such usage relies on assumption that a) most pages ballooned out during initial boot weren't used at all, and b) even if they were, very few secrets are in the guest at that time (before any serious userspace kicks in). Convert CONFIG_XEN_SCRUB_PAGES to CONFIG_XEN_SCRUB_PAGES_DEFAULT (also enabled by default), controlling default value for the new runtime switch. Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2018-09-14xen/manage: don't complain about an empty value in control/sysrq nodeVitaly Kuznetsov
When guest receives a sysrq request from the host it acknowledges it by writing '\0' to control/sysrq xenstore node. This, however, make xenstore watch fire again but xenbus_scanf() fails to parse empty value with "%c" format string: sysrq: SysRq : Emergency Sync Emergency Sync complete xen:manage: Error -34 reading sysrq code in control/sysrq Ignore -ERANGE the same way we already ignore -ENOENT, empty value in control/sysrq is totally legal. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2018-09-14batman-adv: Mark debugfs functionality as deprecatedSven Eckelmann
CONFIG_BATMAN_ADV_DEBUGFS is disabled by default because debugfs is not supported for batman-adv interfaces in any non-default netns. Any remaining users of this interface should still be informed about the deprecation and the generic netlink alternative. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14batman-adv: Start new development cycleSimon Wunderlich
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14asm-generic: io: Fix ioport_map() for !CONFIG_GENERIC_IOMAP && ↵Andrew Murray
CONFIG_INDIRECT_PIO The !CONFIG_GENERIC_IOMAP version of ioport_map uses MMIO_UPPER_LIMIT to prevent users from making I/O accesses outside the expected I/O range - however it erroneously treats MMIO_UPPER_LIMIT as a mask which is contradictory to its other users. The introduction of CONFIG_INDIRECT_PIO, which subtracts an arbitrary amount from IO_SPACE_LIMIT to form MMIO_UPPER_LIMIT, results in ioport_map mangling the given port rather than capping it. We address this by aligning more closely with the CONFIG_GENERIC_IOMAP implementation of ioport_map by using the comparison operator and returning NULL where the port exceeds MMIO_UPPER_LIMIT. Though note that we preserve the existing behavior of masking with IO_SPACE_LIMIT such that we don't break existing buggy drivers that somehow rely on this masking. Fixes: 5745392e0c2b ("PCI: Apply the new generic I/O management on PCI IO hosts") Reported-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-13Merge tag 'printk-for-4.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk fix from Petr Mladek: "Revert a commit that caused "quiet", "debug", and "loglevel" early parameters to be ignored for early boot messages" * tag 'printk-for-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: Revert "printk: make sure to print log on console."
2018-09-13Merge tag 'ovl-fixes-4.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a regression in the recent file stacking update, reported and fixed by Amir Goldstein. The fix is fairly trivial, but involves adding a fadvise() f_op and the associated churn in the vfs. As discussed on -fsdevel, there are other possible uses for this method, than allowing proper stacking for overlays. And there's one other fix for a syzkaller detected oops" * tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix oopses in ovl_fill_super() failure paths ovl: add ovl_fadvise() vfs: implement readahead(2) using POSIX_FADV_WILLNEED vfs: add the fadvise() file operation Documentation/filesystems: update documentation of file_operations ovl: fix GPF in swapfile_activate of file from overlayfs over xfs ovl: respect FIEMAP_FLAG_SYNC flag
2018-09-13Merge tag 'for-linus-20180913' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Three fixes that should go into this series. This contains: - Increase number of policies supported by blk-cgroup. With blk-iolatency, we now have four in kernel, but we had a hard limit of three... - Fix regression in null_blk, where the zoned supported broke queue_mode=0 (bio based). - NVMe pull request, with a single fix for an issue in the rdma code" * tag 'for-linus-20180913' of git://git.kernel.dk/linux-block: null_blk: fix zoned support for non-rq based operation blk-cgroup: increase number of supported policies nvmet-rdma: fix possible bogus dereference under heavy load
2018-09-13Merge tag 'for-4.19/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - DM verity fix for crash due to using vmalloc'd buffers with the asynchronous crypto hadsh API. - Fix to both DM crypt and DM integrity targets to discontinue using CRYPTO_TFM_REQ_MAY_SLEEP because its use of GFP_KERNEL can lead to deadlock by recursing back into a filesystem. - Various DM raid fixes related to reshape and rebuild races. - Fix for DM thin-provisioning to avoid data corruption that was a side-effect of needing to abort DM thin metadata transaction due to running out of metadata space. Fix is to reserve a small amount of metadata space so that once it is used the DM thin-pool can finish its active transaction before switching to read-only mode. * tag 'for-4.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin metadata: try to avoid ever aborting transactions dm raid: bump target version, update comments and documentation dm raid: fix RAID leg rebuild errors dm raid: fix rebuild of specific devices by updating superblock dm raid: fix stripe adding reshape deadlock dm raid: fix reshape race on small devices dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlock dm verity: fix crash on bufio buffer that was allocated with vmalloc
2018-09-13Merge tag 'drm-fixes-2018-09-14' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "This is the general drm fixes pull for rc4. i915: - Two GVT fixes (one for the mm reference issue you pointed out) - Gen 2 video playback fix - IPS timeout error suppression on Broadwell amdgpu: - Small memory leak - SR-IOV reset - locking fix - updated SDMA golden registers nouveau: - Remove some leftover debugging" * tag 'drm-fixes-2018-09-14' of git://anongit.freedesktop.org/drm/drm: drm/nouveau/devinit: fix warning when PMU/PRE_OS is missing drm/amdgpu: fix error handling in amdgpu_cs_user_fence_chunk drm/i915/overlay: Allocate physical registers from stolen drm/amdgpu: move PSP init prior to IH in gpu reset drm/amdgpu: Fix SDMA hang in prt mode v2 drm/amdgpu: fix amdgpu_mn_unlock() in the CS error path drm/i915/bdw: Increase IPS disable timeout to 100ms drm/i915/gvt: Fix the incorrect length of child_device_config issue drm/i915/gvt: Fix life cycle reference on KVM mm
2018-09-13Merge tag 'pstore-v4.19-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fix from Kees Cook: "This fixes a 6 year old pstore bug that everyone just got lucky in avoiding, likely due only using page-aligned persistent ram regions: - Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)" * tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Fix incorrect persistent ram buffer mapping
2018-09-13Merge tag 'mmc-v4.19-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC host fixes from Ulf Hansson: - meson-mx-sdio: Fix OF child-node lookup - omap_hsmmc: Fix wakeirq handling on removal * tag 'mmc-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: meson-mx-sdio: fix OF child-node lookup mmc: omap_hsmmc: fix wakeirq handling on removal
2018-09-13Merge tag 'pinctrl-v4.19-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: - A complicated IRQ fix for the MSM driver (see commit) - Fix the group/function check in the Ingenic driver - Deal with a possible NULL pointer dereference in the Madera driver * tag 'pinctrl-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: madera: Fix possible NULL pointer with pdata config pinctrl: ingenic: Fix group & function error checking pinctrl: msm: Really mask level interrupts to prevent latching
2018-09-13Merge branch 'for-4.19-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu maintainership update from Tejun Heo: "This updates the MAINTAINERS file to transfer the percpu tree maintainership to Dennis Zhou. Dennis rewrote a good portion of the percpu allocator, knows most of percpu related code, is already listed as a co-maintainer, has been reliable, and now sits right behind me. I'll keep reviewing and involved with percpu stuff and am sure that Dennis will soon make a better maintainer than I ever was" * 'for-4.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: MAINTAINERS: Make Dennis the percpu tree maintainer
2018-09-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel Pull hexagon fixes from Richard Kuo: "Some fixes for compile warnings" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel: hexagon: modify ffs() and fls() to return int arch/hexagon: fix kernel/dma.c build warning
2018-09-13Merge tag 's390-4.19-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - One fix for the zcrypt driver to correctly handle incomplete encryption/decryption operations. - A cleanup for the aqmask/apmask parsing to avoid variable length arrays on the stack. * tag 's390-4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: remove VLA usage from the AP bus s390/crypto: Fix return code checking in cbc_paes_crypt()
2018-09-13mm: get rid of vmacache_flush_all() entirelyLinus Torvalds
Jann Horn points out that the vmacache_flush_all() function is not only potentially expensive, it's buggy too. It also happens to be entirely unnecessary, because the sequence number overflow case can be avoided by simply making the sequence number be 64-bit. That doesn't even grow the data structures in question, because the other adjacent fields are already 64-bit. So simplify the whole thing by just making the sequence number overflow case go away entirely, which gets rid of all the complications and makes the code faster too. Win-win. [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics also just goes away entirely with this ] Reported-by: Jann Horn <jannh@google.com> Suggested-by: Will Deacon <will.deacon@arm.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-13net/ibm/emac: Remove VLA usageKees Cook
In the quest to remove all stack VLA usage from the kernel[1], this removes the VLA used for the emac xaht registers size. Since the size of registers can only ever be 4 or 8, as detected in emac_init_config(), the max can be hardcoded and a runtime test added for robustness. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Cc: "David S. Miller" <davem@davemloft.net> Cc: Christian Lamparter <chunkeey@gmail.com> Cc: Ivan Mikhaylov <ivan@de.ibm.com> Cc: netdev@vger.kernel.org Co-developed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-14Merge branch 'linux-4.19' of git://github.com/skeggsb/linux into drm-fixesDave Airlie
One more nouveau fix to remove some debug warnings. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Ben Skeggs <bskeggs@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/CABDvA==GF63dy8a9j611=-0x8G6FRu7uC-ZQypsLO_hqV4OAcA@mail.gmail.com
2018-09-14Merge branch 'drm-fixes-4.19' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-fixes A few fixes for 4.19: - Fix a small memory leak - SR-IOV reset fix - Fix locking in MMU-notifier error path - Updated SDMA golden settings to fix a PRT hang Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180912154735.2683-1-alexander.deucher@amd.com
2018-09-14Merge tag 'drm-intel-fixes-2018-09-11' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes This contains a regression fix for video playbacks on gen 2 hardware, a IPS timeout error suppression on Broadwell and GVT bucked with "Most critical one is to fix KVM's mm reference when we access guest memory, issue was raised by Linus [1], and another one with virtual opregion fix." [1] - https://lists.freedesktop.org/archives/intel-gvt-dev/2018-August/004130.html Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180911223229.GA30328@intel.com
2018-09-13socket: fix struct ifreq size in compat ioctlJohannes Berg
As reported by Reobert O'Callahan, since Viro's commit to kill dev_ifsioc() we attempt to copy too much data in compat mode, which may lead to EFAULT when the 32-bit version of struct ifreq sits at/near the end of a page boundary, and the next page isn't mapped. Fix this by passing the approprate compat/non-compat size to copy and using that, as before the dev_ifsioc() removal. This works because only the embedded "struct ifmap" has different size, and this is only used in SIOCGIFMAP/SIOCSIFMAP which has a different handler. All other parts of the union are naturally compatible. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199469. Fixes: bf4405737f9f ("kill dev_ifsioc()") Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13pktgen: Fix fall-through annotationGustavo A. R. Silva
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tg3: Fix fall-through annotationsGustavo A. R. Silva
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13MAINTAINERS: Make Dennis the percpu tree maintainerTejun Heo
Dennis rewrote a significant portion of the percpu allocator and has shown that he can respond in a timely and helpful manner when issues are reported against percpu allocator. Let's make Dennis the percpu tree maintainer. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com>
2018-09-13gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen
When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen
When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13vxlan: Remove duplicated include from vxlan.hYueHaibing
Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2018-09-13 A few Bluetooth fixes for the 4.19-rc series: - Fixed rw_semaphore leak in hci_ldisc - Fixed local Out-of-Band pairing data handling Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13Merge branch 'tls-don-t-leave-keys-in-kernel-memory'David S. Miller
Sabrina Dubroca says: ==================== tls: don't leave keys in kernel memory There are a few places where the RX/TX key for a TLS socket is copied to kernel memory. This series clears those memory areas when they're no longer needed. v2: add union tls_crypto_context, following Vakul Garg's comment swap patch 2 and 3, using new union in patch 3 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tls: clear key material from kernel memory when do_tls_setsockopt_conf failsSabrina Dubroca
Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tls: zero the crypto information from tls_context before freeingSabrina Dubroca
This contains key material in crypto_send_aes_gcm_128 and crypto_recv_aes_gcm_128. Introduce union tls_crypto_context, and replace the two identical unions directly embedded in struct tls_context with it. We can then use this union to clean up the memory in the new tls_ctx_free() function. Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tls: don't copy the key out of tls12_crypto_info_aes_gcm_128Sabrina Dubroca
There's no need to copy the key to an on-stack buffer before calling crypto_aead_setkey(). Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13neighbour: confirm neigh entries when ARP packet is receivedVasily Khoruzhick
Update 'confirmed' timestamp when ARP packet is received. It shouldn't affect locktime logic and anyway entry can be confirmed by any higher-layer protocol. Thus it makes sense to confirm it when ARP packet is received. Fixes: 77d7123342dc ("neighbour: update neigh timestamps iff update is effective") Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flagsRoopa Prabhu
This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071 Commit 5025f7f7d506 wrongly relied on __dev_change_flags to notify users of dev flag changes in the case when dev->rtnl_link_state = RTNL_LINK_INITIALIZED. Fix it by indicating flag changes explicitly to __dev_notify_flags. Fixes: 5025f7f7d506 ("rtnetlink: add rtnl_link_state check in rtnl_configure_link") Reported-By: Liam mcbirnie <liam.mcbirnie@boeing.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net/appletalk: fix minor pointer leak to userspace in SIOCFINDIPDDPRTWilly Tarreau
Fields ->dev and ->next of struct ipddp_route may be copied to userspace on the SIOCFINDIPDDPRT ioctl. This is only accessible to CAP_NET_ADMIN though. Let's manually copy the relevant fields instead of using memcpy(). BugLink: http://blog.infosectcbr.com.au/2018/09/linux-kernel-infoleaks.html Cc: Jann Horn <jannh@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13hv_netvsc: fix schedule in RCU contextStephen Hemminger
When netvsc device is removed it can call reschedule in RCU context. This happens because canceling the subchannel setup work could (in theory) cause a reschedule when manipulating the timer. To reproduce, run with lockdep enabled kernel and unbind a network device from hv_netvsc (via sysfs). [ 160.682011] WARNING: suspicious RCU usage [ 160.707466] 4.19.0-rc3-uio+ #2 Not tainted [ 160.709937] ----------------------------- [ 160.712352] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section! [ 160.723691] [ 160.723691] other info that might help us debug this: [ 160.723691] [ 160.730955] [ 160.730955] rcu_scheduler_active = 2, debug_locks = 1 [ 160.762813] 5 locks held by rebind-eth.sh/1812: [ 160.766851] #0: 000000008befa37a (sb_writers#6){.+.+}, at: vfs_write+0x184/0x1b0 [ 160.773416] #1: 00000000b097f236 (&of->mutex){+.+.}, at: kernfs_fop_write+0xe2/0x1a0 [ 160.783766] #2: 0000000041ee6889 (kn->count#3){++++}, at: kernfs_fop_write+0xeb/0x1a0 [ 160.787465] #3: 0000000056d92a74 (&dev->mutex){....}, at: device_release_driver_internal+0x39/0x250 [ 160.816987] #4: 0000000030f6031e (rcu_read_lock){....}, at: netvsc_remove+0x1e/0x250 [hv_netvsc] [ 160.828629] [ 160.828629] stack backtrace: [ 160.831966] CPU: 1 PID: 1812 Comm: rebind-eth.sh Not tainted 4.19.0-rc3-uio+ #2 [ 160.832952] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v1.0 11/26/2012 [ 160.832952] Call Trace: [ 160.832952] dump_stack+0x85/0xcb [ 160.832952] ___might_sleep+0x1a3/0x240 [ 160.832952] __flush_work+0x57/0x2e0 [ 160.832952] ? __mutex_lock+0x83/0x990 [ 160.832952] ? __kernfs_remove+0x24f/0x2e0 [ 160.832952] ? __kernfs_remove+0x1b2/0x2e0 [ 160.832952] ? mark_held_locks+0x50/0x80 [ 160.832952] ? get_work_pool+0x90/0x90 [ 160.832952] __cancel_work_timer+0x13c/0x1e0 [ 160.832952] ? netvsc_remove+0x1e/0x250 [hv_netvsc] [ 160.832952] ? __lock_is_held+0x55/0x90 [ 160.832952] netvsc_remove+0x9a/0x250 [hv_netvsc] [ 160.832952] vmbus_remove+0x26/0x30 [ 160.832952] device_release_driver_internal+0x18a/0x250 [ 160.832952] unbind_store+0xb4/0x180 [ 160.832952] kernfs_fop_write+0x113/0x1a0 [ 160.832952] __vfs_write+0x36/0x1a0 [ 160.832952] ? rcu_read_lock_sched_held+0x6b/0x80 [ 160.832952] ? rcu_sync_lockdep_assert+0x2e/0x60 [ 160.832952] ? __sb_start_write+0x141/0x1a0 [ 160.832952] ? vfs_write+0x184/0x1b0 [ 160.832952] vfs_write+0xbe/0x1b0 [ 160.832952] ksys_write+0x55/0xc0 [ 160.832952] do_syscall_64+0x60/0x1b0 [ 160.832952] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 160.832952] RIP: 0033:0x7fe48f4c8154 Resolve this by getting RTNL earlier. This is safe because the subchannel work queue does trylock on RTNL and will detect the race. Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net: dsa: b53: Do not fail when IRQ are not initializedFlorian Fainelli
When the Device Tree is not providing the per-port interrupts, do not fail during b53_srab_irq_enable() but instead bail out gracefully. The SRAB driver is used on the BCM5301X (Northstar) platforms which do not yet have the SRAB interrupts wired up. Fixes: 16994374a6fc ("net: dsa: b53: Make SRAB driver manage port interrupts") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13Merge branch 'vhost_net-TX-batching'David S. Miller
Jason Wang says: ==================== vhost_net TX batching This series tries to batch submitting packets to underlayer socket through msg_control during sendmsg(). This is done by: 1) Doing userspace copy inside vhost_net 2) Build XDP buff 3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once through msg_control during sendmsg(). 4) Underlayer sockets can use XDP buffs directly when XDP is enalbed, or build skb based on XDP buff. For the packet that can not be built easily with XDP or for the case that batch submission is hard (e.g sndbuf is limited). We will go for the previous slow path, passing iov iterator to underlayer socket through sendmsg() once per packet. This can help to improve cache utilization and avoid lots of indirect calls with sendmsg(). It can also co-operate with the batching support of the underlayer sockets (e.g the case of XDP redirection through maps). Testpmd(txonly) in guest shows obvious improvements: Test /+pps% XDP_DROP on TAP /+44.8% XDP_REDIRECT on TAP /+29% macvtap (skb) /+26% Netperf TCP_STREAM TX from guest shows obvious improvements on small packet: size/session/+thu%/+normalize% 64/ 1/ +2%/ 0% 64/ 2/ +3%/ +1% 64/ 4/ +7%/ +5% 64/ 8/ +8%/ +6% 256/ 1/ +3%/ 0% 256/ 2/ +10%/ +7% 256/ 4/ +26%/ +22% 256/ 8/ +27%/ +23% 512/ 1/ +3%/ +2% 512/ 2/ +19%/ +14% 512/ 4/ +43%/ +40% 512/ 8/ +45%/ +41% 1024/ 1/ +4%/ 0% 1024/ 2/ +27%/ +21% 1024/ 4/ +38%/ +73% 1024/ 8/ +15%/ +24% 2048/ 1/ +10%/ +7% 2048/ 2/ +16%/ +12% 2048/ 4/ 0%/ +2% 2048/ 8/ 0%/ +2% 4096/ 1/ +36%/ +60% 4096/ 2/ -11%/ -26% 4096/ 4/ 0%/ +14% 4096/ 8/ 0%/ +4% 16384/ 1/ -1%/ +5% 16384/ 2/ 0%/ +2% 16384/ 4/ 0%/ -3% 16384/ 8/ 0%/ +4% 65535/ 1/ 0%/ +10% 65535/ 2/ 0%/ +8% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ +3% Please review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13vhost_net: batch submitting XDP buffers to underlayer socketsJason Wang
This patch implements XDP batching for vhost_net. The idea is first to try to do userspace copy and build XDP buff directly in vhost. Instead of submitting the packet immediately, vhost_net will batch them in an array and submit every 64 (VHOST_NET_BATCH) packets to the under layer sockets through msg_control of sendmsg(). When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a loop without caring GUP thus it can do batch map flushing. When XDP is not enabled or not supported, the underlayer socket need to build skb and pass it to network core. The batched packet submission allows us to do batching like netif_receive_skb_list() in the future. This saves lots of indirect calls for better cache utilization. For the case that we can't so batching e.g when sndbuf is limited or packet size is too large, we will go for usual one packet per sendmsg() way. Doing testpmd on various setups gives us: Test /+pps% XDP_DROP on TAP /+44.8% XDP_REDIRECT on TAP /+29% macvtap (skb) /+26% Netperf tests shows obvious improvements for small packet transmission: size/session/+thu%/+normalize% 64/ 1/ +2%/ 0% 64/ 2/ +3%/ +1% 64/ 4/ +7%/ +5% 64/ 8/ +8%/ +6% 256/ 1/ +3%/ 0% 256/ 2/ +10%/ +7% 256/ 4/ +26%/ +22% 256/ 8/ +27%/ +23% 512/ 1/ +3%/ +2% 512/ 2/ +19%/ +14% 512/ 4/ +43%/ +40% 512/ 8/ +45%/ +41% 1024/ 1/ +4%/ 0% 1024/ 2/ +27%/ +21% 1024/ 4/ +38%/ +73% 1024/ 8/ +15%/ +24% 2048/ 1/ +10%/ +7% 2048/ 2/ +16%/ +12% 2048/ 4/ 0%/ +2% 2048/ 8/ 0%/ +2% 4096/ 1/ +36%/ +60% 4096/ 2/ -11%/ -26% 4096/ 4/ 0%/ +14% 4096/ 8/ 0%/ +4% 16384/ 1/ -1%/ +5% 16384/ 2/ 0%/ +2% 16384/ 4/ 0%/ -3% 16384/ 8/ 0%/ +4% 65535/ 1/ 0%/ +10% 65535/ 2/ 0%/ +8% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ +3% Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tap: accept an array of XDP buffs through sendmsg()Jason Wang
This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. Tap will build skb through those XDP buffers. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: accept an array of XDP buffs through sendmsg()Jason Wang
This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. If an XDP program is attached, tuntap can run XDP program directly. If not, tuntap will build skb and do a fast receiving since part of the work has been done by vhost_net. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tun: switch to new type of msg_controlJason Wang
This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: move XDP flushing out of tun_do_xdp()Jason Wang
This will allow adding batch flushing on top. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: split out XDP logicJason Wang
This patch split out XDP logic into a single function. This make it to be reused by XDP batching path in the following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: tweak on the path of skb XDP case in tun_build_skb()Jason Wang
If we're sure not to go native XDP, there's no need for several things like bh and rcu stuffs. So this patch introduces a helper to build skb and hold page refcnt. When we found we will go through skb path, build skb directly. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: simplify error handling in tun_build_skb()Jason Wang
There's no need to duplicate page get logic in each action. So this patch tries to get page and calculate the offset before processing XDP actions (except for XDP_DROP), and undo them when meet errors (we don't care the performance on errors). This will be used for factoring out XDP logic. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: enable bh early during processing XDPJason Wang
This patch move the bh enabling a little bit earlier, this will be used for factoring out the core XDP logic of tuntap. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: switch to use XDP_PACKET_HEADROOMJason Wang
Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net: sock: introduce SOCK_XDPJason Wang
This patch introduces a new sock flag - SOCK_XDP. This will be used for notifying the upper layer that XDP program is attached on the lower socket, and requires for extra headroom. TUN will be the first user. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>