summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-04-28PCI: portdrv: Set driver_managed_dmaLu Baolu
If a switch lacks ACS P2P Request Redirect, a device below the switch can bypass the IOMMU and DMA directly to other devices below the switch, so all the downstream devices must be in the same IOMMU group as the switch itself. The existing VFIO framework allows the portdrv driver to be bound to the bridge while its downstream devices are assigned to user space. The pci_dma_configure() marks the IOMMU group as containing only devices with kernel drivers that manage DMA. Avoid this default behavior for the portdrv driver in order for compatibility with the current VFIO usage. We achieve this by setting ".driver_managed_dma = true" in pci_driver structure. It is safe because the portdrv driver meets below criteria: - This driver doesn't use DMA, as you can't find any related calls like pci_set_master() or any kernel DMA API (dma_map_*() and etc.). - It doesn't use MMIO as you can't find ioremap() or similar calls. It's tolerant to userspace possibly also touching the same MMIO registers via P2P DMA access. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20220418005000.897664-7-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28PCI: pci_stub: Set driver_managed_dmaLu Baolu
The current VFIO implementation allows pci-stub driver to be bound to a PCI device with other devices in the same IOMMU group being assigned to userspace. The pci-stub driver has no dependencies on DMA or the IOVA mapping of the device, but it does prevent the user from having direct access to the device, which is useful in some circumstances. The pci_dma_configure() marks the iommu_group as containing only devices with kernel drivers that manage DMA. For compatibility with the VFIO usage, avoid this default behavior for the pci_stub. This allows the pci_stub still able to be used by the admin to block driver binding after applying the DMA ownership to VFIO. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20220418005000.897664-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28bus: platform,amba,fsl-mc,PCI: Add device DMA ownership managementLu Baolu
The devices on platform/amba/fsl-mc/PCI buses could be bound to drivers with the device DMA managed by kernel drivers or user-space applications. Unfortunately, multiple devices may be placed in the same IOMMU group because they cannot be isolated from each other. The DMA on these devices must either be entirely under kernel control or userspace control, never a mixture. Otherwise the driver integrity is not guaranteed because they could access each other through the peer-to-peer accesses which by-pass the IOMMU protection. This checks and sets the default DMA mode during driver binding, and cleanups during driver unbinding. In the default mode, the device DMA is managed by the device driver which handles DMA operations through the kernel DMA APIs (see Documentation/core-api/dma-api.rst). For cases where the devices are assigned for userspace control through the userspace driver framework(i.e. VFIO), the drivers(for example, vfio_pci/ vfio_platfrom etc.) may set a new flag (driver_managed_dma) to skip this default setting in the assumption that the drivers know what they are doing with the device DMA. Calling iommu_device_use_default_domain() before {of,acpi}_dma_configure is currently a problem. As things stand, the IOMMU driver ignored the initial iommu_probe_device() call when the device was added, since at that point it had no fwspec yet. In this situation, {of,acpi}_iommu_configure() are retriggering iommu_probe_device() after the IOMMU driver has seen the firmware data via .of_xlate to learn that it actually responsible for the given device. As the result, before that gets fixed, iommu_use_default_domain() goes at the end, and calls arch_teardown_dma_ops() if it fails. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Stuart Yoder <stuyoder@gmail.com> Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20220418005000.897664-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28amba: Stop sharing platform_dma_configure()Lu Baolu
Stop sharing platform_dma_configure() helper as they are about to have their own bus dma_configure callbacks. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220418005000.897664-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28driver core: Add dma_cleanup callback in bus_typeLu Baolu
The bus_type structure defines dma_configure() callback for bus drivers to configure DMA on the devices. This adds the paired dma_cleanup() callback and calls it during driver unbinding so that bus drivers can do some cleanup work. One use case for this paired DMA callbacks is for the bus driver to check for DMA ownership conflicts during driver binding, where multiple devices belonging to a same IOMMU group (the minimum granularity of isolation and protection) may be assigned to kernel drivers or user space respectively. Without this change, for example, the vfio driver has to listen to a bus BOUND_DRIVER event and then BUG_ON() in case of dma ownership conflict. This leads to bad user experience since careless driver binding operation may crash the system if the admin overlooks the group restriction. Aside from bad design, this leads to a security problem as a root user, even with lockdown=integrity, can force the kernel to BUG. With this change, the bus driver could check and set the DMA ownership in driver binding process and fail on ownership conflicts. The DMA ownership should be released during driver unbinding. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220418005000.897664-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-28iommu: Add DMA ownership management interfacesLu Baolu
Multiple devices may be placed in the same IOMMU group because they cannot be isolated from each other. These devices must either be entirely under kernel control or userspace control, never a mixture. This adds dma ownership management in iommu core and exposes several interfaces for the device drivers and the device userspace assignment framework (i.e. VFIO), so that any conflict between user and kernel controlled dma could be detected at the beginning. The device driver oriented interfaces are, int iommu_device_use_default_domain(struct device *dev); void iommu_device_unuse_default_domain(struct device *dev); By calling iommu_device_use_default_domain(), the device driver tells the iommu layer that the device dma is handled through the kernel DMA APIs. The iommu layer will manage the IOVA and use the default domain for DMA address translation. The device user-space assignment framework oriented interfaces are, int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner); void iommu_group_release_dma_owner(struct iommu_group *group); bool iommu_group_dma_owner_claimed(struct iommu_group *group); The device userspace assignment must be disallowed if the DMA owner claiming interface returns failure. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20220418005000.897664-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-04-24Linux 5.18-rc4Linus Torvalds
2022-04-24Merge tag 'sched_urgent_for_v5.18_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Fix a corner case when calculating sched runqueue variables That fix also removes a check for a zero divisor in the code, without mentioning it. Vincent clarified that it's ok after I whined about it: https://lore.kernel.org/all/CAKfTPtD2QEyZ6ADd5WrwETMOX0XOwJGnVddt7VHgfURdqgOS-Q@mail.gmail.com/ * tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/pelt: Fix attach_entity_load_avg() corner case
2022-04-24Merge tag 'powerpc-5.18-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Partly revert a change to our timer_interrupt() that caused lockups with high res timers disabled. - Fix a bug in KVM TCE handling that could corrupt kernel memory. - Two commits fixing Power9/Power10 perf alternative event selection. Thanks to Alexey Kardashevskiy, Athira Rajeev, David Gibson, Frederic Barrat, Madhavan Srinivasan, Miguel Ojeda, and Nicholas Piggin. * tag 'powerpc-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/perf: Fix 32bit compile powerpc/perf: Fix power10 event alternatives powerpc/perf: Fix power9 event alternatives KVM: PPC: Fix TCE handling for VFIO powerpc/time: Always set decrementer in timer_interrupt()
2022-04-24Merge tag 'perf_urgent_for_v5.18_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Add Sapphire Rapids CPU support - Fix a perf vmalloc-ed buffer mapping error (PERF_USE_VMALLOC in use) * tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/cstate: Add SAPPHIRERAPIDS_X CPU support perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
2022-04-24Merge tag 'edac_urgent_for_v5.18_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fix from Borislav Petkov: - Read the reported error count from the proper register on synopsys_edac * tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/synopsys: Read the error count from the correct register
2022-04-24kvmalloc: use vmalloc_huge for vmalloc allocationsLinus Torvalds
Since commit 559089e0a93d ("vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP"), the use of hugepage mappings for vmalloc is an opt-in strategy, because it caused a number of problems that weren't noticed until x86 enabled it too. One of the issues was fixed by Nick Piggin in commit 3b8000ae185c ("mm/vmalloc: huge vmalloc backing pages should be split rather than compound"), but I'm still worried about page protection issues, and VM_FLUSH_RESET_PERMS in particular. However, like the hash table allocation case (commit f2edd118d02d: "page_alloc: use vmalloc_huge for large system hash"), the use of kvmalloc() should be safe from any such games, since the returned pointer might be a SLUB allocation, and as such no user should reasonably be using it in any odd ways. We also know that the allocations are fairly large, since it falls back to the vmalloc case only when a kmalloc() fails. So using a hugepage mapping seems both safe and relevant. This patch does show a weakness in the opt-in strategy: since the opt-in flag is in the 'vm_flags', not the usual gfp_t allocation flags, very few of the usual interfaces actually expose it. That's not much of an issue in this case that already used one of the fairly specialized low-level vmalloc interfaces for the allocation, but for a lot of other vmalloc() users that might want to opt in, it's going to be very inconvenient. We'll either have to fix any compatibility problems, or expose it in the gfp flags (__GFP_COMP would have made a lot of sense) to allow normal vmalloc() users to use hugepage mappings. That said, the cases that really matter were probably already taken care of by the hash tabel allocation. Link: https://lore.kernel.org/all/20220415164413.2727220-1-song@kernel.org/ Link: https://lore.kernel.org/all/CAHk-=whao=iosX1s5Z4SF-ZGa-ebAukJoAdUJFk5SPwnofV+Vg@mail.gmail.com/ Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Cc: Song Liu <songliubraving@fb.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-24page_alloc: use vmalloc_huge for large system hashSong Liu
Use vmalloc_huge() in alloc_large_system_hash() so that large system hash (>= PMD_SIZE) could benefit from huge pages. Note that vmalloc_huge only allocates huge pages for systems with HAVE_ARCH_HUGE_VMALLOC. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-23Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd server fixes from Steve French: - cap maximum sector size reported to avoid mount problems - reference count fix - fix filename rename race * tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION ksmbd: increment reference count of parent fp ksmbd: remove filename in ksmbd_file
2022-04-23Merge tag 'arc-5.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - Assorted fixes * tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: remove redundant READ_ONCE() in cmpxchg loop ARC: atomic: cleanup atomic-llsc definitions arc: drop definitions of pgd_index() and pgd_offset{, _k}() entirely ARC: dts: align SPI NOR node name with dtschema ARC: Remove a redundant memset() ARC: fix typos in comments ARC: entry: fix syscall_trace_exit argument
2022-04-23Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "One fix for an information leak caused by copying a buffer to userspace without checking for error first in the sr driver" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sr: Do not leak information in ioctl
2022-04-23Merge tag 'for-linus-5.18-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "A simple cleanup patch and a refcount fix for Xen on Arm" * tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: arm/xen: Fix some refcount leaks xen: Convert kmap() to kmap_local_page()
2022-04-23Merge tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull more drm fixes from Dave Airlie: "Maarten was away, so Maxine stepped up and sent me the drm-fixes merge, so no point leaving it for another week. The big change is an OF revert around bridge/panels, it may have some driver fallout, but hopefully this revert gets them shook out in the next week easier. Otherwise it's a bunch of locking/refcounts across drivers, a radeon dma_resv logic fix and some raspberry pi panel fixes. panel: - revert of patch that broke panel/bridge issues dma-buf: - remove unused header file. amdgpu: - partial revert of locking change radeon: - fix dma_resv logic inversion panel: - pi touchscreen panel init fixes vc4: - build fix - runtime pm refcount fix vmwgfx: - refcounting fix" * tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drm: drm/amdgpu: partial revert "remove ctx->lock" v2 Revert "drm: of: Lookup if child node has panel or bridge" Revert "drm: of: Properly try all possible cases for bridge/panel detection" drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage drm/vmwgfx: Fix gem refcounting and memory evictions drm/vc4: Fix build error when CONFIG_DRM_VC4=y && CONFIG_RASPBERRYPI_FIRMWARE=m drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised dma-buf-map: remove renamed header file drm/radeon: fix logic inversion in radeon_sync_resv
2022-04-23Merge tag 'input-for-v5.18-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: - a new set of keycodes to be used by marine navigation systems - minor fixes to omap4-keypad and cypress-sf drivers * tag 'input-for-v5.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: add Marine Navigation Keycodes Input: omap4-keypad - fix pm_runtime_get_sync() error checking Input: cypress-sf - register a callback to disable the regulators
2022-04-23Merge tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Just two small regression fixes for bcache" * tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block: bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook() bcache: put bch_bio_map() back to correct location in journal_write_unlocked()
2022-04-23Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Just two small fixes - one fixing a potential leak for the iovec for larger requests added in this cycle, and one fixing a theoretical leak with CQE_SKIP and IOPOLL" * tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block: io_uring: fix leaks on IOPOLL and CQE_SKIP io_uring: free iovec if file assignment fails
2022-04-23Merge tag 'perf-tools-fixes-for-v5.18-2022-04-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix header include for LLVM >= 14 when building with libclang. - Allow access to 'data_src' for auxtrace in 'perf script' with ARM SPE perf.data files, fixing processing data with such attributes. - Fix error message for test case 71 ("Convert perf time to TSC") on s390, where it is not supported. * tag 'perf-tools-fixes-for-v5.18-2022-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: perf test: Fix error message for test case 71 on s390, where it is not supported perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event perf script: Always allow field 'data_src' for auxtrace perf clang: Fix header include for LLVM >= 14
2022-04-23sparc: cacheflush_32.h needs struct pageRandy Dunlap
Add a struct page forward declaration to cacheflush_32.h. Fixes this build warning: CC drivers/crypto/xilinx/zynqmp-sha.o In file included from arch/sparc/include/asm/cacheflush.h:11, from include/linux/cacheflush.h:5, from drivers/crypto/xilinx/zynqmp-sha.c:6: arch/sparc/include/asm/cacheflush_32.h:38:37: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration 38 | void sparc_flush_page_to_ram(struct page *page); Exposed by commit 0e03b8fd2936 ("crypto: xilinx - Turn SHA into a tristate and allow COMPILE_TEST") but not Fixes: that commit because the underlying problem is older. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: sparclinux@vger.kernel.org Acked-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-23Merge tag 'drm-misc-fixes-2022-04-22' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Two fixes for the raspberrypi panel initialisation, one fix for a logic inversion in radeon, a build and pm refcounting fix for vc4, two reverts for drm_of_get_bridge that caused a number of regression and a locking regression for amdgpu. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20220422084403.2xrhf3jusdej5yo4@houat
2022-04-22Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix some syzbot-detected bugs, as well as other bugs found by I/O injection testing. Change ext4's fallocate to consistently drop set[ug]id bits when an fallocate operation might possibly change the user-visible contents of a file. Also, improve handling of potentially invalid values in the the s_overhead_cluster superblock field to avoid ext4 returning a negative number of free blocks" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix a potential race while discarding reserved buffers after an abort ext4: update the cached overhead value in the superblock ext4: force overhead calculation if the s_overhead_cluster makes no sense ext4: fix overhead calculation to account for the reserved gdt blocks ext4, doc: fix incorrect h_reserved size ext4: limit length to bitmap_maxbytes - blocksize in punch_hole ext4: fix use-after-free in ext4_search_dir ext4: fix bug_on in start_this_handle during umount filesystem ext4: fix symlink file size not match to file content ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-22Merge tag 'ata-5.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ATA fix from Damien Le Moal: "A single fix to avoid a NULL pointer dereference in the pata_marvell driver with adapters not supporting DMA, from Zheyu" * tag 'ata-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: pata_marvell: Check the 'bmdma_addr' beforing reading
2022-04-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "The main and larger change here is a workaround for AMD's lack of cache coherency for encrypted-memory guests. I have another patch pending, but it's waiting for review from the architecture maintainers. RISC-V: - Remove 's' & 'u' as valid ISA extension - Do not allow disabling the base extensions 'i'/'m'/'a'/'c' x86: - Fix NMI watchdog in guests on AMD - Fix for SEV cache incoherency issues - Don't re-acquire SRCU lock in complete_emulated_io() - Avoid NULL pointer deref if VM creation fails - Fix race conditions between APICv disabling and vCPU creation - Bugfixes for disabling of APICv - Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume selftests: - Do not use bitfields larger than 32-bits, they differ between GCC and clang" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: selftests: introduce and use more page size-related constants kvm: selftests: do not use bitfields larger than 32-bits for PTEs KVM: SEV: add cache flush to solve SEV cache incoherency issues KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs KVM: SVM: Simplify and harden helper to flush SEV guest page(s) KVM: selftests: Silence compiler warning in the kvm_page_table_test KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume KVM: SPDX style and spelling fixes KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race KVM: nVMX: Defer APICv updates while L2 is active until L1 is active KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io() RISC-V: KVM: Restrict the extensions that can be disabled RISC-V: KVM: Remove 's' & 'u' as valid ISA extension
2022-04-22perf test: Fix error message for test case 71 on s390, where it is not supportedThomas Richter
Test case 71 'Convert perf time to TSC' is not supported on s390. Subtest 71.1 is skipped with the correct message, but subtest 71.2 is not skipped and fails. The root cause is function evlist__open() called from test__perf_time_to_tsc(). evlist__open() returns -ENOENT because the event cycles:u is not supported by the selected PMU, for example platform s390 on z/VM or an x86_64 virtual machine. The PMU driver returns -ENOENT in this case. This error is leads to the failure. Fix this by returning TEST_SKIP on -ENOENT. Output before: 71: Convert perf time to TSC: 71.1: TSC support: Skip (This architecture does not support) 71.2: Perf time to TSC: FAILED! Output after: 71: Convert perf time to TSC: 71.1: TSC support: Skip (This architecture does not support) 71.2: Perf time to TSC: Skip (perf_read_tsc_conversion is not supported) This also happens on an x86_64 virtual machine: # uname -m x86_64 $ ./perf test -F 71 71: Convert perf time to TSC : 71.1: TSC support : Ok 71.2: Perf time to TSC : FAILED! $ Committer testing: Continues to work on x86_64: $ perf test 71 71: Convert perf time to TSC : 71.1: TSC support : Ok 71.2: Perf time to TSC : Ok $ Fixes: 290fa68bdc458863 ("perf test tsc: Fix error message when not supported") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Chengdong Li <chengdongli@tencent.com> Cc: chengdongli@tencent.com Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20220420062921.1211825-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-04-22perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE eventLeo Yan
Since commit bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem info is not available") "perf mem report" and "perf report --mem-mode" don't report result if the PERF_SAMPLE_DATA_SRC bit is missed in sample type. The commit ffab487052054162 ("perf: arm-spe: Fix perf report --mem-mode") partially fixes the issue. It adds PERF_SAMPLE_DATA_SRC bit for Arm SPE event, this allows the perf data file generated by kernel v5.18-rc1 or later version can be reported properly. On the other hand, perf tool still fails to be backward compatibility for a data file recorded by an older version's perf which contains Arm SPE trace data. This patch is a workaround in reporting phase, when detects ARM SPE PMU event and without PERF_SAMPLE_DATA_SRC bit, it will force to set the bit in the sample type and give a warning info. Fixes: bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem info is not available") Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Tested-by: German Gomez <german.gomez@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Link: https://lore.kernel.org/r/20220414123201.842754-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-04-22perf script: Always allow field 'data_src' for auxtraceLeo Yan
If use command 'perf script -F,+data_src' to dump memory samples with Arm SPE trace data, it reports error: # perf script -F,+data_src Samples for 'dummy:u' event do not have DATA_SRC attribute set. Cannot print 'data_src' field. This is because the 'dummy:u' event is absent DATA_SRC bit in its sample type, so if a file contains AUX area tracing data then always allow field 'data_src' to be selected as an option for perf script. Fixes: e55ed3423c1bb29f ("perf arm-spe: Synthesize memory event") Signed-off-by: Leo Yan <leo.yan@linaro.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: German Gomez <german.gomez@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20220417114837.839896-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-04-22perf clang: Fix header include for LLVM >= 14Guilherme Amadio
The header TargetRegistry.h has moved in LLVM/clang 14. Committer notes: The problem as noticed when building in ubuntu:22.04: 90 98.61 ubuntu:22.04 : FAIL gcc version 11.2.0 (Ubuntu 11.2.0-19ubuntu1) util/c++/clang.cpp:23:10: fatal error: llvm/Support/TargetRegistry.h: No such file or directory 23 | #include "llvm/Support/TargetRegistry.h" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. Fixed after applying this patch. Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Signed-off-by: Guilherme Amadio <amadio@gentoo.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://twitter.com/GuilhermeAmadio/status/1514970524232921088 Link: http://lore.kernel.org/lkml/Ylp0M/VYgHOxtcnF@gentoo.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-04-22gpio: Request interrupts after IRQ is initializedMario Limonciello
Commit 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members before initialization") attempted to fix a race condition that lead to a NULL pointer, but in the process caused a regression for _AEI/_EVT declared GPIOs. This manifests in messages showing deferred probing while trying to allocate IRQs like so: amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517 amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x002C to IRQ, err -517 amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517 [ .. more of the same .. ] The code for walking _AEI doesn't handle deferred probing and so this leads to non-functional GPIO interrupts. Fix this issue by moving the call to `acpi_gpiochip_request_interrupts` to occur after gc->irc.initialized is set. Fixes: 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members before initialization") Link: https://lore.kernel.org/linux-gpio/BL1PR12MB51577A77F000A008AA694675E2EF9@BL1PR12MB5157.namprd12.prod.outlook.com/ Link: https://bugzilla.suse.com/show_bug.cgi?id=1198697 Link: https://bugzilla.kernel.org/show_bug.cgi?id=215850 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1979 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1976 Reported-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Shreeya Patel <shreeya.patel@collabora.com> Tested-By: Samuel Čavoj <samuel@cavoj.net> Tested-By: lukeluk498@gmail.com Link: Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-and-tested-by: Takashi Iwai <tiwai@suse.de> Cc: Shreeya Patel <shreeya.patel@collabora.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22Merge tag 'riscv-for-linus-5.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes Palmer Dabbelt: - A pair of build fixes for the recent cpuidle driver - A fix for systems without sv57 that manifests as a crash early in boot * tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLE RISC-V: mm: Fix set_satp_mode() for platform not having Sv57 cpuidle: riscv: support non-SMP config
2022-04-22Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "There's no real pattern to the fixes, but the main one fixes our pmd_leaf() definition to resolve a NULL dereference on the migration path. - Fix PMU event validation in the absence of any event counters - Fix allmodconfig build using clang in conjunction with binutils - Fix definitions of pXd_leaf() to handle PROT_NONE entries - More typo fixes" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: mm: fix p?d_leaf() arm64: fix typos in comments arm64: Improve HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang arm_pmu: Validate single/group leader events
2022-04-22arm/xen: Fix some refcount leaksMiaoqian Lin
The of_find_compatible_node() function returns a node pointer with refcount incremented, We should use of_node_put() on it when done Add the missing of_node_put() to release the refcount. Fixes: 9b08aaa3199a ("ARM: XEN: Move xen_early_init() before efi_init()") Fixes: b2371587fe0c ("arm/xen: Read extended regions from DT and init Xen resource") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
2022-04-22Merge tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarrayLinus Torvalds
Pull xarray fixes from Matthew Wilcox: "Syzbot found a nasty race between large page splitting and page lookup. Details in the commit log, but fortunately it has a reliable reproducer. I thought it better to send this one to you straight away. Also fix the test suite build for kmem_cache_alloc_lru()" * tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray: XArray: Disallow sibling entries of nodes tools: Add kmem_cache_alloc_lru()
2022-04-22Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Four fixes, two of them for stable: - fcollapse fix - reconnect lock fix - DFS oops fix - minor cleanup patch" * tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: destage any unwritten data to the server before calling copychunk_write cifs: use correct lock type in cifs_reconnect() cifs: fix NULL ptr dereference in refresh_mounts() cifs: Use kzalloc instead of kmalloc/memset
2022-04-22Merge tag 'fs.fixes.v5.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr fix from Christian Brauner: "The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") switched the mount attribute codepaths from do-while to for loops as they are more idiomatic when walking mounts. However, we did originally choose do-while constructs because if we request a mount or mount tree to be made read-only we need to hold writers in the following way: The mount attribute code will grab lock_mount_hash() and then call mnt_hold_writers() which will _unconditionally_ set MNT_WRITE_HOLD on the mount. Any callers that need write access have to call mnt_want_write(). They will immediately see that MNT_WRITE_HOLD is set on the mount and the caller will then either spin (on non-preempt-rt) or wait on lock_mount_hash() (on preempt-rt). The fact that MNT_WRITE_HOLD is set unconditionally means that once mnt_hold_writers() returns we need to _always_ pair it with mnt_unhold_writers() in both the failure and success paths. The do-while constructs did take care of this. But Al's change to a for loop in the failure path stops on the first mount we failed to change mount attributes _without_ going into the loop to call mnt_unhold_writers(). This in turn means that once we failed to make a mount read-only via mount_setattr() - i.e. there are already writers on that mount - we will block any writers indefinitely. Fix this by ensuring that the for loop always unsets MNT_WRITE_HOLD including the first mount we failed to change to read-only. Also sprinkle a few comments into the cleanup code to remind people about what is happening including myself. After all, I didn't catch it during review. This is only relevant on mainline and was reported by syzbot. Details about the syzbot reports are all in the commit message" * tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: unset MNT_WRITE_HOLD on failure
2022-04-22Merge tag 'sound-5.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "At this time, the majority of changes are for pending ASoC fixes while a few usual HD-audio and USB-audio quirks are found. Almost all patches are small device-specific fixes, and nothing worrisome stands out, so far" * tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits) ALSA: hda/realtek: Add quirk for Clevo NP70PNP ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9 ALSA: usb-audio: Clear MIDI port active flag after draining ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX. ALSA: hda/i915: Fix one too many pci_dev_put() ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P ALSA: hda/hdmi: fix warning about PCM count when used with SOF sound/oss/dmasound: fix 'dmasound_setup' defined but not used firmware: cs_dsp: Fix overrun of unterminated control name string ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe() ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15 ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port ASoC: Intel: sof_es8336: support a separate gpio to control headphone ASoC: Intel: sof_es8336: simplify speaker gpio naming ASoC: wm8731: Disable the regulator when probing fails ASoC: Intel: soc-acpi: correct device endpoints for max98373 ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use ASoC: SOF: topology: Fix memory leak in sof_control_load() ASoC: SOF: topology: cleanup dailinks on widget unload ...
2022-04-22XArray: Disallow sibling entries of nodesMatthew Wilcox (Oracle)
There is a race between xas_split() and xas_load() which can result in the wrong page being returned, and thus data corruption. Fortunately, it's hard to hit (syzbot took three months to find it) and often guarded with VM_BUG_ON(). The anatomy of this race is: thread A thread B order-9 page is stored at index 0x200 lookup of page at index 0x274 page split starts load of sibling entry at offset 9 stores nodes at offsets 8-15 load of entry at offset 8 The entry at offset 8 turns out to be a node, and so we descend into it, and load the page at index 0x234 instead of 0x274. This is hard to fix on the split side; we could replace the entire node that contains the order-9 page instead of replacing the eight entries. Fixing it on the lookup side is easier; just disallow sibling entries that point to nodes. This cannot ever be a useful thing as the descent would not know the correct offset to use within the new node. The test suite continues to pass, but I have not added a new test for this bug. Reported-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Tested-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-04-22tools: Add kmem_cache_alloc_lru()Matthew Wilcox (Oracle)
Turn kmem_cache_alloc() into a wrapper around kmem_cache_alloc_lru(). Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Li Wang <liwang@redhat.com>
2022-04-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: mm (memory-failure, memcg, userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() kcov: don't generate a warning on vm_insert_page()'s failure MAINTAINERS: add Vincenzo Frascino to KASAN reviewers oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup selftest/vm: add skip support to mremap_test selftest/vm: support xfail in mremap_test selftest/vm: verify remap destination address in mremap_test selftest/vm: verify mmap addr in mremap_test mm, hugetlb: allow for "high" userspace addresses userfaultfd: mark uffd_wp regardless of VM_WRITE flag memcg: sync flush only if periodic flush is delayed mm/memory-failure.c: skip huge_zero_page in memory_failure() mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
2022-04-22mm/vmalloc: huge vmalloc backing pages should be split rather than compoundNicholas Piggin
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP in order to allow the sub-pages to be refcounted by callers such as "remap_vmalloc_page [sic]" (remap_vmalloc_range). However a similar problem exists for other struct page fields callers use, for example fb_deferred_io_fault() takes a vmalloc'ed page and not only refcounts it but uses ->lru, ->mapping, ->index. This is not compatible with compound sub-pages, and can cause bad page state issues like BUG: Bad page state in process swapper/0 pfn:00743 page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743 flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff) raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: corrupted mapping in tail page Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810 Call Trace: dump_stack_lvl+0x74/0xa8 (unreliable) bad_page+0x12c/0x170 free_tail_pages_check+0xe8/0x190 free_pcp_prepare+0x31c/0x4e0 free_unref_page+0x40/0x1b0 __vunmap+0x1d8/0x420 ... The correct approach is to use split high-order pages for the huge vmalloc backing. These allow callers to treat them in exactly the same way as individually-allocated order-0 pages. Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Cc: Song Liu <songliubraving@fb.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22arm64: mm: fix p?d_leaf()Muchun Song
The pmd_leaf() is used to test a leaf mapped PMD, however, it misses the PROT_NONE mapped PMD on arm64. Fix it. A real world issue [1] caused by this was reported by Qian Cai. Also fix pud_leaf(). Link: https://patchwork.kernel.org/comment/24798260/ [1] Fixes: 8aa82df3c123 ("arm64: mm: add p?d_leaf() definitions") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/r/20220422060033.48711-1-songmuchun@bytedance.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-21Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "Extra quiet after Easter, only have minor i915 and msm pulls. However I haven't seen a PR from our misc tree in a little while, I've cc'ed all the suspects. Once that unblocks I expect a bit larger bunch of patches to arrive. Otherwise as I said, one msm revert and two i915 fixes. msm: - revert iommu change that broke some platforms. i915: - Unset enable_psr2_sel_fetch if PSR2 detection fails - Fix to detect when VRR is turned off from panel settings" * tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm: drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails drm/msm: Revert "drm/msm: Stop using iommu_present()" drm/i915/display/vrr: Reset VRR capable property on a long hpd
2022-04-21mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()Alistair Popple
In some cases it is possible for mmu_interval_notifier_remove() to race with mn_tree_inv_end() allowing it to return while the notifier data structure is still in use. Consider the following sequence: CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove() ----------------------------------- ------------------------------------ spin_lock(subscriptions->lock); seq = subscriptions->invalidate_seq; spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock); subscriptions->invalidate_seq++; wait_event(invalidate_seq != seq); return; interval_tree_remove(interval_sub); kfree(interval_sub); spin_unlock(subscriptions->lock); wake_up_all(); As the wait_event() condition is true it will return immediately. This can lead to use-after-free type errors if the caller frees the data structure containing the interval notifier subscription while it is still on a deferred list. Fix this by taking the appropriate lock when reading invalidate_seq to ensure proper synchronisation. I observed this whilst running stress testing during some development. You do have to be pretty unlucky, but it leads to the usual problems of use-after-free (memory corruption, kernel crash, difficult to diagnose WARN_ON, etc). Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier") Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21kcov: don't generate a warning on vm_insert_page()'s failureAleksandr Nogikh
vm_insert_page()'s failure is not an unexpected condition, so don't do WARN_ONCE() in such a case. Instead, print a kernel message and just return an error code. This flaw has been reported under an OOM condition by sysbot [1]. The message is mainly for the benefit of the test log, in this case the fuzzer's log so that humans inspecting the log can figure out what was going on. KCOV is a testing tool, so I think being a little more chatty when KCOV unexpectedly is about to fail will save someone debugging time. We don't want the WARN, because it's not a kernel bug that syzbot should report, and failure can happen if the fuzzer tries hard enough (as above). Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1] Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com Fixes: b3d7fe86fbd0 ("kcov: properly handle subsequent mmap calls"), Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21MAINTAINERS: add Vincenzo Frascino to KASAN reviewersVincenzo Frascino
Add my email address to KASAN reviewers list to make sure that I am Cc'ed in all the KASAN changes that may affect arm64 MTE. Link: https://lkml.kernel.org/r/20220419170640.21404-1-vincenzo.frascino@arm.com Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanupNico Pache
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can be targeted by the oom reaper. This mapping is used to store the futex robust list head; the kernel does not keep a copy of the robust list and instead references a userspace address to maintain the robustness during a process death. A race can occur between exit_mm and the oom reaper that allows the oom reaper to free the memory of the futex robust list before the exit path has handled the futex death: CPU1 CPU2 -------------------------------------------------------------------- page_fault do_exit "signal" wake_oom_reaper oom_reaper oom_reap_task_mm (invalidates mm) exit_mm exit_mm_release futex_exit_release futex_cleanup exit_robust_list get_user (EFAULT- can't access memory) If the get_user EFAULT's, the kernel will be unable to recover the waiters on the robust_list, leaving userspace mutexes hung indefinitely. Delay the OOM reaper, allowing more time for the exit path to perform the futex cleanup. Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer Based on a patch by Michal Hocko. Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1] Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Joel Savitz <jsavitz@redhat.com> Signed-off-by: Nico Pache <npache@redhat.com> Co-developed-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Herton R. Krzesinski <herton@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Darren Hart <dvhart@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21selftest/vm: add skip support to mremap_testSidhartha Kumar
Allow the mremap test to be skipped due to errors such as failing to parse the mmap_min_addr sysctl. Link: https://lkml.kernel.org/r/20220420215721.4868-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>