path: root/virt
AgeCommit message (Collapse)Author
2021-06-24KVM: do not allow mapping valid but non-reference-counted pagesNicholas Piggin
It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). This addresses CVE-2021-22543. Signed-off-by: Nicholas Piggin <> Tested-by: Paolo Bonzini <> Cc: Signed-off-by: Paolo Bonzini <>
2021-05-27KVM: VMX: update vcpu posted-interrupt descriptor when assigning deviceMarcelo Tosatti
For VMX, when a vcpu enters HLT emulation, pi_post_block will: 1) Add vcpu to per-cpu list of blocked vcpus. 2) Program the posted-interrupt descriptor "notification vector" to POSTED_INTR_WAKEUP_VECTOR With interrupt remapping, an interrupt will set the PIR bit for the vector programmed for the device on the CPU, test-and-set the ON bit on the posted interrupt descriptor, and if the ON bit is clear generate an interrupt for the notification vector. This way, the target CPU wakes upon a device interrupt and wakes up the target vcpu. Problem is that pi_post_block only programs the notification vector if kvm_arch_has_assigned_device() is true. Its possible for the following to happen: 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false, notification vector is not programmed 2) device is assigned to VM 3) device interrupts vcpu V, sets ON bit (notification vector not programmed, so pcpu P remains in idle) 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set, kvm_vcpu_kick is skipped 5) vcpu 0 busy spins on vcpu V's response for several seconds, until RCU watchdog NMIs all vCPUs. To fix this, use the start_assignment kvm_x86_ops callback to kick vcpus out of the halt loop, so the notification vector is properly reprogrammed to the wakeup vector. Reported-by: Pei Zhang <> Signed-off-by: Marcelo Tosatti <> Message-Id: <20210526172014.GA29007@fuller.cnet> Signed-off-by: Paolo Bonzini <>
2021-05-27KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCKMarcelo Tosatti
KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by: Marcelo Tosatti <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-05-27KVM: PPC: exit halt polling on need_resched()Wanpeng Li
This is inspired by commit 262de4102c7bb8 (kvm: exit halt polling on need_resched() as well). Due to PPC implements an arch specific halt polling logic, we have to the need_resched() check there as well. This patch adds a helper function that can be shared between book3s and generic halt-polling loops. Reviewed-by: David Matlack <> Reviewed-by: Venkatesh Srinivas <> Cc: Ben Segall <> Cc: Venkatesh Srinivas <> Cc: Jim Mattson <> Cc: David Matlack <> Cc: Paul Mackerras <> Cc: Suraj Jitindar Singh <> Signed-off-by: Wanpeng Li <> Message-Id: <> [Make the function inline. - Paolo] Signed-off-by: Paolo Bonzini <>
2021-05-17Merge tag 'kvmarm-fixes-5.13-1' of ↵Paolo Bonzini
git:// into HEAD KVM/arm64 fixes for 5.13, take #1 - Fix regression with irqbypass not restarting the guest on failed connect - Fix regression with debug register decoding resulting in overlapping access - Commit exception state on exit to usrspace - Fix the MMU notifier return values - Add missing 'static' qualifiers in the new host stage-2 code
2021-05-15Revert "irqbypass: do not start cons/prod when failed connect"Zhu Lingshan
This reverts commit a979a6aa009f3c99689432e0cdb5402a4463fb88. The reverted commit may cause VM freeze on arm64 with GICv4, where stopping a consumer is implemented by suspending the VM. Should the connect fail, the VM will not be resumed, which is a bit of a problem. It also erroneously calls the producer destructor unconditionally, which is unexpected. Reported-by: Shaokun Zhang <> Suggested-by: Marc Zyngier <> Acked-by: Jason Wang <> Acked-by: Michael S. Tsirkin <> Reviewed-by: Eric Auger <> Tested-by: Shaokun Zhang <> Signed-off-by: Zhu Lingshan <> [maz: tags and cc-stable, commit message update] Signed-off-by: Marc Zyngier <> Fixes: a979a6aa009f ("irqbypass: do not start cons/prod when failed connect") Link: Link: Cc:
2021-05-07kvm: Cap halt polling at kvm->max_halt_poll_nsDavid Matlack
When growing halt-polling, there is no check that the poll time exceeds the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past kvm->max_halt_poll_ns and stay there until a halt which takes longer than kvm->halt_poll_ns. Signed-off-by: David Matlack <> Signed-off-by: Venkatesh Srinivas <> Message-Id: <> Cc: Signed-off-by: Paolo Bonzini <>
2021-05-03kvm: exit halt polling on need_resched() as wellBenjamin Segall
single_task_running() is usually more general than need_resched() but CFS_BANDWIDTH throttling will use resched_task() when there is just one task to get the task to block. This was causing long-need_resched warnings and was likely allowing VMs to overrun their quota when halt polling. Signed-off-by: Ben Segall <> Signed-off-by: Venkatesh Srinivas <> Message-Id: <> Signed-off-by: Paolo Bonzini <> Cc: Reviewed-by: Jim Mattson <>
2021-04-21KVM: Boost vCPU candidate in user mode which is delivering interruptWanpeng Li
Both lock holder vCPU and IPI receiver that has halted are condidate for boost. However, the PLE handler was originally designed to deal with the lock holder preemption problem. The Intel PLE occurs when the spinlock waiter is in kernel mode. This assumption doesn't hold for IPI receiver, they can be in either kernel or user mode. the vCPU candidate in user mode will not be boosted even if they should respond to IPIs. Some benchmarks like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most of the time they are running in user mode. It can lead to a large number of continuous PLE events because the IPI sender causes PLE events repeatedly until the receiver is scheduled while the receiver is not candidate for a boost. This patch boosts the vCPU candidiate in user mode which is delivery interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). There is no performance regression for other benchmarks like Unixbench spawn (most of the time contend read/write lock in kernel mode), ebizzy (most of the time contend read/write sem and TLB shoodtdown in kernel mode). Signed-off-by: Wanpeng Li <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-21KVM: x86: Support KVM VMs sharing SEV contextNathan Tempelman
Add a capability for userspace to mirror SEV encryption context from one vm to another. On our side, this is intended to support a Migration Helper vCPU, but it can also be used generically to support other in-guest workloads scheduled by the host. The intention is for the primary guest and the mirror to have nearly identical memslots. The primary benefits of this are that: 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they can't accidentally clobber each other. 2) The VMs can have different memory-views, which is necessary for post-copy migration (the migration vCPUs on the target need to read and write to pages, when the primary guest would VMEXIT). This does not change the threat model for AMD SEV. Any memory involved is still owned by the primary guest and its initial state is still attested to through the normal SEV_LAUNCH_* flows. If userspace wanted to circumvent SEV, they could achieve the same effect by simply attaching a vCPU to the primary VM. This patch deliberately leaves userspace in charge of the memslots for the mirror, as it already has the power to mess with them in the primary guest. This patch does not support SEV-ES (much less SNP), as it does not handle handing off attested VMSAs to the mirror. For additional context, we need a Migration Helper because SEV PSP migration is far too slow for our live migration on its own. Using an in-guest migrator lets us speed this up significantly. Signed-off-by: Nathan Tempelman <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-20KVM: Add proper lockdep assertion in I/O bus unregisterSean Christopherson
Convert a comment above kvm_io_bus_unregister_dev() into an actual lockdep assertion, and opportunistically add curly braces to a multi-line for-loop. Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-20KVM: Stop looking for coalesced MMIO zones if the bus is destroyedSean Christopherson
Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev() fails to allocate memory for the new instance of the bus. If it can't instantiate a new bus, unregister_dev() destroys all devices _except_ the target device. But, it doesn't tell the caller that it obliterated the bus and invoked the destructor for all devices that were on the bus. In the coalesced MMIO case, this can result in a deleted list entry dereference due to attempting to continue iterating on coalesced_zones after future entries (in the walk) have been deleted. Opportunistically add curly braces to the for-loop, which encompasses many lines but sneaks by without braces due to the guts being a single if statement. Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: Reported-by: Hao Sun <> Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-20KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCUSean Christopherson
If allocating a new instance of an I/O bus fails when unregistering a device, wait to destroy the device until after all readers are guaranteed to see the new null bus. Destroying devices before the bus is nullified could lead to use-after-free since readers expect the devices on their reference of the bus to remain valid. Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslotSean Christopherson
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been detected in the memslots, i.e. don't take the lock for notifications that don't affect the guest. For small VMs, spurious locking is a minor annoyance. And for "volatile" setups where the majority of notifications _are_ relevant, this barely qualifies as an optimization. But, for large VMs (hundreds of threads) with static setups, e.g. no page migration, no swapping, etc..., the vast majority of MMU notifier callbacks will be unrelated to the guest, e.g. will often be in response to the userspace VMM adjusting its own virtual address space. In such large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from handling page faults. In some scenarios it can even be "fatal" in the sense that it causes unacceptable brownouts, e.g. when rebuilding huge pages after live migration, a significant percentage of vCPUs will be attempting to handle page faults. x86's TDP MMU implementation is especially susceptible to spurious locking due it taking mmu_lock for read when handling page faults. Because rwlock is fair, a single writer will stall future readers, while the writer is itself stalled waiting for in-progress readers to complete. This is exacerbated by the MMU notifiers often firing multiple times in quick succession, e.g. moving a page will (always?) invoke three separate notifiers: .invalidate_range_start(), invalidate_range_end(), and .change_pte(). Unnecessarily taking mmu_lock each time means even a single spurious sequence can be problematic. Note, this optimizes only the unpaired callbacks. Optimizing the .invalidate_range_{start,end}() pairs is more complex and will be done in a future patch. Suggested-by: Ben Gardon <> Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Move MMU notifier's mmu_lock acquisition into common helperSean Christopherson
Acquire and release mmu_lock in the __kvm_handle_hva_range() helper instead of requiring the caller to do the same. This paves the way for future patches to take mmu_lock if and only if an overlapping memslot is found, without also having to introduce the on_lock() shenanigans used to manipulate the notifier count and sequence. No functional change intended. Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Kill off the old hva-based MMU notifier callbacksSean Christopherson
Yank out the hva-based MMU notifier APIs now that all architectures that use the notifiers have moved to the gfn-based APIs. No functional change intended. Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Move x86's MMU notifier memslot walkers to generic codeSean Christopherson
Move the hva->gfn lookup for MMU notifiers into common code. Every arch does a similar lookup, and some arch code is all but identical across multiple architectures. In addition to consolidating code, this will allow introducing optimizations that will benefit all architectures without incurring multiple walks of the memslots, e.g. by taking mmu_lock if and only if a relevant range exists in the memslots. The use of __always_inline to avoid indirect call retpolines, as done by x86, may also benefit other architectures. Consolidating the lookups also fixes a wart in x86, where the legacy MMU and TDP MMU each do their own memslot walks. Lastly, future enhancements to the memslot implementation, e.g. to add an interval tree to track host address, will need to touch far less arch specific code. MIPS, PPC, and arm64 will be converted one at a time in future patches. Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Assert that notifier count is elevated in .change_pte()Sean Christopherson
In KVM's .change_pte() notification callback, replace the notifier sequence bump with a WARN_ON assertion that the notifier count is elevated. An elevated count provides stricter protections than bumping the sequence, and the sequence is guarnateed to be bumped before the count hits zero. When .change_pte() was added by commit 828502d30073 ("ksm: add mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary as .change_pte() would be invoked without any surrounding notifications. However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end"), all calls to .change_pte() are guaranteed to be surrounded by start() and end(), and so are guaranteed to run with an elevated notifier count. Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes .change_pte() is called when the relevant SPTE is present in KVM's MMU, as the original goal was to accelerate Kernel Samepage Merging (KSM) by updating KVM's SPTEs without requiring a VM-Exit (due to invalidating the SPTE). I.e. it means that .change_pte() is effectively dead code on _all_ architectures. x86 and MIPS are clearcut nops if the old SPTE is not-present, and that is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, which again should be a nop due to the invalidation. arm64 is a bit murky, but it's also likely a nop because kvm_pgtable_stage2_map() is called without a cache pointer, which means it will map an entry if and only if an existing PTE was found. For now, take advantage of the bug to simplify future consolidation of KVMs's MMU notifier code. Doing so will not greatly complicate fixing .change_pte(), assuming it's even worth fixing. .change_pte() has been broken for 8+ years and no one has complained. Even if there are KSM+KVM users that care deeply about its performance, the benefits of avoiding VM-Exits via .change_pte() need to be reevaluated to justify the added complexity and testing burden. Ripping out .change_pte() entirely would be a lot easier. Signed-off-by: Sean Christopherson <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocationsSean Christopherson
Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that that the allocations are accounted, to make it easier to audit KVM's allocations in the future, and to be consistent with other cache usage in KVM. When using SLAB/SLUB, this is a nop as the cache itself is created with SLAB_ACCOUNT. When using SLOB, there are caveats within caveats. SLOB doesn't honor SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU allocations now being accounted. But, even that depends on internal SLOB details as SLOB will only go to the page allocator when its cache is depleted. That just happens to be extremely likely for vCPUs because the size of kvm_vcpu is larger than the a page for almost all combinations of architecture and page size. Whether or not the SLOB behavior is by design is unknown; it's just as likely that no SLOB users care about accounding and so no one has bothered to implemented support in SLOB. Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup users, if any exist. Reviewed-by: Wanpeng Li <> Acked-by: David Rientjes <> Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-04-17KVM: Move arm64's MMU notifier trace events to generic codeSean Christopherson
Move arm64's MMU notifier trace events into common code in preparation for doing the hva->gfn lookup in common code. The alternative would be to trace the gfn instead of hva, but that's not obviously better and could also be done in common code. Tracing the notifiers is also quite handy for debug regardless of architecture. Remove a completely redundant tracepoint from PPC e500. Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-02-22KVM: x86/mmu: Consider the hva in mmu_notifier retryDavid Stevens
Track the range being invalidated by mmu_notifier and skip page fault retries if the fault address is not affected by the in-progress invalidation. Handle concurrent invalidations by finding the minimal range which includes all ranges being invalidated. Although the combined range may include unrelated addresses and cannot be shrunk as individual invalidation operations complete, it is unlikely the marginal gains of proper range tracking are worth the additional complexity. The primary benefit of this change is the reduction in the likelihood of extreme latency when handing a page fault due to another thread having been preempted while modifying host virtual addresses. Signed-off-by: David Stevens <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-02-09KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()Sean Christopherson
Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel. This bug was inadvertantly exposed by commit bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long. arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’ Cc: Fixes: add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by: Sean Christopherson <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-02-09mm: provide a saner PTE walking API for modulesPaolo Bonzini
Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by: Jason Gunthorpe <> Signed-off-by: Paolo Bonzini <>
2021-02-04KVM: x86/mmu: Use an rwlock for the x86 MMUBen Gardon
Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by: Peter Feiner <> Signed-off-by: Ben Gardon <> Message-Id: <> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by: Paolo Bonzini <>
2021-02-04KVM: X86: use vzalloc() instead of vmalloc/memsetTian Tao
fixed the following warning: /virt/kvm/dirty_ring.c:70:20-27: WARNING: vzalloc should be used for ring -> dirty_gfns, instead of vmalloc/memset. Signed-off-by: Tian Tao <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2021-02-04KVM: do not assume PTE is writable after follow_pfnPaolo Bonzini
In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <> Cc: Cc: Jann Horn <> Cc: Jason Gunthorpe <> Cc: Signed-off-by: Paolo Bonzini <>
2021-01-25Merge tag 'kvmarm-fixes-5.11-2' of ↵Paolo Bonzini
git:// into HEAD KVM/arm64 fixes for 5.11, take #2 - Don't allow tagged pointers to point to memslots - Filter out ARMv8.1+ PMU events on v8.0 hardware - Hide PMU registers from userspace when no PMU is configured - More PMU cleanups - Don't try to handle broken PSCI firmware - More sys_reg() to reg_to_encoding() conversions
2021-01-21KVM: Forbid the use of tagged userspace addresses for memslotsMarc Zyngier
The use of a tagged address could be pretty confusing for the whole memslot infrastructure as well as the MMU notifiers. Forbid it altogether, as it never quite worked the first place. Cc: Reported-by: Rick Edgecombe <> Reviewed-by: Catalin Marinas <> Signed-off-by: Marc Zyngier <>
2021-01-08Merge tag 'for-linus' of git:// Torvalds
Pull kvm fixes from Paolo Bonzini: "x86: - Fixes for the new scalable MMU - Fixes for migration of nested hypervisors on AMD - Fix for clang integrated assembler - Fix for left shift by 64 (UBSAN) - Small cleanups - Straggler SEV-ES patch ARM: - VM init cleanups - PSCI relay cleanups - Kill CONFIG_KVM_ARM_PMU - Fixup __init annotations - Fixup reg_to_encoding() - Fix spurious PMCR_EL0 access Misc: - selftests cleanups" * tag 'for-linus' of git:// (38 commits) KVM: x86: __kvm_vcpu_halt can be static KVM: SVM: Add support for booting APs in an SEV-ES guest KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode KVM: nSVM: correctly restore nested_run_pending on migration KVM: x86/mmu: Clarify TDP MMU page list invariants KVM: x86/mmu: Ensure TDP MMU roots are freed after yield kvm: check tlbs_dirty directly KVM: x86: change in pv_eoi_get_pending() to make code more readable MAINTAINERS: Really update email address for Sean Christopherson KVM: x86: fix shift out of bounds reported by UBSAN KVM: selftests: Implement perf_test_util more conventionally KVM: selftests: Use vm_create_with_vcpus in create_vm KVM: selftests: Factor out guest mode code KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte() KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte() ...
2021-01-07kvm: check tlbs_dirty directlyLai Jiangshan
In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as: need_tlb_flush |= kvm->tlbs_dirty; with need_tlb_flush's type being int and tlbs_dirty's type being long. It means that tlbs_dirty is always used as int and the higher 32 bits is useless. We need to check tlbs_dirty in a correct way and this change checks it directly without propagating it to need_tlb_flush. Note: it's _extremely_ unlikely this neglecting of higher 32 bits can cause problems in practice. It would require encountering tlbs_dirty on a 4 billion count boundary, and KVM would need to be using shadow paging or be running a nested guest. Cc: Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by: Lai Jiangshan <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-12-20Merge tag 'for-linus' of git:// Torvalds
Pull KVM updates from Paolo Bonzini: "Much x86 work was pushed out to 5.12, but ARM more than made up for it. ARM: - PSCI relay at EL2 when "protected KVM" is enabled - New exception injection code - Simplification of AArch32 system register handling - Fix PMU accesses when no PMU is enabled - Expose CSV3 on non-Meltdown hosts - Cache hierarchy discovery fixes - PV steal-time cleanups - Allow function pointers at EL2 - Various host EL2 entry cleanups - Simplification of the EL2 vector allocation s390: - memcg accouting for s390 specific parts of kvm and gmap - selftest for diag318 - new kvm_stat for when async_pf falls back to sync x86: - Tracepoints for the new pagetable code from 5.10 - Catch VFIO and KVM irqfd events before userspace - Reporting dirty pages to userspace with a ring buffer - SEV-ES host support - Nested VMX support for wait-for-SIPI activity state - New feature flag (AVX512 FP16) - New system ioctl to report Hyper-V-compatible paravirtualization features Generic: - Selftest improvements" * tag 'for-linus' of git:// (171 commits) KVM: SVM: fix 32-bit compilation KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting KVM: SVM: Provide support to launch and run an SEV-ES guest KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests KVM: SVM: Provide support for SEV-ES vCPU loading KVM: SVM: Provide support for SEV-ES vCPU creation/loading KVM: SVM: Update ASID allocation to support SEV-ES guests KVM: SVM: Set the encryption mask for the SVM host save area KVM: SVM: Add NMI support for an SEV-ES guest KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest KVM: SVM: Do not report support for SMM for an SEV-ES guest KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES KVM: SVM: Add support for CR8 write traps for an SEV-ES guest KVM: SVM: Add support for CR4 write traps for an SEV-ES guest KVM: SVM: Add support for CR0 write traps for an SEV-ES guest KVM: SVM: Add support for EFER write traps for an SEV-ES guest KVM: SVM: Support string IO operations for an SEV-ES guest KVM: SVM: Support MMIO for an SEV-ES guest KVM: SVM: Create trace events for VMGEXIT MSR protocol processing KVM: SVM: Create trace events for VMGEXIT processing ...
2020-12-19mm, kvm: account kvm_vcpu_mmap to kmemcgShakeel Butt
A VCPU of a VM can allocate couple of pages which can be mmap'ed by the user space application. At the moment this memory is not charged to the memcg of the VMM. On a large machine running large number of VMs or small number of VMs having large number of VCPUs, this unaccounted memory can be very significant. So, charge this memory to the memcg of the VMM. Please note that lifetime of these allocations corresponds to the lifetime of the VMM. Link: Signed-off-by: Shakeel Butt <> Acked-by: Roman Gushchin <> Acked-by: Paolo Bonzini <> Cc: Johannes Weiner <> Cc: Michal Hocko <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-11-15KVM: Don't allocate dirty bitmap if dirty ring is enabledPeter Xu
Because kvm dirty rings and kvm dirty log is used in an exclusive way, Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled. At the meantime, since the dirty_bitmap will be conditionally created now, we can't use it as a sign of "whether this memory slot enabled dirty tracking". Change users like that to check against the kvm memory slot flags. Note that there still can be chances where the kvm memory slot got its dirty_bitmap allocated, _if_ the memory slots are created before enabling of the dirty rings and at the same time with the dirty tracking capability enabled, they'll still with the dirty_bitmap. However it should not hurt much (e.g., the bitmaps will always be freed if they are there), and the real users normally won't trigger this because dirty bit tracking flag should in most cases only be applied to kvm slots only before migration starts, that should be far latter than kvm initializes (VM starts). Signed-off-by: Peter Xu <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-11-15KVM: Make dirty ring exclusive to dirty bitmap logPeter Xu
There's no good reason to use both the dirty bitmap logging and the new dirty ring buffer to track dirty bits. We should be able to even support both of them at the same time, but it could complicate things which could actually help little. Let's simply make it the rule before we enable dirty ring on any arch, that we don't allow these two interfaces to be used together. The big world switch would be KVM_CAP_DIRTY_LOG_RING capability enablement. That's where we'll switch from the default dirty logging way to the dirty ring way. As long as kvm->dirty_ring_size is setup correctly, we'll once and for all switch to the dirty ring buffer mode for the current virtual machine. Signed-off-by: Peter Xu <> Message-Id: <> [Change errno from EINVAL to ENXIO. - Paolo] Signed-off-by: Paolo Bonzini <>
2020-11-15KVM: X86: Implement ring-based dirty memory trackingPeter Xu
This patch is heavily based on previous work from Lei Cao <> and Paolo Bonzini <>. [1] KVM currently uses large bitmaps to track dirty memory. These bitmaps are copied to userspace when userspace queries KVM for its dirty page information. The use of bitmaps is mostly sufficient for live migration, as large parts of memory are be dirtied from one log-dirty pass to another. However, in a checkpointing system, the number of dirty pages is small and in fact it is often bounded---the VM is paused when it has dirtied a pre-defined number of pages. Traversing a large, sparsely populated bitmap to find set bits is time-consuming, as is copying the bitmap to user-space. A similar issue will be there for live migration when the guest memory is huge while the page dirty procedure is trivial. In that case for each dirty sync we need to pull the whole dirty bitmap to userspace and analyse every bit even if it's mostly zeros. The preferred data structure for above scenarios is a dense list of guest frame numbers (GFN). This patch series stores the dirty list in kernel memory that can be memory mapped into userspace to allow speedy harvesting. This patch enables dirty ring for X86 only. However it should be easily extended to other archs as well. [1] Signed-off-by: Lei Cao <> Signed-off-by: Paolo Bonzini <> Signed-off-by: Peter Xu <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-11-15KVM: Pass in kvm pointer into mark_page_dirty_in_slot()Peter Xu
The context will be needed to implement the kvm dirty ring. Signed-off-by: Peter Xu <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-11-15KVM: remove kvm_clear_guest_pagePaolo Bonzini
kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest. We can just inline it in its sole user. Signed-off-by: Paolo Bonzini <>
2020-11-15kvm/eventfd: Drain events from eventfd in irqfd_wakeup()David Woodhouse
Don't allow the events to accumulate in the eventfd counter, drain them as they are handled. Signed-off-by: David Woodhouse <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-11-15kvm/eventfd: Use priority waitqueue to catch events before userspaceDavid Woodhouse
As far as I can tell, when we use posted interrupts we silently cut off the events from userspace, if it's listening on the same eventfd that feeds the irqfd. I like that behaviour. Let's do it all the time, even without posted interrupts. It makes it much easier to handle IRQ remapping invalidation without having to constantly add/remove the fd from the userspace poll set. We can just leave userspace polling on it, and the bypass will... well... bypass it. Signed-off-by: David Woodhouse <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-10-23kvm: x86/mmu: Support dirty logging for the TDP MMUBen Gardon
Dirty logging is a key feature of the KVM MMU and must be supported by the TDP MMU. Add support for both the write protection and PML dirty logging modes. Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell machine. This series introduced no new failures. This series can be viewed in Gerrit at: Signed-off-by: Ben Gardon <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-10-21KVM: Cache as_id in kvm_memory_slotPeter Xu
Cache the address space ID just like the slot ID. It will be used in order to fill in the dirty ring entries. Suggested-by: Paolo Bonzini <> Suggested-by: Sean Christopherson <> Reviewed-by: Sean Christopherson <> Signed-off-by: Peter Xu <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-09-28KVM: use struct_size() and flex_array_size() helpers in ↵Rustam Kovhaev
kvm_io_bus_unregister_dev() Make use of the struct_size() helper to avoid any potential type mistakes and protect against potential integer overflows Make use of the flex_array_size() helper to calculate the size of a flexible array member within an enclosing structure Suggested-by: Gustavo A. R. Silva <> Signed-off-by: Rustam Kovhaev <> Message-Id: <> Reviewed-by: Gustavo A. R. Silva <> Signed-off-by: Paolo Bonzini <>
2020-09-28kvm/eventfd: move wildcard calculation outside loopYi Li
There is no need to calculate wildcard in each iteration since wildcard is not changed. Signed-off-by: Yi Li <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-09-11KVM: fix memory leak in kvm_io_bus_unregister_dev()Rustam Kovhaev
when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing the bus, we should iterate over all other devices linked to it and call kvm_iodevice_destructor() for them Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail") Cc: Reported-and-tested-by: Link: Signed-off-by: Rustam Kovhaev <> Reviewed-by: Vitaly Kuznetsov <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-09-11Merge tag 'kvmarm-fixes-5.9-1' of ↵Paolo Bonzini
git:// into HEAD KVM/arm64 fixes for Linux 5.9, take #1 - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values
2020-08-21KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon
The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <> Cc: Marc Zyngier <> Cc: Suzuki K Poulose <> Cc: James Morse <> Signed-off-by: Will Deacon <> Message-Id: <> Signed-off-by: Paolo Bonzini <>
2020-08-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction, mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util, memory-hotplug, cleanups, uaccess, migration, gup, pagemap), - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops, checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump, exec, kdump, rapidio, panic, kcov, kgdb, ipc). * emailed patches from Andrew Morton <>: (164 commits) mm/gup: remove task_struct pointer for all gup code mm: clean up the last pieces of page fault accountings mm/xtensa: use general page fault accounting mm/x86: use general page fault accounting mm/sparc64: use general page fault accounting mm/sparc32: use general page fault accounting mm/sh: use general page fault accounting mm/s390: use general page fault accounting mm/riscv: use general page fault accounting mm/powerpc: use general page fault accounting mm/parisc: use general page fault accounting mm/openrisc: use general page fault accounting mm/nios2: use general page fault accounting mm/nds32: use general page fault accounting mm/mips: use general page fault accounting mm/microblaze: use general page fault accounting mm/m68k: use general page fault accounting mm/ia64: use general page fault accounting mm/hexagon: use general page fault accounting mm/csky: use general page fault accounting ...
2020-08-12mm/gup: remove task_struct pointer for all gup codePeter Xu
After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu <> Signed-off-by: Andrew Morton <> Reviewed-by: John Hubbard <> Link: Signed-off-by: Linus Torvalds <>
2020-08-11Merge tag 'for_linus' of git:// Torvalds
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git:// (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...
2020-08-10Merge tag 'locking-urgent-2020-08-10' of ↵Linus Torvalds
git:// Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git:// (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...