summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
6 daysMerge tag 'its-for-linus-20250509' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 ITS mitigation from Dave Hansen: "Mitigate Indirect Target Selection (ITS) issue. I'd describe this one as a good old CPU bug where the behavior is _obviously_ wrong, but since it just results in bad predictions it wasn't wrong enough to notice. Well, the researchers noticed and also realized that thus bug undermined a bunch of existing indirect branch mitigations. Thus the unusually wide impact on this one. Details: ITS is a bug in some Intel CPUs that affects indirect branches including RETs in the first half of a cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the second half of a cacheline. Researchers at VUSec found this behavior and reported to Intel. Affected processors: - Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet Lake, Ice Lake, Tiger Lake and Rocket Lake. Scope of impact: - Guest/host isolation: When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to direct branches in the guest. - Intra-mode using cBPF: cBPF can be used to poison the branch history to exploit ITS. Realigning the indirect branches and RETs mitigates this attack vector. - User/kernel: With eIBRS enabled user/kernel isolation is *not* impacted by ITS. - Indirect Branch Prediction Barrier (IBPB): Due to this bug indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This will be fixed in the microcode. Mitigation: As indirect branches in the first half of cacheline are affected, the mitigation is to replace those indirect branches with a call to thunk that is aligned to the second half of the cacheline. RETs that take prediction from RSB are not affected, but they may be affected by RSB-underflow condition. So, RETs in the first half of cacheline are also patched to a return thunk that executes the RET aligned to second half of cacheline" * tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftest/x86/bugs: Add selftests for ITS x86/its: FineIBT-paranoid vs ITS x86/its: Use dynamic thunks for indirect branches x86/ibt: Keep IBT disabled during alternative patching mm/execmem: Unify early execmem_cache behaviour x86/its: Align RETs in BHB clear sequence to avoid thunking x86/its: Add support for RSB stuffing mitigation x86/its: Add "vmexit" option to skip mitigation on some CPUs x86/its: Enable Indirect Target Selection mitigation x86/its: Add support for ITS-safe return thunk x86/its: Add support for ITS-safe indirect thunk x86/its: Enumerate Indirect Target Selection (ITS) bug Documentation: x86/bugs/its: Add ITS documentation
9 daysx86/its: Enumerate Indirect Target Selection (ITS) bugPawan Gupta
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the first half of a cache line get predicted to a target of a branch located in the second half of the cache line. Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
10 daysKVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitionsSean Christopherson
Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and only if KVM has at least one active VM. Leaving the bit set at all times unfortunately degrades performance by a wee bit more than expected. Use a dedicated spinlock and counter instead of hooking virtualization enablement, as changing the behavior of kvm.enable_virt_at_load based on SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could result in performance issues for flows that are sensitive to VM creation latency. Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting performance on CPUs that aren't running VMs, e.g. if a setup is using housekeeping CPUs. Setting BP_SPEC_REDUCE in task context, i.e. without blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions without incurring a gross amount of complexity (see the Link for details on how ugly coordinating via IPIs gets). Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com Fixes: 8442df2b49ed ("x86/bugs: KVM: Add support for SRSO_MSR_FIX") Reported-by: Michael Larabel <Michael@michaellarabel.com> Closes: https://www.phoronix.com/review/linux-615-amd-regression Cc: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02KVM: x86/mmu: Prevent installing hugepages when mem attributes are changingSean Christopherson
When changing memory attributes on a subset of a potential hugepage, add the hugepage to the invalidation range tracking to prevent installing a hugepage until the attributes are fully updated. Like the actual hugepage tracking updates in kvm_arch_post_set_memory_attributes(), process only the head and tail pages, as any potential hugepages that are entirely covered by the range will already be tracked. Note, only hugepage chunks whose current attributes are NOT mixed need to be added to the invalidation set, as mixed attributes already prevent installing a hugepage, and it's perfectly safe to install a smaller mapping for a gfn whose attributes aren't changing. Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory") Cc: stable@vger.kernel.org Reported-by: Michael Roth <michael.roth@amd.com> Tested-by: Michael Roth <michael.roth@amd.com> Link: https://lore.kernel.org/r/20250430220954.522672-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fieldsTom Lendacky
Commit 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it") updated the SEV code to take a snapshot of the GHCB before using it. But the dump_ghcb() function wasn't updated to use the snapshot locations. This results in incorrect output from dump_ghcb() for the "is_valid" and "valid_bitmap" fields. Update dump_ghcb() to use the proper locations. Fixes: 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Link: https://lore.kernel.org/r/8f03878443681496008b1b37b7c4bf77a342b459.1745866531.git.thomas.lendacky@amd.com [sean: add comment and snapshot qualifier] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()Yan Zhao
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete root. Since kvm_mmu_reload() can be called outside the vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be invoked after a root has been marked obsolete and before vcpu_enter_guest() is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to invalid. This causes kvm_mmu_reload() to fail to load a new root, which can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to is_page_fault_stale(). Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in vcpu_enter_guest() since the cost of kvm_check_request() is negligible, especially a check that's guarded by kvm_request_pending(). Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is inline and may be called outside of kvm.ko. Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()Dan Carpenter
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are u64 type. We are only using the lowest 3 bits but we want to ensure that the users are not passing invalid bits so that we can use the remaining bits in the future. However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so the check only ensures that the lower 32 bits are clear. Fix this by changing the types to u64. Fixes: 74c1807f6c4f ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interceptionMikhail Lobanov
Previously, commit ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset") addressed an issue where a triple fault occurring in nested mode could lead to use-after-free scenarios. However, the commit did not handle the analogous situation for System Management Mode (SMM). This omission results in triggering a WARN when KVM forces a vCPU INIT after SHUTDOWN interception while the vCPU is in SMM. This situation was reprodused using Syzkaller by: 1) Creating a KVM VM and vCPU 2) Sending a KVM_SMI ioctl to explicitly enter SMM 3) Executing invalid instructions causing consecutive exceptions and eventually a triple fault The issue manifests as follows: WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112 kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112 Modules linked in: CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted 6.1.130-syzkaller-00157-g164fe5dde9b6 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112 Call Trace: <TASK> shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136 svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395 svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457 vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline] vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062 kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283 kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Architecturally, INIT is blocked when the CPU is in SMM, hence KVM's WARN() in kvm_vcpu_reset() to guard against KVM bugs, e.g. to detect improper emulation of INIT. SHUTDOWN on SVM is a weird edge case where KVM needs to do _something_ sane with the VMCB, since it's technically undefined, and INIT is the least awful choice given KVM's ABI. So, double down on stuffing INIT on SHUTDOWN, and force the vCPU out of SMM to avoid any weirdness (and the WARN). Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru> Link: https://lore.kernel.org/r/20250414171207.155121-1-m.lobanov@rosa.ru [sean: massage changelog, make it clear this isn't architectural behavior] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILINGAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepointsAdrian Hunter
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: WARN if an invalid posted interrupt IRTE entry is addedSean Christopherson
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM attempts to track an AMD IRTE entry without metadata. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producerSean Christopherson
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex. Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Explicitly treat routing entry type changes as changesSean Christopherson
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as-is. Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Reset IRTE to host control if *new* route isn't postableSean Christopherson
Restore an IRTE back to host control (remapped or posted MSI mode) if the *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of the GSI routing type. Updating the IRTE if and only if the new GSI is an MSI results in KVM leaving an IRTE posting to a vCPU. The dangling IRTE can result in interrupts being incorrectly delivered to the guest, and in the worst case scenario can result in use-after-free, e.g. if the VM is torn down, but the underlying host IRQ isn't freed. Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Allocate IR data using atomic allocationSean Christopherson
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held when kvm_irq_routing_update() reacts to GSI routing changes. Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: SVM: Don't update IRTEs if APICv/AVIC is disabledSean Christopherson
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE into remapped mode (kvm_vcpu_apicv_active() will never be true) is unnecessary and wasteful. The IOMMU driver is responsible for putting IRTEs into remapped mode when an IRQ is allocated by a device, long before that device is assigned to a VM. I.e. the kernel as a whole has major issues if the IRTE isn't already in remapped mode. Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so so that all checks in KVM x86 incorporate the same information. Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401161804.842968-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk stage-1 page tables) to align with the architecture. This avoids possibly taking an SEA at EL2 on the page table walk or using an architecturally UNKNOWN fault IPA - Use acquire/release semantics in the KVM FF-A proxy to avoid reading a stale value for the FF-A version - Fix KVM guest driver to match PV CPUID hypercall ABI - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM selftests, which is the only memory type for which atomic instructions are architecturally guaranteed to work s390: - Don't use %pK for debug printing and tracepoints x86: - Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. The schedule out code runs with a scheduler lock that the wakeup handler takes in the opposite order; but it does so with IRQs disabled and cannot run concurrently with a wakeup - Explicitly zero-initialize on-stack CPUID unions - Allow building irqbypass.ko as as module when kvm.ko is a module - Wrap relatively expensive sanity check with KVM_PROVE_MMU - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses selftests: - Add more scenarios to the MONITOR/MWAIT test - Add option to rseq test to override /dev/cpu_dma_latency - Bring list of exit reasons up to date - Cleanup Makefile to list once tests that are valid on all architectures Other: - Documentation fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: arm64: Use acquire/release to communicate FF-A version negotiation KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable KVM: arm64: selftests: Introduce and use hardware-definition macros KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list KVM: x86: Explicitly zero-initialize on-stack CPUID unions KVM: Allow building irqbypass.ko as as module when kvm.ko is a module KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses Documentation: kvm: remove KVM_CAP_MIPS_TE Documentation: kvm: organize capabilities in the right section Documentation: kvm: fix some definition lists Documentation: kvm: drop "Capability" heading from capabilities Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE selftests: kvm: list once tests that are valid on all architectures selftests: kvm: bring list of exit reasons up to date selftests: kvm: revamp MONITOR/MWAIT tests KVM: arm64: Don't translate FAR if invalid/unsafe ...
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-04Merge branch 'kvm-pi-fix-lockdep' into HEADPaolo Bonzini
2025-04-04KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positiveYan Zhao
Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** In the wakeup handler, the callchain is *always*: sysvec_kvm_posted_intr_wakeup_ipi() | --> pi_wakeup_handler() | --> kvm_vcpu_wake_up() | --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() | --> schedule() | --> kvm_sched_out() | --> vmx_vcpu_put() | --> vmx_vcpu_pi_put() | --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=>C versus C=>A inversion. In practice, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-ID: <20250401154727.835231-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup listSean Christopherson
Assert that IRQs are already disabled when putting a vCPU on a CPU's PI wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on IRQs being disabled until the vCPU task is fully scheduled out, i.e. until the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), as attempting to wake the task while it's being scheduled out could lead to deadlock. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250401154727.835231-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: x86: Explicitly zero-initialize on-stack CPUID unionsSean Christopherson
Explicitly zero/empty-initialize the unions used for PMU related CPUID entries, instead of manually zeroing all fields (hopefully), or in the case of 0x80000022, relying on the compiler to clobber the uninitialized bitfields. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-ID: <20250315024102.2361628-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMUSean Christopherson
Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity check is omitted from production builds, and more importantly to remove the atomic accesses to account pages. A one-off memory leak in production is relatively uninteresting, and a WARN_ON won't help mitigate a systemic issue; it's as much about helping triage memory leaks as it is about detecting them in the first place, and doesn't magically stop the leaks. I.e. production environments will be quite sad if a severe KVM bug escapes, regardless of whether or not KVM WARNs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315023448.2358456-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accessesSean Christopherson
Acquire a lock on kvm->srcu when userspace is getting MP state to handle a rather extreme edge case where "accepting" APIC events, i.e. processing pending INIT or SIPI, can trigger accesses to guest memory. If the vCPU is in L2 with INIT *and* a TRIPLE_FAULT request pending, then getting MP state will trigger a nested VM-Exit by way of ->check_nested_events(), and emuating the nested VM-Exit can access guest memory. The splat was originally hit by syzkaller on a Google-internal kernel, and reproduced on an upstream kernel by hacking the triple_fault_event_test selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario. ============================= WARNING: suspicious RCU usage 6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted ----------------------------- include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by triple_fault_ev/1256: #0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm] stack backtrace: CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x144/0x190 kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] read_and_check_msr_entry+0x2e/0x180 [kvm_intel] __nested_vmx_vmexit+0x550/0xde0 [kvm_intel] kvm_check_nested_events+0x1b/0x30 [kvm] kvm_apic_accept_events+0x33/0x100 [kvm] kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm] kvm_vcpu_ioctl+0x33e/0x9a0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401150504.829812-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...
2025-03-25Merge tag 'x86_bugs_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 speculation mitigation updates from Borislav Petkov: - Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements * tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds x86/bugs: Relocate mds/taa/mmio/rfds defines x86/bugs: Add X86_BUG_SPECTRE_V2_USER x86/bugs: Remove X86_FEATURE_USE_IBPB KVM: nVMX: Always use IBPB to properly virtualize IBRS x86/bugs: Use a static branch to guard IBPB on vCPU switch x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set() x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation() x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-20Merge branch 'kvm-pre-tdx' into HEADPaolo Bonzini
- Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing.
2025-03-20Merge branch 'kvm-nvmx-and-vm-teardown' into HEADPaolo Bonzini
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-19Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM Xen changes for 6.15 - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks, as there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of Xen TSC should be rare, i.e. are not a hot path.
2025-03-19Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM PV clock changes for 6.15: - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock. - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus. - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks. - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests. - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups).
2025-03-19Merge tag 'kvm-x86-svm-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.15 - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies). - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation. - Add macros and helpers for setting GHCB return/error codes. - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed. - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address. - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall. - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features. - Misc cleanups
2025-03-19Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.15 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step for upcoming FRED virtualization support. - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits as a general cleanup, and in anticipation of adding support for emulating Mode-Based Execution (MBEC). - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits).
2025-03-19Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.15: - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT. - Add a helper to consolidate handling of mp_state transitions, and use it to clear pv_unhalted whenever a vCPU is made RUNNABLE. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS, as KVM can't get the current CPL. - Misc cleanups
2025-03-19Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information.
2025-03-19Merge tag 'v6.14-rc7' into x86/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-14KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protectedPaolo Bonzini
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state, since the register values cannot actually be written. Return 0 when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero. However, on exit from KVM_RUN userspace could have placed a nonzero value into kvm_run->kvm_valid_regs, so check guest_state_protected again and skip store_regs() in that case. Cc: stable@vger.kernel.org Fixes: 517987e3fb19 ("KVM: x86: add fields to struct kvm_arch for CoCo features") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250306202923.646075-1-pbonzini@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add infrastructure for secure TSCIsaku Yamahata
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Push down setting vcpu.arch.user_set_tscIsaku Yamahata
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc() to __kvm_synchronize_tsc() so that the two callers don't have to modify user_set_tsc directly as preparation. Later, prohibit changing TSC synchronization for TDX guests to modify __kvm_synchornize_tsc() change. We don't want to touch caller sites not to change user_set_tsc. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vmPaolo Bonzini
TDX needs to free the TDR control structures last, after all paging structures have been torn down; move the vm_destroy callback at a suitable place. The new place is also okay for AMD; the main difference is that the MMU has been torn down and, if anything, that is better done before the SNP ASID is released. Extracted from a patch by Yan Zhao. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-09Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.
2025-03-04KVM: x86: Remove the unreachable case for 0x80000022 leaf in __do_cpuid_func()Xiaoyao Li
Remove dead/unreachable (and misguided) code in KVM's processing of 0x80000022. The case statement breaks early if PERFMON_V2 isnt supported, i.e. kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2) must be true when KVM reaches the code code to setup EBX. Note, early versions of the patch that became commit 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") didn't break early on lack of PERFMON_V2 support, and instead enumerated the effective number of counters KVM could emulate. All of that code was flawed, e.g. the APM explicitly states EBX is valid only for v2. Performance Monitoring Version 2 supported. When set, CPUID_Fn8000_0022_EBX reports the number of available performance counters. When the flaw of not respecting v2 support was addressed, the misguided stuffing of the number of counters got left behind. Link: https://lore.kernel.org/all/20220919093453.71737-4-likexu@tencent.com Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-2-xiaoyao.li@intel.com [sean: elaborate on the situation a bit more, add Fixes] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVMXiaoyao Li
Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX instead of zeroing both when PERFMON_V2 isn't supported by KVM. In practice, barring a buggy CPU (or vCPU model when running nested) only the !enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2 is unsupported. For the !enable_pmu case, the bug is relatively benign as KVM will refuse to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID into the guest could inadvertently induce #GPs in the guest due to advertising support for MSRs that KVM refuses to emulate. Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com [sean: massage shortlog and changelog, tag for stable] Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04KVM: VMX: Use named operands in inline asmJosh Poimboeuf
Convert the non-asm-goto version of the inline asm in __vmcs_readl() to use named operands, similar to its asm-goto version. Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Cc: linux-kernel@vger.kernel.org
2025-03-03KVM: VMX: Extract checks on entry/exit control pairs to a helper macroSean Christopherson
Extract the checking of entry/exit pairs to a helper macro so that the code can be reused to process the upcoming "secondary" exit controls (the primary exit controls field is out of bits). Use a macro instead of a function to support different sized variables (all secondary exit controls will be optional and so the MSR doesn't have the fixed-0/fixed-1 split). Taking the largest size as input is trivial, but handling the modification of KVM's to-be-used controls is much trickier, e.g. would require bitmap games to clear bits from a 32-bit bitmap vs. a 64-bit bitmap. Opportunistically add sanity checks to ensure the size of the controls match (yay, macro!), e.g. to detect bugs where KVM passes in the pairs for primary exit controls, but its variable for the secondary exit controls. To help users triage mismatches, print the control bits that are checked, not just the actual value. For the foreseeable future, that provides enough information for a user to determine which fields mismatched. E.g. until secondary entry controls comes along, all entry bits and thus all error messages are guaranteed to be unique. To avoid returning from a macro, which can get quite dangerous, simply process all pairs even if error_on_inconsistent_vmcs_config is set. The speed at which KVM rejects module load is not at all interesting. Keep the error message a "once" printk, even though it would be nice to print out all mismatching pairs. In practice, the most likely scenario is that a single pair will mismatch on all CPUs. Printing all mismatches generates redundant messages in that situation, and can be extremely noisy on systems with large numbers of CPUs. If a CPU has multiple mismatches, not printing every bad pair is the least of the user's concerns. Cc: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250227005353.3216123-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Invalidate "next" SNP VMSA GPA even on failureSean Christopherson
When processing an SNP AP Creation event, invalidate the "next" VMSA GPA even if acquiring the page/pfn for the new VMSA fails. In practice, the next GPA will never be used regardless of whether or not its invalidated, as the entire flow is guarded by snp_ap_waiting_for_reset, and said guard and snp_vmsa_gpa are always written as a pair. But that's really hard to see in the code. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP vCPU state updatesSean Christopherson
Use guard(mutex) in sev_snp_init_protected_guest_state() and pull in its lock-protected inner helper. Without an unlock trampoline (and even with one), there is no real need for an inner helper. Eliminating the helper also avoids having to fixup the open coded "lockdep" WARN_ON(). Opportunistically drop the error message if KVM can't obtain the pfn for the new target VMSA. The error message provides zero information that can't be gleaned from the fact that the vCPU is stuck. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Mark VMCB dirty before processing incoming snp_vmsa_gpaSean Christopherson
Mark the VMCB dirty, i.e. zero control.clean, prior to handling the new VMSA. Nothing in the VALID_PAGE() case touches control.clean, and isolating the VALID_PAGE() code will allow simplifying the overall logic. Note, the VMCB probably doesn't need to be marked dirty when the VMSA is invalid, as KVM will disallow running the vCPU in such a state. But it also doesn't hurt anything. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP AP Creation error handlingSean Christopherson
Use guard(mutex) in sev_snp_ap_creation() and modify the error paths to return directly instead of jumping to a common exit point. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>