summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2025-03-03KVM: SVM: Invalidate "next" SNP VMSA GPA even on failureSean Christopherson
When processing an SNP AP Creation event, invalidate the "next" VMSA GPA even if acquiring the page/pfn for the new VMSA fails. In practice, the next GPA will never be used regardless of whether or not its invalidated, as the entire flow is guarded by snp_ap_waiting_for_reset, and said guard and snp_vmsa_gpa are always written as a pair. But that's really hard to see in the code. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP vCPU state updatesSean Christopherson
Use guard(mutex) in sev_snp_init_protected_guest_state() and pull in its lock-protected inner helper. Without an unlock trampoline (and even with one), there is no real need for an inner helper. Eliminating the helper also avoids having to fixup the open coded "lockdep" WARN_ON(). Opportunistically drop the error message if KVM can't obtain the pfn for the new target VMSA. The error message provides zero information that can't be gleaned from the fact that the vCPU is stuck. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Mark VMCB dirty before processing incoming snp_vmsa_gpaSean Christopherson
Mark the VMCB dirty, i.e. zero control.clean, prior to handling the new VMSA. Nothing in the VALID_PAGE() case touches control.clean, and isolating the VALID_PAGE() code will allow simplifying the overall logic. Note, the VMCB probably doesn't need to be marked dirty when the VMSA is invalid, as KVM will disallow running the vCPU in such a state. But it also doesn't hurt anything. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP AP Creation error handlingSean Christopherson
Use guard(mutex) in sev_snp_ap_creation() and modify the error paths to return directly instead of jumping to a common exit point. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Simplify request+kick logic in SNP AP Creation handlingSean Christopherson
Drop the local "kick" variable and the unnecessary "fallthrough" logic from sev_snp_ap_creation(), and simply pivot on the request when deciding whether or not to immediate force a state update on the target vCPU. No functional change intended. Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Require AP's "requested" SEV_FEATURES to match KVM's viewSean Christopherson
When handling an "AP Create" event, return an error if the "requested" SEV features for the vCPU don't exactly match KVM's view of the VM-scoped features. There is no known use case for heterogeneous SEV features across vCPUs, and while KVM can't actually enforce an exact match since the value in RAX isn't guaranteed to match what the guest shoved into the VMSA, KVM can at least avoid knowingly letting the guest run in an unsupported state. E.g. if a VM is created with DebugSwap disabled, KVM will intercept #DBs and DRs for all vCPUs, even if an AP is "created" with DebugSwap enabled in its VMSA. Note, the GHCB spec only "requires" that "AP use the same interrupt injection mechanism as the BSP", but given the disaster that is DebugSwap and SEV_FEATURES in general, it's safe to say that AMD didn't consider all possible complications with mismatching features between the BSP and APs. Opportunistically fold the check into the relevant request flavors; the "request < AP_DESTROY" check is just a bizarre way of implementing the AP_CREATE_ON_INIT => AP_CREATE fallthrough. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't change target vCPU state on AP Creation VMGEXIT errorSean Christopherson
If KVM rejects an AP Creation event, leave the target vCPU state as-is. Nothing in the GHCB suggests the hypervisor is *allowed* to muck with vCPU state on failure, let alone required to do so. Furthermore, kicking only in the !ON_INIT case leads to divergent behavior, and even the "kick" case is non-deterministic. E.g. if an ON_INIT request fails, the guest can successfully retry if the fixed AP Creation request is made prior to sending INIT. And if a !ON_INIT fails, the guest can successfully retry if the fixed AP Creation request is handled before the target vCPU processes KVM's KVM_REQ_UPDATE_PROTECTED_GUEST_STATE. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Refuse to attempt VRMUN if an SEV-ES+ guest has an invalid VMSASean Christopherson
Explicitly reject KVM_RUN with KVM_EXIT_FAIL_ENTRY if userspace "coerces" KVM into running an SEV-ES+ guest with an invalid VMSA, e.g. by modifying a vCPU's mp_state to be RUNNABLE after an SNP vCPU has undergone a Destroy event. On Destroy or failed Create, KVM marks the vCPU HALTED so that *KVM* doesn't run the vCPU, but nothing prevents a misbehaving VMM from manually making the vCPU RUNNABLE via KVM_SET_MP_STATE. Attempting VMRUN with an invalid VMSA should be harmless, but knowingly executing VMRUN with bad control state is at best dodgy. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3Sean Christopherson
Never rely on the CPU to restore/load host DR0..DR3 values, even if the CPU supports DebugSwap, as there are no guarantees that SNP guests will actually enable DebugSwap on APs. E.g. if KVM were to rely on the CPU to load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would run with clobbered-to-zero DRs if an SNP guest created APs without DebugSwap enabled. Update the comment to explain the dangers, and hopefully prevent breaking KVM in the future. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Save host DR masks on CPUs with DebugSwapSean Christopherson
When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-01kvm: retry nx_huge_page_recovery_thread creationKeith Busch
A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01vhost: return task creation error instead of NULLKeith Busch
Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28KVM: x86: Always set mp_state to RUNNABLE on wakeup from HLTSean Christopherson
When emulating HLT and a wake event is already pending, explicitly mark the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is already in the appropriate state. Barring a KVM bug, it should be impossible for the vCPU to be in a non-RUNNABLE state, but there is no advantage to relying on that to hold true, and ensuring the vCPU is made RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted. E.g. if the vCPU is not already RUNNABLE, then depending on when pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check). Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQsSean Christopherson
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabledSean Christopherson
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit 408eb7417a92 ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Suppress DEBUGCTL.BTF on AMDSean Christopherson
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective valueSean Christopherson
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Inject #GP if memory operand for INVPCID is non-canonicalSean Christopherson
Inject a #GP if the memory operand received by INVCPID is non-canonical. The APM clearly states that the intercept takes priority over all #GP checks except the CPL0 restriction. Of course, that begs the question of how the CPU generates a linear address in the first place. Tracing confirms that EXITINFO1 does hold a linear address, at least for 64-bit mode guests (hooray GS prefix). Unfortunately, the APM says absolutely nothing about the EXITINFO fields for INVPCID intercepts, so it's not at all clear what's supposed to happen. Add a FIXME to call out that KVM still does the wrong thing for 32-bit guests, and if the stack segment is used for the memory operand. Cc: Babu Moger <babu.moger@amd.com> Cc: Jim Mattson <jmattson@google.com> Fixes: 4407a797e941 ("KVM: SVM: Enable INVPCID feature on AMD") Link: https://lore.kernel.org/r/20250224174522.2363400-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: VMX: Reject KVM_RUN if userspace forces emulation during nested VM-EnterSean Christopherson
Extend KVM's restrictions on userspace forcing "emulation required" at odd times to cover stuffing invalid guest state while a nested run is pending. Clobbering guest state while KVM is in the middle of emulating VM-Enter is nonsensical, as it puts the vCPU into an architecturally impossible state, and will trip KVM's sanity check that guards against KVM bugs, e.g. helps detect missed VMX consistency checks. WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 Modules linked in: CPU: 3 UID: 0 PID: 6336 Comm: syz.0.73 Not tainted 6.13.0-rc1-syzkaller-00316-gb5f217084ab3 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:__vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] RIP: 0010:vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 <TASK> vcpu_enter_guest arch/x86/kvm/x86.c:11081 [inline] vcpu_run+0x3047/0x4f50 arch/x86/kvm/x86.c:11242 kvm_arch_vcpu_ioctl_run+0x44a/0x1740 arch/x86/kvm/x86.c:11560 kvm_vcpu_ioctl+0x6ce/0x1520 virt/kvm/kvm_main.c:4340 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x190/0x200 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Reported-by: syzbot+ac0bc3a70282b4d586cc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67598fb9.050a0220.17f54a.003b.GAE@google.com Debugged-by: James Houghton <jthoughton@google.com> Tested-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250224171409.2348647-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadowSean Christopherson
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86/tdp_mmu: Remove tdp_mmu_for_each_pte()Nikolay Borisov
That macro acts as a different name for for_each_tdp_pte, apart from adding cognitive load it doesn't bring any value. Let's remove it. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250226074131.312565-1-nik.borisov@suse.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bitsSean Christopherson
Define independent macros for the RWX protection bits that are enumerated via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in EPT entries via compile-time asserts. Piggybacking the EPTE defines works for now, but it creates holes in the EPT_VIOLATION_xxx macros and will cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any other features that introduces additional protection information. Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK so that it doesn't become stale if/when MBEC support is added. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-27KVM: nVMX: Always use IBPB to properly virtualize IBRSYosry Ahmed
On synthesized nested VM-exits in VMX, an IBPB is performed if IBRS is advertised to the guest to properly provide separate prediction domains for L1 and L2. However, this is currently conditional on X86_FEATURE_USE_IBPB, which depends on the host spectre_v2_user mitigation. In short, if spectre_v2_user=no, IBRS is not virtualized correctly and L1 becomes susceptible to attacks from L2. Fix this by performing the IBPB regardless of X86_FEATURE_USE_IBPB. Fixes: 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS") Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-6-yosry.ahmed@linux.dev
2025-02-27x86/bugs: Use a static branch to guard IBPB on vCPU switchYosry Ahmed
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM when a new vCPU is loaded, introduce a static branch, similar to switch_mm_*_ibpb. This makes it obvious in spectre_v2_user_select_mitigation() what exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB (which will be shortly removed). It also provides more fine-grained control, making it simpler to change/add paths that control the IBPB in the vCPU switch path without affecting other IBPBs. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
2025-02-27x86/bugs: Move the X86_FEATURE_USE_IBPB check into callersYosry Ahmed
indirect_branch_prediction_barrier() only performs the MSR write if X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In preparation for removing X86_FEATURE_USE_IBPB, move the feature check into the callers so that they can be addressed one-by-one, and use X86_FEATURE_IBPB instead to guard the MSR write. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
2025-02-26KVM: Drop kvm_arch_sync_events() now that all implementations are nopsSean Christopherson
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other arch has ever used it). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20250224235542.2562848-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Fold guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm()Sean Christopherson
Fold the guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm(), as the kvmclock and PIT background workers only need to be stopped before destroying vCPUs (to avoid accessing vCPUs as they are being freed); it's a-ok for them to be running while the VM is visible on the global vm_list. Note, the PIT also needs to be stopped before IRQ routing is freed (because KVM's IRQ routing is garbage and assumes there is always non-NULL routing). Opportunistically add comments to explain why KVM stops/frees certain assets early. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Unload MMUs during vCPU destruction, not beforeSean Christopherson
When destroying a VM, unload a vCPU's MMUs as part of normal vCPU freeing, instead of as a separate prepratory action. Unloading MMUs ahead of time is a holdover from commit 7b53aa565084 ("KVM: Fix vcpu freeing for guest smp"), which "fixed" a rather egregious flaw where KVM would attempt to free *all* MMU pages when destroying a vCPU. At the time, KVM would spin on all MMU pages in a VM when free a single vCPU, and so would hang due to the way KVM pins and zaps root pages (roots are invalidated but not freed if they are pinned by a vCPU). static void free_mmu_pages(struct kvm_vcpu *vcpu) { struct kvm_mmu_page *page; while (!list_empty(&vcpu->kvm->active_mmu_pages)) { page = container_of(vcpu->kvm->active_mmu_pages.next, struct kvm_mmu_page, link); kvm_mmu_zap_page(vcpu->kvm, page); } free_page((unsigned long)vcpu->mmu.pae_root); } Now that KVM doesn't try to free all MMU pages when destroying a single vCPU, there's no need to unpin roots prior to destroying a vCPU. Note! While KVM mostly destroys all MMUs before calling kvm_arch_destroy_vm() (see commit f00be0cae4e6 ("KVM: MMU: do not free active mmu pages in free_mmu_pages()")), unpinning MMU roots during vCPU destruction will unfortunately trigger remote TLB flushes, i.e. will try to send requests to all vCPUs. Happily, thanks to commit 27592ae8dbe4 ("KVM: Move wiping of the kvm->vcpus array to common code"), that's a non-issue as freed vCPUs are naturally skipped by xa_for_each_range(), i.e. by kvm_for_each_vcpu(). Prior to that commit, KVM x86 rather stupidly freed vCPUs one-by-one, and _then_ nullified them, one-by-one. I.e. triggering a VM-wide request would hit a use-after-free. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Don't load/put vCPU when unloading its MMU during teardownSean Christopherson
Don't load (and then put) a vCPU when unloading its MMU during VM destruction, as nothing in kvm_mmu_unload() accesses vCPU state beyond the root page/address of each MMU, i.e. can't possible need to run with the vCPU loaded. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26x86/bugs: KVM: Add support for SRSO_MSR_FIXBorislav Petkov
Add support for CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate SRSO. Enable BpSpecReduce to mitigate SRSO across guest/host boundaries. Switch back to enabling the bit when virtualization is enabled and to clear the bit when virtualization is disabled because using a MSR slot would clear the bit when the guest is exited and any training the guest has done, would potentially influence the host kernel when execution enters the kernel and hasn't VMRUN the guest yet. More detail on the public thread in Link below. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
2025-02-26KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pendingSean Christopherson
Process pending events on nested VM-Exit if the vCPU has an injectable IRQ or NMI, as the event may have become pending while L2 was active, i.e. may not be tracked in the context of vmcs01. E.g. if L1 has passed its APIC through to L2 and an IRQ arrives while L2 is active, then KVM needs to request an IRQ window prior to running L1, otherwise delivery of the IRQ will be delayed until KVM happens to process events for some other reason. The missed failure is detected by vmx_apic_passthrough_tpr_threshold_test in KVM-Unit-Tests, but has effectively been masked due to a flaw in KVM's PIC emulation that causes KVM to make spurious KVM_REQ_EVENT requests (and apparently no one ever ran the test with split IRQ chips). Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Free vCPUs before freeing VM stateSean Christopherson
Free vCPUs before freeing any VM state, as both SVM and VMX may access VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs to be kicked out of nested guest mode. Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called") partially fixed the issue, but for unknown reasons only moved the MMU unloading before VM destruction. Complete the change, and free all vCPU state prior to destroying VM state, as nVMX accesses even more state than nSVM. In addition to the AVIC, KVM can hit a use-after-free on MSR filters: kvm_msr_allowed+0x4c/0xd0 __kvm_set_msr+0x12d/0x1e0 kvm_set_msr+0x19/0x40 load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel] nested_vmx_vmexit+0x715/0xbd0 [kvm_intel] nested_vmx_free_vcpu+0x33/0x50 [kvm_intel] vmx_free_vcpu+0x54/0xc0 [kvm_intel] kvm_arch_vcpu_destroy+0x28/0xf0 kvm_vcpu_destroy+0x12/0x50 kvm_arch_destroy_vm+0x12c/0x1c0 kvm_put_kvm+0x263/0x3c0 kvm_vm_release+0x21/0x30 and an upcoming fix to process injectable interrupts on nested VM-Exit will access the PIC: BUG: kernel NULL pointer dereference, address: 0000000000000090 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm] Call Trace: <TASK> kvm_cpu_has_injectable_intr+0xe/0x60 [kvm] nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel] nested_vmx_free_vcpu+0x40/0x50 [kvm_intel] vmx_vcpu_free+0x2d/0x80 [kvm_intel] kvm_arch_vcpu_destroy+0x2d/0x130 [kvm] kvm_destroy_vcpus+0x8a/0x100 [kvm] kvm_arch_destroy_vm+0xa7/0x1d0 [kvm] kvm_destroy_vm+0x172/0x300 [kvm] kvm_vcpu_release+0x31/0x50 [kvm] Inarguably, both nSVM and nVMX need to be fixed, but punt on those cleanups for the moment. Conceptually, vCPUs should be freed before VM state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs are created, so it stands to reason that they must be freed _after_ vCPUs are destroyed. Reported-by: Aaron Lewis <aaronlewis@google.com> Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com Cc: Jim Mattson <jmattson@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-25KVM: SVM: Add Idle HLT intercept supportManali Shukla
Add support for "Idle HLT" interception on AMD CPUs, and enable Idle HLT interception instead of "normal" HLT interception for all VMs for which HLT-exiting is enabled. Idle HLT provides a mild performance boost for all VM types, by avoiding a VM-Exit in the scenario where KVM would immediately "wake" and resume the vCPU. Idle HLT makes HLT-exiting conditional on the vCPU not having a valid, unmasked interrupt. Specifically, a VM-Exit occurs on execution of HLT if and only if there are no pending V_IRQ or V_NMI events. Note, Idle is a replacement for full HLT interception, i.e. enabling HLT interception would result in all HLT instructions causing unconditional VM-Exits. Per the APM: When both HLT and Idle HLT intercepts are active at the same time, the HLT intercept takes priority. This intercept occurs only if a virtual interrupt is not pending (V_INTR or V_NMI). For KVM's use of V_IRQ (also called V_INTR in the APM) to detect interrupt windows, the net effect of enabling Idle HLT is that, if a virtual interupt is pending and unmasked at the time of HLT, the vCPU will take a V_IRQ intercept instead of a HLT intercept. When AVIC is enabled, Idle HLT works as intended: the vCPU continues unimpeded and services the pending virtual interrupt. Note, the APM's description of V_IRQ interaction with AVIC is quite confusing, and requires piecing together implied behavior. Per the APM, when AVIC is enabled, V_IRQ *from the VMCB* is ignored: When AVIC mode is enabled for a virtual processor, the V_IRQ, V_INTR_PRIO, V_INTR_VECTOR, and V_IGN_TPR fields in the VMCB are ignored. Which seems to contradict the behavior of Idle HLT: This intercept occurs only if a virtual interrupt is not pending (V_INTR or V_NMI). What's not explicitly stated is that hardware's internal copy of V_IRQ (and related fields) *are* still active, i.e. are presumably used to cache information from the virtual APIC. Handle Idle HLT exits as if they were normal HLT exits, e.g. don't try to optimize the handling under the assumption that there isn't a pending IRQ. Irrespective of AVIC, Idle HLT is inherently racy with respect to the vIRR, as KVM can set vIRR bits asychronously. No changes are required to support KVM's use Idle HLT while running L2. In fact, supporting Idle HLT is actually a bug fix to some extent. If L1 wants to intercept HLT, recalc_intercepts() will enable HLT interception in vmcb02 and forward the intercept to L1 as normal. But if L1 does not want to intercept HLT, then KVM will run L2 with Idle HLT enabled and HLT interception disabled. If a V_IRQ or V_NMI for L2 becomes pending and L2 executes HLT, then use of Idle HLT will do the right thing, i.e. not #VMEXIT and instead deliver the virtual event. KVM currently doesn't handle this scenario correctly, e.g. doesn't check V_IRQ or V_NMI in vmcs02 as part of kvm_vcpu_has_events(). Do not expose Idle HLT to L1 at this time, as supporting nested Idle HLT is more complex than just enumerating the feature, e.g. requires KVM to handle the aforementioned scenarios of V_IRQ and V_NMI at the time of exit. Signed-off-by: Manali Shukla <Manali.Shukla@amd.com> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://bugzilla.kernel.org/attachment.cgi?id=306250 Link: https://lore.kernel.org/r/20250128124812.7324-3-manali.shukla@amd.com [sean: rewrite changelog, drop nested "support"] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: SVM: Provide helpers to set the error codeMelody Wang
Provide helpers to set the error code when converting VMGEXIT SW_EXITINFO1 and SW_EXITINFO2 codes from plain numbers to proper defines. Add comments for better code readability. No functionality changed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Melody Wang <huibo.wang@amd.com> Link: https://lore.kernel.org/r/20250225213937.2471419-3-huibo.wang@amd.com [sean: tweak comments, fix formatting goofs] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: SVM: Convert plain error code numbers to definesMelody Wang
Convert VMGEXIT SW_EXITINFO1 codes from plain numbers to proper defines. Opportunistically update the comment for the malformed input "sub-error" codes to state that they are defined by the GHCB, and to capure the relationship to the malformed input response. No functional change intended. Signed-off-by: Melody Wang <huibo.wang@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com> Link: https://lore.kernel.org/r/20250225213937.2471419-2-huibo.wang@amd.com [sean: update comments] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: VMX: Pass XFD_ERR as pseudo-payload when injecting #NMSean Christopherson
Pass XFD_ERR via KVM's exception payload mechanism when injecting an #NM after interception so that XFD_ERR can be propagated to FRED's event_data field without needing a dedicated field (which would need to be migrated). For non-FRED vCPUs, this is a glorified NOP as kvm_deliver_exception_payload() will simply do nothing (which is desirable and correct). Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-15-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: VMX: Don't modify guest XFD_ERR if CR0.TS=1Sean Christopherson
Don't update the guest's XFD_ERR MSR if CR0.TS is set; per the SDM, XFD_ERR is not modified if CR0.TS=1. Although it's not explicitly stated in the SDM, conceptually it makes sense the CR0.TS check would be done prior to the XFD_ERR check, e.g. CR0.TS=1 blocks all SIMD state, whereas XFD blocks only XTILE state. Device-not-available exceptions that are not due to XFD - those resulting from setting CR0.TS to 1 - do not modify the IA32_XFD_ERR MSR. Opportunistically update the comment to call out that XFD_ERR is updated before the VM-Exit check occurs. Nothing in the SDM explicitly calls out this behavior, but logically it must be the behavior, otherwise reading XFD_ERR in handle_nm_fault_irqoff() would return stale data, i.e. the to-be-delivered XFD_ERR value would need to be saved in EXIT_QUALIFICATION, a la DR6 for #DB and CR2 for #PF, so that software could capture the guest value. Fixes: ec5be88ab29f ("kvm: x86: Intercept #NM for saving IA32_XFD_ERR") Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-3-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Use a dedicated flow for queueing re-injected exceptionsSean Christopherson
Open code the filling of vcpu->arch.exception in kvm_requeue_exception() instead of bouncing through kvm_multiple_exception(), as re-injection doesn't actually share that much code with "normal" injection, e.g. the VM-Exit interception check, payload delivery, and nested exception code is all bypassed as those flows only apply during initial injection. When FRED comes along, the special casing will only get worse, as FRED explicitly tracks nested exceptions and essentially delivers the payload on the stack frame, i.e. re-injection will need more inputs, and normal injection will have yet more code that needs to be bypassed when KVM is re-injecting an exception. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Rename and invert async #PF's send_user_only flag to send_alwaysSean Christopherson
Rename send_user_only to avoid "user", because KVM's ABI is to not inject page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert the polarity to keep the naming simple and unambiguous. E.g. while KVM often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and "send_kernel" could be misconstrued as "send only to kernel". Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Don't inject PV async #PF if SEND_ALWAYS=0 and guest state is ↵Sean Christopherson
protected Don't inject PV async #PFs into guests with protected register state, i.e. SEV-ES and SEV-SNP guests, unless the guest has opted-in to receiving #PFs at CPL0. For protected guests, the actual CPL of the guest is unknown. Note, no sane CoCo guest should enable PV async #PF, but the current state of Linux-as-a-CoCo-guest isn't entirely sane. Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA") Link: https://lore.kernel.org/r/20250215010609.1199982-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Update Xen TSC leaves during CPUID emulationFred Griffoul
The Xen emulation in KVM modifies certain CPUID leaves to expose TSC information to the guest. Previously, these CPUID leaves were updated whenever guest time changed, but this conflicts with KVM_SET_CPUID/KVM_SET_CPUID2 ioctls which reject changes to CPUID entries on running vCPUs. Fix this by updating the TSC information directly in the CPUID emulation handler instead of modifying the vCPU's CPUID entries. Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250124150539.69975-1-fgriffo@amazon.co.uk Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: nVMX: Synthesize nested VM-Exit for supported emulation interceptsSean Christopherson
When emulating an instruction on behalf of L2 that L1 wants to intercept, generate a nested VM-Exit instead of injecting a #UD into L2. Now that (most of) the necessary information is available, synthesizing a VM-Exit isn't terribly difficult. Punt on decoding the ModR/M for descriptor table exits for now. There is no evidence that any hypervisor intercepts descriptor table accesses *and* uses the EXIT_QUALIFICATION to expedite emulation, i.e. it's not worth delaying basic support for. To avoid doing more harm than good, e.g. by putting L2 into an infinite or effectively corrupting its code stream, inject #UD if the instruction length is nonsensical. Link: https://lore.kernel.org/r/20250201015518.689704-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: nVMX: Allow the caller to provide instruction length on nested VM-ExitSean Christopherson
Rework the nested VM-Exit helper to take the instruction length as a parameter, and convert nested_vmx_vmexit() into a "default" wrapper that grabs the length from vmcs02 as appropriate. This will allow KVM to set the correct instruction length when synthesizing a nested VM-Exit when emulating an instruction that L1 wants to intercept. No functional change intended, as the path to prepare_vmcs12()'s reading of vmcs02.VM_EXIT_INSTRUCTION_LEN is gated on the same set of conditions as the VMREAD in the new nested_vmx_vmexit(). Link: https://lore.kernel.org/r/20250201015518.689704-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: x86: Add a #define for the architectural max instruction lengthSean Christopherson
Add a #define to capture x86's architecturally defined max instruction length instead of open coding the literal in a variety of places. No functional change intended. Link: https://lore.kernel.org/r/20250201015518.689704-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: x86: Plumb the emulator's starting RIP into nested intercept checksSean Christopherson
When checking for intercept when emulating an instruction on behalf of L2, pass the emulator's view of the RIP of the instruction being emulated to vendor code. Unlike SVM, which communicates the next RIP on VM-Exit, VMX communicates the length of the instruction that generated the VM-Exit, i.e. requires the current and next RIPs. Note, unless userspace modifies RIP during a userspace exit that requires completion, kvm_rip_read() will contain the same information. Pass the emulator's view largely out of a paranoia, and because there is no meaningful cost in doing so. Link: https://lore.kernel.org/r/20250201015518.689704-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: x86: Plumb the src/dst operand types through to .check_intercept()Sean Christopherson
When checking for intercept when emulating an instruction on behalf of L2, forward the source and destination operand types to vendor code so that VMX can synthesize the correct EXIT_QUALIFICATION for port I/O VM-Exits. Link: https://lore.kernel.org/r/20250201015518.689704-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: nVMX: Consolidate missing X86EMUL_INTERCEPTED logic in L2 emulationSean Christopherson
Refactor the handling of port I/O interception checks when emulating on behalf of L2 in anticipation of synthesizing a nested VM-Exit to L1 instead of injecting a #UD into L2. No functional change intended. Link: https://lore.kernel.org/r/20250201015518.689704-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: nVMX: Emulate HLT in L2 if it's not interceptedSean Christopherson
Extend VMX's nested intercept logic for emulated instructions to handle HLT interception, primarily for testing purposes. Failure to allow emulation of HLT isn't all that interesting, as emulating HLT while L2 is active either requires forced emulation (and no #UD intercept in L1), TLB games in the guest to coerce KVM into emulating the wrong instruction, or a bug elsewhere in KVM. E.g. without commit 47ef3ef843c0 ("KVM: VMX: Handle event vectoring error in check_emulate_instruction()"), KVM can end up trying to emulate HLT if RIP happens to point at a HLT when a vectored event arrives with L2's IDT pointing at emulated MMIO. Note, vmx_check_intercept() is still broken when L1 wants to intercept an instruction, as KVM injects a #UD instead of synthesizing a nested VM-Exit. That issue extends far beyond HLT, punt on it for now. Link: https://lore.kernel.org/r/20250201015518.689704-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: nVMX: Allow emulating RDPID on behalf of L2Sean Christopherson
Return X86EMUL_CONTINUE instead X86EMUL_UNHANDLEABLE when emulating RDPID on behalf of L2 and L1 _does_ expose RDPID/RDTSCP to L2. When RDPID emulation was added by commit fb6d4d340e05 ("KVM: x86: emulate RDPID"), KVM incorrectly allowed emulation by default. Commit 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") fixed that flaw, but missed that RDPID emulation was relying on the common return path to allow emulation on behalf of L2. Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Link: https://lore.kernel.org/r/20250201015518.689704-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>