Age | Commit message (Collapse) | Author |
|
Skip instruction emulation and let the TDX guest retry for MMIO emulation
after installing the MMIO SPTE with suppress #VE bit cleared.
TDX protects TDX guest state from VMM, instructions in guest memory cannot
be emulated. MMIO emulation is the only case that triggers the instruction
emulation code path for TDX guest.
The MMIO emulation handling flow as following:
- The TDX guest issues a vMMIO instruction. (The GPA must be shared and is
not covered by KVM memory slot.)
- The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So
EPT violation causes TD exit to KVM.
- Trigger KVM page fault handler and install a new SPTE with suppress #VE
bit cleared.
- Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU
retry.
- TDX guest re-executes the vMMIO instruction.
- TDX guest gets #VE because KVM has cleared #VE suppress bit.
- TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO>
Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for
TDX guests to retry the MMIO instruction. Also, the instruction emulation
handling will be skipped, so that the callback check_intercept() will never
be called for TDX guest.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
TDX protects TDX guest state from VMM. Implement access methods for TDX
guest state to ignore them or return zero. Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is
zero, return success code and zero in output registers.
TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification. The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling
MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.
MCE and MCA are advertised via cpuid based on the TDX module spec. Guest
kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the
platform or not. If LMCE is opted-in by the platform, guest kernel can
access IA32_MCG_EXT_CTL to enable/disable LMCE.
Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid
failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g.,
Linux guest will treat the failure as a #GP(0).
Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by
default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.
For complete_emulated_msr() callback, instead of injecting #GP on error,
implement tdx_complete_emulated_msr() to set return code on error. Also
set return value on MSR read according to the values from kvm x86
registers.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
and .has_emulated_msr(), for preparation of handling hypercalls from TDX
guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX.
There are three classes of MSR virtualization for TDX.
- Non-configurable: TDX module directly virtualizes it. VMM can't configure
it, the value set by KVM_SET_MSRS is ignored.
- Configurable: TDX module directly virtualizes it. VMM can configure it at
VM creation time. The value set by KVM_SET_MSRS is used.
- #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.
For the MSRs belonging to the #VE case, the TDX module injects #VE to the
TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in
TDX Module ABI Spec.
Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.
TDX doesn't allow VMM to configure interception of MSR accesses. Ignore
KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any
MSR filters, it will be applied when handling
TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle TDX PV HLT hypercall and the interrupt status due to it.
TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.
If the guest halted with interrupt enabled, also query pending RVI by
checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.
Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.
For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked. In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup. The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM
according to TDX Guest Host Communication Interface (GHCI).
For TDX, most CPUID leaf/sub-leaf combinations are virtualized by
the TDX module while some trigger #VE. On #VE, TDX guest can issue
TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID)
to request VMM to emulate CPUID operation.
Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM
backend function.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rewrite changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.
Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
tdh_vp_enter() and TDCALLs.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------
SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation)
tdh_mem_range_block
------------------------------------------------------------
SEPT entry tdh_mem_range_block (Host lock)
tdh_mem_page_remove (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().
TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation)
tdh_mem_page_aug
------------------------------------------------------------
SEPT entry tdh_mem_sept_add (Host lock)
tdh_mem_page_aug (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.
The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
checks if the guest RIP is the same as last guest RIP on TD entry.
-if yes, it means the EPT violation is caused by the same instruction
that caused the last VM exit.
Then, the TDX module increases the guest RIP no-progress count.
When the count increases from 0 to the threshold (currently 6),
the TDX module records the faulting GPA into a
last_epf_gpa_list.
-if no, it means the guest RIP has made progress.
So, the TDX module resets the RIP no-progress count and the
last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
TD entry) checks if the last_epf_gpa_list is empty.
-if yes, TD entry continues without acquiring the lock on the SEPT tree.
-if no, it triggers the 0-step mitigation by acquiring the exclusive
lock on SEPT tree, walking the EPT tree to check if all page
faults caused by the GPAs in the last_epf_gpa_list have been
resolved before continuing TD entry.
Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.
Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Detect SEPT violations that occur when an SEPT entry is in PENDING state
while the TD is configured not to receive #VE on SEPT violations.
A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted. In such cases, the TDX module will not
inject #VE into the TD upon encountering an EPT violation caused by an SEPT
entry in the PENDING state. Instead, TDX module will exit to VMM and set
extended exit qualification type to PENDING_EPT_VIOLATION and exit
qualification bit 6:3 to 0.
Since #VE will not be injected to such TDs, they are not able to be
notified to accept a GPA. TD accessing before accepting a private GPA
is regarded as an error within the guest.
Detect such guest error by inspecting the (extended) exit qualification
bits and make such VM dead.
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For TDX, on EPT violation, call common __vmx_handle_ept_violation() to
trigger x86 MMU code; on EPT misconfiguration, bug the VM since it
shouldn't happen.
EPT violation due to instruction fetch should never be triggered from
shared memory in TDX guest. If such EPT violation occurs, treat it as
broken hardware.
EPT misconfiguration shouldn't happen on neither shared nor secure EPT for
TDX guests.
- TDX module guarantees no EPT misconfiguration on secure EPT. Per TDX
module v1.5 spec section 9.4 "Secure EPT Induced TD Exits":
"By design, since secure EPT is fully controlled by the TDX module, an
EPT misconfiguration on a private GPA indicates a TDX module bug and is
handled as a fatal error."
- For shared EPT, the MMIO caching optimization, which is the only case
where current KVM configures EPT entries to generate EPT
misconfiguration, is implemented in a different way for TDX guests. KVM
configures EPT entries to non-present value without suppressing #VE bit.
It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to
request MMIO emulation.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
[binbin: rework changelog]
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle VM exit caused by "other SMI" for TDX, by returning back to
userspace for Machine Check System Management Interrupt (MSMI) case or
ignoring it and resume vCPU for non-MSMI case.
For VMX, SMM transition can happen in both VMX non-root mode and VMX
root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts
are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI
causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM,
SMI is delivered and handled by kernel handler right away.
An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs
to use TDVMCALL to request VMM to do I/O emulation.
For "other SMI", there are two cases:
- MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
TDX guest will be delivered as an MSMI. It causes an
EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit
qualification. On VM exit, TDX module checks whether the "other SMI" is
caused by an MSMI or not. If so, TDX module marks TD as fatal,
preventing further TD entries, and then completes the TD exit flow to KVM
with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD. After
TD exit, the MSMI is delivered and eventually handled by the kernel
machine check handler (7911f145de5f x86/mce: Implement recovery for
errors in TDX/SEAM non-root mode), i.e., the memory page is marked as
poisoned and it won't be freed to the free list when the TDX guest is
terminated. Since the TDX guest is dead, follow other non-recoverable
cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX
vCPU execution.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX.
NMI Handling: Just like the VMX case, NMI remains blocked after exiting
from TDX guest for NMI-induced exits [*]. Handle NMI-induced exits for
TDX guests in the same way as they are handled for VMX guests, i.e.,
handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi()
helper.
Interrupt and Exception Handling: Similar to the VMX case, external
interrupts and exceptions (machine check is the only exception type
KVM handles for TDX guests) are handled in the .handle_exit_irqoff()
callback.
For other exceptions, because TDX guest state is protected, exceptions in
TDX guests can't be intercepted. TDX VMM isn't supposed to handle these
exceptions. If unexpected exception occurs, exit to userspace with
KVM_EXIT_EXCEPTION.
For external interrupt, increase the statistics, same as the VMX case.
[*]: Some old TDX modules have a bug which makes NMI unblocked after
exiting from TDX guest for NMI-induced exits. This could potentially
lead to nested NMIs: a new NMI arrives when KVM is manually calling the
host NMI handler. This is an architectural violation, but it doesn't
have real harm until FRED is enabled together with TDX (for non-FRED,
the host NMI handler can handle nested NMIs). Given this is rare to
happen and has no real harm, ignore this for the initial TDX support.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a helper to handles NMI exit.
TDX handles the NMI exit the same as VMX case. Add a helper to share the
code with TDX, expose the helper in common.h.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that
vmx_handle_exit_irqoff() can be reused by TDX code.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
TDX protects TDX guest APIC state from VMM. Implement access methods of
TDX guest vAPIC state to ignore them or return zero.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Force APICv active for TDX guests in KVM because APICv is always enabled
by TDX module.
From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.
After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are
rejected due to has_protected_state for TDs and guest_state_protected
for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since
migration is not supported yet, so that userspace cannot disable APIC.
For various APICv inhibit reasons:
- APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv
in tdx_bringup(). If !enable_apicv, TDX support will be disabled.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is
mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note:
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory
allocation fails for KVM apic_map.)
- APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support
HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is
checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support
KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is
mandatory.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects
userspace to set APIC base.
- The rest inhibit reasons are relevant only to AMD's AVIC, including
APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN,
APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
(For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept
EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX
guests, which eliminates the in-kernel PIT.)
Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is
disabled for TDX guests.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC
while in-kernel local APIC is needed.
APICv is always enabled by TDX module and TDX Module doesn't allow the
hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and
never trigger exits. Level-triggered interrupts and other things depending
on EOI VM-Exit can't be faithfully emulated in KVM. Also, the lazy check
of pending APIC EOI for RTC edge-triggered interrupts, which was introduced
as a workaround when EOI cannot be intercepted, doesn't work for TDX either
because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are
invisible in KVM.
If the guest induces generation of a level-triggered interrupt, the VMM is
left with the choice of dropping the interrupt, sending it as-is, or
converting it to an edge-triggered interrupt. Ditto for KVM. All of those
options will make the guest unhappy. There's no architectural behavior KVM
can provide that's better than sending the interrupt and hoping for the
best.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Always block INIT and SIPI events for the TDX guest because the TDX module
doesn't provide API for VMM to inject INIT IPI or SIPI.
TDX defines its own vCPU creation and initialization sequence including
multiple seamcalls. Also, it's only allowed during TD build time.
Given that TDX guest is para-virtualized to boot BSP/APs, normally there
shouldn't be any INIT/SIPI event for TDX guest. If any, three options to
handle them:
1. Always block INIT/SIPI request.
2. (Silently) ignore INIT/SIPI request during delivery.
3. Return error to guest TDs somehow.
Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
need to add new interface or helper function for INIT/SIPI delivery.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return
-ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD.
TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs. Because guest state (vCPU state, memory
state) is protected, it must go through the TDX module APIs to change
guest state. However, the TDX module doesn't provide a way for VMM to
inject SMI into guest TD or a way for VMM to switch guest vCPU mode into
SMM.
MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be
returned when SMI is requested.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module. If there is a further pending NMI in
KVM, collapse it to the one pending in the TDX module.
VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER
to run the vCPU and the TDX module will attempt to inject the NMI as soon
as possible.
KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way. Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
KVM injects the first NMI and requests NMI window exit to inject the
second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
KVM first requests force immediate exit followed by a VM entry to
complete the NMI rejection. Then, during the force immediate exit, KVM
requests NMI window exit to inject the pending NMI.
For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one
NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always
unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it
means the previous NMI needs to be re-injected.
Based on KVM's NMI handling flow, there are following 6 cases:
In NMI handler TDX module KVM
T1. No PEND_NMI=0 1 pending NMI
T2. No PEND_NMI=0 2 pending NMIs
T3. No PEND_NMI=1 1 pending NMI
T4. Yes PEND_NMI=0 1 pending NMI
T5. Yes PEND_NMI=0 2 pending NMIs
T6. Yes PEND_NMI=1 1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.
When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by
this way. However, TDX doesn't allow KVM to request NMI window exit
directly, if PEND_NMI is already set and there is still pending NMI in KVM,
the only way KVM could try is to request a force immediate exit. But for
case T5 and T6, force immediate exit will result in infinite loop because
force immediate exit makes it no progress in the NMI handler, so that the
pending NMI in the TDX module can never be injected.
Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.
Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of
the NMI sources, which could include the source of the pending NMI, the
pending NMI must be injected to avoid the lost of NMI. For case K2 and K3,
the guest OS will poll all NMI sources (including the sources caused by the
second NMI and further NMI collapsed) when the delivery of the first NMI,
KVM doesn't have the necessity to inject the second NMI.
To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the
second pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still
pending in the TDX module, i.e. it has not been delivered to TDX guest
yet, collapse the pending NMI in KVM into the previous one.
This patch goes with option 2 because it is simple and doesn't impact other
VM types. Option 1 may need more discussions.
This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.
[1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector
for LVTT is set in PIR before TD entry.
KVM always assumes a timer IRQ was injected if APIC state is protected.
For TDX guest, APIC state is protected and KVM injects timer IRQ via posted
interrupt. To avoid unnecessary wait calls, only call
kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON
is set and the vector for LVTT is set in PIR.
Add a helper to test PIR.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.
Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Implement non-NMI interrupt injection for TDX via posted interrupt.
As CPU state is protected and APICv is enabled for the TDX guest, TDX
supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can
update them directly. If target vCPU is in non-root mode, send posted
interrupt notification to the vCPU and hardware will sync PIR to vIRR
atomically. Otherwise, kick it to pick up the interrupt from PID. To
post pending interrupts in the PID, KVM can generate a self-IPI with
notification vector prior to TD entry.
Since the guest status of TD vCPU is protected, assume interrupt is
always allowed. Ignore the code path for event injection mechanism or
LAPIC emulation for TDX.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move posted interrupt delivery code to common header so that TDX can
leverage it.
No functional change intended.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disable PI wakeup for IPI virtualization (IPIv) case for TDX.
When a vCPU is being scheduled out, notification vector is switched and
pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and
posted interrupt is used to wake up the vCPU.
For VMX, a blocked vCPU can be the target of posted interrupts when using
IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv.
Also, since the guest status of TD vCPU is protected, assume interrupt is
always enabled for TD. (PV HLT hypercall is not support yet, TDX guest
tells VMM whether HLT is called with interrupt disabled or not.)
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt(). Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked. There is no any CPU flows
that modify RVI in the middle of instruction execution. I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM
doesn't care about STI/SS shadows outside of the HALTED case. And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.
For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the
leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according
to TDX Guest Host Communication Interface (GHCI) spec.
For TDX guests, VMM is not allowed to access vCPU registers and the private
memory, and the code instructions must be fetched from the private memory.
So MMIO emulation implemented for non-TDX VMs is not possible for TDX
guests.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest. The #VE handling is supposed to emulate the MMIO
instruction inside the guest and convert it into a TDVMCALL with the
leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION.
The requested MMIO address must be in shared GPA space. The shared bit
is stripped after check because the existing code for MMIO emulation is
not aware of the shared bit.
The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA
should be shared. KVM could do the checks before exiting to userspace,
however, even if KVM does the check, there still will be race conditions
between the check in KVM and the emulation of MMIO access in userspace due
to a memslot hotplug, or a memory attribute conversion. If userspace
doesn't check the attribute of the GPA and the attribute happens to be
private, it will not pose a security risk or cause an MCE, but it can lead
to another issue. E.g., in QEMU, treating a GPA with private attribute as
shared when it falls within RAM's range can result in extra memory
consumption during the emulation to the access to the HVA of the GPA.
There are two options: 1) Do the check both in KVM and userspace. 2) Do
the check only in QEMU. This patch chooses option 2, i.e. KVM omits the
memslot and attribute checks, and expects userspace to do the checks.
Similar to normal MMIO emulation, try to handle the MMIO in kernel first,
if kernel can't support it, forward the request to userspace. Export
needed symbols used for MMIO handling.
Fragments handling is not needed for TDX PV MMIO because GPA is provided,
if a MMIO access crosses page boundary, it should be continuous in GPA.
Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed.
Allow cross page access because no extra handling needed after checking
both start and end GPA are shared GPAs.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Emulate port I/O requested by TDX guest via TDVMCALL with leaf
Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to
TDX Guest Host Communication Interface (GHCI).
All port I/O instructions inside the TDX guest trigger the #VE exception.
On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with
leaf Instruction.IO to request VMM to emulate I/O instructions.
Similar to normal port I/O emulation, try to handle the port I/O in kernel
first, if kernel can't support it, forward the request to userspace.
Note string I/O operations are not supported in TDX. Guest should unroll
them before calling the TDVMCALL.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250222014225.897298-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.
TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced. This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do parsing or
conversion, it just dumps the 16 general-purpose registers to userspace
and let userspace decide what to do.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.
MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory. It needs to exit to userspace for handling. KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.
To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection. TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.
Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Handle KVM hypercall for TDX according to TDX Guest-Host Communication
Interface (GHCI) specification.
The TDX GHCI specification defines the ABI for the guest TD to issue
hypercalls. When R10 is non-zero, it indicates the TDG.VP.VMCALL is
vendor-specific. KVM uses R10 as KVM hypercall number and R11-R14
as 4 arguments, while the error code is returned in R10.
Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and
marshall r10~r14 from vp_enter_args to the appropriate x86 registers for
KVM hypercall handling.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a place holder and related helper functions for preparation of
TDG.VP.VMCALL handling.
The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM. When the guest TD issues a
TDVMCALL, the guest TD exits to VMM with a new exit reason. The arguments
from the guest TD and returned values from the VMM are passed in the guest
registers. The guest RCX register indicates which registers are used.
Define helper functions to access those registers.
A new VMX exit reason TDCALL is added to indicate the exit is due to
TDVMCALL from the guest TD. Define the TDCALL exit reason and add a place
holder to handle such exit.
Some leafs of TDCALL will be morphed to another VMX exit reason instead of
EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place
holder to do the conversion.
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014225.897298-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally, add error handling during the TDX VM exit flow, and add a
place holder to handle various exit reasons.
Store VMX exit reason and exit qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification. Store extended exit qualification and exit GPA info in
struct vcpu_tdx because they are used by TDX code only.
Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.*
operations due to secure EPT or TD EPOCH. If the contention occurs,
the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to
attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED,
not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during
IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of
KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly.
Error Handling:
- TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the
CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX
isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM
firmware is not loaded or disabled.
- TDX_ERROR: This indicates some check failed in the TDX module, preventing
the vCPU from running.
- Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it
separately before handling TDX_NON_RECOVERABLE because when off-TD debug
is not enabled, TDX_NON_RECOVERABLE is set.
- TDX_NON_RECOVERABLE: Set by the TDX module when the error is
non-recoverable, indicating that the TDX guest is dead or the vCPU is
disabled.
A special case is triple fault, which also sets TDX_NON_RECOVERABLE but
exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case.
- Any unhandled VM exit reason will also return to userspace with
KVM_EXIT_INTERNAL_ERROR.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted
explicitly when handling PV unhalt and expose kvm_vcpu_has_events().
kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted
is only cleared on transitions to KVM_MP_STATE_RUNNABLE. If the guest
initiates a spurious wakeup, pv_unhalted could be left set in perpetuity.
Currently, this is not problematic because kvm_vcpu_has_events() is only
called when handling PV unhalt. However, if kvm_vcpu_has_events() is used
for other purposes in the future, it could return the unexpected results.
Export kvm_vcpu_has_events() for its usage in broader contexts.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
in via the macro.
When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore,
____kvm_emulate_hypercall() can just read registers internally based on KVM
hypercall ABI, and those registers can be removed from the
__kvm_emulate_hypercall() macro.
Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
remove it from the __kvm_emulate_hypercall() macro as well.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20250222014225.897298-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.
TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit. Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs. KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.
MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.
Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it
afterwards. The TDX Module preserves bits 1, 12, and 14, so if no
other bits are set, no restore is done.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-12-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not
yet implemented, so disable support for TSX and WAITPKG for now. Clear the
associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error
if those bits are set in KVM_TDX_INIT_VM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-11-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Several MSRs are clobbered on TD exit that are not used by Linux while
in ring 0. Ensure the cached value of the MSR is updated on vcpu_put,
and the MSRs themselves before returning to ring 3.
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.
TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On exiting from the guest TD, xsave state is clobbered; restore it.
Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch
to find out whether other KVM_RUN code has loaded guest state into
XCR0/PKRU/XSS or not. In the case of TDX, the exit values are known
independent of the guest CR0 and CR4, and in fact the latter are not
available.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-8-adrian.hunter@intel.com>
[Rewrite to not use kvm_load_host_xsave_state. - Paolo]
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-7-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.
Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead,
marshall state to/from the appropriate x86 registers only when needed,
i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage
the existing code.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-6-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible and to make
TDX exit handling more VMX like.
No functional change intended.
[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the handling of SEPT zap errors caused by unsuccessful execution of
tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from
tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a
new helper function tdx_is_sept_zap_err_due_to_premap() to detect this
specific error.
During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the
mirror page table before the corresponding entry in the private page table
is successfully installed by tdh_mem_page_add(). If an error occurs during
the invocation of tdh_mem_page_add(), a mismatch between the mirror and
private page tables results in SEAMCALLs for SEPT zap returning the error
code TDX_EPT_ENTRY_STATE_INCORRECT.
The error TDX_EPT_WALK_FAILED is not possible because, during
KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully
mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf
PTEs between the mirror and private page tables. Therefore, during zap,
SEAMCALLs should find an empty leaf entry in the private EPT, leading to
the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED.
Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(),
move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to
tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for
errors due to premap to skip executing other SEAMCALLs for zap, which are
unnecessary. Return 1 to indicate no other errors, allowing the execution
of other zap SEAMCALLs to continue.
The failure of tdh_mem_page_add() is uncommon and has not been observed in
real workloads. Currently, this failure is only hypothetically triggered by
skipping the real SEAMCALL and faking the add error in the SEAMCALL
wrapper. Additionally, without this fix, there will be no host crashes or
other severe issues.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250217085642.19696-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Wrap vmx_update_cpu_dirty_logging so as to ignore requests to update
CPU dirty logging for TDs, as basic TDX does not support the PML feature.
Invoking vmx_update_cpu_dirty_logging() for TDs would cause an incorrect
access to a kvm_vmx struct for a TDX VM, so block that before it happens.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.
Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.
Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).
This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.
Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202 [1]
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|