summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu
AgeCommit message (Collapse)Author
2025-06-12KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORYPaolo Bonzini
Only let userspace pass the same addresses that were used in KVM_SET_USER_MEMORY_REGION (or KVM_SET_USER_MEMORY_REGION2); gpas in the the upper half of the address space are an implementation detail of TDX and KVM. Extracted from a patch by Sean Christopherson <seanjc@google.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-12KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORYPaolo Bonzini
Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU. It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have shared bit set when the memory attribute of it is shared, and it leads to wrong root in tdp_mmu_get_root_for_fault(). Fix it by embedding the direct bits in the gpa that is passed to kvm_tdp_map_page(), when the memory of the gpa is not private. [*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/ Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250611001018.2179964-1-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27Merge tag 'kvm-x86-mmu-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.16: - Refine and harden handling of spurious faults. - Use kvm_x86_call() instead of open coding static_call().
2025-05-26Merge tag 'loongarch-kvm-6.16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.16 1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support.
2025-05-16KVM: x86/mmu: Use kvm_x86_call() instead of manual static_call()Sean Christopherson
Use KVM's preferred kvm_x86_call() wrapper to invoke static calls related to mirror page tables. No functional change intended. Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables") Fixes: 94faba8999b9 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables") Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250331182703.725214-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02KVM: x86/mmu: Prevent installing hugepages when mem attributes are changingSean Christopherson
When changing memory attributes on a subset of a potential hugepage, add the hugepage to the invalidation range tracking to prevent installing a hugepage until the attributes are fully updated. Like the actual hugepage tracking updates in kvm_arch_post_set_memory_attributes(), process only the head and tail pages, as any potential hugepages that are entirely covered by the range will already be tracked. Note, only hugepage chunks whose current attributes are NOT mixed need to be added to the invalidation set, as mixed attributes already prevent installing a hugepage, and it's perfectly safe to install a smaller mapping for a gfn whose attributes aren't changing. Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory") Cc: stable@vger.kernel.org Reported-by: Michael Roth <michael.roth@amd.com> Tested-by: Michael Roth <michael.roth@amd.com> Link: https://lore.kernel.org/r/20250430220954.522672-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()Yan Zhao
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete root. Since kvm_mmu_reload() can be called outside the vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be invoked after a root has been marked obsolete and before vcpu_enter_guest() is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to invalid. This causes kvm_mmu_reload() to fail to load a new root, which can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to is_page_fault_stale(). Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in vcpu_enter_guest() since the cost of kvm_check_request() is negligible, especially a check that's guarded by kvm_request_pending(). Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is inline and may be called outside of kvm.ko. Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMUYan Zhao
Warn if PFN changes on shadow-present SPTE in mmu_set_spte(). KVM should _never_ change the PFN of a shadow-present SPTE. In mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present SPTE in mmu_spte_update() to detect this condition. However, that WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte() earlier before mmu_spte_update(), which clears SPTE to a !shadow-present state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in mmu_set_spte() to warn PFN change of a shadow-present SPTE. For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when PFN is not changed. When PFN changes, fall through to follow the sequence of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add(). Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: WARN if PFN changes for spurious faultsYan Zhao
Add a WARN() to assert that KVM does _not_ change the PFN of a shadow-present SPTE during spurious fault handling. KVM should _never_ change the PFN of a shadow-present SPTE and TDP MMU already BUG()s on this. However, spurious faults just return early before the existing BUG() could be hit. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013238.5732-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: Merge prefetch and access checks for spurious faultsYan Zhao
Combine prefetch and is_access_allowed() checks into a unified path to detect spurious faults, since both cases now share identical logic. No functional changes. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013210.5701-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch faultYan Zhao
Instead of simply treating a prefetch fault as spurious when there's a shadow-present old SPTE, further check if the old SPTE is leaf to determine if a prefetch fault is spurious. It's not reasonable to treat a prefetch fault as spurious when there's a shadow-present non-leaf SPTE without a corresponding shadow-present leaf SPTE. e.g., in the following sequence, a prefetch fault should not be considered spurious: 1. add a memslot with size 4K 2. prefault GPA A in the memslot 3. delete the memslot (zap all disabled) 4. re-add the memslot with size 2M 5. prefault GPA A again. In step 5, the prefetch fault attempts to install a 2M huge entry. Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE, the leaf entry will remain empty after step 5 if the fetch fault is regarded as spurious due to a shadow-present non-leaf SPTE. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMUSean Christopherson
Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity check is omitted from production builds, and more importantly to remove the atomic accesses to account pages. A one-off memory leak in production is relatively uninteresting, and a WARN_ON won't help mitigate a systemic issue; it's as much about helping triage memory leaks as it is about detecting them in the first place, and doesn't magically stop the leaks. I.e. production environments will be quite sad if a severe KVM bug escapes, regardless of whether or not KVM WARNs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315023448.2358456-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-19Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.15 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step for upcoming FRED virtualization support. - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits as a general cleanup, and in anticipation of adding support for emulating Mode-Based Execution (MBEC). - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits).
2025-03-19Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information.
2025-03-14KVM: x86: remove shadow_memtype_maskPaolo Bonzini
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled, if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not called for the shadow paging case, there is no need to consult shadow_memtype_mask and it can be removed altogether. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PATYan Zhao
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Make cpu_dirty_log_size a per-VM valueYan Zhao
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Add parameter "kvm" to kvm_mmu_page_ad_need_write_protect()Yan Zhao
Add a parameter "kvm" to kvm_mmu_page_ad_need_write_protect() and its caller tdp_mmu_need_write_protect(). This is a preparation to make cpu_dirty_log_size a per-VM value rather than a system-wide value. No function changes expected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: Add parameter "kvm" to kvm_cpu_dirty_log_size() and its callersYan Zhao
Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers: kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc(). This is a preparation to make cpu_dirty_log_size a per-VM value rather than a system-wide value. No function changes expected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Export kvm_tdp_map_page()Rick Edgecombe
In future changes coco specific code will need to call kvm_tdp_map_page() from within their respective gmem_post_populate() callbacks. Export it so this can be done from vendor specific code. Since kvm_mmu_reload() will be needed for this operation, export its callee kvm_mmu_load() as well. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073827.22270-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM deadYan Zhao
Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise, kvm_tdp_map_page() may get stuck in the kernel loop when there's only one vCPU in the VM (or if the other vCPUs are not executing ioctls), even if fatal errors have occurred. kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY, regardless of the specific error code from tdp_mmu_set_spte_atomic(), tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable in general cases where the only possible error code from these functions is -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL errors. Since this -EIO error is also a fatal error, check for VM dead in the kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending. The error -EIO is uncommon and has not been observed in real workloads. Currently, it is only hypothetically triggered by bypassing the real SEAMCALL and faking an error in the SEAMCALL wrapper. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250220102728.24546-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Set per-VM shadow_mmio_value to 0Isaku Yamahata
Set per-VM shadow_mmio_value to 0 for TDX. With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set to 0. This is necessary to override the default value of the suppress VE bit in the SPTE, which is 1, and to ensure value 0 in RWX bits. For MMIO SPTE, the spte value changes as follows: 1. initial value (suppress VE bit is set) 2. Guest issues MMIO and triggers EPT violation 3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared) 4. Guest MMIO resumes. It triggers VE exception in guest TD 5. Guest VE handler issues TDG.VP.VMCALL<MMIO> 6. KVM handles MMIO 7. Guest VE handler resumes its execution after MMIO instruction Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073743.22214-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Add setter for shadow_mmio_valueIsaku Yamahata
Future changes will want to set shadow_mmio_value from TDX code. Add a helper to setter with a name that makes more sense from that context. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [split into new patch] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073730.22200-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Require TDP MMU, mmio caching and EPT A/D bits for TDXIsaku Yamahata
Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't supported. As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU support for TDX isn't implemented. TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO emulation without installing SPTEs for MMIOs. However, TDX guest is protected and KVM would meet errors when trying to emulate MMIOs for TDX guest during instruction decoding. So, TDX guest relies on SPTEs being installed for MMIOs, which are with no RWX bits and with VE suppress bit unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL in the VE handler to perform instruction decoding and have host do MMIO emulation. TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all CPUs since Haswell. Relying on it can avoid RWX bits being masked out in the mirror page table for prefaulted entries. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- Requested by Sean at [1]. [1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Do not enable page track for TD guestYan Zhao
Fail kvm_page_track_write_tracking_enabled() if VM type is TDX to make the external page track user fail in kvm_page_track_register_notifier() since TDX does not support write protection and hence page track. No need to fail KVM internal users of page track (i.e. for shadow page), because TDX is always with EPT enabled and currently TDX module does not emulate and send VMLAUNCH/VMRESUME VMExits to VMM. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Message-ID: <20241112073515.22028-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMUIsaku Yamahata
Export a function to walk down the TDP without modifying it and simply check if a GPA is mapped. Future changes will support pre-populating TDX private memory. In order to implement this KVM will need to check if a given GFN is already pre-populated in the mirrored EPT. [1] There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within the KVM MMU that almost does what is required. However, to make sense of the results, MMU internal PTE helpers are needed. Refactor the code to provide a helper that can be used outside of the KVM MMU code. Refactoring the KVM page fault handler to support this lookup usage was also considered, but it was an awkward fit. kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini. Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1] Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073457.22011-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Implement memslot deletion for TDXRick Edgecombe
Update attr_filter field to zap both private and shared mappings for TDX when memslot is deleted. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073426.21997-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86/mmu: Taking guest pa into consideration when calculate tdp levelXiaoyao Li
For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used to configure the EPT level and GPAW. Use max_gpa to determine the TDP level. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01kvm: retry nx_huge_page_recovery_thread creationKeith Busch
A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01vhost: return task creation error instead of NULLKeith Busch
Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28KVM: x86/tdp_mmu: Remove tdp_mmu_for_each_pte()Nikolay Borisov
That macro acts as a different name for for_each_tdp_pte, apart from adding cognitive load it doesn't bring any value. Let's remove it. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250226074131.312565-1-nik.borisov@suse.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bitsSean Christopherson
Define independent macros for the RWX protection bits that are enumerated via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in EPT entries via compile-time asserts. Piggybacking the EPTE defines works for now, but it creates holes in the EPT_VIOLATION_xxx macros and will cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any other features that introduces additional protection information. Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK so that it doesn't become stale if/when MBEC support is added. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Walk rmaps (shadow MMU) without holding mmu_lock when aging gfnsSean Christopherson
Convert the shadow MMU to use per-rmap locking instead of the per-VM mmu_lock to protect rmaps when aging SPTEs. When A/D bits are enabled, it is safe to simply clear the Accessed bits, i.e. KVM just needs to ensure the parent page table isn't freed. The less obvious case is marking SPTEs for access tracking in the non-A/D case (for EPT only). Because aging a gfn means making the SPTE not-present, KVM needs to play nice with the case where the CPU has TLB entries for a SPTE that is not-present in memory. For example, when doing dirty tracking, if KVM encounters a non-present shadow accessed SPTE, KVM must know to do a TLB invalidation. Fortunately, KVM already provides (and relies upon) the necessary functionality. E.g. KVM doesn't flush TLBs when aging pages (even in the clear_flush_young() case), and when harvesting dirty bitmaps, KVM flushes based on the dirty bitmaps, not on SPTEs. Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-12-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Add support for lockless walks of rmap SPTEsSean Christopherson
Add a lockless version of for_each_rmap_spte(), which is pretty much the same as the normal version, except that it doesn't BUG() the host if a non-present SPTE is encountered. When mmu_lock is held, it should be impossible for a different task to zap a SPTE, _and_ zapped SPTEs must be removed from their rmap chain prior to dropping mmu_lock. Thus, the normal walker BUG()s if a non-present SPTE is encountered as something is wildly broken. When walking rmaps without holding mmu_lock, the SPTEs pointed at by the rmap chain can be zapped/dropped, and so a lockless walk can observe a non-present SPTE if it runs concurrently with a different operation that is zapping SPTEs. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-11-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lockSean Christopherson
Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a *very* rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lockSean Christopherson
Refactor the pte_list and rmap code to always read and write rmap_head->val exactly once, e.g. by collecting changes in a local variable and then propagating those changes back to rmap_head->val as appropriate. This will allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into the rmap value alongside the MANY bit. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-9-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0James Houghton
When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and only if the VM has at least one shadow page, as opposed to checking if the VM has rmaps. Checking for rmaps will effectively yield a false positive if the VM ran nested TDP VMs in the past, but is not currently doing so. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as youngJames Houghton
Reorder the processing of the TDP MMU versus the shadow MMU when aging SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP MMU reports that the page is young, i.e. completely avoid taking mmu_lock if the TDP MMU SPTE is young. Swap the order for the test-and-age helper as well for consistency. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lockSean Christopherson
Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots, and do all SPTE aging with atomic updates; while clobbering Accessed information is ok, KVM must not corrupt other bits, e.g. must not drop a Dirty or Writable bit when making a SPTE young.. If updating a SPTE to mark it for access tracking fails, leave it as is and treat it as if it were young. If the spte is being actively modified, it is most likely young. Acquire and release mmu_lock for write when harvesting age information from the shadow MMU, as the shadow MMU doesn't yet support aging outside of mmu_lock. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Always update A/D-disabled SPTEs atomicallySean Christopherson
In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them for access-tracking outside of mmu_lock. Coupled with restoring access- tracked SPTEs in the fast page fault handler, the end result is that A/D-disable SPTEs will be volatile at all times. Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/all/Z60bhK96JnKIgqZQ@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatileJames Houghton
Don't force SPTE modifications to be done atomically if the only volatile bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate stale aging state, and the probability of an Accessed bit A/D assist being clobbered *and* affecting again is likely far lower than the probability of consuming stale information due to not flushing TLBs when aging. Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better capture the nature of the helper. Opportunstically do s/write/update on the TDP MMU wrapper, as it's not simply the "write" that needs to be done atomically, it's the entire update, i.e. the entire read-modify-write operation needs to be done atomically so that KVM has an accurate view of the old SPTE. Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect, it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net positive, and would require significant churn. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Factor out spte atomic bit clearing routineJames Houghton
This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a follow-up patch to enable lockless Accessed bit clearing. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-4-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: nSVM: Enter guest mode before initializing nested NPT MMUSean Christopherson
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest mode prior to initializing the MMU for nested NPT so that guest_mode is set in the MMU's role. KVM's model is that all L2 MMUs are tagged with guest_mode, as the behavior of hypervisor MMUs tends to be significantly different than kernel MMUs. Practically speaking, the bug is relatively benign, as KVM only directly queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths that are optimizations (mmu_page_zap_pte() and shadow_mmu_try_split_huge_pages()). And while the role is incorprated into shadow page usage, because nested NPT requires KVM to be using NPT for L1, reusing shadow pages across L1 and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs will have direct=0. Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning of the flow instead of forcing guest_mode in the MMU, as nothing in nested_vmcb02_prepare_control() between the old and new locations touches TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present inconsistent vCPU state to the MMU. Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-04KVM: x86/mmu: Ensure NX huge page recovery thread is alive before wakingSean Christopherson
When waking a VM's NX huge page recovery thread, ensure the thread is actually alive before trying to wake it. Now that the thread is spawned on-demand during KVM_RUN, a VM without a recovery thread is reachable via the related module params. BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:vhost_task_wake+0x5/0x10 Call Trace: <TASK> set_nx_huge_pages+0xcc/0x1e0 [kvm] param_attr_store+0x8a/0xd0 module_attr_store+0x1a/0x30 kernfs_fop_write_iter+0x12f/0x1e0 vfs_write+0x233/0x3e0 ksys_write+0x60/0xd0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f3b52710104 </TASK> Modules linked in: kvm_intel kvm CR2: 0000000000000040 Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later") Cc: stable@vger.kernel.org Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250124234623.3609069-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-24kvm: defer huge page recovery vhost task to laterKeith Busch
Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-20Merge branch 'kvm-mirror-page-tables' into HEADPaolo Bonzini
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.14: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
2025-01-15KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()Yan Zhao
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID) instead of 1 in kvm_mmu_page_fault(). The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e., npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(), kvm_handle_page_fault()). They either check if the return value is > 0 (as in npf_interception()) or pass it further to vcpu_run() to decide whether to break out of the kernel loop and return to the user when r <= 0. Therefore, returning any positive value is equivalent to returning 1. Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure a positive return value. This is a preparation to allow TDX's EPT violation handler to check the RET_PF* value and retry internally for RET_PF_RETRY. No functional changes are intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>