summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2024-11-04KVM: x86: Move kvm_set_apic_base() implementation to lapic.c (from x86.c)Sean Christopherson
Move kvm_set_apic_base() to lapic.c so that the bulk of KVM's local APIC code resides in lapic.c, regardless of whether or not KVM is emulating the local APIC in-kernel. This will also allow making various helpers visible only to lapic.c. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-6-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Inline kvm_get_apic_mode() in lapic.hSean Christopherson
Inline kvm_get_apic_mode() in lapic.h to avoid a CALL+RET as well as an export. The underlying kvm_apic_mode() helper is public information, i.e. there is no state/information that needs to be hidden from vendor modules. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-5-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Get vcpu->arch.apic_base directly and drop kvm_get_apic_base()Sean Christopherson
Access KVM's emulated APIC base MSR value directly instead of bouncing through a helper, as there is no reason to add a layer of indirection, and there are other MSRs with a "set" but no "get", e.g. EFER. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-4-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Drop superfluous kvm_lapic_set_base() call when setting APIC stateSean Christopherson
Now that kvm_lapic_set_base() does nothing if the "new" APIC base MSR is the same as the current value, drop the kvm_lapic_set_base() call in the KVM_SET_LAPIC flow that passes in the current value, as it too does nothing. Note, the purpose of invoking kvm_lapic_set_base() was purely to set apic->base_address (see commit 5dbc8f3fed0b ("KVM: use kvm_lapic_set_base() to change apic_base")). And there is no evidence that explicitly setting apic->base_address in KVM_SET_LAPIC ever had any functional impact; even in the original commit 96ad2cc61324 ("KVM: in-kernel LAPIC save and restore support"), all flows that set apic_base also set apic->base_address to the same address. E.g. svm_create_vcpu() did open code a write to apic_base, svm->vcpu.apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; but it also called kvm_create_lapic() when irqchip_in_kernel() is true. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-3-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Short-circuit all kvm_lapic_set_base() if MSR value isn't changingSean Christopherson
Do nothing in kvm_lapic_set_base() if the APIC base MSR value is the same as the current value. All flows except the handling of the base address explicitly take effect if and only if relevant bits are changing. For the base address, invoking kvm_lapic_set_base() before KVM initializes the base to APIC_DEFAULT_PHYS_BASE during vCPU RESET would be a KVM bug, i.e. KVM _must_ initialize apic->base_address before exposing the vCPU (to userspace or KVM at-large). Note, the inhibit is intended to be set if the base address is _changed_ from the default, i.e. is also covered by the RESET behavior. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-2-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Drop per-VM zapped_obsolete_pages listVipin Sharma
Drop the per-VM zapped_obsolete_pages list now that the usage from the defunct mmu_shrinker is gone, and instead use a local list to track pages in kvm_zap_obsolete_pages(), the sole remaining user of zapped_obsolete_pages. Opportunistically add an assertion to verify and document that slots_lock must be held, i.e. that there can only be one active instance of kvm_zap_obsolete_pages() at any given time, and by doing so also prove that using a local list instead of a per-VM list doesn't change any functionality (beyond trivialities like list initialization). Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Remove KVM's MMU shrinkerVipin Sharma
Remove KVM's MMU shrinker and (almost) all of its related code, as the current implementation is very disruptive to VMs (if it ever runs), without providing any meaningful benefit[1]. Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages from the per-vCPU caches[2], but given that no one has complained about lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU was enabled by default, it's safe to say that there is likely no real use case for initiating reclaim of KVM's page tables from the shrinker. And while clever/cute, reclaiming the per-vCPU caches doesn't scale the same way that reclaiming in-use page table pages does. E.g. the amount of memory being used by a VM doesn't always directly correlate with the number vCPUs, and even when it does, reclaiming a few pages from per-vCPU caches likely won't make much of a dent in the VM's total memory usage, especially for VMs with huge amounts of memory. Lastly, if it turns out that there is a strong use case for dropping the per-vCPU caches, re-introducing the shrinker registration is trivial compared to the complexity of actually reclaiming pages from the caches. [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: keep zapped_obsolete_pages for now, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: WARN if huge page recovery triggered during dirty loggingDavid Matlack
WARN and bail out of recover_huge_pages_range() if dirty logging is enabled. KVM shouldn't be recovering huge pages during dirty logging anyway, since KVM needs to track writes at 4KiB. However it's not out of the possibility that that changes in the future. If KVM wants to recover huge pages during dirty logging, make_huge_spte() must be updated to write-protect the new huge page mapping. Otherwise, writes through the newly recovered huge page mapping will not be tracked. Note that this potential risk did not exist back when KVM zapped to recover huge page mappings, since subsequent accesses would just be faulted in at PG_LEVEL_4K if dirty logging was enabled. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-7-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()David Matlack
Rename make_huge_page_split_spte() to make_small_spte(). This ensures that the usage of "small_spte" and "huge_spte" are consistent between make_huge_spte() and make_small_spte(). This should also reduce some confusion as make_huge_page_split_spte() almost reads like it will create a huge SPTE, when in fact it is creating a small SPTE to split the huge SPTE. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zappingDavid Matlack
Recover TDP MMU huge page mappings in-place instead of zapping them when dirty logging is disabled, and rename functions that recover huge page mappings when dirty logging is disabled to move away from the "zap collapsible spte" terminology. Before KVM flushes TLBs, guest accesses may be translated through either the (stale) small SPTE or the (new) huge SPTE. This is already possible when KVM is doing eager page splitting (where TLB flushes are also batched), and when vCPUs are faulting in huge mappings (where TLBs are flushed after the new huge SPTE is installed). Recovering huge pages reduces the number of page faults when dirty logging is disabled: $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: 393,599 kvm:kvm_page_fault After: 262,575 kvm:kvm_page_fault vCPU throughput and the latency of disabling dirty-logging are about equal compared to zapping, but avoiding faults can be beneficial to remove vCPU jitter in extreme scenarios. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Refactor TDP MMU iter need resched checkSean Christopherson
Refactor the TDP MMU iterator "need resched" checks into a helper function so they can be called from a different code path in a subsequent commit. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-4-dmatlack@google.com [sean: rebase on a swapped order of checks] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Demote the WARN on yielded in xxx_cond_resched() to ↵Sean Christopherson
KVM_MMU_WARN_ON Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for production kernels (assuming production kernels disable KVM_PROVE_MMU). Checking for a needed reschedule is a hot path, and KVM sanity checks iter->yielded in several other less-hot paths, i.e. the odds of KVM not flagging that something went sideways are quite low. Furthermore, the odds of KVM not noticing *and* the WARN detecting something worth investigating are even lower. Link: https://lore.kernel.org/r/20241031170633.1502783-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Check yielded_gfn for forward progress iff resched is neededSean Christopherson
Swap the order of the checks in tdp_mmu_iter_cond_resched() so that KVM checks to see if a resched is needed _before_ checking to see if yielding must be disallowed to guarantee forward progress. Iterating over TDP MMU SPTEs is a hot path, e.g. tearing down a root can touch millions of SPTEs, and not needing to reschedule is by far the common case. On the other hand, disallowing yielding because forward progress has not been made is a very rare case. Returning early for the common case (no resched), effectively reduces the number of checks from 2 to 1 for the common case, and should make the code slightly more predictable for the CPU. To resolve a weird conundrum where the forward progress check currently returns false, but the need resched check subtly returns iter->yielded, which _should_ be false (enforced by a WARN), return false unconditionally (which might also help make the sequence more predictable). If KVM has a bug where iter->yielded is left danging, continuing to yield is neither right nor wrong, it was simply an artifact of how the original code was written. Unconditionally returning false when yielding is unnecessary or unwanted will also allow extracting the "should resched" logic to a separate helper in a future patch. Cc: David Matlack <dmatlack@google.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20241031170633.1502783-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-03fdget(), trivial conversionsAl Viro
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-01KVM: x86: Remove ordering check b/w MSR_PLATFORM_INFO and MISC_FEATURES_ENABLESSean Christopherson
Drop KVM's odd restriction that disallows clearing CPUID_FAULT in MSR_PLATFORM_INFO if CPL>0 CPUID faulting is enabled in MSR_MISC_FEATURES_ENABLES. KVM generally doesn't require specific ordering when userspace sets MSRs, and the completely arbitrary order of MSRs in emulated_msrs_all means that a userspace that uses KVM's list verbatim could run afoul of the check. Dropping the restriction obviously means that userspace could stuff a nonsensical vCPU model, but that's the case all over KVM. KVM typically restricts userspace MSR writes only when it makes things easier for KVM and/or userspace. Link: https://lore.kernel.org/r/20240802185511.305849-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Reject userspace attempts to access ARCH_CAPABILITIES w/o supportSean Christopherson
Reject userspace accesses to ARCH_CAPABILITIES if the MSR isn't supposed to exist, according to guest CPUID. However, "reject" accesses with KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are ignored if KVM advertised support ARCH_CAPABILITIES. KVM's ABI is that userspace must set guest CPUID prior to setting MSRs, and that setting MSRs that aren't supposed exist is disallowed (modulo the '0' exemption). Link: https://lore.kernel.org/r/20240802185511.305849-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: VMX: Remove restriction that PMU version > 0 for PERF_CAPABILITIESSean Christopherson
Drop the restriction that the PMU version is non-zero when handling writes to PERF_CAPABILITIES now that KVM unconditionally checks for PDCM support. Link: https://lore.kernel.org/r/20240802185511.305849-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Reject userspace attempts to access PERF_CAPABILITIES w/o PDCMSean Christopherson
Reject userspace accesses to PERF_CAPABILITIES if PDCM isn't set in guest CPUID, i.e. if the vCPU doesn't actually have PERF_CAPABILITIES. But! Do so via KVM_MSR_RET_UNSUPPORTED, so that reads get '0' and writes of '0' are ignored if KVM advertised support PERF_CAPABILITIES. KVM's ABI is that userspace must set guest CPUID prior to setting MSRs, and that setting MSRs that aren't supposed exist is disallowed (modulo the '0' exemption). Link: https://lore.kernel.org/r/20240802185511.305849-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Quirk initialization of feature MSRs to KVM's max configurationSean Christopherson
Add a quirk to control KVM's misguided initialization of select feature MSRs to KVM's max configuration, as enabling features by default violates KVM's approach of letting userspace own the vCPU model, and is actively problematic for MSRs that are conditionally supported, as the vCPU will end up with an MSR value that userspace can't restore. E.g. if the vCPU is configured with PDCM=0, userspace will save and attempt to restore a non-zero PERF_CAPABILITIES, thanks to KVM's meddling. Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Disallow changing MSR_PLATFORM_INFO after vCPU has runSean Christopherson
Tag MSR_PLATFORM_INFO as a feature MSR (because it is), i.e. disallow it from being modified after the vCPU has run. To make KVM's selftest compliant, simply delete the userspace MSR write that restores KVM's original value at the end of the test. Verifying that userspace can write back what it originally read is uninteresting in this particular case, because KVM doesn't enforce _any_ bits in the MSR, i.e. userspace should be able to write any arbitrary value. Link: https://lore.kernel.org/r/20240802185511.305849-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Co-locate initialization of feature MSRs in kvm_arch_vcpu_create()Sean Christopherson
Bunch all of the feature MSR initialization in kvm_arch_vcpu_create() so that it can be easily quirked in a future patch. No functional change intended. Link: https://lore.kernel.org/r/20240802185511.305849-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: nVMX: fix canonical check of vmcs12 HOST_RIPMaxim Levitsky
HOST_RIP canonical check should check the L1 of CR4.LA57 stored in the vmcs12 rather than the current L1's because it is legal to change the CR4.LA57 value during VM exit from L2 to L1. This is a theoretical bug though, because it is highly unlikely that a VM exit will change the CR4.LA57 from the value it had on VM entry. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-5-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: model canonical checks more preciselyMaxim Levitsky
As a result of a recent investigation, it was determined that x86 CPUs which support 5-level paging, don't always respect CR4.LA57 when doing canonical checks. In particular: 1. MSRs which contain a linear address, allow full 57-bitcanonical address regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE. 2. All hidden segment bases and GDT/IDT bases also behave like MSRs. This means that full 57-bit canonical address can be loaded to them regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions (e.g LGDT). 3. TLB invalidation instructions also allow the user to use full 57-bit address regardless of the CR4.LA57. Finally, it must be noted that the CPU doesn't prevent the user from disabling 5-level paging, even when the full 57-bit canonical address is present in one of the registers mentioned above (e.g GDT base). In fact, this can happen without any userspace help, when the CPU enters SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain a non-canonical address in regard to the new mode. Since most of the affected MSRs and all segment bases can be read and written freely by the guest without any KVM intervention, this patch makes the emulator closely follow hardware behavior, which means that the emulator doesn't take in the account the guest CPUID support for 5-level paging, and only takes in the account the host CPU support. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Add X86EMUL_F_MSR and X86EMUL_F_DT_LOAD to aid canonical checksMaxim Levitsky
Add emulation flags for MSR accesses and Descriptor Tables loads, and pass the new flags as appropriate to emul_is_noncanonical_address(). The flags will be used to perform the correct canonical check, as the type of access affects whether or not CR4.LA57 is consulted when determining the canonical bit. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: split to separate patch, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Route non-canonical checks in emulator through emulate_opsMaxim Levitsky
Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in the emulator, which will allow extending is_noncanonical_address() to support different flavors of canonical checks, e.g. for descriptor table bases vs. MSRs, without needing duplicate logic in the emulator. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: separate from additional of flags, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: drop x86.h include from cpuid.hMaxim Levitsky
Drop x86.h include from cpuid.h to allow the x86.h to include the cpuid.h instead. Also fix various places where x86.h was implicitly included via cpuid.h Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-2-mlevitsk@redhat.com [sean: fixup a missed include in mtrr.c] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Use '0' for guest RIP if PMI encounters protected guest stateSean Christopherson
Explicitly return '0' for guest RIP when handling a PMI VM-Exit for a vCPU with protected guest state, i.e. when KVM can't read the real RIP. While there is no "right" value, and profiling a protect guest is rather futile, returning the last known RIP is worse than returning obviously "bad" data. E.g. for SEV-ES+, the last known RIP will often point somewhere in the guest's boot flow. Opportunistically add WARNs to effectively assert that the in_kernel() and get_ip() callbacks are restricted to the common PMI handler, as the return values for the protected guest state case are largely arbitrary, i.e. only make any sense whatsoever for PMIs, where the returned values have no functional impact and thus don't truly matter. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Add lockdep-guarded asserts on register cache usageSean Christopherson
When lockdep is enabled, assert that KVM accesses the register caches if and only if cache fills are guaranteed to consume fresh data, i.e. when KVM when KVM is in control of the code sequence. Concretely, the caches can only be used from task context (synchronous) or when handling a PMI VM-Exit (asynchronous, but only in specific windows where the caches are in a known, stable state). Generally speaking, there are very few flows where reading register state from an asynchronous context is correct or even necessary. So, rather than trying to figure out a generic solution, simply disallow using the caches outside of task context by default, and deal with any future exceptions on a case-by-case basis _if_ they arise. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()Sean Christopherson
When querying guest CPL to determine if a vCPU was preempted while in kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from the VMCS on Intel CPUs. If the kernel is running with full preemption enabled, using the register cache in the preemption path can result in stale and/or uninitialized data being cached in the segment cache. In particular the following scenario is currently possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Note, the VM-Enter failure can also be avoided by moving the call to vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all segments. However, while that change is correct and desirable (and will come along shortly), it does not address the underlying problem that accessing KVM's register caches from !task context is generally unsafe. In addition to fixing the immediate bug, bypassing the cache for this particular case will allow hardening KVM register caching log to assert that the caches are accessed only when KVM _knows_ it is safe to do so. Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: AMD's IBPB is not equivalent to Intel's IBPBJim Mattson
From Intel's documentation [1], "CPUID.(EAX=07H,ECX=0):EDX[26] enumerates support for indirect branch restricted speculation (IBRS) and the indirect branch predictor barrier (IBPB)." Further, from [2], "Software that executed before the IBPB command cannot control the predicted targets of indirect branches (4) executed after the command on the same logical processor," where footnote 4 reads, "Note that indirect branches include near call indirect, near jump indirect and near return instructions. Because it includes near returns, it follows that **RSB entries created before an IBPB command cannot control the predicted targets of returns executed after the command on the same logical processor.**" [emphasis mine] On the other hand, AMD's IBPB "may not prevent return branch predictions from being specified by pre-IBPB branch targets" [3]. However, some AMD processors have an "enhanced IBPB" [terminology mine] which does clear the return address predictor. This feature is enumerated by CPUID.80000008:EDX.IBPB_RET[bit 30] [4]. Adjust the cross-vendor features enumerated by KVM_GET_SUPPORTED_CPUID accordingly. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/cpuid-enumeration-and-architectural-msrs.html [2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/speculative-execution-side-channel-mitigations.html#Footnotes [3] https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1040.html [4] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf Fixes: 0c54914d0c52 ("KVM: x86: use Intel speculation bugs and features as derived in generic x86 code") Suggested-by: Venkatesh Srinivas <venkateshs@chromium.org> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20241011214353.1625057-5-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Advertise AMD_IBPB_RET to userspaceJim Mattson
This is an inherent feature of IA32_PRED_CMD[0], so it is trivially virtualizable (as long as IA32_PRED_CMD[0] is virtualized). Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241011214353.1625057-4-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Ensure vcpu->mode is loaded from memory in kvm_vcpu_exit_request()Sean Christopherson
Wrap kvm_vcpu_exit_request()'s load of vcpu->mode with READ_ONCE() to ensure the variable is re-loaded from memory, as there is no guarantee the caller provides the necessary annotations to ensure KVM sees a fresh value, e.g. the VM-Exit fastpath could theoretically reuse the pre-VM-Enter value. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240828232013.768446-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Fix a comment inside __kvm_set_or_clear_apicv_inhibit()Kai Huang
Change svm_vcpu_run() to vcpu_enter_guest() in the comment of __kvm_set_or_clear_apicv_inhibit() to make it reflect the fact. When one thread updates VM's APICv state due to updating the APICv inhibit reasons, it kicks off all vCPUs and makes them wait until the new reason has been updated and can be seen by all vCPUs. There was one WARN() to make sure VM's APICv state is consistent with vCPU's APICv state in the svm_vcpu_run(). Commit ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86") moved that WARN() to x86 common code vcpu_enter_guest() due to the logic is not unique to SVM, and added comments to both __kvm_set_or_clear_apicv_inhibit() and vcpu_enter_guest() to explain this. However, although the comment in __kvm_set_or_clear_apicv_inhibit() mentioned the WARN(), it seems forgot to reflect that the WARN() had been moved to x86 common, i.e., it still mentioned the svm_vcpu_run() but not vcpu_enter_guest(). Fix it. Note after the change the first line that contains vcpu_enter_guest() exceeds 80 characters, but leave it as is to make the diff clean. Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/e462e7001b8668649347f879c66597d3327dbac2.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Fix a comment inside kvm_vcpu_update_apicv()Kai Huang
The sentence "... so that KVM can the AVIC doorbell to ..." doesn't have a verb. Fix it. After adding the verb 'use', that line exceeds 80 characters. Thus wrap the 'to' to the next line. Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/666e991edf81e1fccfba9466f3fe65965fcba897.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Batch TLB flushes when zapping collapsible TDP MMU SPTEsDavid Matlack
Set SPTEs directly to SHADOW_NONPRESENT_VALUE and batch up TLB flushes when zapping collapsible SPTEs, rather than freezing them first. Freezing the SPTE first is not required. It is fine for another thread holding mmu_lock for read to immediately install a present entry before TLBs are flushed because the underlying mapping is not changing. vCPUs that translate through the stale 4K mappings or a new huge page mapping will still observe the same GPA->HPA translations. KVM must only flush TLBs before dropping RCU (to avoid use-after-free of the zapped page tables) and before dropping mmu_lock (to synchronize with mmu_notifiers invalidating mappings). In VMs backed with 2MiB pages, batching TLB flushes improves the time it takes to zap collapsible SPTEs to disable dirty logging: $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: Disabling dirty logging time: 14.334453428s (131072 flushes) After: Disabling dirty logging time: 4.794969689s (76 flushes) Skipping freezing SPTEs also avoids stalling vCPU threads on the frozen SPTE for the time it takes to perform a remote TLB flush. vCPUs faulting on the zapped mapping can now immediately install a new huge mapping and proceed with guest execution. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level()David Matlack
Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All callers pass in PG_LEVEL_NUM, so @max_level can be replaced with PG_LEVEL_NUM in the function body. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86: Don't emit TLB flushes when aging SPTEs for mmu_notifiersSean Christopherson
Follow x86's primary MMU, which hasn't flushed TLBs when clearing Accessed bits for 10+ years, and skip all TLB flushes when aging SPTEs in response to a clear_flush_young() mmu_notifier event. As documented in x86's ptep_clear_flush_young(), the probability and impact of "bad" reclaim due to stale A-bit information is relatively low, whereas the performance cost of TLB flushes is relatively high. I.e. the cost of flushing TLBs outweighs the benefits. On KVM x86, the cost of TLB flushes is even higher, as KVM doesn't batch TLB flushes for mmu_notifier events (KVM's mmu_notifier contract with MM makes it all but impossible), and sending IPIs forces all running vCPUs to go through a VM-Exit => VM-Enter roundtrip. Furthermore, MGLRU aging of secondary MMUs is expected to use flush-less mmu_notifiers, i.e. flushing for the !MGLRU will make even less sense, and will be actively confusing as it wouldn't be clear why KVM "needs" to flush TLBs for legacy LRU aging, but not for MGLRU aging. Cc: James Houghton <jthoughton@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/20240926013506.860253-18-jthoughton@google.com Link: https://lore.kernel.org/r/20241011021051.1557902-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Set Dirty bit for new SPTEs, even if _hardware_ A/D bits are ↵Sean Christopherson
disabled When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled, and for EPT, the bits are software-available (ignored by hardware) when A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty to track dirty pages in software. Link: https://lore.kernel.org/r/20241011021051.1557902-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changesSean Christopherson
Now that the shadow MMU and TDP MMU have identical logic for detecting required TLB flushes when updating SPTEs, move said logic to a helper so that the TDP MMU code can benefit from the comments that are currently exclusive to the shadow MMU. No functional change intended. Link: https://lore.kernel.org/r/20241011021051.1557902-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE foundSean Christopherson
Return immediately if a young SPTE is found when testing, but not updating, SPTEs. The return value is a boolean, i.e. whether there is one young SPTE or fifty is irrelevant (ignoring the fact that it's impossible for there to be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU roots). Link: https://lore.kernel.org/r/20241011021051.1557902-15-seanjc@google.com [sean: use guard(rcu)(), as suggested by Paolo] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn rangeSean Christopherson
Skip invalid TDP MMU roots when aging a gfn range. There is zero reason to process invalid roots, as they by definition hold stale information. E.g. if a root is invalid because its from a previous memslot generation, in the unlikely event the root has a SPTE for the gfn, then odds are good that the gfn=>hva mapping is different, i.e. doesn't map to the hva that is being aged by the primary MMU. Link: https://lore.kernel.org/r/20241011021051.1557902-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabledSean Christopherson
Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware, i.e. propagate accessed information to SPTE.Accessed even when KVM is doing manual tracking by making SPTEs not-present. In addition to eliminating a small amount of code in is_accessed_spte(), this also paves the way for preserving Accessed information when a SPTE is zapped in response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped because NUMA balancing kicks in. Note, EPT is the only flavor of paging in which A/D bits are conditionally enabled, and the Accessed (and Dirty) bit is software-available when A/D bits are disabled. Note #2, there are currently no concrete plans to preserve Accessed information. Explorations on that front were the initial catalyst, but the cleanup is the motivation for the actual commit. Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabledSean Christopherson
Set shadow_dirty_mask to the architectural EPT Dirty bit value even if A/D bits are disabled at the module level, i.e. even if KVM will never enable A/D bits in hardware. Doing so provides consistent behavior for Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets shadow_accessed_mask but not shadow_dirty_mask. Functionally, this should be one big nop, as consumption of shadow_dirty_mask is always guarded by a check that hardware A/D bits are enabled. Link: https://lore.kernel.org/r/20241011021051.1557902-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabledSean Christopherson
Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D bits are enabled, set shadow_accessed_mask for EPT even when A/D bits are disabled in hardware. This will allow using shadow_accessed_mask for software purposes, e.g. to preserve accessed status in a non-present SPTE acros NUMA balancing, if something like that is ever desirable. Link: https://lore.kernel.org/r/20241011021051.1557902-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabledSean Christopherson
Add a dedicated flag to track if KVM has enabled A/D bits at the module level, instead of inferring the state based on whether or not the MMU's shadow_accessed_mask is non-zero. This will allow defining and using shadow_accessed_mask even when A/D bits aren't used by hardware. Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writableSean Christopherson
Do a remote TLB flush if installing a leaf SPTE overwrites an existing leaf SPTE (with the same target pfn, which is enforced by a BUG() in handle_changed_spte()) and clears the MMU-Writable bit. Since the TDP MMU passes ACC_ALL to make_spte(), i.e. always requests a Writable SPTE, the only scenario in which make_spte() should create a !MMU-Writable SPTE is if the gfn is write-tracked or if KVM is prefetching a SPTE. When write-protecting for write-tracking, KVM must hold mmu_lock for write, i.e. can't race with a vCPU faulting in the SPTE. And when prefetching a SPTE, the TDP MMU takes care to avoid clobbering a shadow-present SPTE, i.e. it should be impossible to replace a MMU-writable SPTE with a !MMU-writable SPTE when handling a TDP MMU fault. Cc: David Matlack <dmatlack@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20241011021051.1557902-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update()Sean Christopherson
Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now that the latter doesn't flush when clearing A/D bits, i.e. now that there is no need to explicitly avoid TLB flushes when aging SPTEs. Opportunistically WARN if mmu_spte_update() requests a TLB flush when aging SPTEs, as aging should never modify a SPTE in such a way that KVM thinks a TLB flush is needed. Link: https://lore.kernel.org/r/20241011021051.1557902-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Drop ignored return value from kvm_tdp_mmu_clear_dirty_slot()Sean Christopherson
Drop the return value from kvm_tdp_mmu_clear_dirty_slot() as its sole caller ignores the result (KVM flushes after clearing dirty logs based on the logs themselves, not based on SPTEs). Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20241011021051.1557902-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMUSean Christopherson
Don't force a TLB flush when an SPTE update in the shadow MMU happens to clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling dirty logging, and when clearing dirty logs, KVM flushes based on its software structures, not the SPTEs. I.e. the flows that care about accurate Dirty bit information already ensure there are no stale TLB entries. Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole caller. Link: https://lore.kernel.org/r/20241011021051.1557902-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bitSean Christopherson
Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as access tracking tolerates false negatives, as evidenced by the mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB flush. In practice, this is very nearly a nop. spte_write_protect() and spte_clear_dirty() never clear the Accessed bit. make_spte() always sets the Accessed bit for !prefetch scenarios. FNAME(sync_spte) only sets SPTE if the protection bits are changing, i.e. if a flush will be needed regardless of the Accessed bits. And FNAME(pte_prefetch) sets SPTE if and only if the old SPTE is !PRESENT. That leaves kvm_arch_async_page_ready() as the one path that will generate a !ACCESSED SPTE *and* overwrite a PRESENT SPTE. And that's very arguably a bug, as clobbering a valid SPTE in that case is nonsensical. Tested-by: Alex Bennée <alex.bennee@linaro.org> Link: https://lore.kernel.org/r/20241011021051.1557902-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>