Age | Commit message (Collapse) | Author |
|
Set "next_rip" in the emulation interception info passed to vendor code
using the emulator context's "_eip", not "eip". "eip" holds RIP from the
start of emulation, i.e. the RIP of the instruction that's being emulated,
whereas _eip tracks the context's current position in decoding the code
stream, which at the time of the intercept checks is effectively the RIP
of the next instruction.
Passing the current RIP as next_rip causes SVM to stuff the wrong value
value into vmcb12->control.next_rip if a nested VM-Exit is generated, i.e.
if L1 wants to intercept the instruction, and could result in L1 putting
L2 into an infinite loop due to restarting L2 with the same RIP over and
over.
Fixes: 8a76d7f25f8f ("KVM: x86: Add x86 callback for intercept check")
Link: https://lore.kernel.org/r/20250201015518.689704-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When emulating PAUSE on behalf of L2, check for interception in vmcs12 by
looking at primary execution controls, not secondary execution controls.
Checking for PAUSE_EXITING in secondary execution controls effectively
results in KVM looking for BUS_LOCK_DETECTION, which KVM doesn't expose to
L1, i.e. is always off in vmcs12, and ultimately results in KVM failing to
"intercept" PAUSE.
Because KVM doesn't handle interception during emulation correctly on VMX,
i.e. the "fixed" code is still quite broken, and not intercepting PAUSE is
relatively benign, for all intents and purposes the bug means that L2 gets
to live when it would otherwise get an unexpected #UD.
Fixes: 4984563823f0 ("KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted")
Link: https://lore.kernel.org/r/20250201015518.689704-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that all KVM usage of the Xen HVM config information is buried behind
CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch
and into kvm_xen.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Query kvm_xen_enabled when detecting writes to the Xen hypercall page MSR
so that the check is optimized away in the likely scenario that Xen isn't
enabled for the VM.
Deliberately open code the check instead of using kvm_xen_msr_enabled() in
order to avoid a double load of xen_hvm_config.msr (which is admittedly
rather pointless given the widespread lack of READ_ONCE() usage on the
plethora of vCPU-scoped accesses to kvm->arch.xen state).
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to detect writes to the Xen hypercall page MSR, and provide a
stub for CONFIG_KVM_XEN=n to optimize out the check for kernels built
without Xen support.
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20250215011437.1203084-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject userspace attempts to set the Xen hypercall page MSR to an index
outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
case writes in KVM, etc.
While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.
Cc: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/5051cfe7ed48ef9913bf2583eeca6795cb53d6ae.1738746821.git.namcao@linutronix.de
|
|
KVM fixes for 6.14 part 1
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
by KVM to fix a NULL pointer dereference.
- Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
nested NPT MMU so that the MMU is properly tagged for L2, not L1.
- Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
guest's value may be stale if a VM-Exit is handled in the fastpath.
|
|
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading. Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert the shadow MMU to use per-rmap locking instead of the per-VM
mmu_lock to protect rmaps when aging SPTEs. When A/D bits are enabled, it
is safe to simply clear the Accessed bits, i.e. KVM just needs to ensure
the parent page table isn't freed.
The less obvious case is marking SPTEs for access tracking in the
non-A/D case (for EPT only). Because aging a gfn means making the SPTE
not-present, KVM needs to play nice with the case where the CPU has TLB
entries for a SPTE that is not-present in memory. For example, when
doing dirty tracking, if KVM encounters a non-present shadow accessed SPTE,
KVM must know to do a TLB invalidation.
Fortunately, KVM already provides (and relies upon) the necessary
functionality. E.g. KVM doesn't flush TLBs when aging pages (even in the
clear_flush_young() case), and when harvesting dirty bitmaps, KVM flushes
based on the dirty bitmaps, not on SPTEs.
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-12-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered. When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock. Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.
When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-11-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock. The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds). In short, a more sophisticated locking scheme shouldn't be
necessary.
Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate. This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-9-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
only if the VM has at least one shadow page, as opposed to checking if the
VM has rmaps. Checking for rmaps will effectively yield a false positive
if the VM ran nested TDP VMs in the past, but is not currently doing so.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reorder the processing of the TDP MMU versus the shadow MMU when aging
SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
MMU reports that the page is young, i.e. completely avoid taking mmu_lock
if the TDP MMU SPTE is young. Swap the order for the test-and-age helper
as well for consistency.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots,
and do all SPTE aging with atomic updates; while clobbering Accessed
information is ok, KVM must not corrupt other bits, e.g. must not drop
a Dirty or Writable bit when making a SPTE young..
If updating a SPTE to mark it for access tracking fails, leave it as is
and treat it as if it were young. If the spte is being actively modified,
it is most likely young.
Acquire and release mmu_lock for write when harvesting age information
from the shadow MMU, as the shadow MMU doesn't yet support aging outside
of mmu_lock.
Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled
SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them
for access-tracking outside of mmu_lock. Coupled with restoring access-
tracked SPTEs in the fast page fault handler, the end result is that
A/D-disable SPTEs will be volatile at all times.
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/all/Z60bhK96JnKIgqZQ@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't force SPTE modifications to be done atomically if the only volatile
bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate
stale aging state, and the probability of an Accessed bit A/D assist being
clobbered *and* affecting again is likely far lower than the probability
of consuming stale information due to not flushing TLBs when aging.
Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better
capture the nature of the helper.
Opportunstically do s/write/update on the TDP MMU wrapper, as it's not
simply the "write" that needs to be done atomically, it's the entire
update, i.e. the entire read-modify-write operation needs to be done
atomically so that KVM has an accurate view of the old SPTE.
Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect,
it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with
kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net
positive, and would require significant churn.
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a
follow-up patch to enable lockless Accessed bit clearing.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-4-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When registering an encrypted memory region for SEV-MEM/SEV-ES guests,
pin the pages with FOLL_TERM so that the pages are migrated out of
MIGRATE_CMA/ZONE_MOVABLE. Failure to do so violates the CMA/MOVABLE
mechanisms and can result in fragmentation due to unmovable pages, e.g.
can make CMA allocations fail.
Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/1739241423-14326-1-git-send-email-yangge1116@126.com
[sean: massage changelog, make @flags an unsigned int]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating PV clocks, handle the Xen-specific UNSTABLE_TSC override in
the main kvm_guest_time_update() by simply clearing PVCLOCK_TSC_STABLE_BIT
in the flags of the reference pvclock structure. Expand the comment to
(hopefully) make it obvious that Xen clocks need to be processed after all
clocks that care about the TSC_STABLE flag.
No functional change intended.
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating paravirtual clocks, setup the Hyper-V TSC page before
Xen PV clocks. This will allow dropping xen_pvclock_tsc_unstable in favor
of simply clearing PVCLOCK_TSC_STABLE_BIT in the reference flags.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the per-vCPU "cache" of the reference pvclock and instead cache
only the TSC shift+multiplier. All other fields in pvclock are fully
recomputed by kvm_guest_time_update(), i.e. aren't actually persisted.
In addition to shaving a few bytes, explicitly tracking the TSC shift/mul
fields makes it easier to see that those fields are tied to hw_tsc_khz
(they exist to avoid having to do expensive math in the common case).
And conversely, not tracking the other fields makes it easier to see that
things like the version number are pulled from the guest's copy, not from
KVM's reference.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass the reference pvclock structure that's used to setup each individual
pvclock as a parameter to kvm_setup_guest_pvclock() as a preparatory step
toward removing kvm_vcpu_arch.hv_clock.
No functional change intended.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Handle "guest stopped" propagation only for kvmclock, as the flag is set
if and only if kvmclock is "active", i.e. can only be set for Xen PV clock
if kvmclock *and* Xen PV clock are in-use by the guest, which creates very
bizarre behavior for the guest.
Simply restrict the flag to kvmclock, e.g. instead of trying to handle
Xen PV clock, as propagation of PVCLOCK_GUEST_STOPPED was unintentionally
added during a refactoring, and while Xen proper defines
XEN_PVCLOCK_GUEST_STOPPED, there's no evidence that Xen guests actually
support the flag.
Check and clear pvclock_set_guest_stopped_request if and only if kvmclock
is active to preserve the original behavior, i.e. keep the flag pending
if kvmclock happens to be disabled when KVM processes the initial request.
Fixes: aa096aa0a05f ("KVM: x86/xen: setup pvclock updates")
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating a specific PV clock, make a full copy of KVM's reference
copy/cache so that PVCLOCK_GUEST_STOPPED doesn't bleed across clocks.
E.g. in the unlikely scenario the guest has enabled both kvmclock and Xen
PV clock, a dangling GUEST_STOPPED in kvmclock would bleed into Xen PV
clock.
Using a local copy of the pvclock structure also sets the stage for
eliminating the per-vCPU copy/cache (only the TSC frequency information
actually "needs" to be cached/persisted).
Fixes: aa096aa0a05f ("KVM: x86/xen: setup pvclock updates")
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the guest's copy of its pvclock when starting a Xen timer, as KVM's
reference copy may not be up-to-date, i.e. may yield a false positive of
sorts. In the unlikely scenario that the guest is starting a Xen timer
and has used a Xen pvclock in the past, but has since but turned it "off",
then vcpu->arch.hv_clock may be stale, as KVM's reference copy is updated
if and only if at least one pvclock is enabled.
Furthermore, vcpu->arch.hv_clock is currently used by three different
pvclocks: kvmclock, Xen, and Xen compat. While it's extremely unlikely a
guest would ever enable multiple pvclocks, effectively sharing KVM's
reference clock could yield very weird behavior. Using the guest's active
Xen pvclock instead of KVM's reference will allow dropping KVM's
reference copy.
Fixes: 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Handle "guest stopped" requests once per guest time update in preparation
of restoring KVM's historical behavior of setting PVCLOCK_GUEST_STOPPED
for kvmclock and only kvmclock. For now, simply move the code to minimize
the probability of an unintentional change in functionally.
Note, in practice, all clocks are guaranteed to see the request (or not)
even though each PV clock processes the request individual, as KVM holds
vcpu->mutex (blocks KVM_KVMCLOCK_CTRL) and it should be impossible for
KVM's suspend notifier to run while KVM is handling requests. And because
the helper updates the reference flags, all subsequent PV clock updates
will pick up PVCLOCK_GUEST_STOPPED.
Note #2, once PVCLOCK_GUEST_STOPPED is restricted to kvmclock, the
horrific #ifdef will go away.
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the local pvclock_flags in kvm_guest_time_update(), the local variable
is immediately shoved into the per-vCPU "cache", i.e. the local variable
serves no purpose.
No functional change intended.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop KVM's handling of kvm_set_guest_paused() failure when reacting to a
SUSPEND notification, as kvm_set_guest_paused() only "fails" if the vCPU
isn't using kvmclock, and KVM's notifier callback pre-checks that kvmclock
is active. I.e. barring some bizarre edge case that shouldn't be treated
as an error in the first place, kvm_arch_suspend_notifier() can't fail.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When queueing vCPU PVCLOCK updates in response to SUSPEND or HIBERNATE,
don't take kvm->lock as doing so can trigger a largely theoretical
deadlock, it is perfectly safe to iterate over the xarray of vCPUs without
holding kvm->lock, and kvm->lock doesn't protect kvm_set_guest_paused() in
any way (pv_time.active and pvclock_set_guest_stopped_request are
protected by vcpu->mutex, not kvm->lock).
Reported-by: syzbot+352e553a86e0d75f5120@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/677c0f36.050a0220.3b3668.0014.GAE@google.com
Fixes: 7d62874f69d7 ("kvm: x86: implement KVM PM-notifier")
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries. Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).
Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:
SKX 6850 => 6450
ICX 9000 => 8800
EMR 7900 => 7700
Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.
Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case. However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex. Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state. KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit. The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.
To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.
Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.
Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug. The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled. E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT. KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.
Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if
the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a
dynamic, KVM-controlled CPUID feature. KVM's funky ABI for that quirk is
to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID.
For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES
itself, i.e. check the actual control, not its reflection in guest CPUID.
Avoiding consumption of dynamic CPUID features will allow KVM to defer
runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates
become visible to the guest. Alternatively, KVM could play other games
with runtime CPUID updates, e.g. by precisely specifying which feature
bits to update, but doing so adds non-trivial complexity and doesn't solve
the underlying issue of unnecessary updates causing meaningful overhead
for nested virtualization roundtrips.
Link: https://lore.kernel.org/r/20241211013302.1347853-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When emulating CPUID, retrieve MSR_IA32_TSX_CTRL.TSX_CTRL_CPUID_CLEAR if
and only if RTM and/or HLE feature bits need to be cleared. Getting the
MSR value is unnecessary if neither bit is set, and avoiding the lookup
saves ~80 cycles for vCPUs without RTM or HLE.
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20241211013302.1347853-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework xstate_required_size() to use a for-loop and continue, to make it
more obvious that the xstate_sizes[] lookups are indeed correctly bounded,
and to make it (hopefully) easier to understand that the loop is iterating
over supported XSAVE features.
Link: https://lore.kernel.org/r/20241211013302.1347853-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The default type of a decimal constant is determined by the magnitude of
its value. If the value falls within the range of int, its type is int;
otherwise, if it falls within the range of unsigned int, its type is
unsigned int. This results in the constant 48 being of type int. In the
following min call,
g_phys_as = min(g_phys_as, 48);
This leads to a building warning/error (CONFIG_KVM_WERROR=y) caused by
the mismatch between the types of the two arguments to macro min. By
adding the suffix U to explicitly declare the type of the constant, this
issue is fixed.
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Link: https://lore.kernel.org/r/20250127013837.12983-1-haifeng.zhao@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always
cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT
instruction will be respected.
Fixes: 6aef266c6e17 ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs")
Fixes: 38c0b192bd6d ("KVM: SVM: leave halted state on vmexit")
Fixes: 1a65105a5aba ("KVM: x86/xen: handle PV spinlocks slowpath")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com
[sean: add Xen PV spinlocks to the list of Fixes, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Replace all open-coded assignments to vcpu->arch.mp_state with calls
to a new helper, kvm_set_mp_state(), to centralize all changes to
mp_state.
No functional change intended.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use kvfree_rcu() to free the old optimized APIC instead of open coding a
rough equivalent via call_rcu() and a callback function.
Note, there is a subtle function change as rcu_barrier() doesn't wait on
kvfree_rcu(), but does wait on call_rcu(). Not forcing rcu_barrier() to
wait is safe and desirable in this case, as KVM doesn't care when an old
map is actually freed. In fact, using kvfree_rcu() fixes a largely
theoretical use-after-free. Because KVM _doesn't_ do rcu_barrier() to
wait for kvm_apic_map_free() to complete, if KVM-the-module is unloaded in
the RCU grace period before kvm_apic_map_free() is invoked, KVM's callback
could run after module unload.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250122073456.2950-1-lirongqing@baidu.com
[sean: rework changelog, call out rcu_barrier() interaction]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating the emulated PIC IRQ status, set "wakeup_needed" if and only
if a new interrupt was found, i.e. if the incoming level is non-zero and
an IRQ is being raised. The bug is relatively benign, as KVM will signal
a spurious wakeup, e.g. set KVM_REQ_EVENT and kick target vCPUs, but KVM
will never actually inject a spurious IRQ as kvm_cpu_has_extint() cares
only about the "output" field.
Fixes: 7049467b5383 ("KVM: remove isr_ack logic from PIC")
Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
Link: https://lore.kernel.org/r/CACZJ9cX2R_=qgvLdaqbB_DUJhv08c674b67Ln_Qb9yyVwgE16w@mail.gmail.com
[sean: reconstruct patch, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec
D: L1 reads DR6, arch.dr6 = 0
CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0
CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0
L2 reads DR6, L1 disables DR interception
CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0
CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0
L2 detects failure
CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT
L1 reads DR6 (confirms failure)
CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0
L0's view:
==========
L2 reads DR6, arch.dr6 = 0
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
L2 => L1 nested VM-Exit
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23
L1 writes DR7, L0 disables DR interception
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23
L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0
A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>
B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23
C: L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0
L1 => L2 nested VM-Enter
CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME
L0 reads DR6, arch.dr6 = 0
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role. KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.
Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).
And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.
Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.
Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and
only if the local API is emulated/virtualized by KVM, and explicitly reject
said hypercalls if the local APIC is emulated in userspace, i.e. don't rely
on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID.
Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if
Hyper-V enlightenments are exposed to the guest without an in-kernel local
APIC:
dump_stack+0xbe/0xfd
__kasan_report.cold+0x34/0x84
kasan_report+0x3a/0x50
__apic_accept_irq+0x3a/0x5c0
kvm_hv_send_ipi.isra.0+0x34e/0x820
kvm_hv_hypercall+0x8d9/0x9d0
kvm_emulate_hypercall+0x506/0x7e0
__vmx_handle_exit+0x283/0xb60
vmx_handle_exit+0x1d/0xd0
vcpu_enter_guest+0x16b0/0x24c0
vcpu_run+0xc0/0x550
kvm_arch_vcpu_ioctl_run+0x170/0x6d0
kvm_vcpu_ioctl+0x413/0xb20
__se_sys_ioctl+0x111/0x160
do_syscal1_64+0x30/0x40
entry_SYSCALL_64_after_hwframe+0x67/0xd1
Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode
can't be modified after vCPUs are created, i.e. if one vCPU has an
in-kernel local APIC, then all vCPUs have an in-kernel local APIC.
Reported-by: Dongjie Zou <zoudongjie@huawei.com>
Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls")
Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID")
Cc: stable@vger.kernel.org
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20250118003454.2619573-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading. Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Simplify code by replacing &to_kvm_svm(kvm)->sev_info with
to_kvm_sev_info() helper function. Wherever possible, drop the local
variable declaration and directly use the helper instead.
No functional changes.
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250123055140.144378-1-nikunj@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The Xen hypercall page MSR is write-only. When the guest writes an address
to the MSR, the hypervisor populates the referenced page with hypercall
functions.
There is no reason for the host ever to write to the MSR, and it isn't
even readable.
Allowing host writes to trigger the hypercall page allows userspace to
attack the kernel, as kvm_xen_write_hypercall_page() takes multiple
locks and writes to guest memory. E.g. if userspace sets the MSR to
MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will
trigger an SRCU violation due to writing guest memory:
=============================
WARNING: suspicious RCU usage
6.13.0-rc3
-----------------------------
include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!
stack backtrace:
CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0x90
lockdep_rcu_suspicious+0x176/0x1c0
kvm_vcpu_gfn_to_memslot+0x259/0x280
kvm_vcpu_write_guest+0x3a/0xa0
kvm_xen_write_hypercall_page+0x268/0x300
kvm_set_msr_common+0xc44/0x1940
vmx_set_msr+0x9db/0x1fc0
kvm_vcpu_reset+0x857/0xb50
kvm_arch_vcpu_create+0x37e/0x4d0
kvm_vm_ioctl+0x669/0x2100
__x64_sys_ioctl+0xc1/0xf0
do_syscall_64+0xc5/0x210
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7feda371b539
While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.
Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/de0437379dfab11e431a23c8ce41a29234c06cbf.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When waking a VM's NX huge page recovery thread, ensure the thread is
actually alive before trying to wake it. Now that the thread is spawned
on-demand during KVM_RUN, a VM without a recovery thread is reachable via
the related module params.
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:vhost_task_wake+0x5/0x10
Call Trace:
<TASK>
set_nx_huge_pages+0xcc/0x1e0 [kvm]
param_attr_store+0x8a/0xd0
module_attr_store+0x1a/0x30
kernfs_fop_write_iter+0x12f/0x1e0
vfs_write+0x233/0x3e0
ksys_write+0x60/0xd0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f3b52710104
</TASK>
Modules linked in: kvm_intel kvm
CR2: 0000000000000040
Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later")
Cc: stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250124234623.3609069-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The only statement in a kvm_arch_post_init_vm implementation
can be moved into the x86 kvm_arch_init_vm. Do so and remove all
traces from architecture-independent code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(),
i.e. when it makes sense to present the feature even if cpuid does not
have it *and* the VM is not able to see the difference. For example,
it can be used when mitigations on the host automatically protect
the guest as well.
The "SYNTHESIZED_F(SRSO_USER_KERNEL_NO)" line came in as a conflict
resolution between the CPUID overhaul from the KVM tree and support
for the feature in the x86 tree. Using it right now does not hurt,
or make a difference for that matter, because there is no
setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO). However, it
is a little less future proof in case such a setup_force_cpu_cap()
appears later, for a case where the kernel somehow is not vulnerable
but the guest would have to apply the mitigation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Pull kvm updates from Paolo Bonzini:
"Loongarch:
- Clear LLBCTL if secondary mmu mapping changes
- Add hypercall service support for usermode VMM
x86:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM
performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
enabled
- Ensure that all SEV code is compiled out when disabled in Kconfig,
even if building with less brilliant compilers
- Remove a redundant TLB flush on AMD processors when guest CR4.PGE
changes
- Use str_enabled_disabled() to replace open coded strings
- Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
APICv cache prior to every VM-Enter
- Overhaul KVM's CPUID feature infrastructure to track all vCPU
capabilities instead of just those where KVM needs to manage state
and/or explicitly enable the feature in hardware. Along the way,
refactor the code to make it easier to add features, and to make it
more self-documenting how KVM is handling each feature
- Rework KVM's handling of VM-Exits during event vectoring; this
plugs holes where KVM unintentionally puts the vCPU into infinite
loops in some scenarios (e.g. if emulation is triggered by the
exit), and brings parity between VMX and SVM
- Add pending request and interrupt injection information to the
kvm_exit and kvm_entry tracepoints respectively
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU
when loading guest/host PKRU, due to a refactoring of the kernel
helpers that didn't account for KVM's pre-checking of the need to
do WRPKRU
- Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.
Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
went to userspace, and all the others did not; the new code need
not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not
- As part of enabling TDX virtual machines, support support
separation of private/shared EPT into separate roots.
When TDX will be enabled, operations on private pages will need to
go through the privileged TDX Module via SEAMCALLs; as a result,
they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the
private EPT in host memory, and define entries in kvm_x86_ops to
operate on external page tables such as the TDX private EPT
- The recently introduced conversion of the NX-page reclamation
kthread to vhost_task moved the task under the main process. The
task is created as soon as KVM_CREATE_VM was invoked and this, of
course, broke userspace that didn't expect to see any child task of
the VM process until it started creating its own userspace threads.
In particular crosvm refuses to fork() if procfs shows any child
task, so unbreak it by creating the task lazily. This is arguably a
userspace bug, as there can be other kinds of legitimate worker
tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they
show as "Kthread: 1" in proc/.../status?
x86 - Intel:
- Fix a bug where KVM updates hardware's APICv cache of the highest
ISR bit while L2 is active, while ultimately results in a
hardware-accelerated L1 EOI effectively being lost
- Honor event priority when emulating Posted Interrupt delivery
during nested VM-Enter by queueing KVM_REQ_EVENT instead of
immediately handling the interrupt
- Rework KVM's processing of the Page-Modification Logging buffer to
reap entries in the same order they were created, i.e. to mark gfns
dirty in the same order that hardware marked the page/PTE dirty
- Misc cleanups
Generic:
- Cleanup and harden kvm_set_memory_region(); add proper lockdep
assertions when setting memory regions and add a dedicated API for
setting KVM-internal memory regions. The API can then explicitly
disallow all flags for KVM-internal memory regions
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to
fix a bug where KVM would return a pointer to a vCPU prior to it
being fully online, and give kvm_for_each_vcpu() similar treatment
to fix a similar flaw
- Wait for a vCPU to come online prior to executing a vCPU ioctl, to
fix a bug where userspace could coerce KVM into handling the ioctl
on a vCPU that isn't yet onlined
- Gracefully handle xarray insertion failures; even though such
failures are impossible in practice after xa_reserve(), reserving
an entry is always followed by xa_store() which does not know (or
differentiate) whether there was an xa_reserve() before or not
RISC-V:
- Zabha, Svvptc, and Ziccrse extension support for guests. None of
them require anything in KVM except for detecting them and marking
them as supported; Zabha adds byte and halfword atomic operations,
while the others are markers for specific operation of the TLB and
of LL/SC instructions respectively
- Virtualize SBI system suspend extension for Guest/VM
- Support firmware counters which can be used by the guests to
collect statistics about traps that occur in the host
Selftests:
- Rework vcpu_get_reg() to return a value instead of using an
out-param, and update all affected arch code accordingly
- Convert the max_guest_memory_test into a more generic
mmu_stress_test. The basic gist of the "conversion" is to have the
test do mprotect() on guest memory while vCPUs are accessing said
memory, e.g. to verify KVM and mmu_notifiers are working as
intended
- Play nice with treewrite builds of unsupported architectures, e.g.
arm (32-bit), as KVM selftests' Makefile doesn't do anything to
ensure the target architecture is actually one KVM selftests
supports
- Use the kernel's $(ARCH) definition instead of the target triple
for arch specific directories, e.g. arm64 instead of aarch64,
mainly so as not to be different from the rest of the kernel
- Ensure that format strings for logging statements are checked by
the compiler even when the logging statement itself is disabled
- Attempt to whack the last LLC references/misses mole in the Intel
PMU counters test by adding a data load and doing CLFLUSH{OPT} on
the data instead of the code being executed. It seems that modern
Intel CPUs have learned new code prefetching tricks that bypass the
PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that
events are counting correctly without actually knowing what the
events count given the underlying hardware; this can happen if
Intel reuses a formerly microarchitecture-specific event encoding
as an architectural event, as was the case for Top-Down Slots"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
kvm: defer huge page recovery vhost task to later
KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
KVM: Disallow all flags for KVM-internal memslots
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
KVM: Add a dedicated API for setting KVM-internal memslots
KVM: Assert slots_lock is held when setting memory regions
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
LoongArch: KVM: Add hypercall service support for usermode VMM
LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
KVM: VMX: read the PML log in the same order as it was written
KVM: VMX: refactor PML terminology
KVM: VMX: Fix comment of handle_vmx_instruction()
KVM: VMX: Reinstate __exit attribute for vmx_exit()
KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
KVM: x86: Use LVT_TIMER instead of an open coded literal
RISC-V: KVM: Add new exit statstics for redirected traps
RISC-V: KVM: Update firmware counters for various events
RISC-V: KVM: Redirect instruction access fault trap to guest
...
|