summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2025-01-25Merge tag 'hyperv-next-signed-20250123' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-24kvm: defer huge page recovery vhost task to laterKeith Busch
Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-21Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21Merge tag 'objtool-core-2025-01-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Introduce the generic section-based annotation infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra) - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra) - ANNOTATE_NOENDBR - ANNOTATE_RETPOLINE_SAFE - instrumentation_{begin,end}() - VALIDATE_UNRET_BEGIN - ANNOTATE_IGNORE_ALTERNATIVE - ANNOTATE_INTRA_FUNCTION_CALL - {.UN}REACHABLE - Optimize the annotation-sections parsing code (Peter Zijlstra) - Centralize annotation definitions in <linux/objtool.h> - Unify & simplify the barrier_before_unreachable()/unreachable() definitions (Peter Zijlstra) - Convert unreachable() calls to BUG() in x86 code, as unreachable() has unreliable code generation (Peter Zijlstra) - Remove annotate_reachable() and annotate_unreachable(), as it's unreliable against compiler optimizations (Peter Zijlstra) - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra) - Robustify the annotation code by warning about unknown annotation types (Peter Zijlstra) - Allow arch code to discover jump table size, in preparation of annotated jump table support (Ard Biesheuvel) * tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Convert unreachable() to BUG() objtool: Allow arch code to discover jump table size objtool: Warn about unknown annotation types objtool: Fix ANNOTATE_REACHABLE to be a normal annotation objtool: Convert {.UN}REACHABLE to ANNOTATE objtool: Remove annotate_{,un}reachable() loongarch: Use ASM_REACHABLE x86: Convert unreachable() to BUG() unreachable: Unify objtool: Collect more annotations in objtool.h objtool: Collapse annotate sequences objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE objtool: Convert instrumentation_{begin,end}() to ANNOTATE objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE objtool: Convert ANNOTATE_NOENDBR to ANNOTATE objtool: Generic annotation infrastructure
2025-01-20Merge branch 'kvm-mirror-page-tables' into HEADPaolo Bonzini
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2025-01-20Merge branch 'kvm-userspace-hypercall' into HEADPaolo Bonzini
Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not.
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20Merge tag 'kvm-x86-vmx-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.14: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost. - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt. - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter. - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty. - Misc cleanups.
2025-01-20Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.14: - Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is optimized away when building with less than brilliant compilers. - Remove a now-redundant TLB flush when guest CR4.PGE changes. - Use str_enabled_disabled() to replace open coded strings.
2025-01-20Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.14: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
2025-01-20Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM kvm_set_memory_region() cleanups and hardening for 6.14: - Add proper lockdep assertions when setting memory regions. - Add a dedicated API for setting KVM-internal memory regions. - Explicitly disallow all flags for KVM-internal memory regions.
2025-01-15KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()Yan Zhao
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID) instead of 1 in kvm_mmu_page_fault(). The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e., npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(), kvm_handle_page_fault()). They either check if the return value is > 0 (as in npf_interception()) or pass it further to vcpu_run() to decide whether to break out of the kernel loop and return to the user when r <= 0. Therefore, returning any positive value is equivalent to returning 1. Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure a positive return value. This is a preparation to allow TDX's EPT violation handler to check the RET_PF* value and retry internally for RET_PF_RETRY. No functional changes are intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-14KVM: x86: Drop double-underscores from __kvm_set_memory_region()Sean Christopherson
Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14KVM: Add a dedicated API for setting KVM-internal memslotsSean Christopherson
Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-14KVM: Assert slots_lock is held when setting memory regionsSean Christopherson
Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-10KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20250110101100.272312-2-thorsten.blum@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-10hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.hNuno Das Neves
Switch to using hvhdk.h everywhere in the kernel. This header includes all the new Hyper-V headers in include/hyperv, which form a superset of the definitions found in hyperv-tlfs.h. This makes it easier to add new Hyper-V interfaces without being restricted to those in the TLFS doc (reflected in hyperv-tlfs.h). To be more consistent with the original Hyper-V code, the names of some definitions are changed slightly. Update those where needed. Update comments in mshyperv.h files to point to include/hyperv for adding new definitions. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-08KVM: VMX: read the PML log in the same order as it was writtenMaxim Levitsky
Intel's PRM specifies that the CPU writes to the PML log 'backwards' or in other words, it first writes entry 511, then entry 510 and so on. I also confirmed on the bare metal that the CPU indeed writes the entries in this order. KVM on the other hand, reads the entries in the opposite order, from the last written entry and towards entry 511 and dumps them in this order to the dirty ring. Usually this doesn't matter, except for one complex nesting case: KVM reties the instructions that cause MMU faults. This might cause an emulated PML log entry to be visible to L1's hypervisor before the actual memory write was committed. This happens when the L0 MMU fault is followed directly by the VM exit to L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full' event. This problem doesn't have a noticeable real-world impact because this write retry is not much different from the guest writing to the same page multiple times, which is also not reflected in the dirty log. The users of the dirty logging only rely on correct reporting of the clean pages, or in other words they assume that if a page is clean, then no writes were committed to it since the moment it was marked clean. However KVM has a kvm_dirty_log_test selftest, a test that tests both the clean and the dirty pages vs the memory contents, and can fail if it detects a dirty page which has an old value at the offset 0 which the test writes. To avoid failure, the test has a workaround for this specific problem: The test skips checking memory that belongs to the last dirty ring entry, which it has seen, relying on the fact that as long as memory writes are committed in-order, only the last entry can belong to a not yet committed memory write. However, since L1's KVM is reading the PML log in the opposite direction that L0 wrote it, the last dirty ring entry often will be not the last entry written by the L0. To fix this, switch the order in which KVM reads the PML log. Note that this issue is not present on the bare metal, because on the bare metal, an update of the A/D bits of a present entry, PML logging and the actual memory write are all done by the CPU without any hypervisor intervention and pending interrupt evaluation, thus once a PML log and/or vCPU kick happens, all memory writes that are in the PML log are committed to memory. The only exception to this rule is when the guest hits a not present EPT entry, in which case KVM first reads (backward) the PML log, dumps it to the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs this to the dirty ring, thus making the entry be the last one in the dirty ring. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241219221034.903927-3-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: VMX: refactor PML terminologyMaxim Levitsky
Rename PML_ENTITY_NUM to PML_LOG_NR_ENTRIES Add PML_HEAD_INDEX to specify the first entry that CPU writes. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241219221034.903927-2-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: VMX: Fix comment of handle_vmx_instruction()Gao Shiyuan
Fix a goof in handle_vmx_instruction()'s comment where it references the non-existent nested_vmx_setup(); the function that overwrites the exit handlers is nested_vmx_hardware_setup(). Note, this isn't a case of a stale comment, e.g. due to the function being renamed. The comment has always been wrong. Fixes: e4027cfafd78 ("KVM: nVMX: Set callbacks for nested functions during hardware setup") Signed-off-by: Gao Shiyuan <gaoshiyuan@baidu.com> Link: https://lore.kernel.org/r/20250103153814.73903-1-gaoshiyuan@baidu.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: VMX: Reinstate __exit attribute for vmx_exit()Costas Argyris
Tag vmx_exit() with __exit now that it's no longer used by vmx_init(). Commit a7b9020b06ec ("x86/l1tf: Handle EPT disabled state proper") dropped the "__exit" attribute from vmx_exit() because vmx_init() was changed to call vmx_exit(). However, commit e32b120071ea ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace") changed vmx_init() to call __vmx_exit() instead of the "full" vmx_exit(). This made it possible to mark vmx_exit() as "__exit" again, as it originally was, and enjoy the benefits that it provides (the function can be discarded from memory in situations where it cannot be called, like the module being built-in or module unloading being disabled in the kernel). Signed-off-by: Costas Argyris <costas.argyris@amd.com> Link: https://lore.kernel.org/r/20250102154050.2403-1-costas.argyris@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20241227094450.674104-2-thorsten.blum@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: x86: Avoid double RDPKRU when loading host/guest PKRUSean Christopherson
Use the raw wrpkru() helper when loading the guest/host's PKRU on switch to/from guest context, as the write_pkru() wrapper incurs an unnecessary rdpkru(). In both paths, KVM is guaranteed to have performed RDPKRU since the last possible write, i.e. KVM has a fresh cache of the current value in hardware. This effectively restores KVM's behavior to that of KVM prior to commit c806e88734b9 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a wrapper. Commit 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU, but completely missed that KVM already optimized away pointless writes. Reported-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20241221011647.3747448-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08KVM: x86: Use LVT_TIMER instead of an open coded literalLiam Ni
Use LVT_TIMER instead of the literal '0' to clean up the apic_lvt_mask lookup when emulating handling writes to APIC_LVTT. No functional change intended. Signed-off-by: Liam Ni <zhiguangni01@gmail.com> [sean: manually regenerate patch (whitespace damaged), massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2024-12-30KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspaceBorislav Petkov (AMD)
SRSO_USER_KERNEL_NO denotes whether the CPU is affected by SRSO across user/kernel boundaries. Advertise it to guest userspace. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20241202120416.6054-3-bp@kernel.org
2024-12-23KVM: x86/mmu: Prevent aliased memslot GFNsRick Edgecombe
Add a few sanity checks to prevent memslot GFNs from ever having alias bits set. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. If these memslot GFNs ever had the bit that selects between the two aliases it could lead to unexpected behavior in the complicated code that directs faulting or zapping operations between the roots that map the two aliases. As a safety measure, prevent memslots from being set at a GFN range that contains the alias bit. Also, check in the kvm_faultin_pfn() for the fault path. This later check does less today, as the alias bits are specifically stripped from the GFN being checked, however future code could possibly call in to the fault handler in a way that skips this stripping. Since kvm_faultin_pfn() now has many references to vcpu->kvm, extract it to local variable. Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Don't zap valid mirror roots in kvm_tdp_mmu_zap_all()Rick Edgecombe
Don't zap valid mirror roots in kvm_tdp_mmu_zap_all(), which in effect is only direct roots (invalid and valid). For TDX, kvm_tdp_mmu_zap_all() is only called during MMU notifier release. Since, mirrored EPT comes from guest mem, it will never be mapped to userspace, and won't apply. But in addition to be unnecessary, mirrored EPT is cleaned up in a special way during VM destruction. Pass the KVM_INVALID_ROOTS bit into __for_each_tdp_mmu_root_yield_safe() as well, to clean up invalid direct roots, as is the current behavior. While at it, remove an obsolete reference to work item-based zapping. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-18-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()Isaku Yamahata
Rename kvm_tdp_mmu_invalidate_all_roots() to kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types as an argument. kvm_tdp_mmu_invalidate_roots() is called with different root types. For kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing down a VM it needs to invalidate all roots. Have the callers only invalidate the required roots instead of all roots. Within kvm_tdp_mmu_invalidate_roots(), respect the root type passed by checking the root type in root iterator. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-17-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Propagate tearing down mirror page tablesIsaku Yamahata
Integrate hooks for mirroring page table operations for cases where TDX will zap PTEs or free page tables. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. Since calls into the TDX module are relatively slow, walking private page tables by making calls into the TDX module would not be efficient. Because of this, previous changes have taught the TDP MMU to keep a mirror root, which is separate, unmapped TDP root that private operations can be directed to. Currently this root is disconnected from the guest. Now add plumbing to propagate changes to the "external" page tables being mirrored. Just create the x86_ops for now, leave plumbing the operations into the TDX module for future patches. Add two operations for tearing down page tables, one for freeing page tables (free_external_spt) and one for zapping PTEs (remove_external_spte). Define them such that remove_external_spte will perform a TLB flush as well. (in TDX terms "ensure there are no active translations"). TDX MMU support will exclude certain MMU operations, so only plug in the mirroring x86 ops where they will be needed. For zapping/freeing, only hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and kvm_mmu_zap_all_fast(). In previous changes to address races around concurrent populating using tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set FROZEN_SPTE in the mirrored page tables while performing the external operations. Such a solution is not needed for the tear down paths in TDX as these will always be performed with the mmu_lock held for write. Sprinkle some KVM_BUG_ON()s to reflect this. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Propagate building mirror page tablesIsaku Yamahata
Integrate hooks for mirroring page table operations for cases where TDX will set PTEs or link page tables. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. Since calls into the TDX module are relatively slow, walking private page tables by making calls into the TDX module would not be efficient. Because of this, previous changes have taught the TDP MMU to keep a mirror root, which is separate, unmapped TDP root that private operations can be directed to. Currently this root is disconnected from any actual guest mapping. Now add plumbing to propagate changes to the "external" page tables being mirrored. Just create the x86_ops for now, leave plumbing the operations into the TDX module for future patches. Add two operations for setting up external page tables, one for linking new page tables and one for setting leaf PTEs. Don't add any op for configuring the root PFN, as TDX handles this itself. Don't provide a way to set permissions on the PTEs also, as TDX doesn't support it. This results in MMU "mirroring" support that is very targeted towards TDX. Since it is likely there will be no other user, the main benefit of making the support generic is to keep TDX specific *looking* code outside of the MMU. As a generic feature it will make enough sense from TDX's perspective. For developers unfamiliar with TDX arch it can express the general concepts such that they can continue to work in the code. TDX MMU support will exclude certain MMU operations, so only plug in the mirroring x86 ops where they will be needed. For setting/linking, only hook tdp_mmu_set_spte_atomic() which is used for mapping and linking PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for setting PTEs in operations unsupported by TDX: splitting huge pages and write protecting. Sprinkle KVM_BUG_ON()s to document as code that these paths are not supported for mirrored page tables. For zapping operations, leave those for near future changes. Many operations in the TDP MMU depend on atomicity of the PTE update. While the mirror PTE on KVM's side can be updated atomically, the update that happens inside the external operations (S-EPT updates via TDX module call) can't happen atomically with the mirror update. The following race could result during two vCPU's populating private memory: * vcpu 1: atomically update 2M level mirror EPT entry to be present * vcpu 2: read 2M level EPT entry that is present * vcpu 2: walk down into 4K level EPT * vcpu 2: atomically update 4K level mirror EPT entry to be present * vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error because 2M secure EPT entry is not populated yet * vcpu 1: link_external_spt() to update 2M secure EPT entry Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect operations are performed. Only write the actual mirror PTE value once the reflect operations have completed. When trying to set a PTE to present and encountering a frozen SPTE, retry the fault. By doing this the race is prevented as follows: * vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE * vcpu 2: read 2M level EPT entry that is FROZEN_SPTE * vcpu 2: find that the EPT entry is frozen abandon page table walk to resume guest execution * vcpu 1: link_external_spt() to update 2M secure EPT entry * vcpu 1: atomically update 2M level EPT entry to be present (unfreeze) * vcpu 2: resume guest execution Depending on vcpu 1 state, vcpu 2 may result in EPT violation again or make progress on guest execution Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Propagate attr_filter to MMU notifier callbacksPaolo Bonzini
Teach the MMU notifier callbacks how to check kvm_gfn_range.process to filter which KVM MMU root types to operate on. The private GPAs are backed by guest memfd. Such memory is not subjected to MMU notifier callbacks because it can't be mapped into the host user address space. Now kvm_gfn_range conveys info about which root to operate on. Enhance the callback to filter the root page table type. The KVM MMU notifier comes down to two functions. kvm_tdp_mmu_unmap_gfn_range() and __kvm_tdp_mmu_age_gfn_range(): - invalidate_range_start() calls kvm_tdp_mmu_unmap_gfn_range() - invalidate_range_end() doesn't call into arch code - the other callbacks call __kvm_tdp_mmu_age_gfn_range() For VM's without a private/shared split in the EPT, all operations should target the normal(direct) root. With the switch from for_each_tdp_mmu_root() to __for_each_tdp_mmu_root() in kvm_tdp_mmu_handle_gfn(), there are no longer any users of for_each_tdp_mmu_root(). Remove it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-14-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Support mirror root for TDP MMUIsaku Yamahata
Add the ability for the TDP MMU to maintain a mirror of a separate mapping. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. In order to handle both shared and private memory, KVM needs to learn to handle faults and other operations on the correct root for the operation. KVM could learn the concept of private roots, and operate on them by calling out to operations that call into the TDX module. But there are two problems with that: 1. Calls into the TDX module are relatively slow compared to the simple accesses required to read a PTE managed directly by KVM. 2. Other Coco technologies deal with private memory completely differently and it will make the code confusing when being read from their perspective. Special operations added for TDX that set private or zap private memory will have nothing to do with these other private memory technologies. (SEV, etc). To handle these, instead teach the TDP MMU about a new concept "mirror roots". Such roots maintain page tables that are not actually mapped, and are just used to traverse quickly to determine if the mid level page tables need to be installed. When the memory be mirrored needs to actually be changed, calls can be made to via x86_ops. private KVM page fault | | | V | private GPA | CPU protected EPTP | | | V | V mirror PT root | external PT root | | | V | V mirror PT --hook to propagate-->external PT | | | \--------------------+------\ | | | | | V V | private guest page | | non-encrypted memory | encrypted memory | Leave calling out to actually update the private page tables that are being mirrored for later changes. Just implement the handling of MMU operations on to mirrored roots. In order to direct operations to correct root, add root types KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct roots to private/shared with conditionals. It could also be implemented by making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line up such that conversion could be a direct assignment with a case. Don't do this because the mapping of private to mirrored is confusing enough. So it is worth not hiding the logic in type casting. Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place in kvm_mmu_free_roots(), because the private root that is being cannot be rebuilt like a normal root. It needs to persist for the lifetime of the VM. The TDX module will also need to be provided with page tables to use for the actual mapping being mirrored by the mirrored page tables. Allocate these in the mapping path using the recently added kvm_mmu_alloc_external_spt(). Don't support 2M page for now. This is avoided by forcing 4k pages in the fault. Add a KVM_BUG_ON() to verify. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Take root in tdp_mmu_for_each_pte()Isaku Yamahata
Take the root as an argument of tdp_mmu_for_each_pte() instead of looking it up in the mmu. With no other purpose of passing the mmu, drop it. Future changes will want to change which root is used based on the context of the MMU operation. So change the callers to pass in the root currently used, mmu->root.hpa in a preparatory patch to make the later one smaller and easier to review. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-12-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Introduce KVM MMU root types to specify page table typeIsaku Yamahata
Define an enum kvm_tdp_mmu_root_types to specify the KVM MMU root type [1] so that the iterator on the root page table can consistently filter the root page table type instead of only_valid. TDX KVM will operate on KVM page tables with specified types. Shared page table, private page table, or both. Introduce an enum instead of bool only_valid so that we can easily enhance page table types applicable to shared, private, or both in addition to valid or not. Replace only_valid=false with KVM_ANY_ROOTS and only_valid=true with KVM_ANY_VALID_ROOTS. Use KVM_ANY_ROOTS and KVM_ANY_VALID_ROOTS to wrap KVM_VALID_ROOTS to avoid further code churn when direct vs mirror root concepts are introduced in future patches. Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ [1] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-11-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Extract root invalid check from tdx_mmu_next_root()Isaku Yamahata
Extract tdp_mmu_root_match() to check if the root has given types and use it for the root page table iterator. It checks only_invalid now. TDX KVM operates on a shared page table only (Shared-EPT), a mirrored page table only (Secure-EPT), or both based on the operation. KVM MMU notifier operations only on shared page table. KVM guest_memfd invalidation operations only on mirrored page table, and so on. Introduce a centralized matching function instead of open coding matching logic in the iterator. The next step is to extend the function to check whether the page is shared or private Link: https://lore.kernel.org/kvm/ZivazWQw1oCU8VBC@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-10-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Support GFN direct bitsIsaku Yamahata
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid in implementing TDX shared memory. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. So KVM will either need to apply or strip the shared bit before mapping or zapping the shared EPT. Having GFNs sometimes have the shared bit and sometimes not would make the code confusing. So instead arrange the code such that GFNs never have shared bit set. Create a concept of "direct bits", that is stripped from the fault address when setting fault->gfn, and applied within the TDP MMU iterator. Calling code will behave as if it is operating on the PTE mapping the GFN (without shared bits) but within the iterator, the actual mappings will be shifted using bits specific for the root. SPs will have the GFN set without the shared bit. In the end the TDP MMU will behave like it is mapping things at the GFN without the shared bit but with a strange page table format where everything is offset by the shared bit. Since TDX only needs to shift the mapping like this for the shared bit, which is mapped as the normal TDP root, add a "gfn_direct_bits" field to the kvm_arch structure for each VM with a default value of 0. It will have the bit set at the position of the GPA shared bit in GFN through TD specific initialization code. Keep TDX specific concepts out of the MMU code by not naming it "shared". Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN ranges. In convention established above, these would need to target the shifted GFN range. It won't matter functionally, since the actual implementation will always result in a full flush for the only planned user (TDX). For correctness reasons, future changes can provide a TDX x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and force the full flush for TDs. This leaves one problem. Some operations use a concept of max GFN (i.e. kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the direct mask to the start of the range, the iterator would end up skipping iterating over the range not covered by the direct mask bit. For safety, make sure the __tdp_mmu_zap_root() operation iterates over the full GFN range supported by the underlying TDP format. Add a new iterator helper, for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range, regardless of root. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Take struct kvm in iter loopsIsaku Yamahata
Add a struct kvm argument to the TDP MMU iterators. Future changes will want to change how the iterator behaves based on a member of struct kvm. Change the signature and callers of the iterator loop helpers in a separate patch to make the future one easier to review. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-8-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return voidRick Edgecombe
The kvm_tdp_mmu_alloc_root() function currently always returns 0. This allows for the caller, mmu_alloc_direct_roots(), to call kvm_tdp_mmu_alloc_root() and also return 0 in one line: return kvm_tdp_mmu_alloc_root(vcpu); So it is useful even though the return value of kvm_tdp_mmu_alloc_root() is always the same. However, in future changes, kvm_tdp_mmu_alloc_root() will be called twice in mmu_alloc_direct_roots(). This will force the first call to either awkwardly handle the return value that will always be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void. Do it in a separate change so the future change will be cleaner. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240718211230.1492011-7-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_roleIsaku Yamahata
Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify SPTEs associated with the mirrored EPT. The TDX module maintains the private half of the EPT mapped in the TD in its protected memory. KVM keeps a copy of the private GPAs in a mirrored EPT tree within host memory. This "is_mirror" attribute enables vCPUs to find and get the root page of mirrored EPT from the MMU root list for a guest TD. This also allows KVM MMU code to detect changes in mirrored EPT according to the "is_mirror" mmu page role and propagate the changes to the private EPT managed by TDX module. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-6-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Add an external pointer to struct kvm_mmu_pageIsaku Yamahata
Add an external pointer to struct kvm_mmu_page for TDX's private page table and add helper functions to allocate/initialize/free a private page table page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU doesn't use unsync_children and write_flooding_count, pack them to have room for a pointer and use a union to avoid memory overhead. For private GPA, CPU refers to a private page table whose contents are encrypted. The dedicated APIs to operate on it (e.g. updating/reading its PTE entry) are used, and their cost is expensive. When KVM resolves the KVM page fault, it walks the page tables. To reuse the existing KVM MMU code and mitigate the heavy cost of directly walking the private page table allocate two sets of page tables for the private half of the GPA space. For the page tables that KVM will walk, allocate them like normal and refer to them as mirror page tables. Additionally allocate one more page for the page tables the CPU will walk, and call them external page tables. Resolve the KVM page fault with the existing code, and do additional operations necessary for modifying the external page table in future patches. The relationship of the types of page tables in this scheme is depicted below: KVM page fault | | | V | -------------+---------- | | | | V V | shared GPA private GPA | | | | V V | shared PT root mirror PT root | private PT root | | | | V V | V shared PT mirror PT --propagate--> external PT | | | | | \-----------------+------\ | | | | | V | V V shared guest page | private guest page | non-encrypted memory | encrypted memory | PT - Page table Shared PT - Visible to KVM, and the CPU uses it for shared mappings. External PT - The CPU uses it, but it is invisible to KVM. TDX module updates this table to map private guest pages. Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses it to propagate PT change to the actual private PT. Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it to the TDX vm type. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: Add member to struct kvm_gfn_range to indicate private/sharedIsaku Yamahata
Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Zap invalid roots with mmu_lock holding for write at uninitRick Edgecombe
Prepare for a future TDX patch which asserts that atomic zapping (i.e. zapping with mmu_lock taken for read) don't operate on mirror roots. When tearing down a VM, all roots have to be zapped (including mirror roots once they're in place) so do that with the mmu_lock taken for write. kvm_mmu_uninit_tdp_mmu() is invoked either before or after executing any atomic operations on SPTEs by vCPU threads. Therefore, it will not impact vCPU threads performance if kvm_tdp_mmu_zap_invalidated_roots() acquires mmu_lock for write to zap invalid roots. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-2-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Refactor __kvm_emulate_hypercall() into a macroPaolo Bonzini
Rework __kvm_emulate_hypercall() into a macro so that completion of hypercalls that don't exit to userspace use direct function calls to the completion helper, i.e. don't trigger a retpoline when RETPOLINE=y. Opportunistically take the names of the input registers, as opposed to taking the input values, to preemptively dedup more of the calling code (TDX needs to use different registers). Use the direct GPR accessors to read values to avoid the pointless marking of the registers as available (KVM requires GPRs to always be available). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20241128004344.4072099-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Always complete hypercall via function callbackSean Christopherson
Finish "emulation" of KVM hypercalls by function callback, even when the hypercall is handled entirely within KVM, i.e. doesn't require an exit to userspace, and refactor __kvm_emulate_hypercall()'s return value to *only* communicate whether or not KVM should exit to userspace or resume the guest. (Ab)Use vcpu->run->hypercall.ret to propagate the return value to the callback, purely to avoid having to add a trampoline for every completion callback. Using the function return value for KVM's control flow eliminates the multiplexed return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall) means "exit to userspace". Note, the unnecessary extra indirect call and thus potential retpoline will be eliminated in the near future by converting the intermediate layer to a macro. Suggested-by: Binbin Wu <binbin.wu@linux.intel.com> Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20241128004344.4072099-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Bump hypercall stat prior to fully completing hypercallSean Christopherson
Increment the "hypercalls" stat for KVM hypercalls as soon as KVM knows it will skip the guest instruction, i.e. once KVM is committed to emulating the hypercall. Waiting until completion adds no known value, and creates a discrepancy where the stat will be bumped if KVM exits to userspace as a result of trying to skip the instruction, but not if the hypercall itself exits. Handling the stat in common code will also avoid the need for another helper to dedup code when TDX comes along (TDX needs a separate completion path due to GPR usage differences). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241128004344.4072099-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Move "emulate hypercall" function declarations to x86.hSean Christopherson
Move the declarations for the hypercall emulation APIs to x86.h. While the helpers are exported, they are intended to be consumed only by KVM vendor modules, i.e. don't need to be exposed to the kernel at-large. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241128004344.4072099-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Add a helper to check for user interception of KVM hypercallsBinbin Wu
Add and use user_exit_on_hypercall() to check if userspace wants to handle a KVM hypercall instead of open-coding the logic everywhere. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> [sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Message-ID: <20241128004344.4072099-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: clear vcpu->run->hypercall.ret before exiting for KVM_EXIT_HYPERCALLPaolo Bonzini
QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and it never modifies it when processing KVM_EXIT_HYPERCALL. Make this explicit in the code, to avoid breakage when KVM starts modifying that field. This in principle is not a good idea... It would have been much better if KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it does not support KVM_HC_MAP_GPA_RANGE. However, breaking userspace is a Very Bad Thing, as everybody should know. Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22Merge tag 'kvm-x86-fixes-6.13-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.13: - Disable AVIC on SNP-enabled systems that don't allow writes to the virtual APIC page, as such hosts will hit unexpected RMP #PFs in the host when running VMs of any flavor. - Fix a WARN in the hypercall completion path due to KVM trying to determine if a guest with protected register state is in 64-bit mode (KVM's ABI is to assume such guests only make hypercalls in 64-bit mode). - Allow the guest to write to supported bits in MSR_AMD64_DE_CFG to fix a regression with Windows guests, and because KVM's read-only behavior appears to be entirely made up. - Treat TDP MMU faults as spurious if the faulting access is allowed given the existing SPTE. This fixes a benign WARN (other than the WARN itself) due to unexpectedly replacing a writable SPTE with a read-only SPTE.