summaryrefslogtreecommitdiff
path: root/Documentation/virt
AgeCommit message (Collapse)Author
2025-04-04Documentation: kvm: remove KVM_CAP_MIPS_TEPaolo Bonzini
Trap and emulate virtualization is not available anymore for MIPS. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04Documentation: kvm: organize capabilities in the right sectionPaolo Bonzini
Categorize the capabilities correctly. Section 6 is for enabled vCPU capabilities; section 7 is for enabled VM capabilities; section 8 is for informational ones. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04Documentation: kvm: fix some definition listsPaolo Bonzini
Ensure that they have a ":" in front of the defined item. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04Documentation: kvm: drop "Capability" heading from capabilitiesPaolo Bonzini
It is redundant, and sometimes wrong. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCEPaolo Bonzini
The capability is incorrectly called KVM_CAP_PPC_MULTITCE in the documentation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINEPaolo Bonzini
TSC_DEADLINE is now advertised unconditionally by KVM_GET_SUPPORTED_CPUID, since commit 9be4ec35d668 ("KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID", 2024-12-18). Adjust the documentation to reflect the new behavior. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250331150550.510320-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20Merge tag 'kvmarm-6.15' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.15 - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events
2025-03-19Merge branch 'kvm-arm64/writable-midr' into kvmarm/nextOliver Upton
* kvm-arm64/writable-midr: : Writable implementation ID registers, courtesy of Sebastian Ott : : Introduce a new capability that allows userspace to set the : ID registers that identify a CPU implementation: MIDR_EL1, REVIDR_EL1, : and AIDR_EL1. Also plug a hole in KVM's trap configuration where : SMIDR_EL1 was readable at EL1, despite the fact that KVM does not : support SME. KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGS KVM: arm64: Copy MIDR_EL1 into hyp VM when it is writable KVM: arm64: Copy guest CTR_EL0 into hyp VM KVM: selftests: arm64: Test writes to MIDR,REVIDR,AIDR KVM: arm64: Allow userspace to change the implementation ID registers KVM: arm64: Load VPIDR_EL2 with the VM's MIDR_EL1 value KVM: arm64: Maintain per-VM copy of implementation ID regs KVM: arm64: Set HCR_EL2.TID1 unconditionally Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge branch 'kvm-arm64/pv-cpuid' into kvmarm/nextOliver Upton
* kvm-arm64/pv-cpuid: : Paravirtualized implementation ID, courtesy of Shameer Kolothum : : Big-little has historically been a pain in the ass to virtualize. The : implementation ID (MIDR, REVIDR, AIDR) of a vCPU can change at the whim : of vCPU scheduling. This can be particularly annoying when the guest : needs to know the underlying implementation to mitigate errata. : : "Hyperscalers" face a similar scheduling problem, where VMs may freely : migrate between hosts in a pool of heterogenous hardware. And yes, our : server-class friends are equally riddled with errata too. : : In absence of an architected solution to this wart on the ecosystem, : introduce support for paravirtualizing the implementation exposed : to a VM, allowing the VMM to describe the pool of implementations that a : VM may be exposed to due to scheduling/migration. : : Userspace is expected to intercept and handle these hypercalls using the : SMCCC filter UAPI, should it choose to do so. smccc: kvm_guest: Fix kernel builds for 32 bit arm KVM: selftests: Add test for KVM_REG_ARM_VENDOR_HYP_BMAP_2 smccc/kvm_guest: Enable errata based on implementation CPUs arm64: Make  _midr_in_range_list() an exported function KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2 KVM: arm64: Specify hypercall ABI for retrieving target implementations arm64: Modify _midr_range() functions to read MIDR/REVIDR internally Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge branch 'kvm-arm64/nv-vgic' into kvmarm/nextOliver Upton
* kvm-arm64/nv-vgic: : NV VGICv3 support, courtesy of Marc Zyngier : : Support for emulating the GIC hypervisor controls and managing shadow : VGICv3 state for the L1 hypervisor. As part of it, bring in support for : taking IRQs to the L1 and UAPI to manage the VGIC maintenance interrupt. KVM: arm64: nv: Fail KVM init if asking for NV without GICv3 KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2 KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting KVM: arm64: nv: Add Maintenance Interrupt emulation KVM: arm64: nv: Handle L2->L1 transition on interrupt injection KVM: arm64: nv: Nested GICv3 emulation KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg KVM: arm64: nv: Load timer before the GIC arm64: sysreg: Add layout for ICH_MISR_EL2 arm64: sysreg: Add layout for ICH_VTR_EL2 arm64: sysreg: Add layout for ICH_HCR_EL2 Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM Xen changes for 6.15 - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks, as there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of Xen TSC should be rare, i.e. are not a hot path.
2025-03-05KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGSOliver Upton
The capability actually fails with EINVAL if vCPUs have already been created. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250305230825.484091-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03KVM: arm64: nv: Allow userland to set VGIC maintenance IRQAndre Przywara
The VGIC maintenance IRQ signals various conditions about the LRs, when the GIC's virtualization extension is used. So far we didn't need it, but nested virtualization needs to know about this interrupt, so add a userland interface to setup the IRQ number. The architecture mandates that it must be a PPI, on top of that this code only exports a per-device option, so the PPI is the same on all VCPUs. Signed-off-by: Andre Przywara <andre.przywara@arm.com> [added some bits of documentation] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250225172930.1850838-16-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2Shameer Kolothum
The vendor_hyp_bmap bitmap holds the information about the Vendor Hyp services available to the user space and can be get/set using {G, S}ET_ONE_REG interfaces. This is done using the pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP. At present, this bitmap is a 64 bit one and since the function numbers for newly added DISCOVER_IPML_* hypercalls are 64-65, introduce another pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP_2. Reviewed-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20250221140229.12588-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Specify hypercall ABI for retrieving target implementationsShameer Kolothum
If the Guest requires migration to multiple targets, these hypercalls will provide a way to retrieve the target CPU implementations from the user space VMM. Subsequent patch will use this to enable the associated errata. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20250221140229.12588-3-shameerali.kolothum.thodi@huawei.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Document ordering requirements for irqbypassOliver Upton
One of the not-so-obvious requirements for restoring a VM is ensuring that the vITS has been restored _before_ creating any irqbypass mappings. This is because KVM needs to get the guest translation for MSIs to correctly assemble the vLPI mapping getting installed in the physical ITS. Document the restore ordering requirements necessary for GICv4 vLPI injection to work. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250226183124.82094-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Allow userspace to change the implementation ID registersSebastian Ott
KVM's treatment of the ID registers that describe the implementation (MIDR, REVIDR, and AIDR) is interesting, to say the least. On the userspace-facing end of it, KVM presents the values of the boot CPU on all vCPUs and treats them as invariant. On the guest side of things KVM presents the hardware values of the local CPU, which can change during CPU migration in a big-little system. While one may call this fragile, there is at least some degree of predictability around it. For example, if a VMM wanted to present big-little to a guest, it could affine vCPUs accordingly to the correct clusters. All of this makes a giant mess out of adding support for making these implementation ID registers writable. Avoid breaking the rather subtle ABI around the old way of doing things by requiring opt-in from userspace to make the registers writable. When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR to any non-reserved value and present those values consistently across all vCPUs. Signed-off-by: Sebastian Ott <sebott@redhat.com> [oliver: changelog, capability] Link: https://lore.kernel.org/r/20250225005401.679536-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24KVM: x86/xen: Restrict hypercall MSR to unofficial synthetic rangeSean Christopherson
Reject userspace attempts to set the Xen hypercall page MSR to an index outside of the "standard" virtualization range [0x40000000, 0x4fffffff], as KVM is not equipped to handle collisions with real MSRs, e.g. KVM doesn't update MSR interception, conflicts with VMCS/VMCB fields, special case writes in KVM, etc. While the MSR index isn't strictly ABI, i.e. can theoretically float to any value, in practice no known VMM sets the MSR index to anything other than 0x40000000 or 0x40000200. Cc: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatileJames Houghton
Don't force SPTE modifications to be done atomically if the only volatile bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate stale aging state, and the probability of an Accessed bit A/D assist being clobbered *and* affecting again is likely far lower than the probability of consuming stale information due to not flushing TLBs when aging. Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better capture the nature of the helper. Opportunstically do s/write/update on the TDP MMU wrapper, as it's not simply the "write" that needs to be done atomically, it's the entire update, i.e. the entire read-modify-write operation needs to be done atomically so that KVM has an accurate view of the old SPTE. Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect, it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net positive, and would require significant churn. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-31KVM: s390: fake memslot for ucontrol VMsClaudio Imbrenda
Create a fake memslot for ucontrol VMs. The fake memslot identity-maps userspace. Now memslots will always be present, and ucontrol is not a special case anymore. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20250123144627.312456-4-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250123144627.312456-4-imbrenda@linux.ibm.com>
2025-01-28Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull KVM/arm64 updates from Will Deacon: "New features: - Support for non-protected guest in protected mode, achieving near feature parity with the non-protected mode - Support for the EL2 timers as part of the ongoing NV support - Allow control of hardware tracing for nVHE/hVHE Improvements, fixes and cleanups: - Massive cleanup of the debug infrastructure, making it a bit less awkward and definitely easier to maintain. This should pave the way for further optimisations - Complete rewrite of pKVM's fixed-feature infrastructure, aligning it with the rest of KVM and making the code easier to follow - Large simplification of pKVM's memory protection infrastructure - Better handling of RES0/RES1 fields for memory-backed system registers - Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer from a pretty nasty timer bug - Small collection of cleanups and low-impact fixes" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (87 commits) arm64/sysreg: Get rid of TRFCR_ELx SysregFields KVM: arm64: nv: Fix doc header layout for timers KVM: arm64: nv: Apply RESx settings to sysreg reset values KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors KVM: arm64: Fix selftests after sysreg field name update coresight: Pass guest TRFCR value to KVM KVM: arm64: Support trace filtering for guests KVM: arm64: coresight: Give TRBE enabled state to KVM coresight: trbe: Remove redundant disable call arm64/sysreg/tools: Move TRFCR definitions to sysreg tools: arm64: Update sysreg.h header files KVM: arm64: Drop pkvm_mem_transition for host/hyp donations KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing KVM: arm64: Drop pkvm_mem_transition for FF-A KVM: arm64: Explicitly handle BRBE traps as UNDEFINED KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe() arm64: kvm: Introduce nvhe stack size constants KVM: arm64: Fix nVHE stacktrace VA bits mask KVM: arm64: Fix FEAT_MTE in pKVM Documentation: Update the behaviour of "kvm-arm.mode" ...
2025-01-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Loongarch: - Clear LLBCTL if secondary mmu mapping changes - Add hypercall service support for usermode VMM x86: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled - Ensure that all SEV code is compiled out when disabled in Kconfig, even if building with less brilliant compilers - Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes - Use str_enabled_disabled() to replace open coded strings - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU - Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not - As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. When TDX will be enabled, operations on private pages will need to go through the privileged TDX Module via SEAMCALLs; as a result, they are limited and relatively slow compared to reading a PTE. The patches included in 6.14 allow KVM to keep a mirror of the private EPT in host memory, and define entries in kvm_x86_ops to operate on external page tables such as the TDX private EPT - The recently introduced conversion of the NX-page reclamation kthread to vhost_task moved the task under the main process. The task is created as soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that didn't expect to see any child task of the VM process until it started creating its own userspace threads. In particular crosvm refuses to fork() if procfs shows any child task, so unbreak it by creating the task lazily. This is arguably a userspace bug, as there can be other kinds of legitimate worker tasks and they wouldn't impede fork(); but it's not like userspace has a way to distinguish kernel worker tasks right now. Should they show as "Kthread: 1" in proc/.../status? x86 - Intel: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty - Misc cleanups Generic: - Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when setting memory regions and add a dedicated API for setting KVM-internal memory regions. The API can then explicitly disallow all flags for KVM-internal memory regions - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug where KVM would return a pointer to a vCPU prior to it being fully online, and give kvm_for_each_vcpu() similar treatment to fix a similar flaw - Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a bug where userspace could coerce KVM into handling the ioctl on a vCPU that isn't yet onlined - Gracefully handle xarray insertion failures; even though such failures are impossible in practice after xa_reserve(), reserving an entry is always followed by xa_store() which does not know (or differentiate) whether there was an xa_reserve() before or not RISC-V: - Zabha, Svvptc, and Ziccrse extension support for guests. None of them require anything in KVM except for detecting them and marking them as supported; Zabha adds byte and halfword atomic operations, while the others are markers for specific operation of the TLB and of LL/SC instructions respectively - Virtualize SBI system suspend extension for Guest/VM - Support firmware counters which can be used by the guests to collect statistics about traps that occur in the host Selftests: - Rework vcpu_get_reg() to return a value instead of using an out-param, and update all affected arch code accordingly - Convert the max_guest_memory_test into a more generic mmu_stress_test. The basic gist of the "conversion" is to have the test do mprotect() on guest memory while vCPUs are accessing said memory, e.g. to verify KVM and mmu_notifiers are working as intended - Play nice with treewrite builds of unsupported architectures, e.g. arm (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the target architecture is actually one KVM selftests supports - Use the kernel's $(ARCH) definition instead of the target triple for arch specific directories, e.g. arm64 instead of aarch64, mainly so as not to be different from the rest of the kernel - Ensure that format strings for logging statements are checked by the compiler even when the logging statement itself is disabled - Attempt to whack the last LLC references/misses mole in the Intel PMU counters test by adding a data load and doing CLFLUSH{OPT} on the data instead of the code being executed. It seems that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that events are counting correctly without actually knowing what the events count given the underlying hardware; this can happen if Intel reuses a formerly microarchitecture-specific event encoding as an architectural event, as was the case for Top-Down Slots" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits) kvm: defer huge page recovery vhost task to later KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() KVM: Disallow all flags for KVM-internal memslots KVM: x86: Drop double-underscores from __kvm_set_memory_region() KVM: Add a dedicated API for setting KVM-internal memslots KVM: Assert slots_lock is held when setting memory regions KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API) LoongArch: KVM: Add hypercall service support for usermode VMM LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup() KVM: VMX: read the PML log in the same order as it was written KVM: VMX: refactor PML terminology KVM: VMX: Fix comment of handle_vmx_instruction() KVM: VMX: Reinstate __exit attribute for vmx_exit() KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup() KVM: x86: Avoid double RDPKRU when loading host/guest PKRU KVM: x86: Use LVT_TIMER instead of an open coded literal RISC-V: KVM: Add new exit statstics for redirected traps RISC-V: KVM: Update firmware counters for various events RISC-V: KVM: Redirect instruction access fault trap to guest ...
2025-01-25Merge tag 'hyperv-next-signed-20250123' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-21Merge tag 'docs-6.14' of git://git.lwn.net/linuxLinus Torvalds
Pull Documentation updates from Jonathan Corbet: - Quite a bit of Chinese and Spanish translation work - Clarifying that Git commit IDs >12chars are OK - A new nvme-multipath document - A reorganization of the admin-guide top-level page to make it readable - Clarification of the role of Acked-by and maintainer discretion on their acceptance - Some reorganization of debugging-oriented docs ... and typo fixes, documentation updates, etc as usual * tag 'docs-6.14' of git://git.lwn.net/linux: (50 commits) Documentation: Fix x86_64 UEFI outdated references to elilo Documentation/sysctl: Add timer_migration to kernel.rst docs/mm: Physical memory: Remove zone_t docs: submitting-patches: clarify that signers may use their discretion on tags docs: submitting-patches: clarify difference between Acked-by and Reviewed-by docs: submitting-patches: clarify Acked-by and introduce "# Suffix" Documentation: bug-hunting.rst: remove odd contact information docs/zh_CN: Add sak index Chinese translation doc: module: DEFAULT_SYMBOL_NAMESPACE must be defined before #includes doc: module: Fix documented type of namespace Documentation/kernel-parameters: Fix a reference to vga-softcursor.rst docs/zh_CN: Add landlock index Chinese translation Documentation: Fix typo localmodonfig -> localmodconfig overlayfs.rst: Fix and improve grammar docs/zh_CN: Add siphash index Chinese translation docs/zh_CN: Add security IMA-templates Chinese translation docs/zh_CN: Add security digsig Chinese translation Align git commit ID abbreviation guidelines and checks docs: process: submitting-patches: split canonical patch format section docs/zh_CN: Add security lsm Chinese translation ...
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-16KVM: arm64: nv: Fix doc header layout for timersMarc Zyngier
Stephen reports that 'make htmldocs' spits out a warning ("Documentation/virt/kvm/devices/vcpu.rst:147: WARNING: Definition list ends without a blank line; unexpected unindent."). Fix it by keeping all the timer attributes on a single line. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-13Documentation: hyperv: Add overview of guest VM hibernationMichael Kelley
Add documentation on how hibernation works in a guest VM on Hyper-V. Describe how VMBus devices and the VMBus itself are hibernated and resumed, along with various limitations. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20250113145645.1320942-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250113145645.1320942-1-mhklinux@outlook.com>
2025-01-07KVM: s390: Reject KVM_SET_GSI_ROUTING on ucontrol VMsChristoph Schlameuss
Prevent null pointer dereference when processing KVM_IRQ_ROUTING_S390_ADAPTER routing entries. The ioctl cannot be processed for ucontrol VMs. Fixes: f65470661f36 ("KVM: s390/interrupt: do not pin adapter interrupt pages") Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Tested-by: Hariharan Mari <hari55@linux.ibm.com> Reviewed-by: Hariharan Mari <hari55@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20241216092140.329196-4-schlameuss@linux.ibm.com Message-ID: <20241216092140.329196-4-schlameuss@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-07KVM: s390: Reject setting flic pfault attributes on ucontrol VMsChristoph Schlameuss
Prevent null pointer dereference when processing the KVM_DEV_FLIC_APF_ENABLE and KVM_DEV_FLIC_APF_DISABLE_WAIT ioctls in the interrupt controller. Fixes: 3c038e6be0e2 ("KVM: async_pf: Async page fault support on s390") Reported-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Hariharan Mari <hari55@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20241216092140.329196-2-schlameuss@linux.ibm.com Message-ID: <20241216092140.329196-2-schlameuss@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-02KVM: arm64: nv: Document EL2 timer APIMarc Zyngier
Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241217142321.763801-13-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-18KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUIDSean Christopherson
Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID, as KVM always emulates deadline mode, *if* the VM has an in-kernel local APIC. The odds of a VMM emulating the local APIC in userspace, not emulating the TSC deadline timer, _and_ reflecting KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of over-advertising and breaking any setups, is extremely low. KVM has _unconditionally_ advertised X2APIC via CPUID since commit 0d1de2d901f4 ("KVM: Always report x2apic as supported feature"), and it is completely impossible for userspace to emulate X2APIC as KVM doesn't support forwarding the MSR accesses to userspace. I.e. KVM has relied on userspace VMMs to not misreport local APIC capabilities for nearly 13 years. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creationSean Christopherson
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless, e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after vCPU creation. vCPUs may also end up with an inconsistent configuration if exits are disabled between creation of multiple vCPUs. Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-11Documentation: kvm: fix typo in api.rstGianfranco Trad
Fix minor typo in api.rst where the word physical was misspelled as physcial. Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241115011831.300705-5-gianf.trad@gmail.com
2024-11-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "The biggest change here is eliminating the awful idea that KVM had of essentially guessing which pfns are refcounted pages. The reason to do so was that KVM needs to map both non-refcounted pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain refcounted pages. However, the result was security issues in the past, and more recently the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by struct page but is not refcounted. In particular this broke virtio-gpu blob resources (which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device) with the amdgpu driver, because amdgpu allocates non-compound higher order pages and the tail pages could not be mapped into KVM. This requires adjusting all uses of struct page in the per-architecture code, to always work on the pfn whenever possible. The large series that did this, from David Stevens and Sean Christopherson, also cleaned up substantially the set of functions that provided arch code with the pfn for a host virtual addresses. The previous maze of twisty little passages, all different, is replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages) saving almost 200 lines of code. ARM: - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests LoongArch: - Add iocsr and mmio bus simulation in kernel. - Add in-kernel interrupt controller emulation. - Add support for virtualization extensions to the eiointc irqchip. PPC: - Drop lingering and utterly obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect documentation references to non-existing ioctls RISC-V: - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side s390: - New selftests: more ucontrol selftests and CPU model sanity checks - Support for the gen17 CPU model - List registers supported by KVM_GET/SET_ONE_REG in the documentation x86: - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes. Even if the hardware A/D tracking is disabled, it is possible to use the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty, and that removes a lot of special cases. - Elide TLB flushes when aging secondary PTEs, as has been done in x86's primary MMU for over 10 years. - Recover huge pages in-place in the TDP MMU when dirty page logging is toggled off, instead of zapping them and waiting until the page is re-accessed to create a huge mapping. This reduces vCPU jitter. - Batch TLB flushes when dirty page logging is toggled off. This reduces the time it takes to disable dirty logging by ~3x. - Remove the shrinker that was (poorly) attempting to reclaim shadow page tables in low-memory situations. - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Advertise CPUIDs for new instructions in Clearwater Forest - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which in turn can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe; harden the cache accessors to try to prevent similar issues from occuring in the future. The issue that triggered this change was already fixed in 6.12, but was still kinda latent. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups - Switch hugepage recovery thread to use vhost_task. These kthreads can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory); therefore KVM tried to place the thread in the VM's cgroups and charge the CPU time consumed by that work to the VM's container. However the kthreads did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM instances inside could not complete freezing. Fix this by replacing the kthread with a PF_USER_WORKER thread, via the vhost_task abstraction. Another 100+ lines removed, with generally better behavior too like having these threads properly parented in the process tree. - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't really work; there was really nothing to work around anyway: the broken patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the erratum. - Fix 6.12 regression where CONFIG_KVM will be built as a module even if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'. x86 selftests: - x86 selftests can now use AVX. Documentation: - Use rST internal links - Reorganize the introduction to the API document Generic: - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long due to having to wait for all CPUs become quiescent. In general both reads and writes are rare, but userspace that supports confidential computing is introducing the use of "helper" vCPUs that may jump from one host processor to another. Those will be very happy to trigger a synchronize_rcu(), and the effect on performance is quite the disaster" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits) KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD KVM: x86: add back X86_LOCAL_APIC dependency Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()" KVM: x86: switch hugepage recovery thread to vhost_task KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest Documentation: KVM: fix malformed table irqchip/loongson-eiointc: Add virt extension support LoongArch: KVM: Add irqfd support LoongArch: KVM: Add PCHPIC user mode read and write functions LoongArch: KVM: Add PCHPIC read and write functions LoongArch: KVM: Add PCHPIC device support LoongArch: KVM: Add EIOINTC user mode read and write functions LoongArch: KVM: Add EIOINTC read and write functions LoongArch: KVM: Add EIOINTC device support LoongArch: KVM: Add IPI user mode read and write function LoongArch: KVM: Add IPI read and write function LoongArch: KVM: Add IPI device support LoongArch: KVM: Add iocsr and mmio bus simulation in kernel KVM: arm64: Pass on SVE mapping failures ...
2024-11-18Merge tag 's390-6.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Add firmware sysfs interface which allows user space to retrieve the dump area size of the machine - Add 'measurement_chars_full' CHPID sysfs attribute to make the complete associated Channel-Measurements Characteristics Block available - Add virtio-mem support - Move gmap aka KVM page fault handling from the main fault handler to KVM code. This is the first step to make s390 KVM page fault handling similar to other architectures. With this first step the main fault handler does not have any special handling anymore, and therefore convert it to support LOCK_MM_AND_FIND_VMA - With gcc 14 s390 support for flag output operand support for inline assemblies was added. This allows for several optimizations: - Provide a cmpxchg inline assembly which makes use of this, and provide all variants of arch_try_cmpxchg() so that the compiler can generate slightly better code - Convert a few cmpxchg() loops to try_cmpxchg() loops - Similar to x86 add a CC_OUT() helper macro (and other macros), and convert all inline assemblies to make use of them, so that depending on compiler version better code can be generated - List installed host-key hashes in sysfs if the machine supports the Query Ultravisor Keys UVC - Add 'Retrieve Secret' ioctl which allows user space in protected execution guests to retrieve previously stored secrets from the Ultravisor - Add pkey-uv module which supports the conversion of Ultravisor retrievable secrets to protected keys - Extend the existing paes cipher to exploit the full AES-XTS hardware acceleration introduced with message-security assist extension 10 - Convert hopefully all sysfs show functions to use sysfs_emit() so that the constant flow of such patches stop - For PCI devices make use of the newly added Topology ID attribute to enable whole card multi-function support despite the change to PCHID per port. Additionally improve the overall robustness and usability of the multifunction support - Various other small improvements, fixes, and cleanups * tag 's390-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (133 commits) s390/cio/ioasm: Convert to use flag output macros s390/cio/qdio: Convert to use flag output macros s390/sclp: Convert to use flag output macros s390/dasd: Convert to use flag output macros s390/boot/physmem: Convert to use flag output macros s390/pci: Convert to use flag output macros s390/kvm: Convert to use flag output macros s390/extmem: Convert to use flag output macros s390/string: Convert to use flag output macros s390/diag: Convert to use flag output macros s390/irq: Convert to use flag output macros s390/smp: Convert to use flag output macros s390/uv: Convert to use flag output macros s390/pai: Convert to use flag output macros s390/mm: Convert to use flag output macros s390/cpu_mf: Convert to use flag output macros s390/cpcmd: Convert to use flag output macros s390/topology: Convert to use flag output macros s390/time: Convert to use flag output macros s390/pageattr: Convert to use flag output macros ...
2024-11-14Merge tag 'kvmarm-6.13' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.13, part #1 - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests
2024-11-13Documentation: KVM: fix malformed tablePaolo Bonzini
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 5f6a3badbb74 ("KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13Merge branch 'kvm-docs-6.13' into HEADPaolo Bonzini
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect references to non-existing ioctls - List registers supported by KVM_GET/SET_ONE_REG on s390 - Use rST internal links - Reorganize the introduction to the API document
2024-11-13Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups
2024-11-11Merge branch kvm-arm64/psci-1.3 into kvmarm/nextOliver Upton
* kvm-arm64/psci-1.3: : PSCI v1.3 support, courtesy of David Woodhouse : : Bump KVM's PSCI implementation up to v1.3, with the added bonus of : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls, : this gets relayed to userspace for further processing with a new : KVM_SYSTEM_EVENT_SHUTDOWN flag. : : As an added bonus, implement client-side support for hibernation with : the SYSTEM_OFF2 call. arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call KVM: selftests: Add test for PSCI SYSTEM_OFF2 KVM: arm64: Add support for PSCI v1.2 and v1.3 KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation firmware/psci: Add definitions for PSCI v1.3 specification Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-08Documentation: kvm: reorganize introductionPaolo Bonzini
Reorganize the text to mention file descriptors as early as possible. Also mention capabilities early as they are a central part of KVM's API. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Documentation: kvm: replace section numbers with linksPaolo Bonzini
In order to simplify further introduction of hyperlinks, replace explicit section numbers with rST hyperlinks. The section numbers could actually be removed now, but I'm not going to do a huge change throughout the file for an RFC... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Documentation: kvm: fix a few mistakesPaolo Bonzini
The only occurrence "Capability: none" actually meant the same as "basic". Fix that and a few more aesthetic or content issues in the document. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: powerpc: remove remaining traces of KVM_CAP_PPC_RMAPaolo Bonzini
This was only needed for PPC970 support, which is long gone: the implementation was removed in 2014. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-07Documentation: s390-diag.rst: Document diag500(STORAGE LIMIT) subfunctionDavid Hildenbrand
Let's document our new diag500 subfunction that can be implemented by userspace. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-3-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-07Documentation: s390-diag.rst: Make diag500 a generic KVM hypercallDavid Hildenbrand
Let's make it a generic KVM hypercall, allowing other subfunctions to be more independent of virtio. While at it, document that unsupported/unimplemented subfunctions result in a SPECIFICATION exception. This is a preparation for documenting a new subfunction. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241025141453.1210600-2-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-01KVM: x86: Quirk initialization of feature MSRs to KVM's max configurationSean Christopherson
Add a quirk to control KVM's misguided initialization of select feature MSRs to KVM's max configuration, as enabling features by default violates KVM's approach of letting userspace own the vCPU model, and is actively problematic for MSRs that are conditionally supported, as the vCPU will end up with an MSR value that userspace can't restore. E.g. if the vCPU is configured with PDCM=0, userspace will save and attempt to restore a non-zero PERF_CAPABILITIES, thanks to KVM's meddling. Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Document an erratum in KVM_SET_VCPU_EVENTS on Intel CPUsSean Christopherson
Document a flaw in KVM's ABI which lets userspace attempt to inject a "bad" hardware exception event, and thus induce VM-Fail on Intel CPUs. Fixing the flaw is a fool's errand, as AMD doesn't sanity check the validity of the error code, Intel CPUs that support CET relax the check for Protected Mode, userspace can change the mode after queueing an exception, KVM ignores the error code when emulating Real Mode exceptions, and so on and so forth. The VM-Fail itself doesn't harm KVM or the kernel beyond triggering a ratelimited pr_warn(), so just document the oddity. Link: https://lore.kernel.org/r/20240802200420.330769-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-25KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIsSean Christopherson
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass "false", and remove a comment blurb about KVM running only the "GUP fast" part in atomic context. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEsSean Christopherson
Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs, e.g. when prefetching, mark folios as accessed only when zapping leaf SPTEs, which is a rough heuristic for "only in response to an mmu_notifier invalidation". Page aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to be precise for correctness, and re-marking folios as accessed when zapping entire roots or when zapping collapsible SPTEs is expensive and adds very little value. E.g. when a VM is dying, all of its memory is being freed; marking folios accessed at that time provides no known value. Similarly, because KVM marks folios as accessed when creating SPTEs, marking all folios as accessed when userspace happens to delete a memslot doesn't add value. The folio was marked access when the old SPTE was created, and will be marked accessed yet again if a vCPU accesses the pfn again after reloading a new root. Zapping collapsible SPTEs is a similar story; marking folios accessed just because userspace disable dirty logging is a side effect of KVM behavior, not a deliberate goal. As an intermediate step, a.k.a. bisection point, towards *never* marking folios accessed when dropping SPTEs, mark folios accessed when the primary MMU might be invalidating mappings, as such zappings are not KVM initiated, i.e. might actually be related to page aging and LRU activity. Note, x86 is the only KVM architecture that "double dips"; every other arch marks pfns as accessed only when mapping into the guest, not when mapping into the guest _and_ when removing from the guest. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-10-seanjc@google.com>