summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2025-04-28KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMUYan Zhao
Warn if PFN changes on shadow-present SPTE in mmu_set_spte(). KVM should _never_ change the PFN of a shadow-present SPTE. In mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present SPTE in mmu_spte_update() to detect this condition. However, that WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte() earlier before mmu_spte_update(), which clears SPTE to a !shadow-present state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in mmu_set_spte() to warn PFN change of a shadow-present SPTE. For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when PFN is not changed. When PFN changes, fall through to follow the sequence of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add(). Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: WARN if PFN changes for spurious faultsYan Zhao
Add a WARN() to assert that KVM does _not_ change the PFN of a shadow-present SPTE during spurious fault handling. KVM should _never_ change the PFN of a shadow-present SPTE and TDP MMU already BUG()s on this. However, spurious faults just return early before the existing BUG() could be hit. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013238.5732-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/tdp_mmu: Merge prefetch and access checks for spurious faultsYan Zhao
Combine prefetch and is_access_allowed() checks into a unified path to detect spurious faults, since both cases now share identical logic. No functional changes. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013210.5701-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: x86/mmu: Further check old SPTE is leaf for spurious prefetch faultYan Zhao
Instead of simply treating a prefetch fault as spurious when there's a shadow-present old SPTE, further check if the old SPTE is leaf to determine if a prefetch fault is spurious. It's not reasonable to treat a prefetch fault as spurious when there's a shadow-present non-leaf SPTE without a corresponding shadow-present leaf SPTE. e.g., in the following sequence, a prefetch fault should not be considered spurious: 1. add a memslot with size 4K 2. prefault GPA A in the memslot 3. delete the memslot (zap all disabled) 4. re-add the memslot with size 2M 5. prefault GPA A again. In step 5, the prefetch fault attempts to install a 2M huge entry. Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE, the leaf entry will remain empty after step 5 if the fetch fault is regarded as spurious due to a shadow-present non-leaf SPTE. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: VMX: Flush shadow VMCS on emergency rebootChao Gao
Ensure the shadow VMCS cache is evicted during an emergency reboot to prevent potential memory corruption if the cache is evicted after reboot. This issue was identified through code inspection, as __loaded_vmcs_clear() flushes both the normal VMCS and the shadow VMCS. Avoid checking the "launched" state during an emergency reboot, unlike the behavior in __loaded_vmcs_clear(). This is important because reboot NMIs can interfere with operations like copy_shadow_to_vmcs12(), where shadow VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur right after the VMCS load, the shadow VMCSes will be active but the "launched" state may not be set. Fixes: 16f5b9034b69 ("KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12") Cc: stable@vger.kernel.org Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250324140849.2099723-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: SVM: Treat DEBUGCTL[5:2] as reservedSean Christopherson
Stop ignoring DEBUGCTL[5:2] on AMD CPUs and instead treat them as reserved. KVM has never properly virtualized AMD's legacy PBi bits, but did allow the guest (and host userspace) to set the bits. To avoid breaking guests when running on CPUs with BusLockTrap, which redefined bit 2 to BLCKDB and made bits 5:3 reserved, a previous KVM change ignored bits 5:3, e.g. so that legacy guest software wouldn't inadvertently enable BusLockTrap or hit a VMRUN failure due to setting reserved. To allow for virtualizing BusLockTrap and whatever future features may use bits 5:3, treat bits 5:2 as reserved (and hope that doing so doesn't break any existing guests). Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28x86/bugs: Allow retbleed=stuff only on IntelDavid Kaplan
The retbleed=stuff mitigation is only applicable for Intel CPUs affected by retbleed. If this option is selected for another vendor, print a warning and fall back to the AUTO option. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-10-david.kaplan@amd.com
2025-04-28x86/bugs: Restructure spectre_v1 mitigationDavid Kaplan
Restructure spectre_v1 to use select/apply functions to create consistent vulnerability handling. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-9-david.kaplan@amd.com
2025-04-28x86/crc: drop "glue" from filenamesEric Biggers
The use of the term "glue" in filenames is a Crypto API-ism that rarely shows up elsewhere in lib/ or arch/*/lib/. I think adopting it there was a mistake. The library just uses standard functions, so the amount of code that could be considered "glue" is quite small. And while often the C functions just wrap the assembly functions, there are also cases like crc32c_arch() in arch/x86/lib/crc32-glue.c that blur the line by in-lining the actual implementation into the C function. That's not "glue code", but rather the actual code. Therefore, let's drop "glue" from the filenames and instead use e.g. crc32.c instead of crc32-glue.c. Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250424002038.179114-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-28lib/crc: make the CPU feature static keys __ro_after_initEric Biggers
All of the CRC library's CPU feature static_keys are initialized by initcalls and never change afterwards, so there's no need for them to be in the regular .data section. Put them in .data..ro_after_init instead. Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20250413154350.10819-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-28x86/bugs: Restructure GDS mitigationDavid Kaplan
Restructure GDS mitigation to use select/apply functions to create consistent vulnerability handling. Define new AUTO mitigation for GDS. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-8-david.kaplan@amd.com
2025-04-28x86/bugs: Restructure SRBDS mitigationDavid Kaplan
Restructure SRBDS to use select/apply functions to create consistent vulnerability handling. Define new AUTO mitigation for SRBDS. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-7-david.kaplan@amd.com
2025-04-28x86/bugs: Remove md_clear_*_mitigation()David Kaplan
The functionality in md_clear_update_mitigation() and md_clear_select_mitigation() is now integrated into the select/update functions for the MDS, TAA, MMIO, and RFDS vulnerabilities. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-6-david.kaplan@amd.com
2025-04-28x86/bugs: Restructure RFDS mitigationDavid Kaplan
Restructure RFDS mitigation to use select/update/apply functions to create consistent vulnerability handling. [ bp: Rename the oneline helper to what it checks. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-5-david.kaplan@amd.com
2025-04-28crypto: x86/polyval - Use API partial block handlingHerbert Xu
Use the Crypto API partial block handling. Also remove the unnecessary SIMD fallback path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28crypto: lib/poly1305 - remove INTERNAL symbol and selection of CRYPTOEric Biggers
Now that the architecture-optimized Poly1305 kconfig symbols are defined regardless of CRYPTO, there is no need for CRYPTO_LIB_POLY1305 to select CRYPTO. So, remove that. This makes the indirection through the CRYPTO_LIB_POLY1305_INTERNAL symbol unnecessary, so get rid of that and just use CRYPTO_LIB_POLY1305 directly. Finally, make the fallback to the generic implementation use a default value instead of a select; this makes it consistent with how the arch-optimized code gets enabled and also with how CRYPTO_LIB_BLAKE2S_GENERIC gets enabled. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28crypto: lib/chacha - remove INTERNAL symbol and selection of CRYPTOEric Biggers
Now that the architecture-optimized ChaCha kconfig symbols are defined regardless of CRYPTO, there is no need for CRYPTO_LIB_CHACHA to select CRYPTO. So, remove that. This makes the indirection through the CRYPTO_LIB_CHACHA_INTERNAL symbol unnecessary, so get rid of that and just use CRYPTO_LIB_CHACHA directly. Finally, make the fallback to the generic implementation use a default value instead of a select; this makes it consistent with how the arch-optimized code gets enabled and also with how CRYPTO_LIB_BLAKE2S_GENERIC gets enabled. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28crypto: x86 - move library functions to arch/x86/lib/crypto/Eric Biggers
Continue disentangling the crypto library functions from the generic crypto infrastructure by moving the x86 BLAKE2s, ChaCha, and Poly1305 library functions into a new directory arch/x86/lib/crypto/ that does not depend on CRYPTO. This mirrors the distinction between crypto/ and lib/crypto/. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28crypto: x86 - drop redundant dependencies on X86Eric Biggers
arch/x86/crypto/Kconfig is sourced only when CONFIG_X86=y, so there is no need for the symbols defined inside it to depend on X86. In the case of CRYPTO_TWOFISH_586 and CRYPTO_TWOFISH_X86_64, the dependency was actually on '(X86 || UML_X86)', which suggests that these two symbols were intended to be available under user-mode Linux as well. Yet, again these symbols were defined only when CONFIG_X86=y, so that was not the case. Just remove this redundant dependency. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-04-28x86/bugs: Restructure MMIO mitigationDavid Kaplan
Restructure MMIO mitigation to use select/update/apply functions to create consistent vulnerability handling. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-4-david.kaplan@amd.com
2025-04-28x86/bugs: Restructure TAA mitigationDavid Kaplan
Restructure TAA mitigation to use select/update/apply functions to create consistent vulnerability handling. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-3-david.kaplan@amd.com
2025-04-28x86/bugs: Restructure MDS mitigationDavid Kaplan
Restructure MDS mitigation selection to use select/update/apply functions to create consistent vulnerability handling. [ bp: rename and beef up comment over VERW mitigation selected var for maximum clarity. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-2-david.kaplan@amd.com
2025-04-26Merge tag 'x86-urgent-2025-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix 32-bit kernel boot crash if passed physical memory with more than 32 address bits - Fix Xen PV crash - Work around build bug in certain limited build environments - Fix CTEST instruction decoding in insn_decoder_test * tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/insn: Fix CTEST instruction decoding x86/boot: Work around broken busybox 'truncate' tool x86/mm: Fix _pgd_alloc() for Xen PV mode x86/e820: Discard high memory that can't be addressed by 32-bit systems
2025-04-26Merge tag 'perf-urgent-2025-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc perf events fixes from Ingo Molnar: - Use POLLERR for events in error state, instead of the ambiguous POLLHUP error value - Fix non-sampling (counting) events on certain x86 platforms * tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix non-sampling (counting) events on certain x86 platforms perf/core: Change to POLLERR for pinned events with error
2025-04-26platform/x86/amd/pmc: Use FCH_PM_BASE definitionMario Limonciello
The s2idle MMIO quirk uses a scratch register in the FCH. Adjust the code to clarify that. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: platform-driver-x86@vger.kernel.org Link: https://lore.kernel.org/r/20250422234830.2840784-5-superm1@kernel.org
2025-04-26i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to ↵Mario Limonciello
<asm/amd/fch.h> SB800_PIIX4_FCH_PM_ADDR is used to indicate the base address for the FCH PM registers. Multiple drivers may need this base address, so move related defines to a common header location and rename them accordingly. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andi Shyti <andi.shyti@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Jean Delvare <jdelvare@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sanket Goswami <Sanket.Goswami@amd.com> Cc: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-i2c@vger.kernel.org Link: https://lore.kernel.org/r/20250422234830.2840784-4-superm1@kernel.org
2025-04-25KVM: SVM: avoid frequency indirect callsPeng Hao
When retpoline is enabled, indirect function calls introduce additional performance overhead. Avoid frequent indirect calls to VMGEXIT when SEV is enabled. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20250306075425.66693-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SEV: Configure "ALLOWED_SEV_FEATURES" VMCB FieldKim Phillips
AMD EPYC 5th generation processors have introduced a feature that allows the hypervisor to control the SEV_FEATURES that are set for, or by, a guest [1]. ALLOWED_SEV_FEATURES can be used by the hypervisor to enforce that SEV-ES and SEV-SNP guests cannot enable features that the hypervisor does not want to be enabled. Always enable ALLOWED_SEV_FEATURES. A VMRUN will fail if any non-reserved bits are 1 in SEV_FEATURES but are 0 in ALLOWED_SEV_FEATURES. Some SEV_FEATURES - currently PmcVirtualization and SecureAvic (see Appendix B, Table B-4) - require an opt-in via ALLOWED_SEV_FEATURES, i.e. are off-by-default, whereas all other features are effectively on-by-default, but still honor ALLOWED_SEV_FEATURES. [1] Section 15.36.20 "Allowed SEV Features", AMD64 Architecture Programmer's Manual, Pub. 24593 Rev. 3.42 - March 2024: https://bugzilla.kernel.org/attachment.cgi?id=306250 Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250310201603.1217954-3-kim.phillips@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25x86/cpufeatures: Add "Allowed SEV Features" FeatureKishon Vijay Abraham I
Add CPU feature detection for "Allowed SEV Features" to allow the Hypervisor to enforce that SEV-ES and SEV-SNP guest VMs cannot enable features (via SEV_FEATURES) that the Hypervisor does not support or wish to be enabled. Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20250310201603.1217954-2-kim.phillips@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SVM: Add a mutex to dump_vmcb() to prevent concurrent outputTom Lendacky
If multiple VMRUN instructions fail, resulting in calls to dump_vmcb(), the output can become interleaved and it is impossible to identify which line of output belongs to which VMCB. Add a mutex to dump_vmcb() so that the output is serialized. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/a880678afd9488e1dd6017445802712f7c02cc6d.1742477213.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SVM: Include the vCPU ID when dumping a VMCBTom Lendacky
Provide the vCPU ID of the VMCB in dump_vmcb(). Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/ee0af5a6c1a49aebb4a8291071c3f68cacf107b2.1742477213.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SVM: Add the type of VM for which the VMCB/VMSA is being dumpedTom Lendacky
Add the type of VM (SVM, SEV, SEV-ES, or SEV-SNP) being dumped to the dump_vmcb() function. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/7a183a8beedf4ee26c42001160e073a884fe466e.1742477213.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SVM: Dump guest register state in dump_vmcb()Tom Lendacky
Guest register state can be useful when debugging, include it as part of dump_vmcb(). Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/a4131a10c082a93610cac12b35dca90292e50f50.1742477213.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabledTom Lendacky
An SEV-ES/SEV-SNP VM save area (VMSA) can be decrypted if the guest policy allows debugging. Update the dump_vmcb() routine to output some of the SEV VMSA contents if possible. This can be useful for debug purposes. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/ea3b852c295b6f4b200925ed6b6e2c90d9475e71.1742477213.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for everyone's favorite pile of sand: Cavium ThunderX x86: - Bugfixes from a planned posted interrupt rework - Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILING KVM: x86: Do not use kvm_rip_read() unconditionally in KVM tracepoints KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer KVM: x86: Explicitly treat routing entry type changes as changes KVM: x86: Reset IRTE to host control if *new* route isn't postable KVM: SVM: Allocate IR data using atomic allocation KVM: SVM: Don't update IRTEs if APICv/AVIC is disabled KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline arm64: Rework checks for broken Cavium HW in the PI code
2025-04-25perf/x86: Optimize the is_x86_eventKan Liang
The current is_x86_event has to go through the hybrid_pmus list to find the matched pmu, then check if it's a X86 PMU and a X86 event. It's not necessary. The X86 PMU has a unique type ID on a non-hybrid machine, and a unique capability type. They are good enough to do the check. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250424134718.311934-5-kan.liang@linux.intel.com
2025-04-25perf/x86/intel: Check the X86 leader for ACR groupKan Liang
The auto counter reload group also requires a group flag in the leader. The leader must be a X86 event. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250424134718.311934-4-kan.liang@linux.intel.com
2025-04-25Merge branch 'perf/urgent'Peter Zijlstra
Merge urgent fixes for dependencies. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-04-25perf/x86/intel/ds: Fix counter backwards of non-precise events ↵Kan Liang
counters-snapshotting The counter backwards may be observed in the PMI handler when counters-snapshotting some non-precise events in the freq mode. For the non-precise events, it's possible the counters-snapshotting records a positive value for an overflowed PEBS event. Then the HW auto-reload mechanism reset the counter to 0 immediately. Because the pebs_event_reset is cleared in the freq mode, which doesn't set the PERF_X86_EVENT_AUTO_RELOAD. In the PMI handler, 0 will be read rather than the positive value recorded in the counters-snapshotting record. The counters-snapshotting case has to be specially handled. Since the event value has been updated when processing the counters-snapshotting record, only needs to set the new period for the counter via x86_pmu_set_period(). Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250424134718.311934-6-kan.liang@linux.intel.com
2025-04-25perf/x86/intel: Check the X86 leader for pebs_counter_event_groupKan Liang
The PEBS counters snapshotting group also requires a group flag in the leader. The leader must be a X86 event. Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250424134718.311934-3-kan.liang@linux.intel.com
2025-04-25perf/x86/intel: Only check the group flag for X86 leaderKan Liang
A warning in intel_pmu_lbr_counters_reorder() may be triggered by below perf command. perf record -e "{cpu-clock,cycles/call-graph="lbr"/}" -- sleep 1 It's because the group is mistakenly treated as a branch counter group. The hw.flags of the leader are used to determine whether a group is a branch counters group. However, the hw.flags is only available for a hardware event. The field to store the flags is a union type. For a software event, it's a hrtimer. The corresponding bit may be set if the leader is a software event. For a branch counter group and other groups that have a group flag (e.g., topdown, PEBS counters snapshotting, and ACR), the leader must be a X86 event. Check the X86 event before checking the flag. The patch only fixes the issue for the branch counter group. The following patch will fix the other groups. There may be an alternative way to fix the issue by moving the hw.flags out of the union type. It should work for now. But it's still possible that the flags will be used by other types of events later. As long as that type of event is used as a leader, a similar issue will be triggered. So the alternative way is dropped. Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging") Closes: https://lore.kernel.org/lkml/20250412091423.1839809-1-luogengkun@huaweicloud.com/ Reported-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250424134718.311934-2-kan.liang@linux.intel.com
2025-04-24KVM: VMX: Use LEAVE in vmx_do_interrupt_irqoff()Uros Bizjak
Micro-optimize vmx_do_interrupt_irqoff() by substituting MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent LEAVE instruction. GCC compiler does this by default for a generic tuning and for all modern processors: DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave", m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN | m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC) The new code also saves a couple of bytes, from: 27: 48 89 ec mov %rbp,%rsp 2a: 5d pop %rbp to: 27: c9 leave No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: nVMX: Check MSR load/store list counts during VM-Enter consistency checksSean Christopherson
Explicitly verify the MSR load/store list counts are below the advertised limit as part of the initial consistency checks on the lists, so that code that consumes the count doesn't need to worry about extreme edge cases. Enforcing the limit during the initial checks fixes a flaw on 32-bit KVM where a sufficiently high @count could lead to overflow: arch/x86/kvm/vmx/nested.c:834 nested_vmx_check_msr_switch() warn: potential user controlled sizeof overflow 'addr + count * 16' '0-u64max + 16-68719476720' arch/x86/kvm/vmx/nested.c 827 static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, 828 u32 count, u64 addr) 829 { 830 if (count == 0) 831 return 0; 832 833 if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) || --> 834 !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While the SDM doesn't explicitly state an illegal count results in VM-Fail, the SDM states that exceeding the limit may result in undefined behavior. I.e. the SDM gives hardware, and thus KVM, carte blanche to do literally anything in response to a count that exceeds the "recommended" limit. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition). KVM already enforces the limit when processing the MSRs, i.e. already signals a late VM-Exit Consistency Check for VM-Enter, and generates a VMX Abort for VM-Exit. I.e. explicitly checking the limits simply means KVM will signal VM-Fail instead of VM-Exit or VMX Abort. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/44961459-2759-4164-b604-f6bd43da8ce9@stanley.mountain Link: https://lore.kernel.org/r/20250315024402.2363098-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: SVM: Fix SNP AP destroy race with VMRUNTom Lendacky
An AP destroy request for a target vCPU is typically followed by an RMPADJUST to remove the VMSA attribute from the page currently being used as the VMSA for the target vCPU. This can result in a vCPU that is about to VMRUN to exit with #VMEXIT_INVALID. This usually does not happen as APs are typically sitting in HLT when being destroyed and therefore the vCPU thread is not running at the time. However, if HLT is allowed inside the VM, then the vCPU could be about to VMRUN when the VMSA attribute is removed from the VMSA page, resulting in a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the guest to crash. An RMPADJUST against an in-use (already running) VMSA results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA attribute cannot be changed until the VMRUN for target vCPU exits. The Qemu command line option '-overcommit cpu-pm=on' is an example of allowing HLT inside the guest. Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for requests to be honored, so create kvm_make_request_and_kick() that will add a new event request and honor the KVM_REQUEST_WAIT flag. This will ensure that the target vCPU sees the AP destroy request before returning to the initiating vCPU should the target vCPU be in guest mode. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com [sean: add a comment explaining the use of smp_send_reschedule()] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: KVM: Add helper for harvesting PIR to deduplicate KVM and posted MSIsSean Christopherson
Now that posted MSI and KVM harvesting of PIR is identical, extract the code (and posted MSI's wonderful comment) to a common helper. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Use arch_xchg() when processing PIR to avoid instrumentationSean Christopherson
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid instrumentation so that KVM is compatible with the needs of posted MSI. This will allow extracting the core PIR logic to common code and sharing it between KVM and posted MSI handling. Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Isolate pure loads from atomic XCHG when processing PIRSean Christopherson
Rework KVM's processing of the PIR to use the same algorithm as posted MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's safe to say far more thought and investigation was put into handling the PIR for posted MSIs, i.e. there's no reason to assume KVM's existing logic is meaningful, let alone superior. Matching the processing done by posted MSIs will also allow deduplicating the code between KVM and posted MSIs. See the comment for handle_pending_pir() added by commit 1b03d82ba15e ("x86/irq: Install posted MSI notification handler") for details on why isolating loads from XCHG is desirable. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Process PIR using 64-bit accesses on 64-bit kernelsSean Christopherson
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on 64-bit kernels, so that the worst case of having a posted IRQ in each chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8. Deliberately use a "continue" to skip empty entries so that the code is a carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM and posted MSI logic. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: KVM: Track PIR bitmap as an "unsigned long" arraySean Christopherson
Track the PIR bitmap in posted interrupt descriptor structures as an array of unsigned longs instead of using unionized arrays for KVM (u32s) versus IRQ management (u64s). In practice, because the non-KVM usage is (sanely) restricted to 64-bit kernels, all existing usage of the u64 variant is already working with unsigned longs. Using "unsigned long" for the array will allow reworking KVM's processing of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will allow optimizing KVM by reducing the number of atomic accesses to PIR. Opportunstically replace the open coded literals in the posted MSIs code with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the for-loops, even though it would be cleaner from a certain perspective, in anticipation of decoupling the processing from the array declaration. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Ensure vIRR isn't reloaded at odd times when sync'ing PIRSean Christopherson
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC to ensure getting the highest priority IRQ from the chunk doesn't reload from the vIRR. In practice, a reload is functionally benign as vcpu->mutex is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing IRQs can't disappear. Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>