Age | Commit message (Collapse) | Author |
|
counters-snapshotting
The counter backwards may be observed in the PMI handler when
counters-snapshotting some non-precise events in the freq mode.
For the non-precise events, it's possible the counters-snapshotting
records a positive value for an overflowed PEBS event. Then the HW
auto-reload mechanism reset the counter to 0 immediately. Because the
pebs_event_reset is cleared in the freq mode, which doesn't set the
PERF_X86_EVENT_AUTO_RELOAD.
In the PMI handler, 0 will be read rather than the positive value
recorded in the counters-snapshotting record.
The counters-snapshotting case has to be specially handled. Since the
event value has been updated when processing the counters-snapshotting
record, only needs to set the new period for the counter via
x86_pmu_set_period().
Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-6-kan.liang@linux.intel.com
|
|
The PEBS counters snapshotting group also requires a group flag in the
leader. The leader must be a X86 event.
Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-3-kan.liang@linux.intel.com
|
|
A warning in intel_pmu_lbr_counters_reorder() may be triggered by below
perf command.
perf record -e "{cpu-clock,cycles/call-graph="lbr"/}" -- sleep 1
It's because the group is mistakenly treated as a branch counter group.
The hw.flags of the leader are used to determine whether a group is a
branch counters group. However, the hw.flags is only available for a
hardware event. The field to store the flags is a union type. For a
software event, it's a hrtimer. The corresponding bit may be set if the
leader is a software event.
For a branch counter group and other groups that have a group flag
(e.g., topdown, PEBS counters snapshotting, and ACR), the leader must
be a X86 event. Check the X86 event before checking the flag.
The patch only fixes the issue for the branch counter group.
The following patch will fix the other groups.
There may be an alternative way to fix the issue by moving the hw.flags
out of the union type. It should work for now. But it's still possible
that the flags will be used by other types of events later. As long as
that type of event is used as a leader, a similar issue will be
triggered. So the alternative way is dropped.
Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging")
Closes: https://lore.kernel.org/lkml/20250412091423.1839809-1-luogengkun@huaweicloud.com/
Reported-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250424134718.311934-2-kan.liang@linux.intel.com
|
|
Micro-optimize vmx_do_interrupt_irqoff() by substituting
MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent
LEAVE instruction. GCC compiler does this by default for
a generic tuning and for all modern processors:
DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN
| m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC)
The new code also saves a couple of bytes, from:
27: 48 89 ec mov %rbp,%rsp
2a: 5d pop %rbp
to:
27: c9 leave
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly verify the MSR load/store list counts are below the advertised
limit as part of the initial consistency checks on the lists, so that code
that consumes the count doesn't need to worry about extreme edge cases.
Enforcing the limit during the initial checks fixes a flaw on 32-bit KVM
where a sufficiently high @count could lead to overflow:
arch/x86/kvm/vmx/nested.c:834 nested_vmx_check_msr_switch()
warn: potential user controlled sizeof overflow 'addr + count * 16' '0-u64max + 16-68719476720'
arch/x86/kvm/vmx/nested.c
827 static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
828 u32 count, u64 addr)
829 {
830 if (count == 0)
831 return 0;
832
833 if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) ||
--> 834 !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While the SDM doesn't explicitly state an illegal count results in VM-Fail,
the SDM states that exceeding the limit may result in undefined behavior.
I.e. the SDM gives hardware, and thus KVM, carte blanche to do literally
anything in response to a count that exceeds the "recommended" limit.
If the limit is exceeded, undefined processor behavior may result
(including a machine check during the VMX transition).
KVM already enforces the limit when processing the MSRs, i.e. already
signals a late VM-Exit Consistency Check for VM-Enter, and generates a
VMX Abort for VM-Exit. I.e. explicitly checking the limits simply means
KVM will signal VM-Fail instead of VM-Exit or VMX Abort.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/44961459-2759-4164-b604-f6bd43da8ce9@stanley.mountain
Link: https://lore.kernel.org/r/20250315024402.2363098-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
An AP destroy request for a target vCPU is typically followed by an
RMPADJUST to remove the VMSA attribute from the page currently being
used as the VMSA for the target vCPU. This can result in a vCPU that
is about to VMRUN to exit with #VMEXIT_INVALID.
This usually does not happen as APs are typically sitting in HLT when
being destroyed and therefore the vCPU thread is not running at the time.
However, if HLT is allowed inside the VM, then the vCPU could be about to
VMRUN when the VMSA attribute is removed from the VMSA page, resulting in
a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the
guest to crash. An RMPADJUST against an in-use (already running) VMSA
results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA
attribute cannot be changed until the VMRUN for target vCPU exits. The
Qemu command line option '-overcommit cpu-pm=on' is an example of allowing
HLT inside the guest.
Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the
KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for
requests to be honored, so create kvm_make_request_and_kick() that will
add a new event request and honor the KVM_REQUEST_WAIT flag. This will
ensure that the target vCPU sees the AP destroy request before returning
to the initiating vCPU should the target vCPU be in guest mode.
Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com
[sean: add a comment explaining the use of smp_send_reschedule()]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that posted MSI and KVM harvesting of PIR is identical, extract the
code (and posted MSI's wonderful comment) to a common helper.
No functional change intended.
Link: https://lore.kernel.org/r/20250401163447.846608-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use arch_xchg() when moving IRQs from the PIR to the vIRR, purely to avoid
instrumentation so that KVM is compatible with the needs of posted MSI.
This will allow extracting the core PIR logic to common code and sharing
it between KVM and posted MSI handling.
Link: https://lore.kernel.org/r/20250401163447.846608-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rework KVM's processing of the PIR to use the same algorithm as posted
MSIs, i.e. to do READ(x4) => XCHG(x4) instead of (READ+XCHG)(x4). Given
KVM's long-standing, sub-optimal use of 32-bit accesses to the PIR, it's
safe to say far more thought and investigation was put into handling the
PIR for posted MSIs, i.e. there's no reason to assume KVM's existing
logic is meaningful, let alone superior.
Matching the processing done by posted MSIs will also allow deduplicating
the code between KVM and posted MSIs.
See the comment for handle_pending_pir() added by commit 1b03d82ba15e
("x86/irq: Install posted MSI notification handler") for details on
why isolating loads from XCHG is desirable.
Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Process the PIR at the natural kernel width, i.e. in 64-bit chunks on
64-bit kernels, so that the worst case of having a posted IRQ in each
chunk of the vIRR only requires 4 loads and xchgs from/to the PIR, not 8.
Deliberately use a "continue" to skip empty entries so that the code is a
carbon copy of handle_pending_pir(), in anticipation of deduplicating KVM
and posted MSI logic.
Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250401163447.846608-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the PIR bitmap in posted interrupt descriptor structures as an array
of unsigned longs instead of using unionized arrays for KVM (u32s) versus
IRQ management (u64s). In practice, because the non-KVM usage is (sanely)
restricted to 64-bit kernels, all existing usage of the u64 variant is
already working with unsigned longs.
Using "unsigned long" for the array will allow reworking KVM's processing
of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will
allow optimizing KVM by reducing the number of atomic accesses to PIR.
Opportunstically replace the open coded literals in the posted MSIs code
with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the
for-loops, even though it would be cleaner from a certain perspective, in
anticipation of decoupling the processing from the array declaration.
No functional change intended.
Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Read each vIRR exactly once when shuffling IRQs from the PIR to the vAPIC
to ensure getting the highest priority IRQ from the chunk doesn't reload
from the vIRR. In practice, a reload is functionally benign as vcpu->mutex
is held and so IRQs can be consumed, i.e. new IRQs can appear, but existing
IRQs can't disappear.
Link: https://lore.kernel.org/r/20250401163447.846608-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track whether or not at least one IRQ was found in PIR during the initial
loop to load PIR chunks from memory. Doing so generates slightly better
code (arguably) for processing the for-loop of XCHGs, especially for the
case where there are no pending IRQs.
Note, while PIR can be modified between the initial load and the XCHG, it
can only _gain_ new IRQs, i.e. there is no danger of a false positive due
to the final version of pir_copy[] being empty.
Opportunistically convert the boolean to an "unsigned long" and compute
the effective boolean result via bitwise-OR. Some compilers, e.g.
clang-14, need the extra "hint" to elide conditional branches.
Opportunistically rename the variable in anticipation of moving the PIR
accesses to a common helper that can be shared by posted MSIs and KVM.
Old:
<+74>: test %rdx,%rdx
<+77>: je 0xffffffff812bbeb0 <handle_pending_pir+144>
<pir[0]>
<+88>: mov $0x1,%dl>
<+90>: test %rsi,%rsi
<+93>: je 0xffffffff812bbe8c <handle_pending_pir+108>
<pir[1]>
<+106>: mov $0x1,%dl
<+108>: test %rcx,%rcx
<+111>: je 0xffffffff812bbe9e <handle_pending_pir+126>
<pir[2]>
<+124>: mov $0x1,%dl
<+126>: test %rax,%rax
<+129>: je 0xffffffff812bbeb9 <handle_pending_pir+153>
<pir[3]>
<+142>: jmp 0xffffffff812bbec1 <handle_pending_pir+161>
<+144>: xor %edx,%edx
<+146>: test %rsi,%rsi
<+149>: jne 0xffffffff812bbe7f <handle_pending_pir+95>
<+151>: jmp 0xffffffff812bbe8c <handle_pending_pir+108>
<+153>: test %dl,%dl
<+155>: je 0xffffffff812bbf8e <handle_pending_pir+366>
New:
<+74>: mov %rax,%r8
<+77>: or %rcx,%r8
<+80>: or %rdx,%r8
<+83>: or %rsi,%r8
<+86>: setne %bl
<+89>: je 0xffffffff812bbf88 <handle_pending_pir+360>
<+95>: test %rsi,%rsi
<+98>: je 0xffffffff812bbe8d <handle_pending_pir+109>
<pir[0]>
<+109>: test %rdx,%rdx
<+112>: je 0xffffffff812bbe9d <handle_pending_pir+125>
<pir[1]>
<+125>: test %rcx,%rcx
<+128>: je 0xffffffff812bbead <handle_pending_pir+141>
<pir[2]>
<+141>: test %rax,%rax
<+144>: je 0xffffffff812bbebd <handle_pending_pir+157>
<pir[3]>
Link: https://lore.kernel.org/r/20250401163447.846608-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Ensure the PIR is read exactly once at the start of handle_pending_pir(),
to guarantee that checking for an outstanding posted interrupt in a given
chuck doesn't reload the chunk from the "real" PIR. Functionally, a reload
is benign, but it would defeat the purpose of pre-loading into a copy.
Fixes: 1b03d82ba15e ("x86/irq: Install posted MSI notification handler")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250401163447.846608-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
insn_decoder_test found a problem with decoding APX CTEST instructions:
Found an x86 instruction decoder bug, please report this.
ffffffff810021df 62 54 94 05 85 ff ctestneq
objdump says 6 bytes, but insn_get_length() says 5
It happens because x86-opcode-map.txt doesn't specify arguments for the
instruction and the decoder doesn't expect to see ModRM byte.
Fixes: 690ca3a3067f ("x86/insn: Add support for APX EVEX instructions to the opcode map")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v6.10+
Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
|
|
Add a module param to each KVM vendor module to allow disabling device
posted interrupts without having to sacrifice all of APICv/AVIC, and to
also effectively enumerate to userspace whether or not KVM may be
utilizing device posted IRQs. Disabling device posted interrupts is
very desirable for testing, and can even be desirable for production
environments, e.g. if the host kernel wants to interpose on device
interrupts.
Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match
the overall APICv/AVIC controls, and to avoid complications with said
controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the
original user-defined value to play nice with a vendor module being
reloaded with different enable_apicv settings.
Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When starting device assignment, i.e. potential IRQ bypass, don't blast
KVM_REQ_UNBLOCK if APICv is disabled/unsupported. There is no need to
wake vCPUs if they can never use VT-d posted IRQs (sending UNBLOCK guards
against races being vCPUs blocking and devices starting IRQ bypass).
Opportunistically use kvm_arch_has_irq_bypass() for all relevant checks in
the VMX Posted Interrupt code so that all checks in KVM x86 incorporate
the same information (once AMD/AVIC is given similar treatment).
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250401161804.842968-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC
EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's
effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight
when IRQ routing changes, e.g. because the guest changes routing from its
IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs,
so that the in-flight IRQ can be de-asserted when it's EOI'd.
However, only the EOI for the in-flight IRQ needs to be intercepted, as
IRQs on the same vector with the new routing are coincidental, i.e. occur
only if the guest is reusing the vector for multiple interrupt sources.
If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept
EOIs for the vector and negative impact the vCPU's interrupt performance.
Note, both commit db2bdcbbbd32 ("KVM: x86: fix edge EOI and IOAPIC reconfig
race") and commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI
and IOAPIC reconfigure race") mentioned this issue, but it was considered
a "rare" occurrence thus was not addressed. However in real environments,
this issue can happen even in a well-behaved guest.
Cc: Kai Huang <kai.huang@intel.com>
Co-developed-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: xuyun <xuyun_xy.xy@linux.alibaba.com>
Signed-off-by: weizijie <zijie.wei@linux.alibaba.com>
[sean: massage changelog and comments, use int/-1, reset at scan]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract the vCPU specific EOI interception logic for I/O APIC emulation
into a common helper for userspace and in-kernel emulation in anticipation
of optimizing the "pending EOI" case.
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract and isolate the trigger mode check in kvm_scan_ioapic_routes() in
anticipation of moving destination matching logic to a common helper (for
userspace vs. in-kernel I/O APIC emulation).
No functional change intended.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250304013335.4155703-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The latest AMD platform has introduced a new instruction called PREFETCHI.
This instruction loads a cache line from a specified memory address into
the indicated data or instruction cache level, based on locality reference
hints.
Feature bit definition:
CPUID_Fn80000021_EAX [bit 20] - Indicates support for IC prefetch.
This feature is analogous to Intel's PREFETCHITI (CPUID.(EAX=7,ECX=1):EDX),
though the CPUID bit definitions differ between AMD and Intel.
Advertise support to userspace, as no additional enabling is necessary
(PREFETCHI can't be intercepted as there's no instruction specific behavior
that needs to be virtualize).
The feature is documented in Processor Programming Reference (PPR)
for AMD Family 1Ah Model 02h, Revision C1 (Link below).
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ee1c08fc400bb574a2b8f2c6a0bd9def10a29d35.1744130533.git.babu.moger@amd.com
[sean: rewrite shortlog to highlight the KVM functionality]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WRMSR_XX_BASE_NS is bit 1 so put it there, add some new bits as
comments only.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250324160617.15379-1-bp@kernel.org
[sean: skip the FSRS/FSRC placeholders to avoid confusion]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Returning a literal X86EMUL_CONTINUE is slightly clearer than returning
rc.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/7604cbbf-15e6-45a8-afec-cf5be46c2924@stanley.mountain
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for WRMSRNS (WRMSR non-serializing) to userspace if the
instruction is supported by the underlying CPU. From a virtualization
perspective, the only difference between WRMSRNS and WRMSR is that VM-Exits
due to WRMSRNS set EXIT_QUALIFICATION to '1'. WRMSRNS doesn't require a
new enabling control, shares the same basic exit reason, and behaves the
same as WRMSR with respect to MSR interception.
WRMSR and WRMSRNS use the same basic exit reason (see Appendix C). For
WRMSR, the exit qualification is 0, while for WRMSRNS it is 1.
Don't do anything different when emulating WRMSRNS vs. WRMSR, as KVM can't
do anything less, i.e. can't make emulation non-serializing. The
motivation for the guest to use WRMSRNS instead of WRMSR is to avoid
immediately serializing the CPU when the necessary serialization is
guaranteed by some other mechanism, i.e. WRMSRNS being fully serializing
isn't guest-visible, just less performant.
Suggested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250227010111.3222742-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename the WRMSRNS instruction opcode macro so that it doesn't collide
with X86_FEATURE_WRMSRNS when using token pasting to generate references
to X86_FEATURE_WRMSRNS. KVM heavily uses token pasting to generate KVM's
set of support feature bits, and adding WRMSRNS support in KVM will run
will run afoul of the opcode macro.
arch/x86/kvm/cpuid.c:719:37: error: pasting "X86_FEATURE_" and "" "" does not
give a valid preprocessing token
719 | u32 __leaf = __feature_leaf(X86_FEATURE_##name); \
| ^~~~~~~~~~~~
KVM has worked around one such collision in the past by #undef'ing the
problematic macro in order to avoid blocking a KVM rework, but such games
are generally undesirable, e.g. requires bleeding macro details into KVM,
risks weird behavior if what KVM is #undef'ing changes, etc.
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250227010111.3222742-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Commit 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when
guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to
properly virtualize IBRS by providing separate predictor modes for L1
and L2.
AMD requires similar handling, except when IbrsSameMode is enumerated by
the host CPU (which is the case on most/all AMD CPUs). With
IbrsSameMode, hardware IBRS is sufficient and no extra handling is
needed from KVM.
Generalize the handling in nested_vmx_vmexit() by moving it into a
generic function, add the AMD handling, and use it in
nested_svm_vmexit() too. The main reason for using a generic function is
to have a single place to park the huge comment about virtualizing IBRS.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-4-yosry.ahmed@linux.dev
[sean: use kvm_nested_vmexit_handle_spec_ctrl() for the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If IBRS provides same mode (kernel/user or host/guest) protection on the
host, then by definition it also provides same mode protection in the
guest. In fact, all different modes from the guest's perspective are the
same mode from the host's perspective anyway.
Propagate IbrsSameMode to the guests.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Per the APM [1]:
Some processors, identified by CPUID Fn8000_0008_EBX[IbrsSameMode]
(bit 19) = 1, provide additional speculation limits. For these
processors, when IBRS is set, indirect branch predictions are not
influenced by any prior indirect branches, regardless of mode (CPL
and guest/host) and regardless of whether the prior indirect branches
occurred before or after the setting of IBRS. This is referred to as
Same Mode IBRS.
Define this feature bit, which will be used by KVM to determine if an
IBPB is required on nested VM-exits in SVM.
[1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev 4.08 - April
2024, Volume 2, 3.2.9 Speculation Control MSRs
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250221163352.3818347-2-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are
u64 type. We are only using the lowest 3 bits but we want to ensure that
the users are not passing invalid bits so that we can use the remaining
bits in the future.
However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so
the check only ensures that the lower 32 bits are clear. Fix this by
changing the types to u64.
Fixes: 74c1807f6c4f ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Previously, commit ed129ec9057f ("KVM: x86: forcibly leave nested mode
on vCPU reset") addressed an issue where a triple fault occurring in
nested mode could lead to use-after-free scenarios. However, the commit
did not handle the analogous situation for System Management Mode (SMM).
This omission results in triggering a WARN when KVM forces a vCPU INIT
after SHUTDOWN interception while the vCPU is in SMM. This situation was
reprodused using Syzkaller by:
1) Creating a KVM VM and vCPU
2) Sending a KVM_SMI ioctl to explicitly enter SMM
3) Executing invalid instructions causing consecutive exceptions and
eventually a triple fault
The issue manifests as follows:
WARNING: CPU: 0 PID: 25506 at arch/x86/kvm/x86.c:12112
kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Modules linked in:
CPU: 0 PID: 25506 Comm: syz-executor.0 Not tainted
6.1.130-syzkaller-00157-g164fe5dde9b6 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
RIP: 0010:kvm_vcpu_reset+0x1d2/0x1530 arch/x86/kvm/x86.c:12112
Call Trace:
<TASK>
shutdown_interception+0x66/0xb0 arch/x86/kvm/svm/svm.c:2136
svm_invoke_exit_handler+0x110/0x530 arch/x86/kvm/svm/svm.c:3395
svm_handle_exit+0x424/0x920 arch/x86/kvm/svm/svm.c:3457
vcpu_enter_guest arch/x86/kvm/x86.c:10959 [inline]
vcpu_run+0x2c43/0x5a90 arch/x86/kvm/x86.c:11062
kvm_arch_vcpu_ioctl_run+0x50f/0x1cf0 arch/x86/kvm/x86.c:11283
kvm_vcpu_ioctl+0x570/0xf00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4122
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Architecturally, INIT is blocked when the CPU is in SMM, hence KVM's WARN()
in kvm_vcpu_reset() to guard against KVM bugs, e.g. to detect improper
emulation of INIT. SHUTDOWN on SVM is a weird edge case where KVM needs to
do _something_ sane with the VMCB, since it's technically undefined, and
INIT is the least awful choice given KVM's ABI.
So, double down on stuffing INIT on SHUTDOWN, and force the vCPU out of
SMM to avoid any weirdness (and the WARN).
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: ed129ec9057f ("KVM: x86: forcibly leave nested mode on vCPU reset")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
Link: https://lore.kernel.org/r/20250414171207.155121-1-m.lobanov@rosa.ru
[sean: massage changelog, make it clear this isn't architectural behavior]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Perf doesn't work at perf stat for hardware events on certain x86 platforms:
$perf stat -- sleep 1
Performance counter stats for 'sleep 1':
16.44 msec task-clock # 0.016 CPUs utilized
2 context-switches # 121.691 /sec
0 cpu-migrations # 0.000 /sec
54 page-faults # 3.286 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
The reason is that the check in x86_pmu_hw_config() for sampling events is
unexpectedly applied to counting events as well.
It should only impact x86 platforms with limit_period used for non-PEBS
events. For Intel platforms, it should only impact some older platforms,
e.g., HSW, BDW and NHM.
Fixes: 88ec7eedbbd2 ("perf/x86: Fix low freqency setting issue")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250423064724.3716211-1-luogengkun@huaweicloud.com
|
|
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI
code, adding an open-coded erratum check for Cavium ThunderX
* Bugfixes from a planned posted interrupt rework
* Do not use kvm_rip_read() unconditionally to cater for guests
with inaccessible register state.
|
|
The GNU coreutils version of truncate, which is the original, accepts a
% prefix for the -s size argument which means the file in question
should be padded to a multiple of the given size. This is currently used
to pad the setup block of bzImage to a multiple of 4k before appending
the decompressor.
busybox reimplements truncate but does not support this idiom, and
therefore fails the build since commit
9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it")
Since very little build code within the kernel depends on the 'truncate'
utility, work around this incompatibility by avoiding truncate altogether,
and relying on dd to perform the padding.
Fixes: 9c54baab4401 ("x86/boot: Drop CRC-32 checksum and the build tool that generates it")
Reported-by: <phasta@kernel.org>
Tested-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250424101917.1552527-2-ardb+git@google.com
|
|
This commits breaks SNP guests:
234cf67fc3bd ("x86/sev: Split off startup code from core code")
The SNP guest boots, but no longer has access to the VMPCK keys needed
to communicate with the ASP, which is used, for example, to obtain an
attestation report.
The secrets_pa value is defined as static in both startup.c and
core.c. It is set by a function in startup.c and so when used in
core.c its value will be 0.
Share it again and add the sev_ prefix to put it into the global
SEV symbols namespace.
[ mingo: Renamed to sev_secrets_pa ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Link: https://lore.kernel.org/r/cf878810-81ed-3017-52c6-ce6aa41b5f01@amd.com
|
|
Not all VMs allow access to RIP. Check guest_state_protected before
calling kvm_rip_read().
This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.
Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Not all VMs allow access to RIP. Check guest_state_protected before
calling kvm_rip_read().
This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for
TDX VMs.
Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20250415104821.247234-2-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM
attempts to track an AMD IRTE entry without metadata.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.
Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-is.
Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Restore an IRTE back to host control (remapped or posted MSI mode) if the
*new* GSI route prevents posting the IRQ directly to a vCPU, regardless of
the GSI routing type. Updating the IRTE if and only if the new GSI is an
MSI results in KVM leaving an IRTE posting to a vCPU.
The dangling IRTE can result in interrupts being incorrectly delivered to
the guest, and in the worst case scenario can result in use-after-free,
e.g. if the VM is torn down, but the underlying host IRQ isn't freed.
Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts")
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as
svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held
when kvm_irq_routing_update() reacts to GSI routing changes.
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250404193923.1413163-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Skip IRTE updates if AVIC is disabled/unsupported, as forcing the IRTE
into remapped mode (kvm_vcpu_apicv_active() will never be true) is
unnecessary and wasteful. The IOMMU driver is responsible for putting
IRTEs into remapped mode when an IRQ is allocated by a device, long before
that device is assigned to a VM. I.e. the kernel as a whole has major
issues if the IRTE isn't already in remapped mode.
Opportunsitically kvm_arch_has_irq_bypass() to query for APICv/AVIC, so
so that all checks in KVM x86 incorporate the same information.
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250401161804.842968-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Recently _pgd_alloc() was switched from using __get_free_pages() to
pagetable_alloc_noprof(), which might return a compound page in case
the allocation order is larger than 0.
On x86 this will be the case if CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
is set, even if PTI has been disabled at runtime.
When running as a Xen PV guest (this will always disable PTI), using
a compound page for a PGD will result in VM_BUG_ON_PGFLAGS being
triggered when the Xen code tries to pin the PGD.
Fix the Xen issue together with the not needed 8k allocation for a
PGD with PTI disabled by replacing PGD_ALLOCATION_ORDER with an
inline helper returning the needed order for PGD allocations.
Fixes: a9b3c355c2e6 ("asm-generic: pgalloc: provide generic __pgd_{alloc,free}")
Reported-by: Petr Vaněk <arkamar@atlas.cz>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Petr Vaněk <arkamar@atlas.cz>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250422131717.25724-1-jgross%40suse.com
|
|
Use the Crypto API partial block handling.
Also remove the unnecessary SIMD fallback path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Use the Crypto API partial block handling.
Also remove the unnecessary SIMD fallback path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Now that all sha256_base users have been converted to use the API
partial block handling, remove the partial block helpers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Use the Crypto API partial block handling.
Also remove the unnecessary SIMD fallback path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
objtool already struggles to identify jump tables correctly in non-PIC
code, where the idiom is something like
jmpq *table(,%idx,8)
and the table is a list of absolute addresses of jump targets.
When using -fPIC, both the table reference as well as the jump targets
are emitted in a RIP-relative manner, resulting in something like
leaq table(%rip), %tbl
movslq (%tbl,%idx,4), %offset
addq %offset, %tbl
jmpq *%tbl
and the table is a list of offsets of the jump targets relative to the
start of the entire table.
Considering that this sequence of instructions can be interleaved with
other instructions that have nothing to do with the jump table in
question, it is extremely difficult to infer the control flow by
deriving the jump targets from the indirect jump, the location of the
table and the relative offsets it contains.
So let's not bother and disable jump tables for code built with -fPIC
under arch/x86/boot/startup.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250422210510.600354-2-ardb+git@google.com
|
|
Use the Crypto API partial block handling.
Also remove the unnecessary SIMD fallback path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|