Age | Commit message (Collapse) | Author |
|
Remove the redundant .hwapic_irr_update() ops.
If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1
in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned
with the vIRR in the virtual APIC. So, no need to update RVI every time
the vIRR changes.
Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or
vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no
impact to the nested case.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint.
As is done with kvm_exit, gather the information via a kvm_x86_ops hook
to avoid the moderately costly VMREADs on VMX when the tracepoint isn't
enabled.
Opportunistically rename the parameters in the get_exit_info()
declaration to match the names used by both SVM and VMX.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com
[sean: drop is_guest_mode() change, use intr_info/error_code for names]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move handling of emulation during event vectoring, which KVM doesn't
support, into VMX's check_emulate_instruction(), so that KVM detects
all unsupported emulation, not just cached emulated MMIO (EPT misconfig).
E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with
legacy shadow paging (#PF).
Rejecting emulation on other sources of emulation also fixes a largely
theoretical flaw (thanks to the "unprotect and retry" logic), where KVM
could incorrectly inject a #DF:
1. CPU executes an instruction and hits a #GP
2. While vectoring the #GP, a shadow #PF occurs
3. On the #PF VM-Exit, KVM re-injects #GP
4. KVM emulates because of the write-protected page
5. KVM "successfully" emulates and also detects the #GP
6. KVM synthesizes a #GP, and since #GP has already been injected,
incorrectly escalates to a #DF.
Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.
Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before
attempting emulation of any kind.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.
To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry. For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps. Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.
Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.
The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology. And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.
Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU. And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.
Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix
for consistency and vertical alignment.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits
Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
|
|
Move the XSAVE-related CPUID leaf definitions to common code. Then,
use the new definition to remove the last magic number from the CPUID
level dependency table.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205037.43C57CDE%40davehans-spike.ostc.intel.com
|
|
All the code that reads the CPUID frequency information leaf hard-codes
a magic number. Give it a symbolic name and use it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205036.4397658F%40davehans-spike.ostc.intel.com
|
|
Prepare to use the TSC CPUID leaf definition more widely by moving
it to the common header.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205033.68799E53%40davehans-spike.ostc.intel.com
|
|
The DCA leaf number is also hard-coded in the CPUID level dependency
table. Move its definition to common code and use it.
While at it, fix up the naming and types in the probe code. All
CPUID data is provided in 32-bit registers, not 'unsigned long'.
Also stop referring to "level_9". Move away from test_bit()
because the type is no longer an 'unsigned long'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205032.476A30FE%40davehans-spike.ostc.intel.com
|
|
Begin constructing a common place to keep all CPUID leaf definitions.
Move CPUID_MWAIT_LEAF to the CPUID header and include it where
needed.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/all/20241213205028.EE94D02A%40davehans-spike.ostc.intel.com
|
|
All the users of 'x86_cpu_desc' are gone. Zap it from the tree.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185133.AF0BF2BC%40davehans-spike.ostc.intel.com
|
|
The x86_match_cpu() infrastructure can match CPU steppings. Since
there are only 16 possible steppings, the matching infrastructure goes
all out and stores the stepping match as a bitmap. That means it can
match any possible steppings in a single list entry. Fun.
But it exposes this bitmap to each of the X86_MATCH_*() helpers when
none of them really need a bitmap. It makes up for this by exporting a
helper (X86_STEPPINGS()) which converts a contiguous stepping range
into the bitmap which every single user leverages.
Instead of a bitmap, have the main helper for this sort of thing
(X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up
actually being even more compact than before.
Leave the helper in place (renamed to __X86_STEPPINGS()) to make it
more clear what is going on instead of just having a random GENMASK()
in the middle of an already complicated macro.
One oddity that I hit was this macro:
X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues)
It *could* have been converted over to take a min/max stepping value
for each entry. But that would have been a bit too verbose and would
prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0)
from sticking out.
Instead, just have it take a *maximum* stepping and imply that the match
is from 0=>max_stepping. This is functional for all the cases now and
also retains the nice property of having INTEL_COMETLAKE_L stepping 0
stick out like a sore thumb.
skx_cpuids[] is goofy. It uses the stepping match but encodes all
possible steppings. Just use a normal, non-stepping match helper.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
|
|
The 'x86_cpu_id' and 'x86_cpu_desc' structures are very similar and
need to be consolidated. There is a microcode version matching
function for 'x86_cpu_desc' but not 'x86_cpu_id'.
Create one for 'x86_cpu_id'.
This essentially just leverages the x86_cpu_id->driver_data field
to replace the less generic x86_cpu_desc->x86_microcode_rev field.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185128.8F24EEFC%40davehans-spike.ostc.intel.com
|
|
The hypercall page is no longer needed. It can be removed, as from the
Xen perspective it is optional.
But, from Linux's perspective, it removes naked RET instructions that
escape the speculative protections that Call Depth Tracking and/or
Untrain Ret are trying to achieve.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
|
|
Call the Xen hypervisor via the new xen_hypercall_func static-call
instead of the hypercall page.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
|
|
Add generic hypercall functions usable for all normal (i.e. not iret)
hypercalls. Depending on the guest type and the processor vendor
different functions need to be used due to the to be used instruction
for entering the hypervisor:
- PV guests need to use syscall
- HVM/PVH guests on Intel need to use vmcall
- HVM/PVH guests on AMD and Hygon need to use vmmcall
As PVH guests need to issue hypercalls very early during boot, there
is a 4th hypercall function needed for HVM/PVH which can be used on
Intel and AMD processors. It will check the vendor type and then set
the Intel or AMD specific function to use via static_call().
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
|
|
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.
Note, commit d39850f57d21 ("KVM: x86: Drop @vcpu parameter from
kvm_x86_ops.hwapic_isr_update()") removed the parameter with the
justification that doing so "allows for a decent amount of (future)
cleanup in the APIC code", but it's not at all clear what cleanup was
intended, or if it was ever realized.
No functional change intended.
Cc: stable@vger.kernel.org
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
A segmented RMP table allows for improved locality of reference between
the memory protected by the RMP and the RMP entries themselves.
Add support to detect and initialize a segmented RMP table with multiple
segments as configured by the system BIOS. While the RMPREAD instruction
will be used to read an RMP entry in a segmented RMP, initialization and
debugging capabilities will require the mapping of the segments.
The RMP_CFG MSR indicates if segmented RMP support is enabled and, if
enabled, the amount of memory that an RMP segment covers. When segmented
RMP support is enabled, the RMP_BASE MSR points to the start of the RMP
bookkeeping area, which is 16K in size. The RMP Segment Table (RST) is
located immediately after the bookkeeping area and is 4K in size. The RST
contains up to 512 8-byte entries that identify the location of the RMP
segment and amount of memory mapped by the segment (which must be less
than or equal to the configured segment size). The physical address that
is covered by a segment is based on the segment size and the index of the
segment in the RST. The RMP entry for a physical address is based on the
offset within the segment.
For example, if the segment size is 64GB (0x1000000000 or 1 << 36), then
physical address 0x9000800000 is RST entry 9 (0x9000800000 >> 36) and
RST entry 9 covers physical memory 0x9000000000 to 0x9FFFFFFFFF.
The RMP entry index within the RMP segment is the physical address
AND-ed with the segment mask, 64GB - 1 (0xFFFFFFFFF), and then
right-shifted 12 bits or PHYS_PFN(0x9000800000 & 0xFFFFFFFFF), which
is 0x800.
CPUID 0x80000025_EBX[9:0] describes the number of RMP segments that can
be cached by the hardware. Additionally, if CPUID 0x80000025_EBX[10] is
set, then the number of actual RMP segments defined cannot exceed the
number of RMP segments that can be cached and can be used as a maximum
RST index.
[ bp: Unify printk hex format specifiers. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/02afd0ffd097a19cb6e5fb1bb76eb110496c5b11.1734101742.git.thomas.lendacky@amd.com
|
|
The RMPREAD instruction returns an architecture defined format of an
RMP table entry. This is the preferred method for examining RMP entries.
The instruction is advertised in CPUID 0x8000001f_EAX[21]. Use this
instruction when available.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/72c734ac8b324bbc0c839b2c093a11af4a8881fa.1733172653.git.thomas.lendacky@amd.com
|
|
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
|
|
In order to be able to differentiate between AMD and Intel based
systems for very early hypercalls without having to rely on the Xen
hypercall page, make get_cpu_vendor() non-static.
Refactor early_cpu_init() for the same reason by splitting out the
loop initializing cpu_devs() into an externally callable function.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
cpu_feature_enabled() should be used in most cases when CPU feature
support needs to be tested in code. Document that.
Reported-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241031103401.GBZyNdGQ-ZyXKyzC_z@fat_crate.local
|
|
Rename the helper to better reflect its function.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20241202073139.448208-1-kirill.shutemov@linux.intel.com
|
|
The pv_ops::cpu.wbinvd paravirt callback is a leftover of lguest times.
Today it is no longer needed, as all users use the native WBINVD
implementation.
Remove the callback and rename native_wbinvd() to wbinvd().
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241203071550.26487-1-jgross@suse.com
|
|
Linux defined feature bits X86_FEATURE_P3 and X86_FEATURE_P4 are not
used anywhere. Commit f31d731e4467 ("x86: use X86_FEATURE_NOPL in
alternatives") got rid of the last usage in 2008. Remove the related
mappings and code.
Just like all X86_FEATURE bits, the raw bit numbers can be exposed to
userspace via MODULE_DEVICE_TABLE(). There is a very small theoretical
chance of userspace getting confused if these bits got reassigned and
changed logical meaning. But these bits were never used for a device
table, so it's highly unlikely this will ever happen in practice.
[ dhansen: clarify userspace visibility of these bits ]
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20241107233000.2742619-1-sohil.mehta%40intel.com
|
|
The kernel's virtual mapping of the relocate_kernel page currently needs
to be RWX because it is written to before the %cr3 switch.
Now that the relocate_kernel page has its own .data section and local
variables, it can also have *global* variables. So eliminate the separate
page_list argument, and write the same information directly to variables
in the relocate_kernel page instead. This way, the relocate_kernel code
itself doesn't need to copy it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-11-dwmw2@infradead.org
|
|
Now that the copy is executed instead of the original, the relocate_kernel
page can live in the kernel's .text section. This will allow subsequent
commits to actually add real data to it and clean up the code somewhat as
well as making the control page ROX.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-9-dwmw2@infradead.org
|
|
Now that the following fix:
d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
stops kernel_ident_mapping_init() from scribbling over the end of a
4KiB PGD by assuming the following 4KiB will be a userspace PGD,
there's no good reason for the kexec PGD to be part of a single
8KiB allocation with the control_code_page.
( It's not clear that that was the reason for x86_64 kexec doing it that
way in the first place either; there were no comments to that effect and
it seems to have been the case even before PTI came along. It looks like
it was just a happy accident which prevented memory corruption on kexec. )
Either way, it definitely isn't needed now. Just allocate the PGD
separately on x86_64, like i386 already does.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-6-dwmw2@infradead.org
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Setting and clearing CPU bits in the mm_cpumask is only ever done
by the CPU itself, from the context switch code or the TLB flush
code.
Synchronization is handled by switch_mm_irqs_off() blocking interrupts.
Sending TLB flush IPIs to CPUs that are in the mm_cpumask, but no
longer running the program causes a regression in the will-it-scale
tlbflush2 test. This test is contrived, but a large regression here
might cause a small regression in some real world workload.
Instead of always sending IPIs to CPUs that are in the mm_cpumask,
but no longer running the program, send these IPIs only once a second.
The rest of the time we can skip over CPUs where the loaded_mm is
different from the target mm.
Reported-by: kernel test roboto <oliver.sang@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241204210316.612ee573@fangorn
Closes: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/
|
|
Dump TD configuration on boot. Attributes and TD_CTLS define TD
behavior. This information is useful for tracking down bugs.
The output ends up looking like this in practice:
[ 0.000000] tdx: Guest detected
[ 0.000000] tdx: Attributes: SEPT_VE_DISABLE
[ 0.000000] tdx: TD_CTLS: PENDING_VE_DISABLE ENUM_TOPOLOGY VIRT_CPUID2 REDUCE_VE
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/20241202072458.447455-1-kirill.shutemov%40linux.intel.com
|
|
The early boot code runs from a 1:1 mapping of memory, and may execute
before the kernel virtual mapping is even up. This means absolute symbol
references cannot be permitted in this code.
UBSAN injects references to global data structures into the code, and
without -fPIC, those references are emitted as absolute references to
kernel virtual addresses. Accessing those will fault before the kernel
virtual mapping is up, so UBSAN needs to be disabled in early boot code.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205112804.3416920-13-ardb+git@google.com
|
|
Implicit absolute symbol references (e.g., taking the address of a
global variable) must be avoided in the C code that runs from the early
1:1 mapping of the kernel, given that this is a practice that violates
assumptions on the part of the toolchain. I.e., RIP-relative and
absolute references are expected to produce the same values, and so the
compiler is free to choose either. However, the code currently assumes
that RIP-relative references are never emitted here.
So an explicit virtual-to-physical offset needs to be used instead to
derive the kernel virtual addresses of _text and _end, instead of simply
taking the addresses and assuming that the compiler will not choose to
use a RIP-relative references in this particular case.
Currently, phys_base is already used to perform such calculations, but
it is derived from the kernel virtual address of _text, which is taken
using an implicit absolute symbol reference. So instead, derive this
VA-to-PA offset in asm code, and pass it to the C startup code.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205112804.3416920-11-ardb+git@google.com
|
|
The set_p4d() and set_pgd() functions (in 4-level or 5-level page table setups
respectively) assume that the root page table is actually a 8KiB allocation,
with the userspace root immediately after the kernel root page table (so that
the former can enforce NX on on all the subordinate page tables, which are
actually shared).
However, users of the kernel_ident_mapping_init() code do not give it an 8KiB
allocation for its PGD. Both swsusp_arch_resume() and acpi_mp_setup_reset()
allocate only a single 4KiB page. The kexec code on x86_64 currently gets
away with it purely by chance, because it allocates 8KiB for its "control
code page" and then actually uses the first half for the PGD, then copies the
actual trampoline code into the second half only after the identmap code has
finished scribbling over it.
Fix this by defining a _PAGE_NOPTISHADOW bit (which can use the same bit as
_PAGE_SAVED_DIRTY since one is only for the PGD/P4D root and the other is
exclusively for leaf PTEs.). This instructs __pti_set_user_pgtbl() not to
write to the userspace 'shadow' PGD.
Strictly, the _PAGE_NOPTISHADOW bit doesn't need to be written out to the
actual page tables; since __pti_set_user_pgtbl() returns the value to be
written to the kernel page table, it could be filtered out. But there seems
to be no benefit to actually doing so.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/412c90a4df7aef077141d9f68d19cbe5602d6c6d.camel@infradead.org
Cc: stable@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
|
|
Originally, #VE was defined as the TDX behavior in order to support
paravirtualization of x86 features that can’t be virtualized by the TDX
module. The intention is that if guest software wishes to use such a
feature, it implements some logic to support this. This logic resides in
the #VE exception handler it may work in cooperation with the host VMM.
Theoretically, the guest TD’s #VE handler was supposed to act as a "TDX
enlightenment agent" inside the TD. However, in practice, the #VE
handler is simplistic:
- #VE on CPUID is handled by returning all-0 to the code which
executed CPUID. In many cases, an all-0 value is not the correct
value, and may cause improper operation.
- #VE on RDMSR is handled by requesting the MSR value from the host
VMM. This is prone to security issues since the host VMM is
untrusted. It may also be functionally incorrect in case the
expected operation is to paravirtualize some CPU functionality.
Newer TDX modules provide a "REDUCE_VE" feature. When enabled, it
drastically cuts cases when guests receive #VE on MSR and CPUID
accesses. Basically, instead of punting the problem to the VMM, the
TDX module fills in good data. What the TDX module provides is
obviously highly specific to the MSR or CPUID. This is all spelled
out in excruciating detail in the TDX specs.
Enable REDUCE_VE. Make TDX guest behaviour less odd, and closer to
how a normal CPU behaves.
Note that enabling of the feature doesn't eliminate need in #VE handler
for CPUID and MSR accesses. Some MSRs still generate #VE (notably
APIC-related) and kernel needs CPUID #VE handler to ask VMM for leafs in
hypervisor range.
[ dhansen: changelog tweaks, rename/rework VE reduction function ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/20241202072431.447380-1-kirill.shutemov%40linux.intel.com
|
|
Rename the helper to better reflect its function.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/all/20241202073139.448208-1-kirill.shutemov%40linux.intel.com
|
|
Following new HSMP messages are available on family 0x1A, model 0x00-0x1F
platforms with protocol version 7. Add support for them in the driver.
- SetXgmiPstateRange(26h)
- CpuRailIsoFreqPolicy(27h)
- DfcEnable(28h)
- GetRaplUnit(30h)
- GetRaplCoreCounter(31h)
- GetRaplPackageCounter(32h)
Also update HSMP message PwrEfficiencyModeSelection-21h. This message is
updated to include GET option in recent firmware.
Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Link: https://lore.kernel.org/r/20241118102752.11703-1-suma.hegde@amd.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
Currently REACHABLE is weird for being on the instruction after the
instruction it modifies.
Since all REACHABLE annotations have an explicit instruction, flip
them around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.494176035@infradead.org
|
|
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.353431347@infradead.org
|
|
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094311.786598147@infradead.org
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094311.465691316@infradead.org
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094311.145275669@infradead.org
|
|
On x86, topology_core_id() returns a unique core ID within the PKG
domain. Looking at match_smt() suggests that a core ID just needs to be
unique within a LLC domain. For use cases such as the core RAPL PMU,
there exists a need for a unique core ID across the entire system with
multiple PKG domains. Introduce topology_logical_core_id() to derive a
unique core ID across the system.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20241115060805.447565-3-Dhananjay.Ugwekar@amd.com
|
|
Modify the pebs_basic and pebs_meminfo structs to make the bitfields
more explicit to ease readability of the code.
Co-developed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119135504.1463839-3-kan.liang@linux.intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a terminating zero end-element to the array describing AMD CPUs
affected by erratum 1386 so that the matching loop actually
terminates instead of going off into the weeds
- Update the boot protocol documentation to mention the fact that the
preferred address to load the kernel to is considered in the
relocatable kernel case too
- Flush the memory buffer containing the microcode patch after applying
microcode on AMD Zen1 and Zen2, to avoid unnecessary slowdowns
- Make sure the PPIN CPU feature flag is cleared on all CPUs if PPIN
has been disabled
* tag 'x86_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Terminate the erratum_1386_microcode array
x86/Documentation: Update algo in init_size description of boot protocol
x86/microcode/AMD: Flush patch buffer mapping after application
x86/mm: Carve out INVLPG inline asm for use by others
x86/cpu: Fix PPIN initialization
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull rust trace event support from Steven Rostedt:
"Allow Rust code to have trace events
Trace events is a popular way to debug what is happening inside the
kernel or just to find out what is happening. Rust code is being added
to the Linux kernel but it currently does not support the tracing
infrastructure. Add support of trace events inside Rust code"
* tag 'trace-rust-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rust: jump_label: skip formatting generated file
jump_label: rust: pass a mut ptr to `static_key_count`
samples: rust: fix `rust_print` build making it a combined module
rust: add arch_static_branch
jump_label: adjust inline asm to be consistent
rust: samples: add tracepoint to Rust sample
rust: add tracepoint support
rust: add static_branch_unlikely for static_key_false
|
|
Josh Poimboeuf reports that he got a "will-it-scale.per_process_ops 1.9%
improvement" report for his patch that changed __get_user() to use
pointer masking instead of the explicit speculation barrier. However,
that patch doesn't actually work in the general case, because some (very
bad) architecture-specific code actually depends on __get_user() also
working on kernel addresses.
A profile showed that the offending __get_user() was the futex code,
which really should be fixed up to not use that horrid legacy case.
Rewrite futex_get_value_locked() to use the modern user acccess helpers,
and inline it so that the compiler not only avoids the function call for
a few instructions, but can do CSE on the address masking.
It also turns out the x86 futex functions have unnecessary barriers in
other places, so let's fix those up too.
Link: https://lore.kernel.org/all/20241115230653.hfvzyf3aqqntgp63@jpoimboe/
Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ZyulbYuvrkshfsd2@antipodes
|