summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)Author
2024-04-05x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macroBorislav Petkov (AMD)
Orphaned after 05e91e721138 ("x86/microcode/AMD: Rip out static buffers") No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-04-05x86/microcode/AMD: Avoid -Wformat warning with clang-15Arnd Bergmann
Older versions of clang show a warning for amd.c after a fix for a gcc warning: arch/x86/kernel/cpu/microcode/amd.c:478:47: error: format specifies type \ 'unsigned char' but the argument has type 'u16' (aka 'unsigned short') [-Werror,-Wformat] "amd-ucode/microcode_amd_fam%02hhxh.bin", family); ~~~~~~ ^~~~~~ %02hx In clang-16 and higher, this warning is disabled by default, but clang-15 is still supported, and it's trivial to avoid by adapting the types according to the range of the passed data and the format string. [ bp: Massage commit message. ] Fixes: 2e9064faccd1 ("x86/microcode/amd: Fix snprintf() format string warning in W=1 build") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240405204919.1003409-1-arnd@kernel.org
2024-04-04x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()Borislav Petkov (AMD)
Modifying a MCA bank's MCA_CTL bits which control which error types to be reported is done over /sys/devices/system/machinecheck/ ├── machinecheck0 │   ├── bank0 │   ├── bank1 │   ├── bank10 │   ├── bank11 ... sysfs nodes by writing the new bit mask of events to enable. When the write is accepted, the kernel deletes all current timers and reinits all banks. Doing that in parallel can lead to initializing a timer which is already armed and in the timer wheel, i.e., in use already: ODEBUG: init active (active state 0) object: ffff888063a28000 object type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642 WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514 debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514 Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code does. Reported by: Yue Sun <samsun1006219@gmail.com> Reported by: xingwei lee <xrivendell7@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
2024-04-04x86/extable: Remove unused fixup type EX_TYPE_COPYTong Tiangen
After 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") rewrote __copy_user_nocache() to use EX_TYPE_UACCESS instead of the EX_TYPE_COPY exception type, there are no more EX_TYPE_COPY users, so remove it. [ bp: Massage commit message. ] Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240204082627.3892816-2-tongtiangen@huawei.com
2024-04-04x86/CPU/AMD: Track SNP host status with cc_platform_*()Borislav Petkov (AMD)
The host SNP worthiness can determined later, after alternatives have been patched, in snp_rmptable_init() depending on cmdline options like iommu=pt which is incompatible with SNP, for example. Which means that one cannot use X86_FEATURE_SEV_SNP and will need to have a special flag for that control. Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places. Move kdump_sev_callback() to its rightful place, while at it. Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Srikanth Aithal <sraithal@amd.com> Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
2024-04-03x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offlineReinette Chatre
Tony encountered this OOPS when the last CPU of a domain goes offline while running a kernel built with CONFIG_NO_HZ_FULL: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI ... RIP: 0010:__find_nth_andnot_bit+0x66/0x110 ... Call Trace: <TASK> ? __die() ? page_fault_oops() ? exc_page_fault() ? asm_exc_page_fault() cpumask_any_housekeeping() mbm_setup_overflow_handler() resctrl_offline_cpu() resctrl_arch_offline_cpu() cpuhp_invoke_callback() cpuhp_thread_fun() smpboot_thread_fn() kthread() ret_from_fork() ret_from_fork_asm() </TASK> The NULL pointer dereference is encountered while searching for another online CPU in the domain (of which there are none) that can be used to run the MBM overflow handler. Because the kernel is configured with CONFIG_NO_HZ_FULL the search for another CPU (in its effort to prefer those CPUs that aren't marked nohz_full) consults the mask representing the nohz_full CPUs, tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y tick_nohz_full_mask is not allocated unless the kernel is booted with the "nohz_full=" parameter and because of that any access to tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled(). Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled(). The latter ensures tick_nohz_full_mask can be accessed safely and can be used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not. [ Use Ingo's suggestion that combines the two NO_HZ checks into one. ] Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow") Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
2024-03-26x86/mce: Dynamically size space for machine check recordsTony Luck
Systems with a large number of CPUs may generate a large number of machine check records when things go seriously wrong. But Linux has a fixed-size buffer that can only capture a few dozen errors. Allocate space based on the number of CPUs (with a minimum value based on the historical fixed buffer that could store 80 records). [ bp: Rename local var from tmpp to something more telling: gpool. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Avadhut Naik <avadhut.naik@amd.com> Link: https://lore.kernel.org/r/20240307192704.37213-1-tony.luck@intel.com
2024-03-26x86/cpu: Clear TME feature flag if TME is not enabled by BIOSBingsong Si
When TME is disabled by BIOS, the dmesg output is: x86/tme: not enabled by BIOS ... and TME functionality is not enabled by the kernel, but the TME feature is still shown in /proc/cpuinfo. Clear it. [ mingo: Clarified changelog ] Signed-off-by: Bingsong Si <sibs@chinatelecom.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "Huang, Kai" <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240311071938.13247-1-sibs@chinatelecom.cn
2024-03-25x86/CPU/AMD: Improve the erratum 1386 workaroundBorislav Petkov (AMD)
Disable XSAVES only on machines which haven't loaded the microcode revision containing the erratum fix. This will come in handy when running archaic OSes as guests. OSes whose brilliant programmers thought that CPUID is overrated and one should not query it but use features directly, ala shoot first, ask questions later... but only if you're alive after the shooting. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20240324200525.GBZgCHhYFsBj12PrKv@fat_crate.local
2024-03-25perf/x86/amd/lbr: Use freeze based on availabilitySandipan Das
Currently, the LBR code assumes that LBR Freeze is supported on all processors when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX] bit 1 is set. This is incorrect as the availability of the feature is additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set, which may not be set for all Zen 4 processors. Define a new feature bit for LBR and PMC freeze and set the freeze enable bit (FLBRI) in DebugCtl (MSR 0x1d9) conditionally. It should still be possible to use LBR without freeze for profile-guided optimization of user programs by using an user-only branch filter during profiling. When the user-only filter is enabled, branches are no longer recorded after the transition to CPL 0 upon PMI arrival. When branch entries are read in the PMI handler, the branch stack does not change. E.g. $ perf record -j any,u -e ex_ret_brn_tkn ./workload Since the feature bit is visible under flags in /proc/cpuinfo, it can be used to determine the feasibility of use-cases which require LBR Freeze to be supported by the hardware such as profile-guided optimization of kernels. Fixes: ca5b7c0d9621 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
2024-03-24Merge tag 'x86-urgent-2024-03-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Ensure that the encryption mask at boot is properly propagated on 5-level page tables, otherwise the PGD entry is incorrectly set to non-encrypted, which causes system crashes during boot. - Undo the deferred 5-level page table setup as it cannot work with memory encryption enabled. - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset to the default value but the cached variable is not, so subsequent comparisons might yield the wrong result and as a consequence the result prevents updating the MSR. - Register the local APIC address only once in the MPPARSE enumeration to prevent triggering the related WARN_ONs() in the APIC and topology code. - Handle the case where no APIC is found gracefully by registering a fake APIC in the topology code. That makes all related topology functions work correctly and does not affect the actual APIC driver code at all. - Don't evaluate logical IDs during early boot as the local APIC IDs are not yet enumerated and the invoked function returns an error code. Nothing requires the logical IDs before the final CPUID enumeration takes place, which happens after the enumeration. - Cure the fallout of the per CPU rework on UP which misplaced the copying of boot_cpu_data to per CPU data so that the final update to boot_cpu_data got lost which caused inconsistent state and boot crashes. - Use copy_from_kernel_nofault() in the kprobes setup as there is no guarantee that the address can be safely accessed. - Reorder struct members in struct saved_context to work around another kmemleak false positive - Remove the buggy code which tries to update the E820 kexec table for setup_data as that is never passed to the kexec kernel. - Update the resource control documentation to use the proper units. - Fix a Kconfig warning observed with tinyconfig * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/64: Move 5-level paging global variable assignments back x86/boot/64: Apply encryption mask to 5-level pagetable update x86/cpu: Add model number for another Intel Arrow Lake mobile processor x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD Documentation/x86: Document that resctrl bandwidth control units are MiB x86/mpparse: Register APIC address only once x86/topology: Handle the !APIC case gracefully x86/topology: Don't evaluate logical IDs during early boot x86/cpu: Ensure that CPU info updates are propagated on UP kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address x86/pm: Work around false positive kmemleak report in msr_build_context() x86/kexec: Do not update E820 kexec table for setup_data x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-23x86/topology: Handle the !APIC case gracefullyThomas Gleixner
If there is no local APIC enumerated and registered then the topology bitmaps are empty. Therefore, topology_init_possible_cpus() will die with a division by zero exception. Prevent this by registering a fake APIC id to populate the topology bitmap. This also allows to use all topology query interfaces unconditionally. It does not affect the actual APIC code because either the local APIC address was not registered or no local APIC could be detected. Fixes: f1f758a80516 ("x86/topology: Add a mechanism to track topology via APIC IDs") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
2024-03-23x86/topology: Don't evaluate logical IDs during early bootThomas Gleixner
The local APICs have not yet been enumerated so the logical ID evaluation from the topology bitmaps does not work and would return an error code. Skip the evaluation during the early boot CPUID evaluation and only apply it on the final run. Fixes: 380414be78bf ("x86/cpu/topology: Use topology logical mapping mechanism") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de
2024-03-23x86/cpu: Ensure that CPU info updates are propagated on UPThomas Gleixner
The boot sequence evaluates CPUID information twice: 1) During early boot 2) When finalizing the early setup right before mitigations are selected and alternatives are patched. In both cases the evaluation is stored in boot_cpu_data, but on UP the copying of boot_cpu_data to the per CPU info of the boot CPU happens between #1 and #2. So any update which happens in #2 is never propagated to the per CPU info instance. Consolidate the whole logic and copy boot_cpu_data right before applying alternatives as that's the point where boot_cpu_data is in it's final state and not supposed to change anymore. This also removes the voodoo mb() from smp_prepare_cpus_common() which had absolutely no purpose. Fixes: 71eb4893cfaf ("x86/percpu: Cure per CPU madness on UP") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
2024-03-22x86/cpu/topology: Add support for the AMD 0x80000026 leafThomas Gleixner
On AMD processors that support extended CPUID leaf 0x80000026, use the extended leaf to parse the topology information. In case of a failure, fall back to parsing the information from CPUID leaf 0xb. CPUID leaf 0x80000026 exposes the "CCX" and "CCD (Die)" information on AMD processors which have been mapped to TOPO_TILE_DOMAIN and TOPO_DIE_DOMAIN respectively. Since this information was previously not available via CPUID leaf 0xb or 0x8000001e, the "die_id", "logical_die_id", "max_die_per_pkg", "die_cpus", and "die_cpus_list" will differ with this addition on AMD processors that support extended CPUID leaf 0x80000026 and contain more than one "CCD (Die)" on the package. For example, following are the changes in the values reported by "/sys/kernel/debug/x86/topo/cpus/16" after applying this patch on a 4th Generation AMD EPYC System (1 x 128C/256T): (CPU16 is the first CPU of the second CCD on the package) tip:x86/apic tip:x86/apic + this patch online: 1 1 initial_apicid: 80 80 apicid: 80 80 pkg_id: 0 0 die_id: 0 4 * cu_id: 255 255 core_id: 64 64 logical_pkg_id: 0 0 logical_die_id: 0 4 * llc_id: 8 8 l2c_id: 65535 65535 amd_node_id: 0 0 amd_nodes_per_pkg: 1 1 num_threads: 256 256 num_cores: 128 128 max_dies_per_pkg: 1 8 * max_threads_per_core:2 2 [ prateek: commit log, updated comment in topoext_amd.c, changed has_0xb to has_topoext, rebased the changes on tip:x86/apic, tested the changes on 4th Gen AMD EPYC system ] [ mingo: tidy up the changelog a bit more ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240314050432.1710-1-kprateek.nayak@amd.com
2024-03-21x86/cpu: Get rid of an unnecessary local variable in get_cpu_address_sizes()Borislav Petkov (AMD)
Drop 'vp_bits_from_cpuid' as it is not really needed. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240316120706.4352-1-bp@alien8.de
2024-03-21Merge tag 'hyperv-next-signed-20240320' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Use Hyper-V entropy to seed guest random number generator (Michael Kelley) - Convert to platform remove callback returning void for vmbus (Uwe Kleine-König) - Introduce hv_get_hypervisor_version function (Nuno Das Neves) - Rename some HV_REGISTER_* defines for consistency (Nuno Das Neves) - Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_* (Nuno Das Neves) - Cosmetic changes for hv_spinlock.c (Purna Pavan Chandra Aekkaladevi) - Use per cpu initial stack for vtl context (Saurabh Sengar) * tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Use Hyper-V entropy to seed guest random number generator x86/hyperv: Cosmetic changes for hv_spinlock.c hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency hv: vmbus: Convert to platform remove callback returning void mshyperv: Introduce hv_get_hypervisor_version function x86/hyperv: Use per cpu initial stack for vtl context hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
2024-03-18x86/hyperv: Use Hyper-V entropy to seed guest random number generatorMichael Kelley
A Hyper-V host provides its guest VMs with entropy in a custom ACPI table named "OEM0". The entropy bits are updated each time Hyper-V boots the VM, and are suitable for seeding the Linux guest random number generator (rng). See a brief description of OEM0 in [1]. Generation 2 VMs on Hyper-V use UEFI to boot. Existing EFI code in Linux seeds the rng with entropy bits from the EFI_RNG_PROTOCOL. Via this path, the rng is seeded very early during boot with good entropy. The ACPI OEM0 table provided in such VMs is an additional source of entropy. Generation 1 VMs on Hyper-V boot from BIOS. For these VMs, Linux doesn't currently get any entropy from the Hyper-V host. While this is not fundamentally broken because Linux can generate its own entropy, using the Hyper-V host provided entropy would get the rng off to a better start and would do so earlier in the boot process. Improve the rng seeding for Generation 1 VMs by having Hyper-V specific code in Linux take advantage of the OEM0 table to seed the rng. For Generation 2 VMs, use the OEM0 table to provide additional entropy beyond the EFI_RNG_PROTOCOL. Because the OEM0 table is custom to Hyper-V, parse it directly in the Hyper-V code in the Linux kernel and use add_bootloader_randomness() to add it to the rng. Once the entropy bits are read from OEM0, zero them out in the table so they don't appear in /sys/firmware/acpi/tables/OEM0 in the running VM. The zero'ing is done out of an abundance of caution to avoid potential security risks to the rng. Also set the OEM0 data length to zero so a kexec or other subsequent use of the table won't try to use the zero'ed bits. [1] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20240318155408.216851-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240318155408.216851-1-mhklinux@outlook.com>
2024-03-18hyperv-tlfs: Rename some HV_REGISTER_* defines for consistencyNuno Das Neves
Rename HV_REGISTER_GUEST_OSID to HV_REGISTER_GUEST_OS_ID. This matches the existing HV_X64_MSR_GUEST_OS_ID. Rename HV_REGISTER_CRASH_* to HV_REGISTER_GUEST_CRASH_*. Including GUEST_ is consistent with other #defines such as HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. The new names also match the TLFS document more accurately, i.e. HvRegisterGuestCrash*. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Link: https://lore.kernel.org/r/1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-16x86/CPU/AMD: Update the Zenbleed microcode revisionsBorislav Petkov (AMD)
Update them to the correct revision numbers. Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-14Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-12Merge tag 'rfds-for-linus-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RFDS mitigation from Dave Hansen: "RFDS is a CPU vulnerability that may allow a malicious userspace to infer stale register values from kernel space. Kernel registers can have all kinds of secrets in them so the mitigation is basically to wait until the kernel is about to return to userspace and has user values in the registers. At that point there is little chance of kernel secrets ending up in the registers and the microarchitectural state can be cleared. This leverages some recent robustness fixes for the existing MDS vulnerability. Both MDS and RFDS use the VERW instruction for mitigation" * tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests x86/rfds: Mitigate Register File Data Sampling (RFDS) Documentation/hw-vuln: Add documentation for RFDS x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12mshyperv: Introduce hv_get_hypervisor_version functionNuno Das Neves
Introduce x86_64 and arm64 functions to get the hypervisor version information and store it in a structure for simpler parsing. Use the new function to get and parse the version at boot time. While at it, move the printing code to hv_common_init() so it is not duplicated. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86-cleanups-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11Merge tag 'x86_sev_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ...
2024-03-11Merge tag 'x86_cache_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull resource control updates from Borislav Petkov: - Rework different aspects of the resctrl code like adding arch-specific accessors and splitting the locking, in order to accomodate ARM's MPAM implementation of hw resource control and be able to use the same filesystem control interface like on x86. Work by James Morse - Improve the memory bandwidth throttling heuristic to handle workloads with not too regular load levels which end up penalized unnecessarily - Use CPUID to detect the memory bandwidth enforcement limit on AMD - The usual set of fixes * tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Remove lockdep annotation that triggers false positive x86/resctrl: Separate arch and fs resctrl locks x86/resctrl: Move domain helper migration into resctrl_offline_cpu() x86/resctrl: Add CPU offline callback for resctrl work x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU x86/resctrl: Add CPU online callback for resctrl work x86/resctrl: Add helpers for system wide mon/alloc capable x86/resctrl: Make rdt_enable_key the arch's decision to switch x86/resctrl: Move alloc/mon static keys into helpers x86/resctrl: Make resctrl_mounted checks explicit x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() x86/resctrl: Allow resctrl_arch_rmid_read() to sleep x86/resctrl: Queue mon_event_read() instead of sending an IPI x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow x86/resctrl: Move CLOSID/RMID matching and setting to use helpers x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding x86/resctrl: Track the number of dirty RMID a CLOSID has x86/resctrl: Allow RMID allocation to be scoped by CLOSID x86/resctrl: Access per-rmid structures by index ...
2024-03-11Merge tag 'x86_mtrr_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MTRR update from Borislav Petkov: - Relax the PAT MSR programming which was unnecessarily using the MTRR programming protocol of disabling the cache around the changes. The reason behind this is the current algorithm triggering a #VE exception for TDX guests and unnecessarily complicating things * tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pat: Simplify the PAT programming protocol
2024-03-11Merge tag 'x86_cpu_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu update from Borislav Petkov: - Have AMD Zen common init code run on all families from Zen1 onwards in order to save some future enablement effort * tag 'x86_cpu_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Do the common init on future Zens too
2024-03-11Merge tag 'ras_core_for_v6.9_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixlet from Borislav Petkov: - Constify yet another static struct bus_type instance now that the driver core can handle that * tag 'ras_core_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make mce_subsys const
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta
RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is setPawan Gupta
Currently MMIO Stale Data mitigation for CPUs not affected by MDS/TAA is to only deploy VERW at VMentry by enabling mmio_stale_data_clear static branch. No mitigation is needed for kernel->user transitions. If such CPUs are also affected by RFDS, its mitigation may set X86_FEATURE_CLEAR_CPU_BUF to deploy VERW at kernel->user and VMentry. This could result in duplicate VERW at VMentry. Fix this by disabling mmio_stale_data_clear static branch when X86_FEATURE_CLEAR_CPU_BUF is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
2024-03-11Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
2024-03-06x86/topology: Ignore non-present APIC IDs in a present packageThomas Gleixner
Borislav reported that one of his systems has a broken MADT table which advertises eight present APICs and 24 non-present APICs in the same package. The non-present ones are considered hot-pluggable by the topology evaluation code, which is obviously bogus as there is no way to hot-plug within the same package. As the topology evaluation code accounts for hot-pluggable CPUs in a package, the maximum number of cores per package is computed wrong, which in turn causes the uncore performance counter driver to access non-existing MSRs. It will probably confuse other entities which rely on the maximum number of cores and threads per package too. Cure this by ignoring hot-pluggable APIC IDs within a present package. In theory it would be reasonable to just do this unconditionally, but then there is this thing called reality^Wvirtualization which ruins everything. Virtualization is the only existing user of "physical" hotplug and the virtualization tools allow the above scenario. Whether that is actually in use or not is unknown. As it can be argued that the virtualization case is not affected by the issues which exposed the reported problem, allow the bogosity if the kernel determined that it is running in a VM for now. Fixes: 89b0f15f408f ("x86/cpu/topology: Get rid of cpuinfo::x86_max_cores") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/87a5nbvccx.ffs@tglx
2024-03-04x86/idle: Select idle routine only onceThomas Gleixner
The idle routine selection is done on every CPU bringup operation and has a guard in place which is effective after the first invocation, which is a pointless exercise. Invoke it once on the boot CPU and mark the related functions __init. The guard check has to stay as xen_set_default_idle() runs early. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/87edcu6vaq.ffs@tglx
2024-03-04x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_currentThomas Gleixner
Sparse rightfully complains: bugs.c:71:9: sparse: warning: incorrect type in initializer (different address spaces) bugs.c:71:9: sparse: expected void const [noderef] __percpu *__vpp_verify bugs.c:71:9: sparse: got unsigned long long * The reason is that x86_spec_ctrl_current which is a per CPU variable is exported with EXPORT_SYMBOL_GPL(). Use EXPORT_PER_CPU_SYMBOL_GPL() instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.732288812@linutronix.de
2024-03-04x86/percpu: Cure per CPU madness on UPThomas Gleixner
On UP builds Sparse complains rightfully about accesses to cpu_info with per CPU accessors: cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces) cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify cacheinfo.c:282:30: sparse: got unsigned int * The reason is that on UP builds cpu_info which is a per CPU variable on SMP is mapped to boot_cpu_info which is a regular variable. There is a hideous accessor cpu_data() which tries to hide this, but it's not sufficient as some places require raw accessors and generates worse code than the regular per CPU accessors. Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU cpu_info unconditionally. This requires to update the CPU info on the boot CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function and implement exactly that. This allows to use regular per CPU accessors uncoditionally and paves the way to remove the cpu_data() hackery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.622511517@linutronix.de
2024-03-04x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>Thomas Gleixner
To clean up the per CPU insanity of UP which causes sparse to be rightfully unhappy and prevents the usage of the generic per CPU accessors on cpu_info it is necessary to include <linux/percpu.h> into <asm/msr.h>. Including <linux/percpu.h> into <asm/msr.h> is impossible because it ends up in header dependency hell. The problem is that <asm/processor.h> includes <asm/msr.h>. The inclusion of <linux/percpu.h> results in a compile fail where the compiler cannot longer handle an include in <asm/cpufeature.h> which references boot_cpu_data which is defined in <asm/processor.h>. The only reason why <asm/msr.h> is included in <asm/processor.h> are the set/get_debugctlmsr() inlines. They are defined there because <asm/processor.h> is such a nice dump ground for everything. In fact they belong obviously into <asm/debugreg.h>. Move them to <asm/debugreg.h> and fix up the resulting damage which is just exposing the reliance on random include chains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.454678686@linutronix.de
2024-03-04hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*Nuno Das Neves
The HV_REGISTER_ are used as arguments to hv_set/get_register(), which delegate to arch-specific mechanisms for getting/setting synthetic Hyper-V MSRs. On arm64, HV_REGISTER_ defines are synthetic VP registers accessed via the get/set vp registers hypercalls. The naming matches the TLFS document, although these register names are not specific to arm64. However, on x86 the prefix HV_REGISTER_ indicates Hyper-V MSRs accessed via rdmsrl()/wrmsrl(). This is not consistent with the TLFS doc, where HV_REGISTER_ is *only* used for used for VP register names used by the get/set register hypercalls. To fix this inconsistency and prevent future confusion, change the arch-generic aliases used by callers of hv_set/get_register() to have the prefix HV_MSR_ instead of HV_REGISTER_. Use the prefix HV_X64_MSR_ for the x86-only Hyper-V MSRs. On x86, the generic HV_MSR_'s point to the corresponding HV_X64_MSR_. Move the arm64 HV_REGISTER_* defines to the asm-generic hyperv-tlfs.h, since these are not specific to arm64. On arm64, the generic HV_MSR_'s point to the corresponding HV_REGISTER_. While at it, rename hv_get/set_registers() and related functions to hv_get/set_msr(), hv_get/set_nested_msr(), etc. These are only used for Hyper-V MSRs and this naming makes that clear. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-02-27Merge branch 'x86/urgent' into x86/apic, to resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/intel.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-27x86/apic: Build the x86 topology enumeration functions on UP APIC builds tooIngo Molnar
These functions are mostly pointless on UP, but nevertheless the 64-bit UP APIC build already depends on the existence of topology_apply_cmdline_limits_early(), which caused a build bug, resolve it by making them available under CONFIG_X86_LOCAL_APIC, as their prototypes already are. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-26x86/cpu/intel: Detect TME keyid bits before setting MTRR mask registersPaolo Bonzini
MKTME repurposes the high bit of physical address to key id for encryption key and, even though MAXPHYADDR in CPUID[0x80000008] remains the same, the valid bits in the MTRR mask register are based on the reduced number of physical address bits. detect_tme() in arch/x86/kernel/cpu/intel.c detects TME and subtracts it from the total usable physical bits, but it is called too late. Move the call to early_init_intel() so that it is called in setup_arch(), before MTRRs are setup. This fixes boot on TDX-enabled systems, which until now only worked with "disable_mtrr_cleanup". Without the patch, the values written to the MTRRs mask registers were 52-bit wide (e.g. 0x000fffff_80000800) and the writes failed; with the patch, the values are 46-bit wide, which matches the reduced MAXPHYADDR that is shown in /proc/cpuinfo. Reported-by: Zixi Chen <zixchen@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240131230902.1867092-3-pbonzini%40redhat.com
2024-02-26x86/cpu: Allow reducing x86_phys_bits during early_identify_cpu()Paolo Bonzini
In commit fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach"), the initialization of c->x86_phys_bits was moved after this_cpu->c_early_init(c). This is incorrect because early_init_amd() expected to be able to reduce the value according to the contents of CPUID leaf 0x8000001f. Fortunately, the bug was negated by init_amd()'s call to early_init_amd(), which does reduce x86_phys_bits in the end. However, this is very late in the boot process and, most notably, the wrong value is used for x86_phys_bits when setting up MTRRs. To fix this, call get_cpu_address_sizes() as soon as X86_FEATURE_CPUID is set/cleared, and c->extended_cpuid_level is retrieved. Fixes: fbf6449f84bf ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240131230902.1867092-2-pbonzini%40redhat.com
2024-02-23x86, crash: wrap crash dumping code into crash related ifdefsBaoquan He
Now crash codes under kernel/ folder has been split out from kexec code, crash dumping can be separated from kexec reboot in config items on x86 with some adjustments. Here, also change some ifdefs or IS_ENABLED() check to more appropriate ones, e,g - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE)) [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope] Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspaceSean Christopherson
Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query whether or not the CPU supports 5-level EPT paging. EPT capabilities are enumerated via MSR, i.e. aren't accessible to userspace without help from the kernel, and knowing whether or not 5-level EPT is supported is useful for debug, triage, testing, etc. For example, when EPT is enabled, bits 51:48 of guest physical addresses are consumed by the CPU if and only if 5-level EPT is enabled. For CPUs with MAXPHYADDR > 48, KVM *can't* map all legal guest memory without 5-level EPT, making 5-level EPT support valuable information for userspace. Reported-by: Yi Lai <yi1.lai@intel.com> Cc: Tao Su <tao1.su@linux.intel.com> Cc: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20240110002340.485595-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22x86/resctrl: Remove lockdep annotation that triggers false positiveJames Morse
get_domain_from_cpu() walks a list of domains to find the one that contains the specified CPU. This needs to be protected against races with CPU hotplug when the list is modified. It has recently gained a lockdep annotation to check this. The lockdep annotation causes false positives when called via IPI as the lock is held, but by another process. Remove it. [ bp: Refresh it ontop of x86/cache. ] Fixes: fb700810d30b ("x86/resctrl: Separate arch and fs resctrl locks") Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/ZdUSwOM9UUNpw84Y@agluck-desk3
2024-02-20x86/pat: Simplify the PAT programming protocolKirill A. Shutemov
The programming protocol for the PAT MSR follows the MTRR programming protocol. However, this protocol is cumbersome and requires disabling caching (CR0.CD=1), which is not possible on some platforms. Specifically, a TDX guest is not allowed to set CR0.CD. It triggers a #VE exception. It turns out that the requirement to follow the MTRR programming protocol for PAT programming is unnecessarily strict. The new Intel Software Developer Manual (http://www.intel.com/sdm) (December 2023) relaxes this requirement, please refer to the section titled "Programming the PAT" for more information. In short, this section provides an alternative PAT update sequence which doesn't need to disable caches around the PAT update but only to flush those caches and TLBs. The AMD documentation does not link PAT programming to MTRR and is there fore, fine too. The kernel only needs to flush the TLB after updating the PAT MSR. The set_memory code already takes care of flushing the TLB and cache when changing the memory type of a page. [ bp: Expand commit message. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240124130650.496056-1-kirill.shutemov@linux.intel.com
2024-02-19x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static keyPawan Gupta
The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com