summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2025-03-25Merge tag 'hyperv-next-signed-20250324' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Add support for running as the root partition in Hyper-V (Microsoft Hypervisor) by exposing /dev/mshv (Nuno and various people) - Add support for CPU offlining in Hyper-V (Hamza Mahfooz) - Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael Kelley, Thorsten Blum) * tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits) x86/hyperv: fix an indentation issue in mshyperv.h x86/hyperv: Add comments about hv_vpset and var size hypercall input args Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs hyperv: Add definitions for root partition driver to hv headers x86: hyperv: Add mshv_handler() irq handler and setup function Drivers: hv: Introduce per-cpu event ring tail Drivers: hv: Export some functions for use by root partition module acpi: numa: Export node_to_pxm() hyperv: Introduce hv_recommend_using_aeoi() arm64/hyperv: Add some missing functions to arm64 x86/mshyperv: Add support for extended Hyper-V features hyperv: Log hypercall status codes as strings x86/hyperv: Fix check of return value from snp_set_vmsa() x86/hyperv: Add VTL mode callback for restarting the system x86/hyperv: Add VTL mode emergency restart callback hyperv: Remove unused union and structs hyperv: Add CONFIG_MSHV_ROOT to gate root partition support hyperv: Change hv_root_partition into a function hyperv: Convert hypercall statuses to linux error codes drivers/hv: add CPU offlining support ...
2025-03-25Merge tag 'ras_core_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS update from Borislav Petkov: - A cleanup to the MCE notification machinery * tag 'ras_core_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/inject: Remove call to mce_notify_irq()
2025-03-25Merge tag 'x86_cache_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - First part of the MPAM work: split the architectural part of resctrl from the filesystem part so that ARM's MPAM varian of resource control can be added later while sharing the user interface with x86 (James Morse) * tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers x86/resctrl: Move get_config_index() to a header x86/resctrl: Handle throttle_mode for SMBA resources x86/resctrl: Move RFTYPE flags to be managed by resctrl x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr x86/resctrl: Make prefetch_disable_bits belong to the arch code x86/resctrl: Allow an architecture to disable pseudo lock x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions x86/resctrl: Move mbm_cfg_mask to struct rdt_resource x86/resctrl: Move mba_mbps_default_event init to filesystem code x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC x86/resctrl: Move the is_mbm_*_enabled() helpers to asm/resctrl.h x86/resctrl: Rewrite and move the for_each_*_rdt_resource() walkers x86/resctrl: Move monitor init work to a resctrl init call x86/resctrl: Move monitor exit work to a resctrl exit call x86/resctrl: Add an arch helper to reset one resource x86/resctrl: Move resctrl types to a separate header x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code x86/resctrl: Expose resctrl fs's init function to the rest of the kernel ...
2025-03-25Merge tag 'x86_bugs_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 speculation mitigation updates from Borislav Petkov: - Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements * tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds x86/bugs: Relocate mds/taa/mmio/rfds defines x86/bugs: Add X86_BUG_SPECTRE_V2_USER x86/bugs: Remove X86_FEATURE_USE_IBPB KVM: nVMX: Always use IBPB to properly virtualize IBRS x86/bugs: Use a static branch to guard IBPB on vCPU switch x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set() x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation() x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Nothing major this time around. Apart from the usual perf/PMU updates, some page table cleanups, the notable features are average CPU frequency based on the AMUv1 counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in the uaccess routines. Perf and PMUs: - Support for the 'Rainier' CPU PMU from Arm - Preparatory driver changes and cleanups that pave the way for BRBE support - Support for partial virtualisation of the Apple-M1 PMU - Support for the second event filter in Arm CSPMU designs - Minor fixes and cleanups (CMN and DWC PMUs) - Enable EL2 requirements for FEAT_PMUv3p9 Power, CPU topology: - Support for AMUv1-based average CPU frequency - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds a generic topology_is_primary_thread() function overridden by x86 and powerpc New(ish) features: - MOPS (memcpy/memset) support for the uaccess routines Security/confidential compute: - Fix the DMA address for devices used in Realms with Arm CCA. The CCA architecture uses the address bit to differentiate between shared and private addresses - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by default Memory management clean-ups: - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs - Some minor page table accessor clean-ups - PIE/POE (permission indirection/overlay) helpers clean-up Kselftests: - MTE: skip hugetlb tests if MTE is not supported on such mappings and user correct naming for sync/async tag checking modes Miscellaneous: - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people request) - Sysreg updates for new register fields - CPU type info for some Qualcomm Kryo cores" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) arm64: mm: Don't use %pK through printk perf/arm_cspmu: Fix missing io.h include arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists arm64: cputype: Add MIDR_CORTEX_A76AE arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list arm64/sysreg: Enforce whole word match for open/close tokens arm64/sysreg: Fix unbalanced closing block arm64: Kconfig: Enable HOTPLUG_SMT arm64: topology: Support SMT control on ACPI based system arch_topology: Support SMT control for OF based system cpu/SMT: Provide a default topology_is_primary_thread() arm64/mm: Define PTDESC_ORDER perf/arm_cspmu: Add PMEVFILT2R support perf/arm_cspmu: Generalise event filtering perf/arm_cspmu: Move register definitons to header arm64/kernel: Always use level 2 or higher for early mappings arm64/mm: Drop PXD_TABLE_BIT arm64/mm: Check pmd_table() in pmd_trans_huge() ...
2025-03-25Merge tag 'irq-drivers-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq driver updates from Thomas Gleixner: - Support for hard indices on RISC-V. The hart index identifies a hart (core) within a specific interrupt domain in RISC-V's Priviledged Architecture. - Rework of the RISC-V MSI driver This moves the driver over to the generic MSI library and solves the affinity problem of unmaskable PCI/MSI controllers. Unmaskable PCI/MSI controllers are prone to lose interrupts when the MSI message is updated to change the affinity because the message write consists of three 32-bit subsequent writes, which update address and data. As these writes are non-atomic versus the device raising an interrupt, the device can observe a half written update and issue an interrupt on the wrong vector. This is mitiated by a carefully orchestrated step by step update and the observation of an eventually pending interrupt on the CPU which issues the update. The algorithm follows the well established method of the X86 MSI driver. - A new driver for the RISC-V Sophgo SG2042 MSI controller - Overhaul of the Renesas RZQ2L driver Simplification of the probe function by using devm_*() mechanisms, which avoid the endless list of error prone gotos in the failure paths. - Expand the Renesas RZV2H driver to support RZ/G3E SoCs - A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to ensure that the addressing is limited to the lower 32-bit of the physical address space. - Add support for the Allwinner AS23 NMI controller - Expand the IMX irqsteer driver to handle up to 960 input interrupts - The usual small updates, cleanups and device tree changes * tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) irqchip/imx-irqsteer: Support up to 960 input interrupts irqchip/sunxi-nmi: Support Allwinner A523 NMI controller dt-bindings: irq: sun7i-nmi: Document the Allwinner A523 NMI controller irqchip/davinci-cp-intc: Remove public header irqchip/renesas-rzv2h: Add RZ/G3E support irqchip/renesas-rzv2h: Update macros ICU_TSSR_TSSEL_{MASK,PREP} irqchip/renesas-rzv2h: Update TSSR_TIEN macro irqchip/renesas-rzv2h: Add field_width to struct rzv2h_hw_info irqchip/renesas-rzv2h: Add max_tssel to struct rzv2h_hw_info irqchip/renesas-rzv2h: Add struct rzv2h_hw_info with t_offs variable irqchip/renesas-rzv2h: Use devm_pm_runtime_enable() irqchip/renesas-rzv2h: Use devm_reset_control_get_exclusive_deasserted() irqchip/renesas-rzv2h: Simplify rzv2h_icu_init() irqchip/renesas-rzv2h: Drop irqchip from struct rzv2h_icu_priv irqchip/renesas-rzv2h: Fix wrong variable usage in rzv2h_tint_set_type() dt-bindings: interrupt-controller: renesas,rzv2h-icu: Document RZ/G3E SoC riscv: sophgo: dts: Add msi controller for SG2042 irqchip: Add the Sophgo SG2042 MSI interrupt controller dt-bindings: interrupt-controller: Add Sophgo SG2042 MSI arm64: dts: rockchip: rk356x: Move PCIe MSI to use GIC ITS instead of MBI ...
2025-03-25x86/split_lock: Simplify reenablingMaksim Davydov
When split_lock_mitigate is disabled, each CPU needs its own delayed_work structure. They are used to reenable split lock detection after its disabling. But delayed_work structure must be correctly initialized after its allocation. Current implementation uses deferred initialization that makes the split lock handler code unclear. The code can be simplified a bit if the initialization is moved to the appropriate initcall. sld_setup() is called before setup_per_cpu_areas(), thus it can't be used for this purpose, so introduce an independent initcall for the initialization. [ mingo: Simplified the 'work' assignment line a bit more. ] Signed-off-by: Maksim Davydov <davydov-max@yandex-team.ru> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250325085807.171885-1-davydov-max@yandex-team.ru
2025-03-25x86/fpu: Update the outdated comment above fpstate_init_user()Chao Gao
fpu_init_fpstate_user() was removed in: commit 582b01b6ab27 ("x86/fpu: Remove old KVM FPU interface"). Update that comment to accurately reflect the current state regarding its callers. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250324131931.2097905-1-chao.gao@intel.com
2025-03-25x86/early_printk: Add support for MMIO-based UARTsDenis Mukhin
During the bring-up of an x86 board, the kernel was crashing before reaching the platform's console driver because of a bug in the firmware, leaving no trace of the boot progress. The only available method to debug the kernel boot process was via the platform's MMIO-based UART, as the board lacked an I/O port-based UART, PCI UART, or functional video output. Then it turned out that earlyprintk= does not have a knob to configure the MMIO-mapped UART. Extend the early printk facility to support platform MMIO-based UARTs on x86 systems, enabling debugging during the system bring-up phase. The command line syntax to enable platform MMIO-based UART is: earlyprintk=mmio,membase[,{nocfg|baudrate}][,keep] Note, the change does not integrate MMIO-based UART support to: arch/x86/boot/early_serial_console.c Also, update kernel parameters documentation with the new syntax and add the missing 'nocfg' setting to the PCI serial cards description. Signed-off-by: Denis Mukhin <dmukhin@ford.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250324-earlyprintk-v3-1-aee7421dc469@ford.com
2025-03-25x86/dumpstack: Fix inaccurate unwinding from exception stacks due to ↵Jann Horn
misplaced assignment Commit: 2e4be0d011f2 ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again") was intended to ensure alignment of the stack pointer; but it also moved the initialization of the "stack" variable down into the loop header. This was likely intended as a no-op cleanup, since the commit message does not mention it; however, this caused a behavioral change because the value of "regs" is different between the two places. Originally, get_stack_pointer() used the regs provided by the caller; after that commit, get_stack_pointer() instead uses the regs at the top of the stack frame the unwinder is looking at. Often, there are no such regs at all, and "regs" is NULL, causing get_stack_pointer() to fall back to the task's current stack pointer, which is not what we want here, but probably happens to mostly work. Other times, the original regs will point to another regs frame - in that case, the linear guess unwind logic in show_trace_log_lvl() will start unwinding too far up the stack, causing the first frame found by the proper unwinder to never be visited, resulting in a stack trace consisting purely of guess lines. Fix it by moving the "stack = " assignment back where it belongs. Fixes: 2e4be0d011f2 ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250325-2025-03-unwind-fixes-v1-2-acd774364768@google.com
2025-03-24Merge tag 'x86-cleanups-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo" * tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function x86/delay: Fix inconsistent whitespace selftests/x86/syscall: Fix coccinelle WARNING recommending the use of ARRAY_SIZE() x86/platform: Fix missing declaration of 'x86_apple_machine' x86/irq: Fix missing declaration of 'io_apic_irqs' x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s description x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()
2025-03-24Merge tag 'x86-fpu-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu updates from Ingo Molnar: - Improve crypto performance by making kernel-mode FPU reliably usable in softirqs ((Eric Biggers) - Fully optimize out WARN_ON_FPU() (Eric Biggers) - Initial steps to support Support Intel APX (Advanced Performance Extensions) (Chang S. Bae) - Fix KASAN for arch_dup_task_struct() (Benjamin Berg) - Refine and simplify the FPU magic number check during signal return (Chang S. Bae) - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav Spassov) - selftests/x86/xstate: Introduce common code for testing extended states (Chang S. Bae) - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros Bizjak) * tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs x86/fpu/xstate: Simplify print_xstate_features() x86/fpu: Refine and simplify the magic number check during signal return selftests/x86/xstate: Fix spelling mistake "hader" -> "header" x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct() vmlinux.lds.h: Remove entry to place init_task onto init_stack selftests/x86/avx: Add AVX tests selftests/x86/xstate: Clarify supported xstates selftests/x86/xstate: Consolidate test invocations into a single entry selftests/x86/xstate: Introduce signal ABI test selftests/x86/xstate: Refactor ptrace ABI test selftests/x86/xstate: Refactor context switching test selftests/x86/xstate: Enumerate and name xstate components selftests/x86/xstate: Refactor XSAVE helpers for general use selftests/x86: Consolidate redundant signal helper functions x86/fpu: Fix guest FPU state buffer allocation size x86/fpu: Fully optimize out WARN_ON_FPU()
2025-03-24Merge tag 'x86-boot-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot code updates from Ingo Molnar: - Memblock setup and other early boot code cleanups (Mike Rapoport) - Export e820_table_kexec[] to sysfs (Dave Young) - Baby steps of adding relocate_kernel() debugging support (David Woodhouse) - Replace open-coded parity calculation with parity8() (Kuan-Wei Chiu) - Move the LA57 trampoline to separate source file (Ard Biesheuvel) - Misc micro-optimizations (Uros Bizjak) - Drop obsolete E820_TYPE_RESERVED_KERN and related code (Mike Rapoport) * tag 'x86-boot-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kexec: Add relocate_kernel() debugging support: Load a GDT x86/boot: Move the LA57 trampoline to separate source file x86/boot: Do not test if AC and ID eflags are changeable on x86_64 x86/bootflag: Replace open-coded parity calculation with parity8() x86/bootflag: Micro-optimize sbf_write() x86/boot: Add missing has_cpuflag() prototype x86/kexec: Export e820_table_kexec[] to sysfs x86/boot: Change some static bootflag functions to bool x86/e820: Drop obsolete E820_TYPE_RESERVED_KERN and related code x86/boot: Split parsing of boot_params into the parse_boot_params() helper function x86/boot: Split kernel resources setup into the setup_kernel_resources() helper function x86/boot: Move setting of memblock parameters to e820__memblock_setup()
2025-03-24Merge tag 'x86-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ...
2025-03-24Merge tag 'perf-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core: - Move perf_event sysctls into kernel/events/ (Joel Granados) - Use POLLHUP for pinned events in error (Namhyung Kim) - Avoid the read if the count is already updated (Peter Zijlstra) - Allow the EPOLLRDNORM flag for poll (Tao Chen) - locking/percpu-rwsem: Add guard support [ NOTE: this got (mis-)merged into the perf tree due to related work ] (Peter Zijlstra) perf_pmu_unregister() related improvements: (Peter Zijlstra) - Simplify the perf_event_alloc() error path - Simplify the perf_pmu_register() error path - Simplify perf_pmu_register() - Simplify perf_init_event() - Simplify perf_event_alloc() - Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count - Add this_cpc() helper - Introduce perf_free_addr_filters() - Robustify perf_event_free_bpf_prog() - Simplify the perf_mmap() control flow - Further simplify perf_mmap() - Remove retry loop from perf_mmap() - Lift event->mmap_mutex in perf_mmap() - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes - Fix perf_mmap() failure path Uprobes: - Harden x86 uretprobe syscall trampoline check (Jiri Olsa) - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang) - Remove the spinlock within handle_singlestep() (Liao Chang) x86 Intel PMU enhancements: - Support PEBS counters snapshotting (Kan Liang) - Fix intel_pmu_read_event() (Kan Liang) - Extend per event callchain limit to branch stack (Kan Liang) - Fix system-wide LBR profiling (Kan Liang) - Allocate bts_ctx only if necessary (Li RongQing) - Apply static call for drain_pebs (Peter Zijlstra) x86 AMD PMU enhancements: (Ravi Bangoria) - Remove pointless sample period check - Fix ->config to sample period calculation for OP PMU - Fix perf_ibs_op.cnt_mask for CurCnt - Don't allow freq mode event creation through ->config interface - Add PMU specific minimum period - Add ->check_period() callback - Ceil sample_period to min_period - Add support for OP Load Latency Filtering - Update DTLB/PageSize decode logic Hardware breakpoints: - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar Bhaskar) Hardlockup detector improvements: (Li Huafei) - perf_event memory leak - Warn if watchdog_ev is leaked Fixes and cleanups: - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)" * tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) perf: Fix __percpu annotation perf: Clean up pmu specific data perf/x86: Remove swap_task_ctx() perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode perf: Supply task information to sched_task() perf: attach/detach PMU specific data locking/percpu-rwsem: Add guard support perf: Save PMU specific data in task_struct perf: Extend per event callchain limit to branch stack perf/ring_buffer: Allow the EPOLLRDNORM flag for poll perf/core: Use POLLHUP for pinned events in error perf/core: Use sysfs_emit() instead of scnprintf() perf/core: Remove optional 'size' arguments from strscpy() calls perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions uprobes/x86: Harden uretprobe syscall trampoline check watchdog/hardlockup/perf: Warn if watchdog_ev is leaked watchdog/hardlockup/perf: Fix perf_event memory leak perf/x86: Annotate struct bts_buffer::buf with __counted_by() perf/core: Clean up perf_try_init_event() perf/core: Fix perf_mmap() failure path ...
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'objtool-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - The biggest change is the new option to automatically fail the build on objtool warnings: CONFIG_OBJTOOL_WERROR. While there are no currently known unfixed false positives left, such an expansion in the severity of objtool warnings inevitably creates a risk of build failures, so it's disabled by default and depends on !COMPILE_TEST, so it shouldn't be enabled on allyesconfig/allmodconfig builds and won't be forced on people who just accept build-time defaults in 'make oldconfig'. While the option is strongly recommended, only people who enable it explicitly should see it. (Josh Poimboeuf) - Disable branch profiling in noinstr code with a broad brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf) - Create backup object files on objtool errors and print exact objtool arguments to make failure analysis easier (Josh Poimboeuf) - Improve noreturn handling (Josh Poimboeuf) - Improve rodata handling (Tiezhu Yang) - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang) - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar) * tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) tracing: Disable branch profiling in noinstr code objtool: Use O_CREAT with explicit mode mask objtool: Add CONFIG_OBJTOOL_WERROR objtool: Create backup on error and print args objtool: Change "warning:" to "error:" for --Werror objtool: Add --Werror option objtool: Add --output option objtool: Upgrade "Linked object detected" warning to error objtool: Consolidate option validation objtool: Remove --unret dependency on --rethunk objtool: Increase per-function WARN_FUNC() rate limit objtool: Update documentation objtool: Improve __noreturn annotation warning objtool: Fix error handling inconsistencies in check() x86/traps: Make exc_double_fault() consistently noreturn LoongArch: Enable jump table for objtool objtool/LoongArch: Add support for goto table objtool/LoongArch: Add support for switch table objtool: Handle PC relative relocation type objtool: Handle different entry size of rodata ...
2025-03-24Merge tag 'locking-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Locking primitives: - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and {,try_}cmpxchg{64,128} on x86 (Uros Bizjak) - mutexes: extend debug checks in mutex_lock() (Yunhui Cui) - Misc cleanups (Uros Bizjak) Lockdep: - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter Zijlstra) - Don't disable interrupts on RT in disable_irq_nosync_lockdep.*() (Sebastian Andrzej Siewior) - Disable KASAN instrumentation of lockdep.c (Waiman Long) - Add kasan_check_byte() check in lock_acquire() (Waiman Long) - Misc cleanups (Sebastian Andrzej Siewior) Rust runtime integration: - Use Pin for all LockClassKey usages (Mitchell Levy) - sync: Add accessor for the lock behind a given guard (Alice Ryhl) - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl) - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng) Split-lock detection feature (x86): - Fix warning mode with disabled mitigation mode (Maksim Davydov) Locking events: - Add locking events for rtmutex slow paths (Waiman Long) - Add locking events for lockdep (Waiman Long)" * tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Remove disable_irq_lockdep() lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*() rust: lockdep: Use Pin for all LockClassKey usages rust: sync: condvar: Add wait_interruptible_freezable() rust: sync: lock: Add an example for Guard:: Lock_ref() rust: sync: Add accessor for the lock behind a given guard locking/lockdep: Add kasan_check_byte() check in lock_acquire() locking/lockdep: Disable KASAN instrumentation of lockdep.c locking/lock_events: Add locking events for lockdep locking/lock_events: Add locking events for rtmutex slow paths x86/split_lock: Fix the delayed detection logic lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lock x86/locking: Remove semicolon from "lock" prefix locking/mutex: Add MUTEX_WARN_ON() into fast path x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
2025-03-24Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.15-rc1: - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code from drivers (Viresh Kumar). - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, and zuoqian). - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai). - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski). - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng). - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar). - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki). - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan). * pm-cpufreq: (91 commits) dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650 dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1 dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible cpufreq: Init cpufreq only for present CPUs cpufreq: tegra186: Share policy per cluster cpufreq/amd-pstate: Drop actions in amd_pstate_epp_cpu_offline() cpufreq/amd-pstate: Stop caching EPP cpufreq/amd-pstate: Rework CPPC enabling cpufreq/amd-pstate: Drop debug statements for policy setting cpufreq/amd-pstate: Update cppc_req_cached for shared mem EPP writes cpufreq/amd-pstate: Move all EPP tracing into *_update_perf and *_set_epp functions cpufreq/amd-pstate: Cache CPPC request in shared mem case too cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks cpufreq/amd-pstate-ut: Adjust variable scope cpufreq/amd-pstate-ut: Run on all of the correct CPUs cpufreq/amd-pstate-ut: Drop SUCCESS and FAIL enums cpufreq/amd-pstate-ut: Allow lowest nonlinear and lowest to be the same cpufreq/amd-pstate-ut: Use _free macro to free put policy cpufreq/amd-pstate: Drop `cppc_cap1_cached` ...
2025-03-24Merge branches 'acpi-x86', 'acpi-platform-profile', 'acpi-apei' and 'acpi-misc'Rafael J. Wysocki
Merge an ACPI CPPC update, ACPI platform-profile driver updates, an ACPI APEI update and a MAINTAINERS update related to ACPI for 6.15-rc1: - Add a missing header file include to the x86 arch CPPC code (Mario Limonciello). - Rework the sysfs attributes implementation in the ACPI platform-profile driver and improve the unregistration code in it (Nathan Chancellor, Kurt Borja). - Prevent the ACPI HED driver from being built as a module and change its initcall level to subsys_initcall to avoid initialization ordering issues related to it (Xiaofei Tan). - Update a maintainer email address in the ACPI PMIC entry in MAINTAINERS (Mika Westerberg). * acpi-x86: x86/ACPI: CPPC: Add missing include * acpi-platform-profile: ACPI: platform_profile: Improve platform_profile_unregister() ACPI: platform-profile: Fix CFI violation when accessing sysfs files * acpi-apei: ACPI: HED: Always initialize before evged * acpi-misc: MAINTAINERS: Use my kernel.org address for ACPI PMIC work
2025-03-22tracing: Disable branch profiling in noinstr codeJosh Poimboeuf
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-20x86: hyperv: Add mshv_handler() irq handler and setup functionNuno Das Neves
Add mshv_handler() to process messages related to managing guest partitions such as intercepts, doorbells, and scheduling messages. In a (non-nested) root partition, the same interrupt vector is shared between the vmbus and mshv_root drivers. Introduce a stub for mshv_handler() and call it in sysvec_hyperv_callback alongside vmbus_handler(). Even though both handlers will be called for every Hyper-V interrupt, the messages for each driver are delivered to different offsets within the SYNIC message page, so they won't step on each other. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-20Drivers: hv: Export some functions for use by root partition moduleNuno Das Neves
hv_get_hypervisor_version(), hv_call_deposit_pages(), and hv_call_create_vp(), are all needed in-module with CONFIG_MSHV_ROOT=m. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@microsoft.linux.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-20x86/mshyperv: Add support for extended Hyper-V featuresStanislav Kinsburskii
Extend the "ms_hyperv_info" structure to include a new field, "ext_features", for capturing extended Hyper-V features. Update the "ms_hyperv_init_platform" function to retrieve these features using the cpuid instruction and include them in the informational output. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-19x86/cpu/intel: Limit the non-architectural constant_tsc model checksSohil Mehta
X86_FEATURE_CONSTANT_TSC is a Linux-defined, synthesized feature flag. It is used across several vendors. Intel CPUs will set the feature when the architectural CPUID.80000007.EDX[1] bit is set. There are also some Intel CPUs that have the X86_FEATURE_CONSTANT_TSC behavior but don't enumerate it with the architectural bit. Those currently have a model range check. Today, virtually all of the CPUs that have the CPUID bit *also* match the "model >= 0x0e" check. This is confusing. Instead of an open-ended check, pick some models (INTEL_IVYBRIDGE and P4_WILLAMETTE) as the end of goofy CPUs that should enumerate the bit but don't. These models are relatively arbitrary but conservative pick for this. This makes it obvious that later CPUs (like Family 18+) no longer need to synthesize X86_FEATURE_CONSTANT_TSC. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250219184133.816753-14-sohil.mehta@intel.com
2025-03-19x86/cpu/intel: Fix fast string initialization for extended FamiliesSohil Mehta
X86_FEATURE_REP_GOOD is a linux defined feature flag to track whether fast string operations should be used for copy_page(). It is also used as a second alternative for clear_page() if enhanced fast string operations (ERMS) are not available. X86_FEATURE_ERMS is an Intel-specific hardware-defined feature flag that tracks hardware support for Enhanced Fast strings. It is used to track whether Fast strings should be used for similar memory copy and memory clearing operations. On top of these, there is a FAST_STRING enable bit in the IA32_MISC_ENABLE MSR. It is typically controlled by the BIOS to provide a hint to the hardware and the OS on whether fast string operations are preferred. Commit: 161ec53c702c ("x86, mem, intel: Initialize Enhanced REP MOVSB/STOSB") introduced a mechanism to honor the BIOS preference for fast string operations and clear the above feature flags if needed. Unfortunately, the current initialization code for Intel to set and clear these bits is confusing at best and likely incorrect. X86_FEATURE_REP_GOOD is cleared in early_init_intel() if MISC_ENABLE.FAST_STRING is 0. But it gets set later on unconditionally for all Family 6 processors in init_intel(). This not only overrides the BIOS preference but also contradicts the earlier check. Fix this by combining the related checks and always relying on the BIOS provided preference for fast string operations. This simplification makes sure the upcoming Intel Family 18 and 19 models are covered as well. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-12-sohil.mehta@intel.com
2025-03-19x86/smpboot: Fix INIT delay assignment for extended Intel FamiliesSohil Mehta
Some old crusty CPUs need an extra delay that slows down booting. See the comment above 'init_udelay' for details. Newer CPUs don't need the delay. Right now, for Intel, Family 6 and only Family 6 skips the delay. That leaves out both the Family 15 (Pentium 4s) and brand new Family 18/19 models. The omission of Family 15 (Pentium 4s) seems like an oversight and 18/19 do not need the delay. Skip the delay on all Intel processors Family 6 and beyond. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250219184133.816753-11-sohil.mehta@intel.com
2025-03-19x86/smpboot: Remove confusing quirk usage in INIT delaySohil Mehta
Very old multiprocessor systems required a 10 msec delay between asserting and de-asserting INIT but modern processors do not require this delay. Over time the usage of the "quirk" wording while setting the INIT delay has become misleading. The code comments suggest that modern processors need to be quirked, which clears the default init_udelay of 10 msec, while legacy processors don't need the quirk and continue to use the default init_udelay. With a lot more modern processors, the wording should be inverted if at all needed. Instead, simplify the comments and the code by getting rid of "quirk" usage altogether and clarifying the following: - Old legacy processors -> Set the "legacy" 10 msec delay - Modern processors -> Do not set any delay No functional change. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-10-sohil.mehta@intel.com
2025-03-19x86/acpi/cstate: Improve Intel Family model checksSohil Mehta
Update the Intel Family checks to consistently use Family 15 instead of Family 0xF. Also, get rid of one of last usages of x86_model by using the new VFM checks. Update the incorrect comment since the check has changed since the initial commit: ee1ca48fae7e ("ACPI: Disable ARB_DISABLE on platforms where it is not needed") The two changes were: - 3e2ada5867b7 ("ACPI: fix Compaq Evo N800c (Pentium 4m) boot hang regression") removed the P4 - Family 15. - 03a05ed11529 ("ACPI: Use the ARB_DISABLE for the CPU which model id is less than 0x0f.") got rid of CORE_YONAH - Family 6, model E. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-9-sohil.mehta@intel.com
2025-03-19x86/cpu/intel: Replace Family 5 model checks with VFM onesSohil Mehta
Introduce names for some Family 5 models and convert some of the checks to be VFM based. Also, to keep the file sorted by family, move Family 5 to the top of the header file. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-8-sohil.mehta@intel.com
2025-03-19x86/cpu/intel: Replace Family 15 checks with VFM onesSohil Mehta
Introduce names for some old pentium 4 models and replace the x86_model checks with VFM ones. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-7-sohil.mehta@intel.com
2025-03-19x86/cpu/intel: Replace early Family 6 checks with VFM onesSohil Mehta
Introduce names for some old pentium models and replace the x86_model checks with VFM ones. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-6-sohil.mehta@intel.com
2025-03-19x86/mtrr: Modify a x86_model check to an Intel VFM checkSohil Mehta
Simplify one of the last few Intel x86_model checks in arch/x86 by substituting it with a VFM one. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-5-sohil.mehta@intel.com
2025-03-19x86/microcode: Update the Intel processor flag scan checkSohil Mehta
The Family model check to read the processor flag MSR is misleading and potentially incorrect. It doesn't consider Family while comparing the model number. The original check did have a Family number but it got lost/moved during refactoring. intel_collect_cpu_info() is called through multiple paths such as early initialization, CPU hotplug as well as IFS image load. Some of these flows would be error prone due to the ambiguous check. Correct the processor flag scan check to use a Family number and update it to a VFM based one to make it more readable. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-4-sohil.mehta@intel.com
2025-03-19x86/cpu/intel: Fix the MOVSL alignment preference for extended FamiliesSohil Mehta
The alignment preference for 32-bit MOVSL based bulk memory move has been 8-byte for a long time. However this preference is only set for Family 6 and 15 processors. Use the same preference for upcoming Family numbers 18 and 19. Also, use a simpler VFM based check instead of switching based on Family numbers. Refresh the comment to reflect the new check. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-3-sohil.mehta@intel.com
2025-03-19x86/apic: Fix 32-bit APIC initialization for extended Intel FamiliesSohil Mehta
APIC detection is currently limited to a few specific Families and will not match the upcoming Families >=18. Extend the check to include all Families 6 or greater. Also convert it to a VFM check to make it simpler. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-2-sohil.mehta@intel.com
2025-03-19x86/syscall: Move sys_ni_syscall()Brian Gerst
Move sys_ni_syscall() to kernel/process.c, and remove the now empty entry/common.c No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-6-brgerst@gmail.com
2025-03-19x86/amd_node: Add support for debugfs access to SMN registersMario Limonciello
There are certain registers on AMD Zen systems that can only be accessed through SMN. Introduce a new interface that provides debugfs files for accessing SMN. As this introduces the capability for userspace to manipulate the hardware in unpredictable ways, taint the kernel when writing. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-3-b5cc997e471b@amd.com
2025-03-19x86/amd_node: Add SMN offsets to exclusive region accessMario Limonciello
Offsets 0x60 and 0x64 are used internally by kernel drivers that call the amd_smn_read() and amd_smn_write() functions. If userspace accesses the regions at the same time as the kernel it may cause malfunctions in drivers using the offsets. Add these offsets to the exclusions so that the kernel is tainted if a non locked down userspace tries to access them. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-2-b5cc997e471b@amd.com
2025-03-19x86/amd_node, platform/x86/amd/hsmp: Have HSMP use SMN through AMD_NODEYazen Ghannam
The HSMP interface is just an SMN interface with different offsets. Define an HSMP wrapper in the SMN code and have the HSMP platform driver use that rather than a local solution. Also, remove the "root" member from AMD_NB, since there are no more users of it. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-1-b5cc997e471b@amd.com
2025-03-19x86/mtrr: Use str_enabled_disabled() helper in print_mtrr_state()Thorsten Blum
Remove hard-coded strings by using the str_enabled_disabled() helper function. Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20250117144900.171684-2-thorsten.blum%40linux.dev
2025-03-19x86/cpufeatures: Warn about unmet CPU feature dependenciesSohil Mehta
Currently, the cpuid_deps[] table is only exercised when a particular feature is explicitly disabled and clear_cpu_cap() is called. However, some of these listed dependencies might already be missing during boot. These types of errors shouldn't generally happen in production environments, but they could sometimes sneak through, especially when VMs and Kconfigs are in the mix. Also, the kernel might introduce artificial dependencies between unrelated features, such as making LAM depend on LASS. Unexpected failures can occur when the kernel tries to use such features. Add a simple boot-time scan of the cpuid_deps[] table to detect the missing dependencies. One option is to disable all of such features during boot, but that may cause regressions in existing systems. For now, just warn about the missing dependencies to create awareness. As a trade-off between spamming the kernel log and keeping track of all the features that have been warned about, only warn about the first missing dependency. Any subsequent unmet dependency will only be logged after the first one has been resolved. Features are typically represented through unsigned integers within the kernel, though some of them have user-friendly names if they are exposed via /proc/cpuinfo. Show the friendlier name if available, otherwise display the X86_FEATURE_* numerals to make it easier to identify the feature. Suggested-by: Tony Luck <tony.luck@intel.com> Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250313201608.3304135-1-sohil.mehta@intel.com
2025-03-19x86/rfds: Exclude P-only parts from the RFDS affected listPawan Gupta
The affected CPU table (cpu_vuln_blacklist) marks Alderlake and Raptorlake P-only parts affected by RFDS. This is not true because only E-cores are affected by RFDS. With the current family/model matching it is not possible to differentiate the unaffected parts, as the affected and unaffected hybrid variants have the same model number. Add a cpu-type match as well for such parts so as to exclude P-only parts being marked as affected. Note, family/model and cpu-type enumeration could be inaccurate in virtualized environments. In a guest affected status is decided by RFDS_NO and RFDS_CLEAR bits exposed by VMMs. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-5-e8514dcaaff2@linux.intel.com
2025-03-19x86/cpu: Update x86_match_cpu() to also use cpu-typePawan Gupta
Non-hybrid CPU variants that share the same Family/Model could be differentiated by their cpu-type. x86_match_cpu() currently does not use cpu-type for CPU matching. Dave Hansen suggested to use below conditions to match CPU-type: 1. If CPU_TYPE_ANY (the wildcard), then matched 2. If hybrid, then matched 3. If !hybrid, look at the boot CPU and compare the cpu-type to determine if it is a match. This special case for hybrid systems allows more compact vulnerability list. Imagine that "Haswell" CPUs might or might not be hybrid and that only Atom cores are vulnerable to Meltdown. That means there are three possibilities: 1. P-core only 2. Atom only 3. Atom + P-core (aka. hybrid) One might be tempted to code up the vulnerability list like this: MATCH( HASWELL, X86_FEATURE_HYBRID, MELTDOWN) MATCH_TYPE(HASWELL, ATOM, MELTDOWN) Logically, this matches #2 and #3. But that's a little silly. You would only ask for the "ATOM" match in cases where there *WERE* hybrid cores in play. You shouldn't have to _also_ ask for hybrid cores explicitly. In short, assume that processors that enumerate Hybrid==1 have a vulnerable core type. Update x86_match_cpu() to also match cpu-type. Also treat hybrid systems as special, and match them to any cpu-type. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-4-e8514dcaaff2@linux.intel.com
2025-03-19x86/cpufeatures: Generate the <asm/cpufeaturemasks.h> header based on build ↵H. Peter Anvin (Intel)
config Introduce an AWK script to auto-generate the <asm/cpufeaturemasks.h> header with required and disabled feature masks based on <asm/cpufeatures.h> and the current build config. Thus for any CPU feature with a build config, e.g., X86_FRED, simply add: config X86_DISABLED_FEATURE_FRED def_bool y depends on !X86_FRED to arch/x86/Kconfig.cpufeatures, instead of adding a conditional CPU feature disable flag, e.g., DISABLE_FRED. Lastly, the generated required and disabled feature masks will be added to their corresponding feature masks for this particular compile-time configuration. [ Xin: build integration improvements ] [ mingo: Improved changelog and comments ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250305184725.3341760-3-xin@zytor.com
2025-03-19x86/acpi: Replace manual page table initialization with ↵Kirill A. Shutemov
kernel_ident_mapping_init() The init_transition_pgtable() functions maps the page with asm_acpi_mp_play_dead() into an identity mapping. Replace open-coded manual page table initialization with kernel_ident_mapping_init() to avoid code duplication. Use x86_mapping_info::offset to get the page mapped at the correct location. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241016111458.846228-3-kirill.shutemov@linux.intel.com
2025-03-19x86/mm: Enable AMD translation cache extensionsRik van Riel
With AMD TCE (translation cache extensions) only the intermediate mappings that cover the address range zapped by INVLPG / INVLPGB get invalidated, rather than all intermediate mappings getting zapped at every TLB invalidation. This can help reduce the TLB miss rate, by keeping more intermediate mappings in the cache. From the AMD manual: Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB entries. When this bit is 0, these instructions remove the target PTE from the TLB as well as all upper-level table entries that are cached in the TLB, whether or not they are associated with the target PTE. When this bit is set, these instructions will remove the target PTE and only those upper-level entries that lead to the target PTE in the page table hierarchy, leaving unrelated upper-level entries intact. [ bp: use cpu_has()... I know, it is a mess. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-13-riel@surriel.com
2025-03-19x86/mm: Add INVLPGB feature and Kconfig entryRik van Riel
In addition, the CPU advertises the maximum number of pages that can be shot down with one INVLPGB instruction in CPUID. Save that information for later use. [ bp: use cpu_has(), typos, massage. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-3-riel@surriel.com
2025-03-19Merge tag 'v6.14-rc7' into x86/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-17x86/mce: use is_copy_from_user() to determine copy-from-user contextShuai Xue
Patch series "mm/hwpoison: Fix regressions in memory failure handling", v4. ## 1. What am I trying to do: This patchset resolves two critical regressions related to memory failure handling that have appeared in the upstream kernel since version 5.17, as compared to 5.10 LTS. - copyin case: poison found in user page while kernel copying from user space - instr case: poison found while instruction fetching in user space ## 2. What is the expected outcome and why - For copyin case: Kernel can recover from poison found where kernel is doing get_user() or copy_from_user() if those places get an error return and the kernel return -EFAULT to the process instead of crashing. More specifily, MCE handler checks the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space. - For instr case: If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage. ## 3. What actually happens and why - For copyin case: kernel panic since v5.17 Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS handling when posion found in get_user() or copy_from_user(). - For instr case: user process is killed by a SIGBUS signal due to #CMCI and #MCE race When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. ### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. ### Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). ## Why user process is killed for instr case Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") tries to fix noise message "Memory error not recovered" and skips duplicate SIGBUSs due to the race. But it also introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. # 4. The fix, in my opinion, should be: - For copyin case: The key point is whether the error context is in a read from user memory. We do not care about the ex-type if we know its a MOV reading from userspace. is_copy_from_user() return true when both of the following two checks are true: - the current instruction is copy - source address is user memory If copy_user is true, we set m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV; Then do_machine_check() will try fixup_exception() first. - For instr case: let kill_accessing_process() return 0 to prevent a SIGBUS. - For patch 3: The return value of memory_failure() is quite important while discussed instr case regression with Tony and Miaohe for patch 2, so add comment about the return value. This patch (of 3): Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and commit 4c132d1d844a ("x86/futex: Remove .fixup usage") updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. The error context for copy-from-user operations no longer functions as an in-kernel recovery context. Consequently, the error context for copy-from-user operations no longer functions as an in-kernel recovery context, resulting in kernel panics with the message: "Machine check: Data load in unrecoverable area of kernel." To address this, it is crucial to identify if an error context involves a read operation from user memory. The function is_copy_from_user() can be utilized to determine: - the current operation is copy - when reading user memory When these conditions are met, is_copy_from_user() will return true, confirming that it is indeed a direct copy from user memory. This check is essential for correctly handling the context of errors in these operations without relying on the extable fixup types that previously allowed for in-kernel recovery. So, use is_copy_from_user() to determine if a context is copy user directly. Link: https://lkml.kernel.org/r/20250312112852.82415-1-xueshuai@linux.alibaba.com Link: https://lkml.kernel.org/r/20250312112852.82415-2-xueshuai@linux.alibaba.com Fixes: 4c132d1d844a ("x86/futex: Remove .fixup usage") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>