summaryrefslogtreecommitdiff
path: root/arch/x86/include/asm
AgeCommit message (Collapse)Author
2025-04-11x86/alternatives: Rename 'apply_relocation()' to 'text_poke_apply_relocation()'Ingo Molnar
Join the text_poke_*() API namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-52-mingo@kernel.org
2025-04-11x86/alternatives: Move declarations of vmlinux.lds.S defined section symbols ↵Ingo Molnar
to <asm/alternative.h> Move it from the middle of a .c file next to the similar declarations of __alt_instructions[] et al. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-49-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'POKE_MAX_OPCODE_SIZE' to 'TEXT_POKE_MAX_OPCODE_SIZE'Ingo Molnar
Join the TEXT_POKE_ namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-47-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_sync()' to 'smp_text_poke_sync_each_cpu()'Ingo Molnar
Unlike sync_core(), text_poke_sync() is a very heavy operation, as it sends an IPI to every online CPU in the system and waits for completion. Reflect this in the name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-41-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_queue()' to 'smp_text_poke_batch_add()'Ingo Molnar
This name is actively confusing as well, because the simple text_poke*() APIs use MM-switching based code patching, while text_poke_queue() is part of the INT3 based text_poke_int3_*() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_add() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-17-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_finish()' to 'smp_text_poke_batch_finish()'Ingo Molnar
This name is actively confusing as well, because the simple text_poke*() APIs use MM-switching based code patching, while text_poke_finish() is part of the INT3 based text_poke_int3_*() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_finish() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-16-mingo@kernel.org
2025-04-11x86/alternatives: Update comments in int3_emulate_push()Ingo Molnar
The idtentry macro in entry_64.S hasn't had a create_gap option for 5 years - update the comment. (Also clean up the entire comment block while at it.) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-13-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poking_addr' to 'text_poke_mm_addr'Ingo Molnar
Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-10-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poking_mm' to 'text_poke_mm'Ingo Molnar
Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-9-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'poke_int3_handler()' to 'smp_text_poke_int3_handler()'Ingo Molnar
All related functions in this subsystem already have a text_poke_int3_ prefix - add it to the trap handler as well. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-8-mingo@kernel.org
2025-04-11x86/alternatives: Rename 'text_poke_bp()' to 'smp_text_poke_single()'Ingo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-7-mingo@kernel.org
2025-04-10Merge tag 'x86-urgent-2025-04-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix CPU topology related regression that limited Xen PV guests to a single CPU - Fix ancient e820__register_nosave_regions() bugs that were causing problems with kexec's artificial memory maps - Fix an S4 hibernation crash caused by two missing ENDBR's that were mistakenly removed in a recent commit - Fix a resctrl serialization bug - Fix early_printk documentation and comments - Fix RSB bugs, combined with preparatory updates to better match the code to vendor recommendations. - Add RSB mitigation document - Fix/update documentation - Fix the erratum_1386_microcode[] table to be NULL terminated * tag 'x86-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ibt: Fix hibernate x86/cpu: Avoid running off the end of an AMD erratum table Documentation/x86: Zap the subsection letters Documentation/x86: Update the naming of CPU features for /proc/cpuinfo x86/bugs: Add RSB mitigation document x86/bugs: Don't fill RSB on context switch with eIBRS x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpoline x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier() x86/bugs: Use SBPB in write_ibpb() if applicable x86/bugs: Rename entry_ibpb() to write_ibpb() x86/early_printk: Use 'mmio32' for consistency, fix comments x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::name x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions() x86/acpi: Don't limit CPUs to 1 for Xen PV guests due to disabled ACPI
2025-04-10Merge tag 'objtool-urgent-2025-04-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc objtool fixes from Ingo Molnar: - Remove the recently introduced ANNOTATE_IGNORE_ALTERNATIVE noise from clac()/stac() code to make .s files more readable - Fix INSN_SYSCALL / INSN_SYSRET semantics - Fix various false-positive warnings * tag 'objtool-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix false-positive "ignoring unreachables" warning objtool: Remove ANNOTATE_IGNORE_ALTERNATIVE from CLAC/STAC objtool, xen: Fix INSN_SYSCALL / INSN_SYSRET semantics objtool: Stop UNRET validation on UD2 objtool: Split INSN_CONTEXT_SWITCH into INSN_SYSCALL and INSN_SYSRET objtool: Fix INSN_CONTEXT_SWITCH handling in validate_unret()
2025-04-10x86/sev: Add SVSM vTPM probe/send_command functionsStefano Garzarella
Add two new functions to probe and send commands to the SVSM vTPM. They leverage the two calls defined by the AMD SVSM specification [1] for the vTPM protocol: SVSM_VTPM_QUERY and SVSM_VTPM_CMD. Expose snp_svsm_vtpm_send_command() to be used by a TPM driver. [1] "Secure VM Service Module for SEV-SNP Guests" Publication # 58019 Revision: 1.00 [ bp: Some doc touchups. ] Co-developed-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Co-developed-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20250403100943.120738-2-sgarzare@redhat.com
2025-04-10x86/kexec: Add 8250 MMIO serial port outputDavid Woodhouse
This supports the same 32-bit MMIO-mapped 8250 as the early_printk code. It's not clear why the early_printk code supports this form and only this form; the actual runtime 8250_pci doesn't seem to support it. But having hacked up QEMU to expose such a device, early_printk does work with it, and now so does the kexec debug code. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-3-dwmw2@infradead.org
2025-04-10x86/kexec: Add 8250 serial port outputDavid Woodhouse
If a serial port was configured for early_printk, use it for debug output from the relocate_kernel exception handler too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-2-dwmw2@infradead.org
2025-04-10x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_on_cpu()' to 'wrmsrq_on_cpu()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_on_cpu()' to 'rdmsrq_on_cpu()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_safe_on_cpu()' to 'wrmsrq_safe_on_cpu()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe_on_cpu()' to 'rdmsrq_safe_on_cpu()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Standardize on 'u32' MSR indices in <asm/msr.h>Ingo Molnar
This is the customary type used for hardware ABIs. Suggested-by: Xin Li <xin@zytor.com> Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Harmonize the prototype and definition of do_trace_rdpmc()Ingo Molnar
In <asm/msr.h> the first parameter of do_trace_rdpmc() is named 'msr': extern void do_trace_rdpmc(unsigned int msr, u64 val, int failed); But in the definition it's 'counter': void do_trace_rdpmc(unsigned counter, u64 val, int failed) Use 'msr' in both cases, and change the type to u32. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Use u64 in rdmsrl_safe() and paravirt_read_pmc()Ingo Molnar
The paravirt_read_pmc() result is in fact only loaded into an u64 variable. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Standardize on u64 in <asm/msr-index.h>Ingo Molnar
Also fix some nearby whitespace damage while at it. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Standardize on u64 in <asm/msr.h>Ingo Molnar
There's 9 uses of 'unsigned long long' in <asm/msr.h>, which is really the same as 'u64', which is used 34 times. Standardize on u64. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86: Remove __FORCE_ORDER workaroundUros Bizjak
GCC PR82602 that caused invalid scheduling of volatile asms: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82602 was fixed for gcc-8.1.0, the current minimum version of the compiler required to compile the kernel. Remove workaround that prevented invalid scheduling for compilers, affected by PR82602. There were no differences between old and new kernel object file when compiled for x86_64 defconfig with gcc-8.1.0. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250407112316.378347-1-ubizjak@gmail.com
2025-04-09x86/mm: Consolidate initmem_init()Mike Rapoport (Microsoft)
There are 4 wariants of initmem_init(), for 32 and 64 bits and for CONFIG_NUMA enabled and disabled. After commit bbeb69ce3013 ("x86/mm: Remove CONFIG_HIGHMEM64G support") NUMA is not supported on 32 bit kernels anymore, and arch/x86/mm/numa_32.c can be just deleted and setup_bootmem_allocator() with completely misleading name can be folded into initmem_init(). For 64 bits the NUMA variant calls x86_numa_init() and !NUMA variant sets all memory to node 0. The later can be split out into inline helper called x86_numa_init() and then both initmem_init() functions become the same. Split out memblock_set_node() from initmem_init() for !NUMA on 64 bit into x86_numa_init() helper and remove arch/x86/mm/numa_*.c that only contained initmem_init() variants for NUMA configs. Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20250409122815.420041-1-rppt@kernel.org
2025-04-09Merge tag 'v6.15-rc1' into x86/mm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09x86/uaccess: Predict valid_user_address() returning trueMateusz Guzik
This works around what seems to be an optimization bug in GCC (at least 13.3.0), where it predicts access_ok() to fail despite the hint to the contrary. _copy_to_user() contains: if (access_ok(to, n)) { instrument_copy_to_user(to, from, n); n = raw_copy_to_user(to, from, n); } Where access_ok() is likely(__access_ok(addr, size)), yet the compiler emits conditional jumps forward for the case where it succeeds: <+0>: endbr64 <+4>: mov %rdx,%rcx <+7>: mov %rdx,%rax <+10>: xor %edx,%edx <+12>: add %rdi,%rcx <+15>: setb %dl <+18>: movabs $0x123456789abcdef,%r8 <+28>: test %rdx,%rdx <+31>: jne 0xffffffff81b3b7c6 <_copy_to_user+38> <+33>: cmp %rcx,%r8 <+36>: jae 0xffffffff81b3b7cb <_copy_to_user+43> <+38>: jmp 0xffffffff822673e0 <__x86_return_thunk> <+43>: nop <+44>: nop <+45>: nop <+46>: mov %rax,%rcx <+49>: rep movsb %ds:(%rsi),%es:(%rdi) <+51>: nop <+52>: nop <+53>: nop <+54>: mov %rcx,%rax <+57>: nop <+58>: nop <+59>: nop <+60>: jmp 0xffffffff822673e0 <__x86_return_thunk> Patching _copy_to_user() to likely() around the access_ok() use does not change the asm. However, spelling out the prediction *within* valid_user_address() does the trick: <+0>: endbr64 <+4>: xor %eax,%eax <+6>: mov %rdx,%rcx <+9>: add %rdi,%rdx <+12>: setb %al <+15>: movabs $0x123456789abcdef,%r8 <+25>: test %rax,%rax <+28>: jne 0xffffffff81b315e6 <_copy_to_user+54> <+30>: cmp %rdx,%r8 <+33>: jb 0xffffffff81b315e6 <_copy_to_user+54> <+35>: nop <+36>: nop <+37>: nop <+38>: rep movsb %ds:(%rsi),%es:(%rdi) <+40>: nop <+41>: nop <+42>: nop <+43>: nop <+44>: nop <+45>: nop <+46>: mov %rcx,%rax <+49>: jmp 0xffffffff82255ba0 <__x86_return_thunk> <+54>: mov %rcx,%rax <+57>: jmp 0xffffffff82255ba0 <__x86_return_thunk> Since we kinda expect valid_user_address() to be likely anyway, add the likely() annotation that also happens to work around this compiler bug. [ mingo: Moved the unlikely() branch into valid_user_address() & updated the changelog ] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401203029.1132135-1-mjguzik@gmail.com
2025-04-09Merge tag 'v6.15-rc1' into x86/asm, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier()Josh Poimboeuf
IBPB is expected to clear the RSB. However, if X86_BUG_IBPB_NO_RET is set, that doesn't happen. Make indirect_branch_prediction_barrier() take that into account by calling write_ibpb() which clears RSB on X86_BUG_IBPB_NO_RET: /* Make sure IBPB clears return stack preductions too. */ FILL_RETURN_BUFFER %rax, RSB_CLEAR_LOOPS, X86_BUG_IBPB_NO_RET Note that, as of the previous patch, write_ibpb() also reads 'x86_pred_cmd' in order to use SBPB when applicable: movl _ASM_RIP(x86_pred_cmd), %eax Therefore that existing behavior in indirect_branch_prediction_barrier() is not lost. Fixes: 50e4b3b94090 ("x86/entry: Have entry_ibpb() invalidate return predictions") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/bba68888c511743d4cd65564d1fc41438907523f.1744148254.git.jpoimboe@kernel.org
2025-04-09x86/bugs: Rename entry_ibpb() to write_ibpb()Josh Poimboeuf
There's nothing entry-specific about entry_ibpb(). In preparation for calling it from elsewhere, rename it to write_ibpb(). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1e54ace131e79b760de3fe828264e26d0896e3ac.1744148254.git.jpoimboe@kernel.org
2025-04-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk stage-1 page tables) to align with the architecture. This avoids possibly taking an SEA at EL2 on the page table walk or using an architecturally UNKNOWN fault IPA - Use acquire/release semantics in the KVM FF-A proxy to avoid reading a stale value for the FF-A version - Fix KVM guest driver to match PV CPUID hypercall ABI - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM selftests, which is the only memory type for which atomic instructions are architecturally guaranteed to work s390: - Don't use %pK for debug printing and tracepoints x86: - Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. The schedule out code runs with a scheduler lock that the wakeup handler takes in the opposite order; but it does so with IRQs disabled and cannot run concurrently with a wakeup - Explicitly zero-initialize on-stack CPUID unions - Allow building irqbypass.ko as as module when kvm.ko is a module - Wrap relatively expensive sanity check with KVM_PROVE_MMU - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses selftests: - Add more scenarios to the MONITOR/MWAIT test - Add option to rseq test to override /dev/cpu_dma_latency - Bring list of exit reasons up to date - Cleanup Makefile to list once tests that are valid on all architectures Other: - Documentation fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: arm64: Use acquire/release to communicate FF-A version negotiation KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable KVM: arm64: selftests: Introduce and use hardware-definition macros KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list KVM: x86: Explicitly zero-initialize on-stack CPUID unions KVM: Allow building irqbypass.ko as as module when kvm.ko is a module KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses Documentation: kvm: remove KVM_CAP_MIPS_TE Documentation: kvm: organize capabilities in the right section Documentation: kvm: fix some definition lists Documentation: kvm: drop "Capability" heading from capabilities Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE selftests: kvm: list once tests that are valid on all architectures selftests: kvm: bring list of exit reasons up to date selftests: kvm: revamp MONITOR/MWAIT tests KVM: arm64: Don't translate FAR if invalid/unsafe ...
2025-04-08objtool: Remove ANNOTATE_IGNORE_ALTERNATIVE from CLAC/STACJosh Poimboeuf
ANNOTATE_IGNORE_ALTERNATIVE adds additional noise to the code generated by CLAC/STAC alternatives, hurting readability for those whose read uaccess-related code generation on a regular basis. Remove the annotation specifically for the "NOP patched with CLAC/STAC" case in favor of a manual check. Leave the other uses of that annotation in place as they're less common and more difficult to detect. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fc972ba4995d826fcfb8d02733a14be8d670900b.1744098446.git.jpoimboe@kernel.org
2025-04-08perf/x86/intel: Support auto counter reloadKan Liang
The relative rates among two or more events are useful for performance analysis, e.g., a high branch miss rate may indicate a performance issue. Usually, the samples with a relative rate that exceeds some threshold are more useful. However, the traditional sampling takes samples of events separately. To get the relative rates among two or more events, a high sample rate is required, which can bring high overhead. Many samples taken in the non-hotspot area are also dropped (useless) in the post-process. The auto counter reload (ACR) feature takes samples when the relative rate of two or more events exceeds some threshold, which provides the fine-grained information at a low cost. To support the feature, two sets of MSRs are introduced. For a given counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s) can cause a reload of that counter. The reload value is stored in the IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C. The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto Counter Reload. In the hw_config(), an ACR event is specially configured, because the cause/reloadable counter mask has to be applied to the dyn_constraint. Besides the HW limit, e.g., not support perf metrics, PDist and etc, a SW limit is applied as well. ACR events in a group must be contiguous. It facilitates the later conversion from the event idx to the counter idx. Otherwise, the intel_pmu_acr_late_setup() has to traverse the whole event list again to find the "cause" event. Also, add a new flag PERF_X86_EVENT_ACR to indicate an ACR group, which is set to the group leader. The late setup() is also required for an ACR group. It's to convert the event idx to the counter idx, and saved it in hw.config1. The ACR configuration MSRs are only updated in the enable_event(). The disable_event() doesn't clear the ACR CFG register. Add acr_cfg_b/acr_cfg_c in the struct cpu_hw_events to cache the MSR values. It can avoid a MSR write if the value is not changed. Expose an acr_mask to the sysfs. The perf tool can utilize the new format to configure the relation of events in the group. The bit sequence of the acr_mask follows the events enabled order of the group. Example: Here is the snippet of the mispredict.c. Since the array has a random numbers, jumps are random and often mispredicted. The mispredicted rate depends on the compared value. For the Loop1, ~11% of all branches are mispredicted. For the Loop2, ~21% of all branches are mispredicted. main() { ... for (i = 0; i < N; i++) data[i] = rand() % 256; ... /* Loop 1 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 64) sum += data[i]; ... ... /* Loop 2 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; ... } Usually, a code with a high branch miss rate means a bad performance. To understand the branch miss rate of the codes, the traditional method usually samples both branches and branch-misses events. E.g., perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}" -c 1000000 -- ./mispredict [ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ] The 5106 samples are from both events and spread in both Loops. In the post-process stage, a user can know that the Loop 2 has a 21% branch miss rate. Then they can focus on the samples of branch-misses events for the Loop 2. With this patch, the user can generate the samples only when the branch miss rate > 20%. For example, perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu, cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}" -- ./mispredict (Two different periods are applied to branch-misses and branch-instructions. The ratio is set to 20%. If the branch-instructions is overflowed first, the branch-miss rate < 20%. No samples should be generated. All counters should be automatically reloaded. If the branch-misses is overflowed first, the branch-miss rate > 20%. A sample triggered by the branch-misses event should be generated. Just the counter of the branch-instructions should be automatically reloaded. The branch-misses event should only be automatically reloaded when the branch-instructions is overflowed. So the "cause" event is the branch-instructions event. The acr_mask is set to 0x2, since the event index in the group of branch-instructions is 1. The branch-instructions event is automatically reloaded no matter which events are overflowed. So the "cause" events are the branch-misses and the branch-instructions event. The acr_mask should be set to 0x3.) [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ] $perf report Percent │154: movl $0x0,-0x14(%rbp) │ ↓ jmp 1af │ for (i = j; i < N; i++) │15d: mov -0x10(%rbp),%eax │ mov %eax,-0x18(%rbp) │ ↓ jmp 1a2 │ if (data[i] >= 128) │165: mov -0x18(%rbp),%eax │ cltq │ lea 0x0(,%rax,4),%rdx │ mov -0x8(%rbp),%rax │ add %rdx,%rax │ mov (%rax),%eax │ ┌──cmp $0x7f,%eax 100.00 0.00 │ ├──jle 19e │ │sum += data[i]; The 2498 samples are all from the branch-misses events for the Loop 2. The number of samples and overhead is significantly reduced without losing any information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-6-kan.liang@linux.intel.com
2025-04-08perf/x86/intel: Add CPUID enumeration for the auto counter reloadKan Liang
The counters that support the auto counter reload feature can be enumerated in the CPUID Leaf 0x23 sub-leaf 0x2. Add acr_cntr_mask to store the mask of counters which are reloadable. Add acr_cause_mask to store the mask of counters which can cause reload. Since the e-core and p-core may have different numbers of counters, track the masks in the struct x86_hybrid_pmu as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-5-kan.liang@linux.intel.com
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07Merge branch 'kvm-6.15-rc2-fixes' into HEADPaolo Bonzini
2025-04-06x86/boot/compressed: Merge the local pgtable.h include into <asm/boot.h>Ard Biesheuvel
Merge the local include "pgtable.h" -which declares the API of the 5-level paging trampoline- into <asm/boot.h> so that its implementation in la57toggle.S as well as the calling code can be decoupled from the traditional decompressor. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401133416.1436741-9-ardb+git@google.com
2025-04-06x86/boot: Use __ALIGN_KERNEL_MASK() instead of open coded analogueAndy Shevchenko
LOAD_PHYSICAL_ADDR is calculated as an aligned (up) CONFIG_PHYSICAL_START with the respective alignment value CONFIG_PHYSICAL_ALIGN. However, the code is written openly while we have __ALIGN_KERNEL_MASK() macro that does the same. This macro has nothing special, that's why it may be used in assembler code or linker scripts (on the contrary __ALIGN_KERNEL() may not). Do it so. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250404165303.3657139-1-andriy.shevchenko@linux.intel.com
2025-04-04Merge tag 'x86-urgent-2025-04-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix a performance regression on AMD iGPU and dGPU drivers, related to the unintended activation of DMA bounce buffers that regressed game performance if KASLR disturbed things just enough - Fix a copy_user_generic() performance regression on certain older non-FSRM/ERMS CPUs - Fix a Clang build warning due to a semantic merge conflict the Kunit tree generated with the x86 tree - Fix FRED related system hang during S4 resume - Remove an unused API * tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fred: Fix system hang during S4 resume with FRED enabled x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier() x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs
2025-04-04KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMUSean Christopherson
Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity check is omitted from production builds, and more importantly to remove the atomic accesses to account pages. A one-off memory leak in production is relatively uninteresting, and a WARN_ON won't help mitigate a systemic issue; it's as much about helping triage memory leaks as it is about detecting them in the first place, and doesn't magically stop the leaks. I.e. production environments will be quite sad if a severe KVM bug escapes, regardless of whether or not KVM WARNs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315023448.2358456-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-03x86/idle: Use MONITOR and MWAIT mnemonics in <asm/mwait.h>Uros Bizjak
Current minimum required version of binutils is 2.25, which supports MONITOR and MWAIT instruction mnemonics. Replace the byte-wise specification of MONITOR and MWAIT with these proper mnemonics. No functional change intended. Note: LLVM assembler is not able to assemble correct forms of MONITOR and MWAIT instructions with explicit operands and reports: error: invalid operand for instruction monitor %rax,%ecx,%edx ^~~~ # https://lore.kernel.org/oe-kbuild-all/202504030802.2lEVBSpN-lkp@intel.com/ Use instruction mnemonics with implicit operands to work around this issue. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250403125111.429805-1-ubizjak@gmail.com
2025-04-03x86/idle: Change arguments of mwait_idle_with_hints() to u32Uros Bizjak
All functions in mwait_idle_with_hints() cast eax and ecx arguments to u32. Propagate argument type to the enclosing function. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250403073105.245987-1-ubizjak@gmail.com
2025-04-03x86/idle: Remove CONFIG_AS_TPAUSEUros Bizjak
There is not much point in CONFIG_AS_TPAUSE at all when the emitted assembly is always the same - it only obfuscates the __tpause() code in essence. Remove the TPAUSE insn mnemonic from __tpause() and leave only the equivalent byte-wise definition. This can then be changed back to insn mnemonic once binutils 2.31.1 is the minimum version to build the kernel. (Right now it's 2.25.) Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250402180827.3762-4-ubizjak@gmail.com