summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-15x86/efi: Apply EFI Memory Attributes after kexecNicolas Saenz Julienne
Kexec bypasses EFI's switch to virtual mode. In exchange, it has its own routine, kexec_enter_virtual_mode(), which replays the mappings made by the original kernel. Unfortunately, that function fails to reinstate EFI's memory attributes, which would've otherwise been set after entering virtual mode. Remediate this by calling efi_runtime_update_mappings() within kexec's routine. Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-11-15x86/efi: Drop support for the EFI_PROPERTIES_TABLENicolas Saenz Julienne
Drop support for the EFI_PROPERTIES_TABLE. It was a failed, short-lived experiment that broke the boot both on Linux and Windows, and was replaced by the EFI_MEMORY_ATTRIBUTES_TABLE shortly after. Suggested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2024-11-15crypto: aesni - Move back to module_initHerbert Xu
This patch reverts commit 0fbafd06bdde938884f7326548d3df812b267c3c ("crypto: aesni - fix failing setkey for rfc4106-gcm-aesni") by moving the aesni init function back to module_init from late_initcall. The original patch was needed because tests were synchronous. This is no longer the case so there is no need to postpone the registration. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-11-14crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32Dave Vasilevsky
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware. On these machines, the kernel refuses to boot from non-zero PHYSICAL_START, which occurs when CRASH_DUMP is on. Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should default to off for them. Users booting via some other mechanism can still turn it on explicitly. Does not change the default on any other architectures for the time being. Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca Fixes: 75bc255a7444 ("crash: clean up kdump related config items") Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca> Reported-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14KVM: x86: switch hugepage recovery thread to vhost_taskPaolo Bonzini
kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', ↵Catalin Marinas
'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: perf: Switch back to struct platform_driver::remove() perf: arm_pmuv3: Add support for Samsung Mongoose PMU dt-bindings: arm: pmu: Add Samsung Mongoose core compatible perf/dwc_pcie: Fix typos in event names perf/dwc_pcie: Add support for Ampere SoCs ARM: pmuv3: Add missing write_pmuacr() perf/marvell: Marvell PEM performance monitor support perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control perf/dwc_pcie: Convert the events with mixed case to lowercase perf/cxlpmu: Support missing events in 3.1 spec perf: imx_perf: add support for i.MX91 platform dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible drivers perf: remove unused field pmu_node * for-next/gcs: (42 commits) : arm64 Guarded Control Stack user-space support kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c arm64/gcs: Fix outdated ptrace documentation kselftest/arm64: Ensure stable names for GCS stress test results kselftest/arm64: Validate that GCS push and write permissions work kselftest/arm64: Enable GCS for the FP stress tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Add GCS signal tests kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add very basic GCS test program kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Verify the GCS hwcap arm64: Add Kconfig for Guarded Control Stack (GCS) arm64/ptrace: Expose GCS via ptrace and core files arm64/signal: Expose GCS state in signal frames arm64/signal: Set up and restore the GCS context for signal handlers arm64/mm: Implement map_shadow_stack() ... * for-next/probes: : Various arm64 uprobes/kprobes cleanups arm64: insn: Simulate nop instruction for better uprobe performance arm64: probes: Remove probe_opcode_t arm64: probes: Cleanup kprobes endianness conversions arm64: probes: Move kprobes-specific fields arm64: probes: Fix uprobes for big-endian kernels arm64: probes: Fix simulate_ldr*_literal() arm64: probes: Remove broken LDR (literal) uprobe support * for-next/asm-offsets: : arm64 asm-offsets.c cleanup (remove unused offsets) arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE arm64: asm-offsets: remove VM_EXEC and PAGE_SZ arm64: asm-offsets: remove MM_CONTEXT_ID arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET arm64: asm-offsets: remove VMA_VM_* arm64: asm-offsets: remove TSK_ACTIVE_MM * for-next/tlb: : TLB flushing optimisations arm64: optimize flush tlb kernel range arm64: tlbflush: add __flush_tlb_range_limit_excess() * for-next/misc: : Miscellaneous patches arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled arm64/ptrace: Clarify documentation of VL configuration via ptrace acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() arm64: uprobes: Optimize cache flushes for xol slot acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block() arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings arm64/mm: Sanity check PTE address before runtime P4D/PUD folding arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont() ACPI: GTDT: Tighten the check for the array of platform timer structures arm64/fpsimd: Fix a typo arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers arm64: Return early when break handler is found on linked-list arm64/mm: Re-organize arch_make_huge_pte() arm64/mm: Drop _PROT_SECT_DEFAULT arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV arm64: head: Drop SWAPPER_TABLE_SHIFT arm64: cpufeature: add POE to cpucap_is_possible() arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t * for-next/mte: : Various MTE improvements selftests: arm64: add hugetlb mte tests hugetlb: arm64: add mte support * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09 * for-next/stacktrace: : arm64 stacktrace improvements arm64: preserve pt_regs::stackframe during exec*() arm64: stacktrace: unwind exception boundaries arm64: stacktrace: split unwind_consume_stack() arm64: stacktrace: report recovered PCs arm64: stacktrace: report source of unwind data arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk() arm64: use a common struct frame_record arm64: pt_regs: swap 'unused' and 'pmr' fields arm64: pt_regs: rename "pmr_save" -> "pmr" arm64: pt_regs: remove stale big-endian layout arm64: pt_regs: assert pt_regs is a multiple of 16 bytes * for-next/hwcap3: : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4) arm64: Support AT_HWCAP3 binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 * for-next/kselftest: (30 commits) : arm64 kselftest fixes/cleanups kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() kselftest/arm64: Corrupt P0 in the irritator when testing SSVE kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers kselftest/arm64: Test signal handler state modification in fp-stress kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test kselftest/arm64: Implement irritators for ZA and ZT kselftest/arm64: Remove unused ADRs from irritator handlers kselftest/arm64: Correct misleading comments on fp-stress irritators kselftest/arm64: Poll less often while waiting for fp-stress children kselftest/arm64: Increase frequency of signal delivery in fp-stress kselftest/arm64: Fix encoding for SVE B16B16 test ... * for-next/crc32: : Optimise CRC32 using PMULL instructions arm64/crc32: Implement 4-way interleave using PMULL arm64/crc32: Reorganize bit/byte ordering macros arm64/lib: Handle CRC-32 alternative in C code * for-next/guest-cca: : Support for running Linux as a guest in Arm CCA arm64: Document Arm Confidential Compute virt: arm-cca-guest: TSM_REPORT support for realms arm64: Enable memory encrypt for Realms arm64: mm: Avoid TLBI when marking pages as valid arm64: Enforce bounce buffers for realm DMA efi: arm64: Map Device with Prot Shared arm64: rsi: Map unprotected MMIO as decrypted arm64: rsi: Add support for checking whether an MMIO is protected arm64: realm: Query IPA size from the RMM arm64: Detect if in a realm and set RIPAS RAM arm64: rsi: Add RSI definitions * for-next/haft: : Support for arm64 FEAT_HAFT arm64: pgtable: Warn unexpected pmdp_test_and_clear_young() arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG arm64: Add support for FEAT_HAFT arm64: setup: name 'tcr2' register arm64/sysreg: Update ID_AA64MMFR1_EL1 register * for-next/scs: : Dynamic shadow call stack fixes arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14Merge tag 'loongarch-kvm-6.13' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.13 1. Add iocsr and mmio bus simulation in kernel. 2. Add in-kernel interrupt controller emulation. 3. Add virt extension support for eiointc irqchip.
2024-11-14perf/core: Correct perf sampling with guest VMsColton Lewis
Previously any PMU overflow interrupt that fired while a VCPU was loaded was recorded as a guest event whether it truly was or not. This resulted in nonsense perf recordings that did not honor perf_event_attr.exclude_guest and recorded guest IPs where it should have recorded host IPs. Rework the sampling logic to only record guest samples for events with exclude_guest = 0. This way any host-only events with exclude_guest set will never see unexpected guest samples. The behaviour of events with exclude_guest = 0 is unchanged. Note that events configured to sample both host and guest may still misattribute a PMI that arrived in the host as a guest event depending on KVM arch and vendor behavior. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241113190156.2145593-6-coltonlewis@google.com
2024-11-14perf/x86: Refactor misc flag assignmentsColton Lewis
Break the assignment logic for misc flags into their own respective functions to reduce the complexity of the nested logic. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20241113190156.2145593-5-coltonlewis@google.com
2024-11-14perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()Colton Lewis
For clarity, rename the arch-specific definitions of these functions to perf_arch_* to denote they are arch-specifc. Define the generic-named functions in one place where they can call the arch-specific ones as needed. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long"). The follow up verifier work depends on it. And the fix in commit 6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator"). It's fixing instability of BPF CI on s390 arch. No conflicts. Adjacent changes in: Auto-merging arch/Kconfig Auto-merging kernel/bpf/helpers.c Auto-merging kernel/bpf/memalloc.c Auto-merging kernel/bpf/verifier.c Auto-merging mm/slab_common.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13KVM: x86: expose MSR_PLATFORM_INFO as a feature MSRPaolo Bonzini
For userspace that wants to disable KVM_X86_QUIRK_STUFF_FEATURE_MSRS, it is useful to know what bits can be set to 1 in MSR_PLATFORM_INFO (apart from the TSC ratio). The right way to do that is via /dev/kvm's feature MSR mechanism. In fact, MSR_PLATFORM_INFO is already a feature MSR for the purpose of blocking updates after the vCPU is run, but KVM_GET_MSRS did not return a valid value for it. Just like in a VM that leaves KVM_X86_QUIRK_STUFF_FEATURE_MSRS enabled, the TSC ratio field is left to 0. Only bit 31 is set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13x86: KVM: Advertise CPUIDs for new instructions in Clearwater ForestTao Su
Latest Intel platform Clearwater Forest has introduced new instructions enumerated by CPUIDs of SHA512, SM3, SM4 and AVX-VNNI-INT16. Advertise these CPUIDs to userspace so that guests can query them directly. SHA512, SM3 and SM4 are on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Considering they have not truly kernel usages, hide them in /proc/cpuinfo. These new instructions only operate in xmm, ymm registers and have no new VMX controls, so there is no additional host enabling required for guests to use these instructions, i.e. advertising these CPUIDs to userspace is safe. Tested-by: Jiaan Lu <jiaan.lu@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20241105054825.870939-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13x86/mm: Fix a kdump kernel failure on SME system when CONFIG_IMA_KEXEC=yBaoquan He
The kdump kernel is broken on SME systems with CONFIG_IMA_KEXEC=y enabled. Debugging traced the issue back to b69a2afd5afc ("x86/kexec: Carry forward IMA measurement log on kexec"). Testing was previously not conducted on SME systems with CONFIG_IMA_KEXEC enabled, which led to the oversight, with the following incarnation: ... ima: No TPM chip found, activating TPM-bypass! Loading compiled-in module X.509 certificates Loaded X.509 cert 'Build time autogenerated kernel key: 18ae0bc7e79b64700122bb1d6a904b070fef2656' ima: Allocated hash algorithm: sha256 Oops: general protection fault, probably for non-canonical address 0xcfacfdfe6660003e: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc2+ #14 Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.20.0 05/03/2023 RIP: 0010:ima_restore_measurement_list Call Trace: <TASK> ? show_trace_log_lvl ? show_trace_log_lvl ? ima_load_kexec_buffer ? __die_body.cold ? die_addr ? exc_general_protection ? asm_exc_general_protection ? ima_restore_measurement_list ? vprintk_emit ? ima_load_kexec_buffer ima_load_kexec_buffer ima_init ? __pfx_init_ima init_ima ? __pfx_init_ima do_one_initcall do_initcalls ? __pfx_kernel_init kernel_init_freeable kernel_init ret_from_fork ? __pfx_kernel_init ret_from_fork_asm </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- ... Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Rebooting in 10 seconds.. Adding debug printks showed that the stored addr and size of ima_kexec buffer are not decrypted correctly like: ima: ima_load_kexec_buffer, buffer:0xcfacfdfe6660003e, size:0xe48066052d5df359 Three types of setup_data info — SETUP_EFI, - SETUP_IMA, and - SETUP_RNG_SEED are passed to the kexec/kdump kernel. Only the ima_kexec buffer experienced incorrect decryption. Debugging identified a bug in early_memremap_is_setup_data(), where an incorrect range calculation occurred due to the len variable in struct setup_data ended up only representing the length of the data field, excluding the struct's size, and thus leading to miscalculation. Address a similar issue in memremap_is_setup_data() while at it. [ bp: Heavily massage. ] Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240911081615.262202-3-bhe@redhat.com
2024-11-13Merge branch 'kvm-docs-6.13' into HEADPaolo Bonzini
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect references to non-existing ioctls - List registers supported by KVM_GET/SET_ONE_REG on s390 - Use rST internal links - Reorganize the introduction to the API document
2024-11-13Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups
2024-11-13Merge tag 'kvm-x86-vmx-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX change for 6.13 - Remove __invept()'s unused @gpa param, which was left behind when KVM dropped code for invalidating a specific GPA (Intel never officially documented support for single-address INVEPT; presumably pre-production CPUs supported it at some point).
2024-11-13Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.13 - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes, and to simplify A/D-disabled MMUs by using the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty. - Elide TLB flushes when aging SPTEs, as has been done in x86's primary MMU for over 10 years. - Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when dirty logging is toggled off, which reduces the time it takes to disable dirty logging by ~3x. - Recover huge pages in-place in the TDP MMU instead of zapping the SP and waiting until the page is re-accessed to create a huge mapping. Proactively installing huge pages can reduce vCPU jitter in extreme scenarios. - Remove support for (poorly) reclaiming page tables in shadow MMUs via the primary MMU's shrinker interface.
2024-11-13x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() under ↵Rik van Riel
CONFIG_DEBUG_VM On a web server workload, the cpumask_test_cpu() inside the WARN_ON_ONCE() in the 'prev == next branch' takes about 17% of all the CPU time of switch_mm_irqs_off(). On a large fleet, this WARN_ON_ONCE() has not fired in at least a month, possibly never. Move this test under CONFIG_DEBUG_VM so it does not get compiled in production kernels. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241109003727.3958374-4-riel@surriel.com
2024-11-12bpf, x86: Propagate tailcall info only for subprogsLeon Hwang
In x64 JIT, propagate tailcall info only for subprogs, not for helpers or kfuncs. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20241107134529.8602-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf, x86: Support private stack in jitYonghong Song
Private stack is allocated in function bpf_int_jit_compile() with alignment 8. Private stack allocation size includes the stack size determined by verifier and additional space to protect stack overflow and underflow. See below an illustration: ---> memory address increasing [8 bytes to protect overflow] [normal stack] [8 bytes to protect underflow] If overflow/underflow is detected, kernel messages will be emited in dmesg like BPF private stack overflow/underflow detected for prog Fx BPF Private stack overflow/underflow detected for prog bpf_prog_a41699c234a1567a_subprog1x Those messages are generated when I made some changes to jitted code to intentially cause overflow for some progs. For the jited prog, The x86 register 9 (X86_REG_R9) is used to replace bpf frame register (BPF_REG_10). The private stack is used per subprog per cpu. The X86_REG_R9 is saved and restored around every func call (not including tailcall) to maintain correctness of X86_REG_R9. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163922.2224385-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depthYonghong Song
Refactor the code to avoid repeated usage of bpf_prog->aux->stack_depth in do_jit() func. If the private stack is used, the stack_depth will be 0 for that prog. Refactoring make it easy to adjust stack_depth. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163917.2224189-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "x86 and selftests fixes. x86: - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure. - Move the Intel PT virtualization (i.e. outputting host trace to host buffer and guest trace to guest buffer) behind CONFIG_BROKEN. - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START - Fix a bug where KVM fails to inject an interrupt from the IRR after KVM_SET_LAPIC. Selftests: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - Fix broken compilation of kvm selftests after a header sync in tools/" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN KVM: x86: Unconditionally set irr_pending when updating APICv state kvm: svm: Fix gctx page leak on invalid inputs KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB KVM: SVM: Propagate error from snp_guest_req_init() to userspace KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported KVM: selftests: Disable strict aliasing KVM: selftests: fix unintentional noop test in guest_memfd_test.c KVM: selftests: memslot_perf_test: increase guest sync timeout
2024-11-12x86/sgx: Use vmalloc_array() instead of vmalloc()Thorsten Blum
Use vmalloc_array() instead of vmalloc() to calculate the number of bytes to allocate. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20241112182633.172944-2-thorsten.blum%40linux.dev
2024-11-12platform/x86/amd/hsmp: mark hsmp_msg_desc_table[] as maybe_unusedArnd Bergmann
After the file got split, there are now W=1 warnings for users that include it without referencing hsmp_msg_desc_table: In file included from arch/x86/include/asm/amd_hsmp.h:6, from drivers/platform/x86/amd/hsmp/plat.c:12: arch/x86/include/uapi/asm/amd_hsmp.h:91:35: error: 'hsmp_msg_desc_table' defined but not used [-Werror=unused-const-variable=] 91 | static const struct hsmp_msg_desc hsmp_msg_desc_table[] = { | ^~~~~~~~~~~~~~~~~~~ Mark it as __attribute__((maybe_unused)) to shut up the warning but keep it in the file in case it is used from userland. The __maybe_unused shorthand unfortunately isn't available in userspace, so this has to be the long form. While it is not envisioned a normal userspace program could benefit from having this table as part of UAPI, it seems there is non-zero chance this array is used by some userspace tests so it is retained for now (see the Link below). Fixes: e47c018a0ee6 ("platform/x86/amd/hsmp: Move platform device specific code to plat.c") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/platform-driver-x86/CAPhsuW7mDRswhVjYf+4iinO+sph_rQ1JykEof+apoiSOVwOXXQ@mail.gmail.com/ Link: https://lore.kernel.org/r/20241028163553.2452486-1-arnd@kernel.org Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-11-12x86/cpu: Remove redundant CONFIG_NUMA guard around numa_add_cpu()Shivank Garg
Remove unnecessary CONFIG_NUMA #ifdef around numa_add_cpu() since the function is already properly handled in <asm/numa.h> for both NUMA and non-NUMA configurations. For !CONFIG_NUMA builds, numa_add_cpu() is defined as an empty function. Simplify the code without any functionality change. Testing: Build CONFIG_NUMA=n Signed-off-by: Shivank Garg <shivankg@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241112072346.428623-1-shivankg@amd.com
2024-11-11x86/platform/intel-mid: Replace deprecated PCI functionsPhilipp Stanner
pcim_iomap_table() and pcim_request_regions() have been deprecated in e354bb84a4c1 ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()") and d140f80f6035 ("PCI: Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"), respectively. Replace these functions with pcim_iomap_region(). Additionally, pass the actual driver name to pcim_iomap_region() instead of the previous pci_name(), since the @name parameter should always reflect which driver owns a region. Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20241111103602.16615-2-pstanner@redhat.com
2024-11-11perf/x86/amd/uncore: Avoid a false positive warning about snprintf ↵Jean Delvare
truncation in amd_uncore_umc_ctx_init Fix the following warning: CC [M] arch/x86/events/amd/uncore.o arch/x86/events/amd/uncore.c: In function ‘amd_uncore_umc_ctx_init’: arch/x86/events/amd/uncore.c:951:52: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size 8 [-Wformat-truncation=] snprintf(pmu->name, sizeof(pmu->name), "amd_umc_%d", index); ^~ arch/x86/events/amd/uncore.c:951:43: note: directive argument in the range [0, 2147483647] snprintf(pmu->name, sizeof(pmu->name), "amd_umc_%d", index); ^~~~~~~~~~~~ arch/x86/events/amd/uncore.c:951:4: note: ‘snprintf’ output between 10 and 19 bytes into a destination of size 16 snprintf(pmu->name, sizeof(pmu->name), "amd_umc_%d", index); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As far as I can see, there can't be more than UNCORE_GROUP_MAX (256) groups and each group can't have more than 255 PMU, so the number printed by this %d can't exceed 65279, that's only 5 digits and would fit into the buffer. So it's a false positive warning. But we can make the compiler happy by declaring index as a 16-bit number. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sandipan Das <sandipan.das@amd.com> Link: https://lore.kernel.org/r/20241105095253.18f34b4d@endymion.delvare
2024-11-11sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.Sebastian Andrzej Siewior
Add the "Lazy" part to the comment for TIF_NEED_RESCHED_LAZY so it is not the same as TIF_NEED_RESCHED. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241106162449.sk6rDddk@linutronix.de
2024-11-08Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.13 - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side
2024-11-08x86/cpu: Make sure flag_is_changeable_p() is always being usedAndy Shevchenko
When flag_is_changeable_p() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: arch/x86/kernel/cpu/common.c:351:19: error: unused function 'flag_is_changeable_p' [-Werror,-Wunused-function] 351 | static inline int flag_is_changeable_p(u32 flag) | ^~~~~~~~~~~~~~~~~~~~ Fix this by moving core around to make sure flag_is_changeable_p() is always being used. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). While at it, fix the argument type to be unsigned long along with the local variables, although it currently only runs in 32-bit cases. Besides that, makes it return boolean instead of int. This induces the change of the returning type of have_cpuid_p() to be boolean as well. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://lore.kernel.org/all/20241108153105.1578186-1-andriy.shevchenko%40linux.intel.com
2024-11-08x86/stackprotector: Work around strict Clang TLS symbol requirementsArd Biesheuvel
GCC and Clang both implement stack protector support based on Thread Local Storage (TLS) variables, and this is used in the kernel to implement per-task stack cookies, by copying a task's stack cookie into a per-CPU variable every time it is scheduled in. Both now also implement -mstack-protector-guard-symbol=, which permits the TLS variable to be specified directly. This is useful because it will allow to move away from using a fixed offset of 40 bytes into the per-CPU area on x86_64, which requires a lot of special handling in the per-CPU code and the runtime relocation code. However, while GCC is rather lax in its implementation of this command line option, Clang actually requires that the provided symbol name refers to a TLS variable (i.e., one declared with __thread), although it also permits the variable to be undeclared entirely, in which case it will use an implicit declaration of the right type. The upshot of this is that Clang will emit the correct references to the stack cookie variable in most cases, e.g., 10d: 64 a1 00 00 00 00 mov %fs:0x0,%eax 10f: R_386_32 __stack_chk_guard However, if a non-TLS definition of the symbol in question is visible in the same compilation unit (which amounts to the whole of vmlinux if LTO is enabled), it will drop the per-CPU prefix and emit a load from a bogus address. Work around this by using a symbol name that never occurs in C code, and emit it as an alias in the linker script. Fixes: 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/1854 Link: https://lore.kernel.org/r/20241105155801.1779119-2-brgerst@gmail.com
2024-11-08KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKENSean Christopherson
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: x86: Unconditionally set irr_pending when updating APICv stateSean Christopherson
Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Yong He <zhuangel570@gmail.com> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08kvm: svm: Fix gctx page leak on invalid inputsDionna Glaze
Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson <seanjc@google.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: Ashish Kalra <ashish.kalra@amd.com> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: John Allen <john.allen@amd.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: "David S. Miller" <davem@davemloft.net> CC: Michael Roth <michael.roth@amd.com> CC: Luis Chamberlain <mcgrof@kernel.org> CC: Russ Weight <russ.weight@linux.dev> CC: Danilo Krummrich <dakr@redhat.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: "Rafael J. Wysocki" <rafael@kernel.org> CC: Tianfei zhang <tianfei.zhang@intel.com> CC: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Merge tag 'kvm-x86-fixes-6.12-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 and selftests fixes for 6.12: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure.
2024-11-07bootmem: stop using page->indexMatthew Wilcox (Oracle)
Encode the type into the bottom four bits of page->private and the info into the remaining bits. Also turn the bootmem type into a named enum. [arnd@arndb.de: bootmem: add bootmem_type stub function] Link: https://lkml.kernel.org/r/20241015143802.577613-1-arnd@kernel.org [akpm@linux-foundation.org: fix build with !CONFIG_HAVE_BOOTMEM_INFO_NODE] Link: https://lore.kernel.org/oe-kbuild-all/202410090311.eaqcL7IZ-lkp@intel.com/ Link: https://lkml.kernel.org/r/20241005200121.3231142-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07x86/module: enable ROX caches for module text on 64 bitMike Rapoport (Microsoft)
Enable execmem's cache of PMD_SIZE'ed pages mapped as ROX for module text allocations on 64 bit. Link: https://lkml.kernel.org/r/20241023162711.2579610-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07x86/module: prepare module loading for ROX allocations of textMike Rapoport (Microsoft)
When module text memory will be allocated with ROX permissions, the memory at the actual address where the module will live will contain invalid instructions and there will be a writable copy that contains the actual module code. Update relocations and alternatives patching to deal with it. [rppt@kernel.org: fix writable address in cfi_rewrite_endbr()] Link: https://lkml.kernel.org/r/ZysRwR29Ji8CcbXc@kernel.org Link: https://lkml.kernel.org/r/20241023162711.2579610-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07arch: introduce set_direct_map_valid_noflush()Mike Rapoport (Microsoft)
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages. It will be used in the following patches. Link: https://lkml.kernel.org/r/20241023162711.2579610-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07asm-generic: introduce text-patching.hMike Rapoport (Microsoft)
Several architectures support text patching, but they name the header files that declare patching functions differently. Make all such headers consistently named text-patching.h and add an empty header in asm-generic for architectures that do not support text patching. Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07x86/tdx: Enable CPU topology enumerationKirill A. Shutemov
TDX 1.0 defines baseline behaviour of TDX guest platform. TDX 1.0 generates a #VE when accessing topology-related CPUID leafs (0xB and 0x1F) and the X2APIC_APICID MSR. The kernel returns all zeros on CPUID topology. In practice, this means that the kernel can only boot with a plain topology. Any complications will cause problems. The ENUM_TOPOLOGY feature allows the VMM to provide topology information to the guest. Enabling the feature eliminates topology-related #VEs: the TDX module virtualizes accesses to the CPUID leafs and the MSR. Enable ENUM_TOPOLOGY if it is available. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20241104103803.195705-5-kirill.shutemov%40linux.intel.com
2024-11-07x86/tdx: Dynamically disable SEPT violations from causing #VEsKirill A. Shutemov
Memory access #VEs are hard for Linux to handle in contexts like the entry code or NMIs. But other OSes need them for functionality. There's a static (pre-guest-boot) way for a VMM to choose one or the other. But VMMs don't always know which OS they are booting, so they choose to deliver those #VEs so the "other" OSes will work. That, unfortunately has left us in the lurch and exposed to these hard-to-handle #VEs. The TDX module has introduced a new feature. Even if the static configuration is set to "send nasty #VEs", the kernel can dynamically request that they be disabled. Once they are disabled, access to private memory that is not in the Mapped state in the Secure-EPT (SEPT) will result in an exit to the VMM rather than injecting a #VE. Check if the feature is available and disable SEPT #VE if possible. If the TD is allowed to disable/enable SEPT #VEs, the ATTR_SEPT_VE_DISABLE attribute is no longer reliable. It reflects the initial state of the control for the TD, but it will not be updated if someone (e.g. bootloader) changes it before the kernel starts. Kernel must check TDCS_TD_CTLS bit to determine if SEPT #VEs are enabled or disabled. [ dhansen: remove 'return' at end of function ] Fixes: 373e715e31bf ("x86/tdx: Panic on bad configs that #VE on "private" memory access") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20241104103803.195705-4-kirill.shutemov%40linux.intel.com
2024-11-07x86/tdx: Rename tdx_parse_tdinfo() to tdx_setup()Kirill A. Shutemov
Rename tdx_parse_tdinfo() to tdx_setup() and move setting NOTIFY_ENABLES there. The function will be extended to adjust TD configuration. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20241104103803.195705-3-kirill.shutemov%40linux.intel.com
2024-11-07x86/tdx: Introduce wrappers to read and write TD metadataKirill A. Shutemov
The TDG_VM_WR TDCALL is used to ask the TDX module to change some TD-specific VM configuration. There is currently only one user in the kernel of this TDCALL leaf. More will be added shortly. Refactor to make way for more users of TDG_VM_WR who will need to modify other TD configuration values. Add a wrapper for the TDG_VM_RD TDCALL that requests TD-specific metadata from the TDX module. There are currently no users for TDG_VM_RD. Mark it as __maybe_unused until the first user appears. This is preparation for enumeration and enabling optional TD features. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20241104103803.195705-2-kirill.shutemov%40linux.intel.com
2024-11-07x86/boot: Remove unused function atou()Dr. David Alan Gilbert
I can't find any sign of atou() having been used. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240913005753.1392431-1-linux@treblig.org
2024-11-07um: fix sparse warnings in signal codeBenjamin Berg
sparse reports that various places were missing the __user tag in casts. In addition, one location was using 0 instead of NULL. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241031142017.430420-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07um: fix sparse warnings from regset refactorBenjamin Berg
Some variables were not tagged with __user and another was not marked as static even though it should be. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410280655.gOlEFwdG-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202410281821.WSPsAwq7-lkp@intel.com/ Fixes: 3f17fed21491 ("um: switch to regset API and depend on XSTATE") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241031142017.430420-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-07x86/sev: Cleanup vc_handle_msr()Borislav Petkov (AMD)
Carve out the MSR_SVSM_CAA into a helper with the suggestion that upcoming future users should do the same. Rename that silly exit_info_1 into what it actually means in this function - whether the MSR access is a read or a write. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20241106172647.GAZyum1zngPDwyD2IJ@fat_crate.local