summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-12-05x86/boot: Move .head.text into its own output sectionArd Biesheuvel
In order to be able to double check that vmlinux is emitted without absolute symbol references in .head.text, it needs to be distinguishable from the rest of .text in the ELF metadata. So move .head.text into its own ELF section. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-15-ardb+git@google.com
2024-12-05x86/kernel: Move ENTRY_TEXT to the start of the imageArd Biesheuvel
Since commit: 7734a0f31e99 ("x86/boot: Robustify calling startup_{32,64}() from the decompressor code") it is no longer necessary for .head.text to appear at the start of the image. Since ENTRY_TEXT needs to appear PMD-aligned, it is easier to just place it at the start of the image, rather than line it up with the end of the .text section. The amount of padding required should be the same, but this arrangement also permits .head.text to be split off and emitted separately, which is needed by a subsequent change. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-14-ardb+git@google.com
2024-12-05x86/boot: Disable UBSAN in early boot codeArd Biesheuvel
The early boot code runs from a 1:1 mapping of memory, and may execute before the kernel virtual mapping is even up. This means absolute symbol references cannot be permitted in this code. UBSAN injects references to global data structures into the code, and without -fPIC, those references are emitted as absolute references to kernel virtual addresses. Accessing those will fault before the kernel virtual mapping is up, so UBSAN needs to be disabled in early boot code. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-13-ardb+git@google.com
2024-12-05x86/boot/64: Avoid intentional absolute symbol references in .head.textArd Biesheuvel
The code in .head.text executes from a 1:1 mapping and cannot generally refer to global variables using their kernel virtual addresses. However, there are some occurrences of such references that are valid: the kernel virtual addresses of _text and _end are needed to populate the page tables correctly, and some other section markers are used in a similar way. To avoid the need for making exceptions to the rule that .head.text must not contain any absolute symbol references, derive these addresses from the RIP-relative 1:1 mapped physical addresses, which can be safely determined using RIP_REL_REF(). Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-12-ardb+git@google.com
2024-12-05x86/boot/64: Determine VA/PA offset before entering C codeArd Biesheuvel
Implicit absolute symbol references (e.g., taking the address of a global variable) must be avoided in the C code that runs from the early 1:1 mapping of the kernel, given that this is a practice that violates assumptions on the part of the toolchain. I.e., RIP-relative and absolute references are expected to produce the same values, and so the compiler is free to choose either. However, the code currently assumes that RIP-relative references are never emitted here. So an explicit virtual-to-physical offset needs to be used instead to derive the kernel virtual addresses of _text and _end, instead of simply taking the addresses and assuming that the compiler will not choose to use a RIP-relative references in this particular case. Currently, phys_base is already used to perform such calculations, but it is derived from the kernel virtual address of _text, which is taken using an implicit absolute symbol reference. So instead, derive this VA-to-PA offset in asm code, and pass it to the C startup code. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-11-ardb+git@google.com
2024-12-05x86/sev: Avoid WARN()s and panic()s in early boot codeArd Biesheuvel
Using WARN() or panic() while executing from the early 1:1 mapping is unlikely to do anything useful: the string literals are passed using their kernel virtual addresses which are not even mapped yet. But even if they were, calling into the printk() machinery from the early 1:1 mapped code is not going to get very far. So drop the WARN()s entirely, and replace panic() with a deadloop. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20241205112804.3416920-10-ardb+git@google.com
2024-12-05x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tablesDavid Woodhouse
The set_p4d() and set_pgd() functions (in 4-level or 5-level page table setups respectively) assume that the root page table is actually a 8KiB allocation, with the userspace root immediately after the kernel root page table (so that the former can enforce NX on on all the subordinate page tables, which are actually shared). However, users of the kernel_ident_mapping_init() code do not give it an 8KiB allocation for its PGD. Both swsusp_arch_resume() and acpi_mp_setup_reset() allocate only a single 4KiB page. The kexec code on x86_64 currently gets away with it purely by chance, because it allocates 8KiB for its "control code page" and then actually uses the first half for the PGD, then copies the actual trampoline code into the second half only after the identmap code has finished scribbling over it. Fix this by defining a _PAGE_NOPTISHADOW bit (which can use the same bit as _PAGE_SAVED_DIRTY since one is only for the PGD/P4D root and the other is exclusively for leaf PTEs.). This instructs __pti_set_user_pgtbl() not to write to the userspace 'shadow' PGD. Strictly, the _PAGE_NOPTISHADOW bit doesn't need to be written out to the actual page tables; since __pti_set_user_pgtbl() returns the value to be written to the kernel page table, it could be filtered out. But there seems to be no benefit to actually doing so. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/412c90a4df7aef077141d9f68d19cbe5602d6c6d.camel@infradead.org Cc: stable@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com>
2024-12-04x86/tdx: Disable unnecessary virtualization exceptionsKirill A. Shutemov
Originally, #VE was defined as the TDX behavior in order to support paravirtualization of x86 features that can’t be virtualized by the TDX module. The intention is that if guest software wishes to use such a feature, it implements some logic to support this. This logic resides in the #VE exception handler it may work in cooperation with the host VMM. Theoretically, the guest TD’s #VE handler was supposed to act as a "TDX enlightenment agent" inside the TD. However, in practice, the #VE handler is simplistic: - #VE on CPUID is handled by returning all-0 to the code which executed CPUID. In many cases, an all-0 value is not the correct value, and may cause improper operation. - #VE on RDMSR is handled by requesting the MSR value from the host VMM. This is prone to security issues since the host VMM is untrusted. It may also be functionally incorrect in case the expected operation is to paravirtualize some CPU functionality. Newer TDX modules provide a "REDUCE_VE" feature. When enabled, it drastically cuts cases when guests receive #VE on MSR and CPUID accesses. Basically, instead of punting the problem to the VMM, the TDX module fills in good data. What the TDX module provides is obviously highly specific to the MSR or CPUID. This is all spelled out in excruciating detail in the TDX specs. Enable REDUCE_VE. Make TDX guest behaviour less odd, and closer to how a normal CPU behaves. Note that enabling of the feature doesn't eliminate need in #VE handler for CPUID and MSR accesses. Some MSRs still generate #VE (notably APIC-related) and kernel needs CPUID #VE handler to ask VMM for leafs in hypervisor range. [ dhansen: changelog tweaks, rename/rework VE reduction function ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/all/20241202072431.447380-1-kirill.shutemov%40linux.intel.com
2024-12-04x86/cpu: Add Lunar Lake to list of CPUs with a broken MONITOR implementationLen Brown
Under some conditions, MONITOR wakeups on Lunar Lake processors can be lost, resulting in significant user-visible delays. Add Lunar Lake to X86_BUG_MONITOR so that wake_up_idle_cpu() always sends an IPI, avoiding this potential delay. Reported originally here: https://bugzilla.kernel.org/show_bug.cgi?id=219364 [ dhansen: tweak subject ] Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/a4aa8842a3c3bfdb7fe9807710eef159cbf0e705.1731463305.git.len.brown%40intel.com
2024-12-04x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()Kirill A. Shutemov
Rename the helper to better reflect its function. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/all/20241202073139.448208-1-kirill.shutemov%40linux.intel.com
2024-12-04Merge import NS conversion from ↵Ilpo Järvinen
'https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git' into for-next
2024-12-02x86/pkeys: Ensure updated PKRU value is XRSTOR'dAruna Ramakrishna
When XSTATE_BV[i] is 0, and XRSTOR attempts to restore state component 'i' it ignores any value in the XSAVE buffer and instead restores the state component's init value. This means that if XSAVE writes XSTATE_BV[PKRU]=0 then XRSTOR will ignore the value that update_pkru_in_sigframe() writes to the XSAVE buffer. XSTATE_BV[PKRU] only gets written as 0 if PKRU is in its init state. On Intel CPUs, basically never happens because the kernel usually overwrites the init value (aside: this is why we didn't notice this bug until now). But on AMD, the init tracker is more aggressive and will track PKRU as being in its init state upon any wrpkru(0x0). Unfortunately, sig_prepare_pkru() does just that: wrpkru(0x0). This writes XSTATE_BV[PKRU]=0 which makes XRSTOR ignore the PKRU value in the sigframe. To fix this, always overwrite the sigframe XSTATE_BV with a value that has XSTATE_BV[PKRU]==1. This ensures that XRSTOR will not ignore what update_pkru_in_sigframe() wrote. The problematic sequence of events is something like this: Userspace does: * wrpkru(0xffff0000) (or whatever) * Hardware sets: XINUSE[PKRU]=1 Signal happens, kernel is entered: * sig_prepare_pkru() => wrpkru(0x00000000) * Hardware sets: XINUSE[PKRU]=0 (aggressive AMD init tracker) * XSAVE writes most of XSAVE buffer, including XSTATE_BV[PKRU]=XINUSE[PKRU]=0 * update_pkru_in_sigframe() overwrites PKRU in XSAVE buffer ... signal handling * XRSTOR sees XSTATE_BV[PKRU]==0, ignores just-written value from update_pkru_in_sigframe() Fixes: 70044df250d0 ("x86/pkeys: Update PKRU to enable all pkeys before XSAVE") Suggested-by: Rudi Horn <rudi.horn@oracle.com> Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-3-aruna.ramakrishna%40oracle.com
2024-12-02x86/pkeys: Change caller of update_pkru_in_sigframe()Aruna Ramakrishna
update_pkru_in_sigframe() will shortly need some information which is only available inside xsave_to_user_sigframe(). Move update_pkru_in_sigframe() inside the other function to make it easier to provide it that information. No functional changes. Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-2-aruna.ramakrishna%40oracle.com
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02platform/x86/amd/hsmp: Add support for HSMP protocol version 7 messagesSuma Hegde
Following new HSMP messages are available on family 0x1A, model 0x00-0x1F platforms with protocol version 7. Add support for them in the driver. - SetXgmiPstateRange(26h) - CpuRailIsoFreqPolicy(27h) - DfcEnable(28h) - GetRaplUnit(30h) - GetRaplCoreCounter(31h) - GetRaplPackageCounter(32h) Also update HSMP message PwrEfficiencyModeSelection-21h. This message is updated to include GET option in recent firmware. Signed-off-by: Suma Hegde <suma.hegde@amd.com> Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Link: https://lore.kernel.org/r/20241118102752.11703-1-suma.hegde@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-12-02x86/boot/compressed: Remove unused header includes from kaslr.cBorislav Petkov (AMD)
Nothing is using the linux/ namespace headers anymore. Remove them. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241130122644.GAZ0sEhD3Bm_9ZAIuc@fat_crate.local
2024-12-02objtool: Fix ANNOTATE_REACHABLE to be a normal annotationPeter Zijlstra
Currently REACHABLE is weird for being on the instruction after the instruction it modifies. Since all REACHABLE annotations have an explicit instruction, flip them around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094312.494176035@infradead.org
2024-12-02objtool: Convert {.UN}REACHABLE to ANNOTATEPeter Zijlstra
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094312.353431347@infradead.org
2024-12-02x86: Convert unreachable() to BUG()Peter Zijlstra
Avoid unreachable() as it can (and will in the absence of UBSAN) generate fallthrough code. Use BUG() so we get a UD2 trap (with unreachable annotation). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org
2024-12-02objtool: Collect more annotations in objtool.hPeter Zijlstra
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094311.786598147@infradead.org
2024-12-02objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATEPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094311.465691316@infradead.org
2024-12-02objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATEPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094311.145275669@infradead.org
2024-12-02perf/x86/rapl: Add core energy counter support for AMD CPUsDhananjay Ugwekar
Add a new "power_core" PMU and "energy-core" event for monitoring energy consumption by each individual core. The existing energy-cores event aggregates the energy consumption of CPU cores at the package level. This new event aligns with the AMD's per-core energy counters. Tested the package level and core level PMU counters with workloads pinned to different CPUs. Results with workload pinned to CPU 4 in core 4 on an AMD Zen4 Genoa machine: $ sudo perf stat --per-core -e power_core/energy-core/ -- taskset -c 4 stress-ng --matrix 1 --timeout 5s stress-ng: info: [21250] setting to a 5 second run per stressor stress-ng: info: [21250] dispatching hogs: 1 matrix stress-ng: info: [21250] successful run completed in 5.00s Performance counter stats for 'system wide': S0-D0-C0 1 0.00 Joules power_core/energy-core/ S0-D0-C1 1 0.00 Joules power_core/energy-core/ S0-D0-C2 1 0.00 Joules power_core/energy-core/ S0-D0-C3 1 0.00 Joules power_core/energy-core/ S0-D0-C4 1 8.43 Joules power_core/energy-core/ S0-D0-C5 1 0.00 Joules power_core/energy-core/ S0-D0-C6 1 0.00 Joules power_core/energy-core/ S0-D0-C7 1 0.00 Joules power_core/energy-core/ S0-D1-C8 1 0.00 Joules power_core/energy-core/ S0-D1-C9 1 0.00 Joules power_core/energy-core/ S0-D1-C10 1 0.00 Joules power_core/energy-core/ Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20241115060805.447565-11-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Move the cntr_mask to rapl_pmus structDhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. Move cntr_mask to rapl_pmus struct instead of adding a new global cntr_mask for the new RAPL power_core PMU. This will also ensure that the second "core_cntr_mask" is only created if needed (i.e. in case of AMD CPUs). Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-10-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Remove the global variable rapl_msrsDhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. After making the rapl_model struct global, the rapl_msrs global variable isn't needed, so remove it. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-9-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Modify the generic variable names to *_pkg*Dhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. Replace the generic names with *_pkg*, to later on differentiate between the scopes of the two different PMUs and their variables. No functional change. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-8-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Add arguments to the init and cleanup functionsDhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. Add arguments to the init and cleanup functions, which will help in initialization and cleaning up of two separate PMUs. No functional change. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-7-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Make rapl_model struct globalDhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. As there will always be just one rapl_model variable on a system, make it global, to make it easier to access it from any function. No functional change. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-6-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Rename rapl_pmu variablesDhananjay Ugwekar
Rename struct rapl_pmu variables from "pmu" to "rapl_pmu", to avoid any confusion between the variables of two different structs pmu and rapl_pmu. As rapl_pmu also contains a pointer to struct pmu, which leads to situations in code like pmu->pmu, which is needlessly confusing. Above scenario is replaced with much more readable rapl_pmu->pmu with this change. Also rename "pmus" member in rapl_pmus struct, for same reason. No functional change. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-5-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Remove the cpu_to_rapl_pmu() functionDhananjay Ugwekar
Prepare for the addition of RAPL core energy counter support. Post which, one CPU might be mapped to more than one rapl_pmu (package/die one and a core one). So, remove the cpu_to_rapl_pmu() function. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-4-Dhananjay.Ugwekar@amd.com
2024-12-02x86/topology: Introduce topology_logical_core_id()K Prateek Nayak
On x86, topology_core_id() returns a unique core ID within the PKG domain. Looking at match_smt() suggests that a core ID just needs to be unique within a LLC domain. For use cases such as the core RAPL PMU, there exists a need for a unique core ID across the entire system with multiple PKG domains. Introduce topology_logical_core_id() to derive a unique core ID across the system. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20241115060805.447565-3-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/rapl: Remove the unused get_rapl_pmu_cpumask() functionDhananjay Ugwekar
commit 9e9af8bbb5f9 ("perf/x86/rapl: Clean up cpumask and hotplug") removes the cpumask handling from rapl. Post that, we no longer need the get_rapl_pmu_cpumask() function. So remove it. Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241115060805.447565-2-Dhananjay.Ugwekar@amd.com
2024-12-02perf/x86/intel/ds: Simplify the PEBS records processing for adaptive PEBSKan Liang
The current code may iterate all the PEBS records in the DS area several times. The first loop is to find all active events and calculate the available records for each event. Then iterate the whole buffer again and again to process available records until all active events are processed. The algorithm is inherited from the old generations. The old PEBS hardware does not deal well with the situation when events happen near each other. SW has to drop the error records. Multiple iterations are required. The hardware limit has been addressed on newer platforms with adaptive PEBS. A simple one-iteration algorithm is introduced. The samples are output by record order with the patch, rather than the event order. It doesn't impact the post-processing. The perf tool always sorts the records by time before presenting them to the end user. In an NMI, the last record has to be specially handled. Add a last[] variable to track the last unprocessed record of each event. Test: 11 PEBS events are used in the perf test. Only the basic information is collected. perf record -e instructions:up,...,instructions:up -c 2000003 benchmark The ftrace is used to record the duration of the intel_pmu_drain_pebs_icl(). The average duration reduced from 62.04us to 57.94us. A small improvement can be observed with the new algorithm. Also, the implementation becomes simpler and more straightforward. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20241119135504.1463839-5-kan.liang@linux.intel.com
2024-12-02perf/x86/intel/ds: Factor out functions for PEBS records processingKan Liang
Factor out functions to process normal and the last PEBS records, which can be shared with the later patch. Move the event updating related codes (intel_pmu_save_and_restart()) to the end, where all samples have been processed. For the current usage, it doesn't matter when perf updates event counts and reset the counter. Because all counters are stopped when the PEBS buffer is drained. Drop the return of the !intel_pmu_save_and_restart(event) check. Because it never happen. The intel_pmu_save_and_restart(event) only returns 0, when !hwc->event_base or the period_left > 0. - The !hwc->event_base is impossible for the PEBS event, since the PEBS event is only available on GP and fixed counters, which always have a valid hwc->event_base. - The check only happens for the case of non-AUTO_RELOAD and single PEBS, which implies that the event must be overflowed. The period_left must be always <= 0 for an overflowed event after the x86_pmu_update(). Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119135504.1463839-4-kan.liang@linux.intel.com
2024-12-02perf/x86/intel/ds: Clarify adaptive PEBS processingKan Liang
Modify the pebs_basic and pebs_meminfo structs to make the bitfields more explicit to ease readability of the code. Co-developed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119135504.1463839-3-kan.liang@linux.intel.com
2024-12-02perf/x86/intel/ds: Unconditionally drain PEBS DS when changing PEBS_DATA_CFGKan Liang
The PEBS kernel warnings can still be observed with the below case. when the below commands are running in parallel for a while. while true; do perf record --no-buildid -a --intr-regs=AX \ -e cpu/event=0xd0,umask=0x81/pp \ -c 10003 -o /dev/null ./triad; done & while true; do perf record -e 'cpu/mem-loads,ldlat=3/uP' -W -d -- ./dtlb done The commit b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG") intends to flush the entire PEBS buffer before the hardware is reprogrammed. However, it fails in the above case. The first perf command utilizes the large PEBS, while the second perf command only utilizes a single PEBS. When the second perf event is added, only the n_pebs++. The intel_pmu_pebs_enable() is invoked after intel_pmu_pebs_add(). So the cpuc->n_pebs == cpuc->n_large_pebs check in the intel_pmu_drain_large_pebs() fails. The PEBS DS is not flushed. The new PEBS event should not be taken into account when flushing the existing PEBS DS. The check is unnecessary here. Before the hardware is reprogrammed, all the stale records must be drained unconditionally. For single PEBS or PEBS-vi-pt, the DS must be empty. The drain_pebs() can handle the empty case. There is no harm to unconditionally drain the PEBS DS. Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241119135504.1463839-2-kan.liang@linux.intel.com
2024-12-02Merge branch 'perf/urgent'Peter Zijlstra
Keep in sync with the urgent bits. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-02perf/x86/intel: Add Arrow Lake U supportKan Liang
From PMU's perspective, the new Arrow Lake U is the same as the Meteor Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241121180526.2364759-1-kan.liang@linux.intel.com
2024-12-02Merge tag 'v6.13-rc1' into perf/core, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-12-01x86/crc-t10dif: expose CRC-T10DIF function through libEric Biggers
Move the x86 CRC-T10DIF assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01x86/crc32: expose CRC32 functions through libEric Biggers
Move the x86 CRC32 assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01x86/crc32: update prototype for crc32_pclmul_le_16()Eric Biggers
- Change the len parameter from unsigned int to size_t, so that the library function which takes a size_t can safely use this code. - Move the crc parameter to the front, as this is the usual convention. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-13-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01x86/crc32: update prototype for crc_pcl()Eric Biggers
- Change the len parameter from unsigned int to size_t, so that the library function which takes a size_t can safely use this code. - Rename to crc32c_x86_3way() which is much clearer. - Move the crc parameter to the front, as this is the usual convention. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-12-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01Merge tag 'x86_urgent_for_v6.13_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a terminating zero end-element to the array describing AMD CPUs affected by erratum 1386 so that the matching loop actually terminates instead of going off into the weeds - Update the boot protocol documentation to mention the fact that the preferred address to load the kernel to is considered in the relocatable kernel case too - Flush the memory buffer containing the microcode patch after applying microcode on AMD Zen1 and Zen2, to avoid unnecessary slowdowns - Make sure the PPIN CPU feature flag is cleared on all CPUs if PPIN has been disabled * tag 'x86_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Terminate the erratum_1386_microcode array x86/Documentation: Update algo in init_size description of boot protocol x86/microcode/AMD: Flush patch buffer mapping after application x86/mm: Carve out INVLPG inline asm for use by others x86/cpu: Fix PPIN initialization
2024-11-30Merge tag 'kbuild-v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add generic support for built-in boot DTB files - Enable TAB cycling for dialog buttons in nconfig - Fix issues in streamline_config.pl - Refactor Kconfig - Add support for Clang's AutoFDO (Automatic Feedback-Directed Optimization) - Add support for Clang's Propeller, a profile-guided optimization. - Change the working directory to the external module directory for M= builds - Support building external modules in a separate output directory - Enable objtool for *.mod.o and additional kernel objects - Use lz4 instead of deprecated lz4c - Work around a performance issue with "git describe" - Refactor modpost * tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits) kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms gitignore: Don't ignore 'tags' directory kbuild: add dependency from vmlinux to resolve_btfids modpost: replace tdb_hash() with hash_str() kbuild: deb-pkg: add python3:native to build dependency genksyms: reduce indentation in export_symbol() modpost: improve error messages in device_id_check() modpost: rename alias symbol for MODULE_DEVICE_TABLE() modpost: rename variables in handle_moddevtable() modpost: move strstarts() to modpost.h modpost: convert do_usb_table() to a generic handler modpost: convert do_of_table() to a generic handler modpost: convert do_pnp_device_entry() to a generic handler modpost: convert do_pnp_card_entries() to a generic handler modpost: call module_alias_printf() from all do_*_entry() functions modpost: pass (struct module *) to do_*_entry() functions modpost: remove DEF_FIELD_ADDR_VAR() macro modpost: deduplicate MODULE_ALIAS() for all drivers modpost: introduce module_alias_printf() helper modpost: remove unnecessary check in do_acpi_entry() ...
2024-11-30Merge tag 'uml-for-linus-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Richard Weinberger: - Lots of cleanups, mostly from Benjamin Berg and Tiwei Bie - Removal of unused code - Fix for sparse warnings - Cleanup around stub_exe() * tag 'uml-for-linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (68 commits) hostfs: Fix the NULL vs IS_ERR() bug for __filemap_get_folio() um: move thread info into task um: Always dump trace for specified task in show_stack um: vector: Do not use drvdata in release um: net: Do not use drvdata in release um: ubd: Do not use drvdata in release um: ubd: Initialize ubd's disk pointer in ubd_add um: virtio_uml: query the number of vqs if supported um: virtio_uml: fix call_fd IRQ allocation um: virtio_uml: send SET_MEM_TABLE message with the exact size um: remove broken double fault detection um: remove duplicate UM_NSEC_PER_SEC definition um: remove file sync for stub data um: always include kconfig.h and compiler-version.h um: set DONTDUMP and DONTFORK flags on KASAN shadow memory um: fix sparse warnings in signal code um: fix sparse warnings from regset refactor um: Remove double zero check um: fix stub exe build with CONFIG_GCOV um: Use os_set_pdeathsig helper in winch thread/process ...
2024-11-26Merge tag 'pci-v6.13-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Make pci_stop_dev() and pci_destroy_dev() safe so concurrent callers can't stop a device multiple times, even as we migrate from the global pci_rescan_remove_lock to finer-grained locking (Keith Busch) - Improve pci_walk_bus() implementation by making it recursive and moving locking up to avoid need for a 'locked' parameter (Keith Busch) - Unexport pci_walk_bus_locked(), which is only used internally by the PCI core (Keith Busch) - Detect some Thunderbolt chips that are built-in and hence 'trustworthy' by a heuristic since the 'ExternalFacingPort' and 'usb4-host-interface' ACPI properties are not quite enough (Esther Shimanovich) Resource management: - Use PCI bus addresses (not CPU addresses) in 'ranges' properties when building dynamic DT nodes so systems where PCI and CPU addresses differ work correctly (Andrea della Porta) - Tidy resource sizing and assignment with helpers to reduce redundancy (Ilpo Järvinen) - Improve pdev_sort_resources() 'bogus alignment' warning to be more specific (Ilpo Järvinen) Driver binding: - Convert driver .remove_new() callbacks to .remove() again to finish the conversion from returning 'int' to being 'void' (Sergio Paracuellos) - Export pcim_request_all_regions(), a managed interface to request all BARs (Philipp Stanner) - Replace pcim_iomap_regions_request_all() with pcim_request_all_regions(), and pcim_iomap_table()[n] with pcim_iomap(n), in the following drivers: ahci, crypto qat, crypto octeontx2, intel_th, iwlwifi, ntb idt, serial rp2, ALSA korg1212 (Philipp Stanner) - Remove the now unused pcim_iomap_regions_request_all() (Philipp Stanner) - Export pcim_iounmap_region(), a managed interface to unmap and release a PCI BAR (Philipp Stanner) - Replace pcim_iomap_regions(mask) with pcim_iomap_region(n), and pcim_iounmap_regions(mask) with pcim_iounmap_region(n), in the following drivers: fpga dfl-pci, block mtip32xx, gpio-merrifield, cavium (Philipp Stanner) Error handling: - Add sysfs 'reset_subordinate' to reset the entire hierarchy below a bridge; previously Secondary Bus Reset could only be used when there was a single device below a bridge (Keith Busch) - Warn if we reset a running device where the driver didn't register pci_error_handlers notification callbacks (Keith Busch) ASPM: - Disable ASPM L1 before touching L1 PM Substates to follow the spec closer and avoid a CPU load timeout on some platforms (Ajay Agarwal) - Set devices below Intel VMD to D0 before enabling ASPM L1 Substates as required per spec for all L1 Substates changes (Jian-Hong Pan) Power management: - Enable starfive controller runtime PM before probing host bridge (Mayank Rana) - Enable runtime power management for host bridges (Krishna chaitanya chundru) Power control: - Use of_platform_device_create() instead of of_platform_populate() to create pwrctl platform devices so we can control it based on the child nodes (Manivannan Sadhasivam) - Create pwrctrl platform devices only if there's a relevant power supply property (Manivannan Sadhasivam) - Add device link from the pwrctl supplier to the PCI dev to ensure pwrctl drivers are probed before the PCI dev driver; this avoids a race where pwrctl could change device power state while the PCI driver was active (Manivannan Sadhasivam) - Find pwrctl device for removal with of_find_device_by_node() instead of searching all children of the parent (Manivannan Sadhasivam) - Rename 'pwrctl' to 'pwrctrl' to match new bandwidth controller ('bwctrl') and hotplug files (Bjorn Helgaas) Bandwidth control: - Add read/modify/write locking for Link Control 2, which is used to manage Link speed (Ilpo Järvinen) - Extract Link Bandwidth Management Status check into pcie_lbms_seen(), where it can be shared between the bandwidth controller and quirks that use it to help retrain failed links (Ilpo Järvinen) - Re-add Link Bandwidth notification support with updates to address the reasons it was previously reverted (Alexandru Gagniuc, Ilpo Järvinen) - Add pcie_set_target_speed() and related functionality so drivers can manage PCIe Link speed based on thermal or other constraints (Ilpo Järvinen) - Add a thermal cooling driver to throttle PCIe Links via the existing thermal management framework (Ilpo Järvinen) - Add a userspace selftest for the PCIe bandwidth controller (Ilpo Järvinen) PCI device hotplug: - Add hotplug controller driver for Marvell OCTEON multi-function device where function 0 has a management console interface to enable/disable and provision various personalities for the other functions (Shijith Thotton) - Retain a reference to the pci_bus for the lifetime of a pci_slot to avoid a use-after-free when the thunderbolt driver resets USB4 host routers on boot, causing hotplug remove/add of downstream docks or other devices (Lukas Wunner) - Remove unused cpcihp struct cpci_hp_controller_ops.hardware_test (Guilherme Giacomo Simoes) - Remove unused cpqphp struct ctrl_dbg.ctrl (Christophe JAILLET) - Use pci_bus_read_dev_vendor_id() instead of hand-coded presence detection in cpqphp (Ilpo Järvinen) - Simplify cpqphp enumeration, which is already simple-minded and doesn't handle devices below hot-added bridges (Ilpo Järvinen) Virtualization: - Add ACS quirk for Wangxun FF5xxx NICs, which don't advertise an ACS capability but do isolate functions as though PCI_ACS_RR and PCI_ACS_CR were set, so the functions can be in independent IOMMU groups (Mengyuan Lou) TLP Processing Hints (TPH): - Add and document TLP Processing Hints (TPH) support so drivers can enable and disable TPH and the kernel can save/restore TPH configuration (Wei Huang) - Add TPH Steering Tag support so drivers can retrieve Steering Tag values associated with specific CPUs via an ACPI _DSM to improve performance by directing DMA writes closer to their consumers (Wei Huang) Data Object Exchange (DOE): - Wait up to 1 second for DOE Busy bit to clear before writing a request to the mailbox to avoid failures if the mailbox is still busy from a previous transfer (Gregory Price) Endpoint framework: - Skip attempts to allocate from endpoint controller memory window if the requested size is larger than the window (Damien Le Moal) - Add and document pci_epc_mem_map() and pci_epc_mem_unmap() to handle controller-specific size and alignment constraints, and add test cases to the endpoint test driver (Damien Le Moal) - Implement dwc pci_epc_ops.align_addr() so pci_epc_mem_map() can observe DWC-specific alignment requirements (Damien Le Moal) - Synchronously cancel command handler work in endpoint test before cleaning up DMA and BARs (Damien Le Moal) - Respect endpoint page size in dw_pcie_ep_align_addr() (Niklas Cassel) - Use dw_pcie_ep_align_addr() in dw_pcie_ep_raise_msi_irq() and dw_pcie_ep_raise_msix_irq() instead of open coding the equivalent (Niklas Cassel) - Avoid NULL dereference if Modem Host Interface Endpoint lacks 'mmio' DT property (Zhongqiu Han) - Release PCI domain ID of Endpoint controller parent (not controller itself) and before unregistering the controller, to avoid use-after-free (Zijun Hu) - Clear secondary (not primary) EPC in pci_epc_remove_epf() when removing the secondary controller associated with an NTB (Zijun Hu) Cadence PCIe controller driver: - Lower severity of 'phy-names' message (Bartosz Wawrzyniak) Freescale i.MX6 PCIe controller driver: - Fix suspend/resume support on i.MX6QDL, which has a hardware erratum that prevents use of L2 (Stefan Eichenberger) Intel VMD host bridge driver: - Add 0xb60b and 0xb06f Device IDs for client SKUs (Nirmal Patel) MediaTek PCIe Gen3 controller driver: - Update mediatek-gen3 DT binding to require the exact number of clocks for each SoC (Fei Shao) - Add support for DT 'max-link-speed' and 'num-lanes' properties to restrict the link speed and width (AngeloGioacchino Del Regno) Microchip PolarFlare PCIe controller driver: - Add DT and driver support for using either of the two PolarFire Root Ports (Conor Dooley) NVIDIA Tegra194 PCIe controller driver: - Move endpoint controller cleanups that depend on refclk from the host to the notifier that tells us the host has deasserted PERST#, when refclk should be valid (Manivannan Sadhasivam) Qualcomm PCIe controller driver: - Add qcom SAR2130P DT binding with an additional clock (Dmitry Baryshkov) - Enable MSI interrupts if 'global' IRQ is supported, since a previous commit unintentionally masked them (Manivannan Sadhasivam) - Move endpoint controller cleanups that depend on refclk from the host to the notifier that tells us the host has deasserted PERST#, when refclk should be valid (Manivannan Sadhasivam) - Add DT binding and driver support for IPQ9574, with Synopsys IP v5.80a and Qcom IP 1.27.0 (devi priya) - Move the OPP "operating-points-v2" table from the qcom,pcie-sm8450.yaml DT binding to qcom,pcie-common.yaml, where it can be used by other Qcom platforms (Qiang Yu) - Add 'global' SPI interrupt for events like link-up, link-down to qcom,pcie-x1e80100 DT binding so we can start enumeration when the link comes up (Qiang Yu) - Disable ASPM L0s for qcom,pcie-x1e80100 since the PHY is not tuned to support this (Qiang Yu) - Add ops_1_21_0 for SC8280X family SoC, which doesn't use the 'iommu-map' DT property and doesn't need BDF-to-SID translation (Qiang Yu) Rockchip PCIe controller driver: - Define ROCKCHIP_PCIE_AT_SIZE_ALIGN to replace magic 256 endpoint .align value (Damien Le Moal) - When unmapping an endpoint window, compute the region index instead of searching for it, and verify that the address was mapped (Damien Le Moal) - When mapping an endpoint window, verify that the address hasn't been mapped already (Damien Le Moal) - Implement pci_epc_ops.align_addr() for rockchip-ep (Damien Le Moal) - Fix MSI IRQ data mapping to observe the alignment constraint, which fixes intermittent page faults in memcpy_toio() and memcpy_fromio() (Damien Le Moal) - Rename rockchip_pcie_parse_ep_dt() to rockchip_pcie_ep_get_resources() for consistency with similar DT interfaces (Damien Le Moal) - Skip the unnecessary link train in rockchip_pcie_ep_probe() and do it only in the endpoint start operation (Damien Le Moal) - Implement pci_epc_ops.stop_link() to disable link training and controller configuration (Damien Le Moal) - Attempt link training at 5 GT/s when both partners support it (Damien Le Moal) - Add a handler for PERST# signal so we can detect host-initiated resets and start link training after PERST# is deasserted (Damien Le Moal) Synopsys DesignWare PCIe controller driver: - Clear outbound address on unmap so dw_pcie_find_index() won't match an ATU index that was already unmapped (Damien Le Moal) - Use of_property_present() instead of of_property_read_bool() when testing for presence of non-boolean DT properties (Rob Herring) - Advertise 1MB size if endpoint supports Resizable BARs, which was inadvertently lost in v6.11 (Niklas Cassel) TI J721E PCIe driver: - Add PCIe support for J722S SoC (Siddharth Vadapalli) - Delay PCIE_T_PVPERL_MS (100 ms), not just PCIE_T_PERST_CLK_US (100 us), before deasserting PERST# to ensure power and refclk are stable (Siddharth Vadapalli) TI Keystone PCIe controller driver: - Set the 'ti,keystone-pcie' mode so v3.65a devices work in Root Complex mode (Kishon Vijay Abraham I) - Try to avoid unrecoverable SError for attempts to issue config transactions when the link is down; this is racy but the best we can do (Kishon Vijay Abraham I) Miscellaneous: - Reorganize kerneldoc parameter names to match order in function signature (Julia Lawall) - Fix sysfs reset_method_store() memory leak (Todd Kjos) - Simplify pci_create_slot() (Ilpo Järvinen) - Fix incorrect printf format specifiers in pcitest (Luo Yifan)" * tag 'pci-v6.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (127 commits) PCI: rockchip-ep: Handle PERST# signal in EP mode PCI: rockchip-ep: Improve link training PCI: rockship-ep: Implement the pci_epc_ops::stop_link() operation PCI: rockchip-ep: Refactor endpoint link training enable PCI: rockchip-ep: Refactor rockchip_pcie_ep_probe() MSI-X hiding PCI: rockchip-ep: Refactor rockchip_pcie_ep_probe() memory allocations PCI: rockchip-ep: Rename rockchip_pcie_parse_ep_dt() PCI: rockchip-ep: Fix MSI IRQ data mapping PCI: rockchip-ep: Implement the pci_epc_ops::align_addr() operation PCI: rockchip-ep: Improve rockchip_pcie_ep_map_addr() PCI: rockchip-ep: Improve rockchip_pcie_ep_unmap_addr() PCI: rockchip-ep: Use a macro to define EP controller .align feature PCI: rockchip-ep: Fix address translation unit programming PCI/pwrctrl: Rename pwrctrl functions and structures PCI/pwrctrl: Rename pwrctl files to pwrctrl PCI/pwrctl: Remove pwrctl device without iterating over all children of pwrctl parent PCI/pwrctl: Ensure that pwrctl drivers are probed before PCI client drivers PCI/pwrctl: Create pwrctl device only if at least one power supply is present PCI/pwrctl: Use of_platform_device_create() to create pwrctl devices tools: PCI: Fix incorrect printf format specifiers ...
2024-11-27kbuild: Add Propeller configuration for kernel buildRong Xu
Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary. The support requires a Clang compiler LLVM 19 or later, and the create_llvm_prof tool (https://github.com/google/autofdo/releases/tag/v0.30.1). This commit is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS. Here is an example workflow for building an AutoFDO+Propeller optimized kernel: 1) Build the kernel on the host machine, with AutoFDO and Propeller build config CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y then $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> “<autofdo_profile>” is the profile collected when doing a non-Propeller AutoFDO build. This step builds a kernel that has the same optimization level as AutoFDO, plus a metadata section that records basic block information. This kernel image runs as fast as an AutoFDO optimized kernel. 2) Install the kernel on test/production machines. 3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 500009, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c <count> \ -o <perf_file> -- <loadtest> For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 # To see if Zen3 support LBR: $ cat proc/cpuinfo | grep " brs" # To see if Zen4 support LBR: $ cat proc/cpuinfo | grep amd_lbr_v2 # If the result is yes, then collect the profile using: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c <count> -o <perf_file> -- <loadtest> 4) (Optional) Download the raw perf file to the host machine. 5) Generate Propeller profile: $ create_llvm_prof --binary=<vmlinux> --profile=<perf_file> \ --format=propeller --propeller_output_module_name \ --out=<propeller_profile_prefix>_cc_profile.txt \ --propeller_symorder=<propeller_profile_prefix>_ld_profile.txt “create_llvm_prof” is the profile conversion tool, and a prebuilt binary for linux can be found on https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build from source). "<propeller_profile_prefix>" can be something like "/home/user/dir/any_string". This command generates a pair of Propeller profiles: "<propeller_profile_prefix>_cc_profile.txt" and "<propeller_profile_prefix>_ld_profile.txt". 6) Rebuild the kernel using the AutoFDO and Propeller profile files. CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y and $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> \ CLANG_PROPELLER_PROFILE_PREFIX=<propeller_profile_prefix> Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-26x86/CPU/AMD: Terminate the erratum_1386_microcode arraySebastian Andrzej Siewior
The erratum_1386_microcode array requires an empty entry at the end. Otherwise x86_match_cpu_with_stepping() will continue iterate the array after it ended. Add an empty entry to erratum_1386_microcode to its end. Fixes: 29ba89f189528 ("x86/CPU/AMD: Improve the erratum 1386 workaround") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20241126134722.480975-1-bigeasy@linutronix.de
2024-11-25Merge tag 'trace-rust-v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull rust trace event support from Steven Rostedt: "Allow Rust code to have trace events Trace events is a popular way to debug what is happening inside the kernel or just to find out what is happening. Rust code is being added to the Linux kernel but it currently does not support the tracing infrastructure. Add support of trace events inside Rust code" * tag 'trace-rust-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rust: jump_label: skip formatting generated file jump_label: rust: pass a mut ptr to `static_key_count` samples: rust: fix `rust_print` build making it a combined module rust: add arch_static_branch jump_label: adjust inline asm to be consistent rust: samples: add tracepoint to Rust sample rust: add tracepoint support rust: add static_branch_unlikely for static_key_false