linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-09-01	intel_idle: Remove unnecessary address-of operators	Kaushlendra Kumar
	Remove redundant address-of operators (&) when assigning the intel_idle() function pointer to the .enter field in cpuidle_state structures.in C, the & is not needed for function names. This change improves code consistency and readability by using the more conventional form without the & operator. No functional change intended. Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20250818085124.3897921-1-kaushlendra.kumar@intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-08-11	intel_idle: Allow loading ACPI tables for any family	Len Brown
	There is no reason to limit intel_idle's loading of ACPI tables to family 6. Upcoming Intel processors are not in family 6. Below "Fixes" really means "applies cleanly until". That syntax commit didn't change the previous logic, but shows this patch applies back 5-years. Fixes: 4a9f45a0533f ("intel_idle: Convert to new X86 CPU match macros") Signed-off-by: Len Brown <len.brown@intel.com> Link: https://patch.msgid.link/06101aa4fe784e5b0be1cb2c0bdd9afcf16bd9d4.1754681697.git.len.brown@intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-10	intel_idle: Update arguments of mwait_idle_with_hints()	Uros Bizjak
	Commit a17b37a3f416 ("x86/idle: Change arguments of mwait_idle_with_hints() to u32") changed the type of arguments of mwait_idle_with_hints() from unsigned long to u32. Change the type of variables in the call to mwait_idle_with_hints() to unsigned int to follow the change. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20250609063528.48715-1-ubizjak@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-06-07	intel_idle: Rescan "dead" SMT siblings during initialization	Rafael J. Wysocki
	Make intel_idle_init() call arch_cpu_rescan_dead_smt_siblings() after successfully registering intel_idle as the cpuidle driver so as to allow the "dead" SMT siblings (if any) to go into deep idle states. This is necessary for the processor to be able to reach deep package C-states (like PC10) going forward which is requisite for reducing power sufficiently in suspend-to-idle, among other things. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/10669885.nUPlyArG6x@rjwysocki.net
2025-06-06	intel_idle: Use subsys_initcall_sync() for initialization	Rafael J. Wysocki
	It is not necessary to wait until the device_initcall() stage with intel_idle initialization. All of its dependencies are met after all subsys_initcall()s have run, so subsys_initcall_sync() can be used for initializing it. It is also better to ensure that intel_idle will always initialize before the ACPI processor driver that uses module_init() for its initialization. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/2994397.e9J7NaK4W3@rjwysocki.net
2025-05-27	Merge tag 'pm-6.16-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Once again, the changes are dominated by cpufreq updates, but this time the majority of them are cpufreq core changes, mostly related to the introduction of policy locking guards and __free() usage, and fixes related to boost handling. Still, there is also a significant update of the intel_pstate driver making it register an energy model when running on a hybrid platform which is used for enabling energy-aware scheduling (EAS) if the driver operates in the passive mode (and schedutil is used as the cpufreq governor for all CPUs which is the passive mode default). There are some amd-pstate driver updates too, for a good measure, including the "Requested CPU Min frequency" BIOS option support and new online/offline callbacks. In the cpuidle space, the most significant change is the addition of a C1 demotion on/off sysfs knob to intel_idle which should help some users to configure their systems more precisely. There is also the conversion of the PSCI cpuidle driver to a faux device one and there are two small updates of cpuidle governors. Device power management is also modified quite a bit, especially the handling of devices with asynchronous suspend and resume enabled during system transitions. They are now going to be handled more asynchronously during suspend transitions and somewhat less aggressively during resume transitions. Apart from the above, the operating performance points (OPP) library is now going to use mutex locking guards and scope-based cleanup helpers and there is the usual bunch of assorted fixes and code cleanups. Specifics: - Fix potential division-by-zero error in em_compute_costs() (Yaxiong Tian) - Fix typos in energy model documentation and example driver code (Moon Hee Lee, Atul Kumar Pant) - Rearrange the energy model management code and add a new function for adjusting a CPU energy model after adjusting the capacity of the given CPU to it (Rafael Wysocki) - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki) - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar) - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar) - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar) - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar) - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal) - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor) - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki) - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri) - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki) - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab) - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu) - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng) - OPP: Add dev_pm_opp_set_level() (Praveen Talari) - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar) - Switch OPP to use kmemdup_array() (Zhang Enpei) - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the menu cpuidle governor (Zhongqiu Han) - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla) - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem Bityutskiy) - Fix typos in two comments in the teo cpuidle governor (Atul Kumar Pant) - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla) - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki) - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás) - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum) - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki) - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel) - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang) - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu) - Add configurable pm_test delay for hibernation (Zihuan Zhang) - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter) - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki) - Add a systemd service to run cpupower and change cpupower binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)" * tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: CPPC: Add support for autonomous selection cpufreq: Update sscanf() to kstrtouint() cpufreq: Replace magic number OPP: switch to use kmemdup_array() PM: freezer: Rewrite restarting tasks log to remove stray done. PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn() cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpupower: do not install files to /etc/default/ cpupower: do not call systemctl at install time cpupower: do not write DESTDIR to cpupower.service PM: sleep: Introduce pm_sleep_transition_in_progress() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms PM: EM: Introduce em_adjust_cpu_capacity() PM: EM: Move CPU capacity check to em_adjust_new_capacity() PM: EM: Documentation: Fix typos in example driver code cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() PM: sleep: Introduce pm_suspend_in_progress() ...
2025-05-15	x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header	Ahmed S. Darwish
	The main CPUID header <asm/cpuid.h> was originally a storefront for the headers: <asm/cpuid/api.h> <asm/cpuid/leaf_0x2_api.h> Now that the latter CPUID(0x2) header has been merged into the former, there is no practical difference between <asm/cpuid.h> and <asm/cpuid/api.h>. Migrate all users to the <asm/cpuid/api.h> header, in preparation of the removal of <asm/cpuid.h>. Don't remove <asm/cpuid.h> just yet, in case some new code in -next started using it. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
2025-05-02	x86/msr: Add explicit includes of <asm/msr.h>	Xin Li (Intel)
	For historic reasons there are some TSC-related functions in the <asm/msr.h> header, even though there's an <asm/tsc.h> header. To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and to eventually eliminate the inclusion of <asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency to the source files that reference definitions from <asm/msr.h>. [ mingo: Clarified the changelog. ] Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-04-15	intel_idle: Add C1 demotion on/off sysfs knob	Artem Bityutskiy
	Add a sysfs knob to enable/disable C1 demotion for the following Intel platforms: Sapphire Rapids Xeon, Emerald Rapids Xeon, Granite Rapids Xeon, Sierra Forest Xeon, and Grand Ridge SoC. This sysfs file shows up as "/sys/devices/system/cpu/cpuidle/intel_c1_demotion". The C1 demotion feature involves the platform firmware demoting deep C-state requests from the OS (e.g., C6 requests) to C1. The idea is that firmware monitors CPU wake-up rate, and if it is higher than a platform-specific threshold, the firmware demotes deep C-state requests to C1. For example, Linux requests C6, but firmware noticed too many wake-ups per second, and it keeps the CPU in C1. When the CPU stays in C1 long enough, the platform promotes it back to C6. The default value for C1 demotion is whatever is configured by BIOS. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20250317135541.1471754-2-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-10	x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'	Ingo Molnar
	Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10	x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'	Ingo Molnar
	Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-25	Merge tag 'pm-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are dominated by cpufreq updates which in turn are dominated by updates related to boost support in the core and drivers and amd-pstate driver optimizations. Apart from the above, there are some cpuidle updates including a rework of the most recent idle intervals handling in the venerable menu governor that leads to significant improvements in some performance benchmarks, as the governor is now more likely to predict a shorter idle duration in some cases, and there are updates of the core device power management code, mostly related to system suspend and resume, that should help to avoid potential issues arising when the drivers of devices depending on one another want to use different optimizations. There is also a usual collection of assorted fixes and cleanups, including removal of some unused code. Specifics: - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code in drivers (Viresh Kumar) - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, zuoqian) - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai) - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski) - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng) - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar) - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki) - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan) - Update the handling of the most recent idle intervals in the menu cpuidle governor to prevent useful information from being discarded by it in some cases and improve the prediction accuracy (Rafael Wysocki) - Make it possible to tell the intel_idle driver to ignore its built-in table of idle states for the given processor, clean up the handling of auto-demotion disabling on Baytrail and Cherrytrail chips in it, and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy, Rafael Wysocki) - Make some cpuidle drivers use for_each_present_cpu() instead of for_each_possible_cpu() during initialization to avoid issues occurring when nosmp or maxcpus=0 are used (Jacky Bai) - Clean up the Energy Model handling code somewhat (Rafael Wysocki) - Use kfree_rcu() to simplify the handling of runtime Energy Model updates (Li RongQing) - Add an entry for the Energy Model framework to MAINTAINERS as properly maintained (Lukasz Luba) - Address RCU-related sparse warnings in the Energy Model code (Rafael Wysocki) - Remove ENERGY_MODEL dependency on SMP and allow it to be selected when DEVFREQ is set without CPUFREQ so it can be used on a wider range of systems (Jeson Gao) - Unify error handling during runtime suspend and runtime resume in the core to help drivers to implement more consistent runtime PM error handling (Rafael Wysocki) - Drop a redundant check from pm_runtime_force_resume() and rearrange documentation related to __pm_runtime_disable() (Rafael Wysocki) - Rework the handling of the "smart suspend" driver flag in the PM core to avoid issues hat may occur when drivers using it depend on some other drivers and clean up the related PM core code (Rafael Wysocki, Colin Ian King) - Fix the handling of devices with the power.direct_complete flag set if device_suspend() returns an error for at least one device to avoid situations in which some of them may not be resumed (Rafael Wysocki) - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a possible deadlock that may occur if the "compressor" hibernation module parameter is accessed during the registration of a new ieee80211 device (Lizhi Xu) - Suppress sleeping parent warning in device_pm_add() in the case when new children are added under a device with the power.direct_complete set after it has been processed by device_resume() (Xu Yang) - Remove needless return in three void functions related to system wakeup (Zijun Hu) - Replace deprecated kmap_atomic() with kmap_local_page() in the hibernation core code (David Reaver) - Remove unused helper functions related to system sleep (David Alan Gilbert) - Clean up s2idle_enter() so it does not lock and unlock CPU offline in vain and update comments in it (Ulf Hansson) - Clean up broken white space in dpm_wait_for_children() (Geert Uytterhoeven) - Update the cpupower utility to fix lib version-ing in it and memory leaks in error legs, remove hard-coded values, and implement CPU physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah Khan, Yiwei Lin, Zhongqiu Han)" * tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (139 commits) PM: sleep: Fix bit masking operation dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650 dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1 dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible cpufreq: Init cpufreq only for present CPUs PM: sleep: Fix handling devices with direct_complete set on errors cpuidle: Init cpuidle only for present CPUs PM: clk: Remove unused pm_clk_remove() PM: sleep: core: Fix indentation in dpm_wait_for_children() PM: s2idle: Extend comment in s2idle_enter() PM: s2idle: Drop redundant locks when entering s2idle PM: sleep: Remove unused pm_generic_ wrappers cpufreq: tegra186: Share policy per cluster cpupower: Make lib versioning scheme more obvious and fix version link PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL PM: EM: Address RCU-related sparse warnings cpupower: Implement CPU physical core querying pm: cpupower: remove hard-coded topology depth values pm: cpupower: Fix cmd_monitor() error legs to free cpu_topology ...
2025-03-24	Merge tag 'x86-core-2025-03-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ...
2025-03-22	tracing: Disable branch profiling in noinstr code	Josh Poimboeuf
	CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-04	Merge branch 'x86/urgent' into x86/cpu, to pick up dependent commits	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-03	Merge back earlier cpuidle material for 6.15	Rafael J. Wysocki

2025-02-28	intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctly	Thomas Gleixner
	The Intel idle driver is preferred over the ACPI processor idle driver, but fails to implement the work around for Core2 generation CPUs, where the TSC stops in C2 and deeper C-states. This causes stalls and boot delays, when the clocksource watchdog does not catch the unstable TSC before the CPU goes deep idle for the first time. The ACPI driver marks the TSC unstable when it detects that the CPU supports C2 or deeper and the CPU does not have a non-stop TSC. Add the equivivalent work around to the Intel idle driver to cure that. Fixes: 18734958e9bf ("intel_idle: Use ACPI _CST for processor models without C-state tables") Reported-by: Fab Stz <fabstz-it@yahoo.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Fab Stz <fabstz-it@yahoo.fr> Cc: All applicable <stable@vger.kernel.org> Closes: https://lore.kernel.org/all/10cf96aa-1276-4bd4-8966-c890377030c3@yahoo.fr Link: https://patch.msgid.link/87bjupfy7f.ffs@tglx Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-25	intel_idle: introduce 'no_native' module parameter	David Arcari
	Since commit 18734958e9bf ("intel_idle: Use ACPI _CST for processor models without C-state tables") the intel_idle driver has had the ability to use the ACPI _CST to populate C-states when the processor model is not recognized. However, even when the processor model is recognized (native mode) there are cases where it is useful to make the driver ignore the per-CPU idle states in lieu of ACPI C-states (such as specific application performance). Add a new 'no_native' module parameter to provide this functionality. Signed-off-by: David Arcari <darcari@redhat.com> Link: https://patch.msgid.link/20250220151120.1131122-1-darcari@redhat.com Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> [ rjw: Spell CPU in capitals ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-19	intel_idle: clean up BYT/CHT auto demotion disable	Artem Bityutskiy
	Bay Trail (BYT) and Cherry Trail (CHT) platforms have a very specific way of disabling auto-demotion via specific MSR bits. Clean up the code so that BYT/CHT-specifics do not show up in the common 'struct idle_cpu' data structure. Remove the 'byt_auto_demotion_disable_flag' flag from 'struct idle_cpu', because a better coding pattern is to avoid very case-specific fields like 'bool byt_auto_demotion_disable_flag' in a common data structure, which is used for all platforms, not only BYT/CHT. The code is just more readable when common data structures contain only commonly used fields. Instead, match BYT/CHT in the 'intel_idle_init_cstates_icpu()' function, and introduce a small helper to take care of BYT/CHT auto-demotion. This is consistent with how platform-specific things are done for other platforms. No intended functional changes, compile-tested only. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20250210071253.2991030-1-dedekind1@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-05	intel_idle: Provide the default enter_dead() handler	Patryk Wlazlyn
	Recent Intel platforms require idle driver to provide information about the MWAIT hint used to enter the deepest idle state in the play_dead code. Provide the default enter_dead() handler for all of the platforms and allow overwriting with a custom handler for each platform if needed. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20250205155211.329780-4-artem.bityutskiy%40linux.intel.com
2025-01-22	Merge tag 'pm-6.14-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The majority of changes here are cpufreq updates which are dominated by amd-pstate driver changes, like in the previous cycle. Moreover, changes related to amd-pstate are also the majority of cpupower utility updates. Included are some pieces of new hardware support, like the addition of Clearwater Forest processors support to intel_idle, new cpufreq driver for Airoha SoCs, and Apple cpufreq driver extensions to support more SoCs. The intel_pstate driver is also extended to be able to support new platforms by using ACPI CPPC to compute scaling factors between HWP performance states and frequency. The rest is mostly fixes and cleanups in assorted pieces of power management code. Specifics: - Use str_enable_disable()-like helpers in cpufreq (Krzysztof Kozlowski) - Extend the Apple cpufreq driver to support more SoCs (Hector Martin, Nick Chan) - Add new cpufreq driver for Airoha SoCs (Christian Marangi) - Fix using cpufreq-dt as module (Andreas Kemnade) - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter Edwards, Sibi Sankar, Manivannan Sadhasivam) - Fix the maximum supported frequency computation in the ACPI cpufreq driver to avoid relying on unfounded assumptions (Gautham Shenoy) - Fix an amd-pstate driver regression with preferred core rankings not being used (Mario Limonciello) - Fix a precision issue with frequency calculation in the amd-pstate driver (Naresh Solanki) - Add ftrace event to the amd-pstate driver for active mode (Mario Limonciello) - Set default EPP policy on Ryzen processors in amd-pstate (Mario Limonciello) - Clean up the amd-pstate cpufreq driver and optimize it to increase code reuse (Mario Limonciello, Dhananjay Ugwekar) - Use CPPC to get scaling factors between HWP performance levels and frequency in the intel_pstate driver and make it stop using a built-in scaling factor for Arrow Lake processors (Rafael Wysocki) - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for consistency with CPU offline (Christian Loehle) - Fix superfluous updates caused by need_freq_update in the schedutil cpufreq governor (Sultan Alsawaf) - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson) - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan) - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang) - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap) - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy) - Clean up the Exynos devfreq driver and devfreq core (Markus Elfring, Jeongjun Park) - Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong, Joe Hattori) - Implement dev_pm_opp_get_bw() (Neil Armstrong) - Expose OPP reference counting helpers for Rust (Viresh Kumar) - Fix TSC MHz calculation in cpupower (He Rongguang) - Add install and uninstall options to bindings Makefile and add header changes for cpufreq.h to SWIG bindings in cpupower (John B. Wyatt IV) - Add missing residency header changes in cpuidle.h to SWIG bindings in cpupower (John B. Wyatt IV) - Add output files to .gitignore and clean them up in "make clean" in selftests/cpufreq (Li Zhijian) - Fix cross-compilation in cpupower Makefile (Peng Fan) - Revise the is_valid flag handling for idle_monitor in the cpupower utility (wangfushuai) - Extend and clean up AMD processors support in cpupower (Mario Limonciello)" * tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits) PM / OPP: Add reference counting helpers for Rust implementation PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() cpufreq: Use str_enable_disable()-like helpers cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment cpufreq: ACPI: Fix max-frequency computation pm: cpupower: Add missing residency header changes in cpuidle.h to SWIG PM / devfreq: exynos: remove unused function parameter OPP: OF: Fix an OF node leak in _opp_add_static_v2() cpufreq/amd-pstate: Refactor max frequency calculation cpufreq/amd-pstate: Fix prefcore rankings pm: cpupower: Add header changes for cpufreq.h to SWIG bindings cpufreq: sparc: change kzalloc to kcalloc cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT cpufreq: apple-soc: Increase cluster switch timeout to 400us cpufreq: apple-soc: Use 32-bit read for status register ...
2024-12-18	x86/cpu: Make all all CPUID leaf names consistent	Dave Hansen
	The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix for consistency and vertical alignment. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
2024-12-18	x86/cpu: Remove unnecessary MwAIT leaf checks	Dave Hansen
	The CPUID leaf dependency checker will remove X86_FEATURE_MWAIT if the CPUID level is below the required level (CPUID_MWAIT_LEAF). Thus, if you check X86_FEATURE_MWAIT you do not need to also check the CPUID level. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213205030.9B42B458%40davehans-spike.ostc.intel.com
2024-12-18	x86/cpu: Move MWAIT leaf definition to common header	Dave Hansen
	Begin constructing a common place to keep all CPUID leaf definitions. Move CPUID_MWAIT_LEAF to the CPUID header and include it where needed. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205028.EE94D02A%40davehans-spike.ostc.intel.com
2024-12-10	intel_idle: add Clearwater Forest SoC support	Artem Bityutskiy
	Clearwater Forest (CWF) SoC has the same C-states as Sierra Forest (SRF) SoC. Add CWF support by re-using the SRF C-states table. Note: it is expected that CWF C-states will have same or very similar characteristics as SRF C-states (latency and target residency). However, there is a possibility that the characteristics will end up being different enough when the CWF platform development is finished. In that case, a separate CWF C-states table will be created and populated with the CWF-specific characteristics (latency and target residency). Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20241203130306.1559024-1-artem.bityutskiy@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-11	intel_idle: add Granite Rapids Xeon D support	Artem Bityutskiy
	Add Granite Rapids Xeon D C-states support: C1, C1E, C6, and C6P. The C-states are basically the same as in Granite Rapids Xeon SP/AP, but characteristics (latency, target residency) are a bit different. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20241107115608.52233-1-artem.bityutskiy@linux.intel.com [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-25	intel_idle: fix ACPI _CST matching for newer Xeon platforms	Artem Bityutskiy
	Background ~~~~~~~~~~ The driver uses 'use_acpi = true' in C-state custom table for all Xeon platforms. The meaning of this flag is as follows. 1. If a C-state from the custom table is defined in ACPI _CST (matched by the mwait hint), then enable this C-state. 2. Otherwise, disable this C-state, unless the C-sate definition in the custom table has the 'CPUIDLE_FLAG_ALWAYS_ENABLE' flag set, in which case enabled it. The goal is to honor BIOS C6 settings - If BIOS disables C6, disable it by default in the OS too (but it can be enabled via sysfs). This works well on Xeons that expose only one flavor of C6. This are all Xeons except for the newest Granite Rapids (GNR) and Sierra Forest (SRF). The problem ~~~~~~~~~~~ GNR and SRF have 2 flavors of C6: C6/C6P on GNR, C6S/C6SP on SRF. The the "P" flavor allows for the package C6, while the "non-P" flavor allows only for core/module C6. As far as this patch is concerned, both GNR and SRF platforms are handled the same way. Therefore, further discussion is focused on GNR, but it applies to SRF as well. On Intel Xeon platforms, BIOS exposes only 2 ACPI C-states: C1 and C2. Well, depending on BIOS settings, C2 may be named as C3. But there still will be only 2 states - C1 and C3. But this is a non-essential detail, so further discussion is focused on the ACPI C1 and C2 case. On pre-GNR/SRF Xeon platforms, ACPI C1 is mapped to C1 or C1E, and ACPI C2 is mapped to C6. The 'use_acpi' flag works just fine: * If ACPI C2 enabled, enable C6. * Otherwise, disable C6. However, on GNR there are 2 flavors of C6, so BIOS maps ACPI C2 to either C6 or C6P, depending on the user settings. As a result, due to the 'use_acpi' flag, 'intel_idle' disables least one of the C6 flavors. BIOS \| OS \| Verdict ----------------------------------------------------\|--------- ACPI C2 disabled \| C6 disabled, C6P disabled \| OK ACPI C2 mapped to C6 \| C6 enabled, C6P disabled \| Not OK ACPI C2 mapped to C6P \| C6 disabled, C6P enabled \| Not OK The goal of 'use_acpi' is to honor BIOS ACPI C2 disabled case, which works fine. But if ACPI C2 is enabled, the goal is to enable all flavors of C6, not just one of the flavors. This was overlooked when enabling GNR/SRF platforms. In other words, before GNR/SRF, the ACPI C2 status was binary - enabled or disabled. But it is not binary on GNR/SRF, however the goal is to continue treat it as binary. The fix ~~~~~~~ Notice, that current algorithm matches ACPI and custom table C-states by the mwait hint. However, mwait hint consists of the 'state' and 'sub-state' parts, and all C6 flavors have the same state value of 0x20, but different sub-state values. Introduce new C-state table flag - CPUIDLE_FLAG_PARTIAL_HINT_MATCH and add it to both C6 flavors of the GNR/SRF platforms. When matching ACPI _CST and custom table C-states, match only the start part if the C-state has CPUIDLE_FLAG_PARTIAL_HINT_MATCH, other wise match both state and sub-state parts (as before). With this fix, GNR C-states enabled/disabled status looks like this. BIOS \| OS ---------------------------------------------------- ACPI C2 disabled \| C6 disabled, C6P disabled ACPI C2 mapped to C6 \| C6 enabled, C6P enabled ACPI C2 mapped to C6P \| C6 enabled, C6P enabled Possible alternative ~~~~~~~~~~~~~~~~~~~~ The alternative would be to remove 'use_acpi' flag for GNR and SRF. This would be a simpler solution, but it would violate the principle of least surprise - users of Xeon platforms are used to the fact that intel_idle honors C6 enabled/disabled flag. It is more consistent user experience if GNR/SRF continue doing so. How tested ~~~~~~~~~~ Tested on GNR and SRF platform with all the 3 BIOS configurations: ACPI C2 disabled, mapped to C6/C6S, mapped to C6P/C6SP. Tested on Ice lake Xeon and Sapphire Rapids Xeon platforms with ACPI C2 enabled and disabled, just to verify that the patch does not break older Xeons. Fixes: 92813fd5b156 ("intel_idle: add Sierra Forest SoC support") Fixes: 370406bf5738 ("intel_idle: add Granite Rapids Xeon support") Cc: 6.8+ <stable@vger.kernel.org> # 6.8+ Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20240913165143.4140073-1-dedekind1@gmail.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20	intel_idle: Disable promotion to C1E on Jasper Lake and Elkhart Lake	Kai-Heng Feng
	PCIe ethernet throughut is sub-optimal on Jasper Lake and Elkhart Lake. The CPU can take long time to exit to C0 to handle IRQ and perform DMA when C1E has been entered. For this reason, adjust intel_idle to disable promotion to C1E and still use C-states from ACPI _CST on those two platforms. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219023 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Link: https://patch.msgid.link/20240820041128.102452-1-kai.heng.feng@canonical.com [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-19	intel_idle: add Granite Rapids Xeon support	Artem Bityutskiy
	Add Granite Rapids Xeon C-states, which are C1, C1E, C6, and C6P. Comparing to previous Xeon Generations (e.g., Emerald Rapids), C6 requests end up only in core C6 state, and no package C-state promotion takes place even if all cores in the package are in core C6. C6P requests also end up in core C6, but if all cores have requested C6P, the SoC will enter the package C6 state. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Link: https://patch.msgid.link/20240806160310.3719205-1-artem.bityutskiy@linux.intel.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-07	intel_idle: Switch to new Intel CPU model defines	Tony Luck
	New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-03-05	cpuidle: ACPI/intel: fix MWAIT hint target C-state computation	He Rongguang
	According to x86 spec ([1] and [2]), MWAIT hint_address[7:4] plus 1 is the corresponding C-state, and 0xF means C0. ACPI C-state table usually only contains C1+, but nothing prevents ACPI firmware from presenting a C-state (maybe C1+) but using MWAIT address C0 (i.e., 0xF in ACPI FFH MWAIT hint address). And if this is the case, Linux erroneously treat this cstate as C16, while actually this should be valid C0 instead of C16, as per the specifications. Since ACPI firmware is out of Linux kernel scope, fix the kernel handling of 0xF ->(to) C0 in this situation. This is found when a tweaked ACPI C-state table is presented by Qemu to VM. Also modify the intel_idle case for code consistency. [1]. Intel SDM Vol 2, Table 4-11. MWAIT Hints Register (EAX): "Value of 0 means C1; 1 means C2 and so on Value of 01111B means C0". [2]. AMD manual Vol 3, MWAIT: "The processor C-state is EAX[7:4]+1, so to request C0 is to place the value F in EAX[7:4] and to request C1 is to place the value 0 in EAX[7:4].". Signed-off-by: He Rongguang <herongguang@linux.alibaba.com> [ rjw: Subject and changelog edits, whitespace fixups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-01-09	Merge tag 'pm-6.8-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These add support for new processors (Sierra Forest, Grand Ridge and Meteor Lake) to the intel_idle driver, make intel_pstate run on Emerald Rapids without HWP support and adjust it to utilize EPP values supplied by the platform firmware, fix issues, clean up code and improve documentation. The most significant fix addresses deadlocks in the core system-wide resume code that occur if async_schedule_dev() attempts to run its argument function synchronously (for example, due to a memory allocation failure). It rearranges the code in question which may increase the system resume time in some cases, but this basically is a removal of a premature optimization. That optimization will be added back later, but properly this time. Specifics: - Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs to the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui) - Do not enable interrupts when entering idle in the haltpoll cpuidle driver (Borislav Petkov) - Add Emerald Rapids support in no-HWP mode to the intel_pstate cpufreq driver (Zhenguo Yao) - Use EPP values programmed by the platform firmware as balanced performance ones by default in intel_pstate (Srinivas Pandruvada) - Add a missing function return value check to the SCMI cpufreq driver to avoid unexpected behavior (Alexandra Diupina) - Fix parameter type warning in the armada-8k cpufreq driver (Gregory CLEMENT) - Rework trans_stat_show() in the devfreq core code to avoid buffer overflows (Christian Marangi) - Synchronize devfreq_monitor_[start/stop] so as to prevent a timer list corruption from occurring when devfreq governors are switched frequently (Mukesh Ojha) - Fix possible deadlocks in the core system-wide PM code that occur if device-handling functions cannot be executed asynchronously during resume from system-wide suspend (Rafael J. Wysocki) - Clean up unnecessary local variable initializations in multiple places in the hibernation code (Wang chaodong, Li zeming) - Adjust core hibernation code to avoid missing wakeup events that occur after saving an image to persistent storage (Chris Feng) - Update hibernation code to enforce correct ordering during image compression and decompression (Hongchen Zhang) - Use kmap_local_page() instead of kmap_atomic() in copy_data_page() during hibernation and restore (Chen Haonan) - Adjust documentation and code comments to reflect recent tasks freezer changes (Kevin Hao) - Repair excess function parameter description warning in the hibernation image-saving code (Randy Dunlap) - Fix _set_required_opps when opp is NULL (Bryan O'Donoghue) - Use device_get_match_data() in the OPP code for TI (Rob Herring) - Clean up OPP level and other parts and call dev_pm_opp_set_opp() recursively for required OPPs (Viresh Kumar)" * tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits) OPP: Rename 'rate_clk_single' OPP: Pass rounded rate to _set_opp() OPP: Relocate dev_pm_opp_sync_regulators() PM: sleep: Fix possible deadlocks in core system-wide PM code OPP: Move dev_pm_opp_icc_bw to internal opp.h async: Introduce async_schedule_dev_nocall() async: Split async_schedule_node_domain() cpuidle: haltpoll: Do not enable interrupts when entering idle OPP: Fix _set_required_opps when opp is NULL OPP: The level field is always of unsigned int type PM: hibernate: Repair excess function parameter description warning PM: sleep: Remove obsolete comment from unlock_system_sleep() cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes PM: hibernate: Use kmap_local_page() in copy_data_page() intel_idle: add Sierra Forest SoC support intel_idle: add Grand Ridge SoC support PM / devfreq: Synchronize devfreq_monitor_[start/stop] cpufreq: armada-8k: Fix parameter type warning PM: hibernate: Enforce ordering during image compression/decompression ...
2023-12-19	intel_idle: add Sierra Forest SoC support	Artem Bityutskiy
	Add Sierra Forest SoC C-states, which are C1, C1E, C6S, and C6SP. Sierra Forest SoC is built with modules, each module includes 4 cores (Crestmont microarchitecture). There is one L2 cache per module, shared between the 4 cores. There is no core C6 state, but there is C6S state, which has module scope: when all 4 cores request C6S, the entire module (4 cores + L2 cache) enters the low power state. C6SP state has package scope - when all modules in the package enter C6S, the package enters the power state mode. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-19	intel_idle: add Grand Ridge SoC support	Artem Bityutskiy
	Add Intel Grand Ridge SoC C-states, which are C1, C1E, and C6S. The Grand Ridge SoC is built with modules, each module includes 4 cores (Crestmont microarchitecture). There is one L2 cache per module, shared between the 4 cores. There is no core C6 state, but there is C6S state, which has module scope: when all 4 cores request C6S, the entire module (4 cores + L2 cache) enters the low power state. Package C6 is not supported by Grand Ridge SoC. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-11	intel_idle: Add Meteorlake support	Zhang Rui
	Add intel_idle support for MeteorLake. C1 and C1E states on Meteorlake are mutually exclusive, like Alderlake and Raptorlake, but they have little latency difference with measureable power difference, so always enable "C1E promotion" bit and expose C1E only. Expose C6 because it has less power compared with C1E, and smaller latency compared with C8/C10. Ignore C8 and expose C10, because C8 does not show latency advantage compared with C10. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-29	x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram	Peter Zijlstra
	intel_idle_irq() re-enables IRQs very early. As a result, an interrupt may fire before mwait() is eventually called. If such an interrupt queues a timer, it may go unnoticed until mwait returns and the idle loop handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't help because a local timer enqueue doesn't set that flag. The issue is mitigated by the fact that this idle handler is only invoked for shallow C-states when, presumably, the next tick is supposed to be close enough. There may still be rare cases though when the next tick is far away and the selected C-state is shallow, resulting in a timer getting ignored for a while. Fix this with using sti_mwait() whose IRQ-reenablement only triggers upon calling mwait(), dealing with the race while keeping the interrupt latency within acceptable bounds. Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231115151325.6262-3-frederic@kernel.org
2023-10-07	intel_idle: Add ibrs_off module parameter to force-disable IBRS	Waiman Long
	Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle") disables IBRS when the cstate is 6 or lower. However, there are some use cases where a customer may want to use max_cstate=1 to lower latency. Such use cases will suffer from the performance degradation caused by the enabling of IBRS in the sibling idle thread. Add a "ibrs_off" module parameter to force disable IBRS and the CPUIDLE_FLAG_IRQ_ENABLE flag if set. In the case of a Skylake server with max_cstate=1, this new ibrs_off option will likely increase the IRQ response latency as IRQ will now be disabled. When running SPECjbb2015 with cstates set to C1 on a Skylake system. First test when the kernel is booted with: "intel_idle.ibrs_off": max-jOPS = 117828, critical-jOPS = 66047 Then retest when the kernel is booted without the "intel_idle.ibrs_off" added: max-jOPS = 116408, critical-jOPS = 58958 That means booting with "intel_idle.ibrs_off" improves performance by: max-jOPS: +1.2%, which could be considered noise range. critical-jOPS: +12%, which is definitely a solid improvement. The admin-guide/pm/intel_idle.rst file is updated to add a description about the new "ibrs_off" module parameter. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-5-longman@redhat.com
2023-10-07	intel_idle: Use __update_spec_ctrl() in intel_idle_ibrs()	Waiman Long
	When intel_idle_ibrs() is called, it modifies the SPEC_CTRL MSR to 0 in order disable IBRS. However, the new MSR value isn't reflected in x86_spec_ctrl_current which is at odd with the other code that keep track of its state in that percpu variable. Use the new __update_spec_ctrl() to have the x86_spec_ctrl_current percpu value properly updated. Since spec-ctrl.h includes both msr.h and nospec-branch.h, we can remove those from the include file list. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-4-longman@redhat.com
2023-08-28	Merge tag 'perf-core-2023-08-28' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: - AMD IBS improvements - Intel PMU driver updates - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events - Micro-optimize software events and the ring-buffer code - Misc cleanups & fixes * tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call perf/x86/intel: Add Crestmont PMU x86/cpu: Update Hybrids x86/cpu: Fix Crestmont uarch x86/cpu: Fix Gracemont uarch perf: Remove unused extern declaration arch_perf_get_page_size() perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA perf/mem: Introduce PERF_MEM_LVLNUM_UNC perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin locking/arch: Avoid variable shadowing in local_try_cmpxchg() perf/core: Use local64_try_cmpxchg in perf_swevent_set_period perf/x86: Use local64_try_cmpxchg perf/amd: Prevent grouping of IBS events
2023-08-09	x86/cpu: Fix Gracemont uarch	Peter Zijlstra
	Alderlake N is an E-core only product using Gracemont micro-architecture. It fits the pre-existing naming scheme perfectly fine, adhere to it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org
2023-07-19	Revert "intel_idle: Add support for using intel_idle in a VM guest using ↵	Rafael J. Wysocki
	just hlt" This reverts commit 2f3d08f074b0 ("intel_idle: Add support for using intel_idle in a VM guest using just hlt"), because it causes functional issues to appear and it is not really useful without a related commit that got reverted previously. Link: https://lore.kernel.org/linux-pm/5c7de6d5-7706-c4a5-7c41-146db1269aff@intel.com Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-19	Revert "intel_idle: Add a "Long HLT" C1 state for the VM guest mode"	Rafael J. Wysocki
	This reverts commit 0fac214bb75e ("intel_idle: Add a "Long HLT" C1 state for the VM guest mode"), because there is a coding mistake in it and its validity is questioned. Link: https://lore.kernel.org/all/20230711132553.GN3062772@hirez.programming.kicks-ass.net Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-07-19	Revert "intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()"	Rafael J. Wysocki
	This reverts commit b2918089d5cb ("intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()"), because the commit fixed by it will be reverted. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-06-28	intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()	Rafael J. Wysocki
	The caller of (recently added) matchup_vm_state_with_baremetal() is an __init function and it uses some __initdata data structures, so add the __init annotation to it for consistency. This addresses the following build warnings: WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x51 (section: .text) -> intel_idle_max_cstate_reached (section: .init.text) WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x62 (section: .text) -> cpuidle_state_table (section: .init.data) WARNING: modpost: vmlinux: section mismatch in reference: matchup_vm_state_with_baremetal+0x79 (section: .text) -> icpu (section: .init.data) Fixes: 0fac214bb75e ("intel_idle: Add a "Long HLT" C1 state for the VM guest mode") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
2023-06-21	intel_idle: Add a "Long HLT" C1 state for the VM guest mode	Arjan van de Ven
	intel_idle will, for the bare metal case, usually have one or more deep power states that have the CPUIDLE_FLAG_TLB_FLUSHED flag set. When a state with this flag is selected by the cpuidle framework, it will also flush the TLBs as part of entering this state. The benefit of doing this is that the kernel does not need to wake the cpu out of this deep power state just to flush the TLBs... for which the latency can be very high due to the exit latency of deep power states. In a VM guest currently, this benefit of avoiding the wakeup does not exist, while the problem (long exit latency) is even more severe. Linux will need to wake up a vCPU (causing the host to either come out of a deep C state, or the VMM to have to deschedule something else to schedule the vCPU) which can take a very long time.. adding a lot of latency to tlb flush operations (including munmap and others). To solve this, add a "Long HLT" C state to the state table for the VM guest case that has the CPUIDLE_FLAG_TLB_FLUSHED flag set. The result of that is that for long idle periods (where the VMM is likely to do things that cause large latency) the cpuidle framework will flush the TLBs (and avoid the wakeups), while for short/quick idle durations, the existing behavior is retained. Now, there is still only "hlt" available in the guest, but for long idle, the host can go to a deeper state (say C6). There is a reasonable debate one can have to what to set for the exit_latency and break even point for this "Long HLT" state. The good news is that intel_idle has these values available for the underlying CPU (even when mwait is not exposed). The solution thus is to just use the latency and break even of the deepest state from the bare metal CPU. This is under the assumption that this is a pretty reasonable estimate of what the VMM would do to cause latency. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-06-16	intel_idle: Add support for using intel_idle in a VM guest using just hlt	Arjan van de Ven
	In a typical VM guest, the mwait instruction is not available, leaving only the 'hlt' instruction (which causes a VMEXIT to the host). So for this common case, intel_idle will detect the lack of mwait, and fail to initialize (after which another idle method would step in which will just use hlt always). Other (non-common) cases exist; the table below shows the before/after for these: +------------+--------------------------+-------------------------+ \| Hypervisor \| Idle method before patch \| Idle method after patch \| \| exposes \| \| \| +============+==========================+=========================+ \| nothing \| default_idle fallback \| intel_idle VM table \| \| (common) \| (straight "hlt") \| \| +------------+--------------------------+-------------------------+ \| mwait \| intel_idle mwait table \| intel_idle mwait table \| +------------+--------------------------+-------------------------+ \| ACPI \| ACPI C1 state ("hlt") \| intel_idle VM table \| +------------+--------------------------+-------------------------+ This is only applicable to CPUs known by intel_idle. For the bare metal case, unknown CPU models will use the ACPI tables (when available) to get estimates for latency and break even point for longer idle states. In guests, the common case is that ACPI tables are not available, but even when they are available, they can't and don't provide the latency information for the longer (mwait based) states. For this scenario (unknown CPU model), the default_idle mode (no ACPI) or ACPI C1 (ACPI avaible) will be used. By providing capability to do this with the intel_idle driver, we can do better than the fallback or ACPI table methods. While this current change only gets us to the existing behavior, later patches in this series will add new capabilities such as optimized TLB flushing. In order to do this, a simplified version of the initialization function for VM guests is created, and this will be called if the CPU is recognized, but mwait is not supported, and we're in a VM guest. One thing to note is that the max latency (and break even) of this C1 state is higher than the typical bare metal C1 state. Because hlt causes a vmexit, and the cost of vmexit + hypervisor overhead + vmenter is typically in the order of upto 5 microseconds... even if the hypervisor does not actually goes into a hardware power saving state. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> [ rjw: Dropped redundant checks from should_verify_mwait() ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-06-12	intel_idle: clean up the (new) state_update_enter_method function	Arjan van de Ven
	Now that the logic for state_update_enter_method() is in its own function, the long if .. else if .. else if .. else if chain can be simplified by just returning from the function at the various places. This does not change functionality, but it makes the logic much simpler to read or modify later. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-06-12	intel_idle: refactor state->enter manipulation into its own function	Arjan van de Ven
	Since the 6.4 kernel, the logic for updating a state's enter method based on "environmental conditions" (command line options, cpu sidechannel workarounds etc etc) has gotten pretty complex. This patch refactors this into a seperate small, self contained function (no behavior changes) for improved readability and to make future changes to this logic easier to do and understand. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-04-27	intel_idle: mark few variables as __read_mostly	Artem Bityutskiy
	The intention is to clean up the code and make it look a bit more consistent. Mark all unitialized module parameter variables as __read_mostly, not just one of them. The other parameters are read-mostly too. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-04-27	intel_idle: do not sprinkle module parameter definitions around	Artem Bityutskiy
	This is a cleanup which improves code consistency. Move the force_irq_on module parameter variable and definition to the same place where we have variables and definitions for other module parameters. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>