summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-11-09gcov: add support for GCC 15Peter Oberparleiter
Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: allocate metadata directly from the buddy allocatorPasha Tatashin
KHO allocates metadata for its preserved memory map using the slab allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory. A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock. When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded. To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of slab. Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Matlack <dmatlack@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: increase metadata bitmap size to PAGE_SIZEPasha Tatashin
KHO memory preservation metadata is preserved in 512 byte chunks which requires their allocation from slab allocator. Slabs are not safe to be used with KHO because of kfence, and because partial slabs may lead leaks to the next kernel. Change the size to be PAGE_SIZE. The kfence specifically may cause memory corruption, where it randomly provides slab objects that can be within the scratch area. The reason for that is that kfence allocates its objects prior to KHO scratch is marked as CMA region. While this change could potentially increase metadata overhead on systems with sparsely preserved memory, this is being mitigated by ongoing work to reduce sparseness during preservation via 1G guest pages. Furthermore, this change aligns with future work on a stateless KHO, which will also use page-sized bitmaps for its radix tree metadata. Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-09kho: warn and fail on metadata or preserved memory in scratch areaPasha Tatashin
Patch series "KHO: kfence + KHO memory corruption fix", v3. This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled. The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation. Google is using KHO and have had obscure crashes due to this memory corruption, with stacks all over the place. I would prefer this fix to be properly backported to stable so we can also automatically consume it once we switch to the upstream KHO. Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds. Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used. Patch 3/3 Provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of slab. This bypasses the KFENCE interception entirely. This patch (of 3): It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption. Add checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected. [rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency] Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org [pasha.tatashin@soleen.com: build fix] Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-08Merge tag 'sched-urgent-2025-11-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a group-throttling bug in the fair scheduler" * tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
2025-11-08Merge tag 'perf-urgent-2025-11-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a system hang caused by cpu-clock events deadlock" * tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix system hang caused by cpu-clock usage
2025-11-08Merge tag 'locking-urgent-2025-11-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix (well, cut in half) a futex performance regression on PowerPC" * tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Optimize per-cpu reference counting
2025-11-07audit: merge loops in __audit_inode_child()Ricardo Robaina
Whenever there's audit context, __audit_inode_child() gets called numerous times, which can lead to high latency in scenarios that create too many sysfs/debugfs entries at once, for instance, upon device_add_disk() invocation. # uname -r 6.18.0-rc2+ # auditctl -a always,exit -F path=/tmp -k foo # time insmod loop max_loop=1000 real 0m46.676s user 0m0.000s sys 0m46.405s # perf record -a insmod loop max_loop=1000 # perf report --stdio |grep __audit_inode_child 32.73% insmod [kernel.kallsyms] [k] __audit_inode_child __audit_inode_child() searches for both the parent and the child in two different loops that iterate over the same list. This process can be optimized by merging these into a single loop, without changing the function behavior or affecting the code's readability. This patch merges the two loops that walk through the list context->names_list into a single loop. This optimization resulted in around 51% performance enhancement for the benchmark. # uname -r 6.18.0-rc2-enhancedv3+ # auditctl -a always,exit -F path=/tmp -k foo # time insmod loop max_loop=1000 real 0m22.899s user 0m0.001s sys 0m22.652s Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-11-07audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data()Gongwei Li
Replace kmalloc+memset by kzalloc for better readability and simplicity. This addresses the warning below: WARNING: kzalloc should be used for data, instead of kmalloc/memset Signed-off-by: Gongwei Li <ligongwei@kylinos.cn> [PM: subject and description tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-11-07Merge tag 'trace-v6.18-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up
2025-11-07PM: hibernate: Fix style issues in save_compressed_image()Mario Limonciello (AMD)
Address two issues indicated by checkpatch: - Trailing statements should be on next line. - Prefer 'unsigned int' to bare use of 'unsigned'. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-4-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07PM: hibernate: Use atomic64_t for compressed_size variableMario Limonciello (AMD)
`compressed_size` can overflow, showing nonsensical values. Change from `atomic_t` to `atomic64_t` to prevent overflow. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ Link: https://patch.msgid.link/20251106045158.3198061-3-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07PM: hibernate: Emit an error when image writing failsMario Limonciello (AMD)
If image writing fails, a return code is passed up to the caller, but none of the callers log anything to the log and so the only record of it is the return code that userspace gets. Adjust the logging so that the image size and speed of writing is only emitted on success and if there is an error, it's saved to the logs. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ [ rjw: Added missing braces after "else", changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-2-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-07refscale: Do not disable interrupts for tests involving local_bh_enable()Paul E. McKenney
Some kernel configurations prohibit invoking local_bh_enable() while interrupts are disabled. However, refscale disables interrupts to reduce OS noise during the tests, which results in splats. This commit therefore adds an ->enable_irqs flag to the ref_scale_ops structure, and refrains from disabling interrupts when that flag is set. This flag is set for the "bh" and "incpercpubh" scale_type module-parameter values. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add non-atomic per-CPU increment readersPaul E. McKenney
This commit adds refscale readers based on READ_ONCE() and WRITE_ONCE() that are unprotected (can lose counts, "refscale.scale_type=incpercpu"), preempt-disabled ("refscale.scale_type=incpercpupreempt"), bh-disabled ("refscale.scale_type=incpercpubh"), and irq-disabled ("refscale.scale_type=incpercpuirqsave"). On my x86 laptop, these are about 4.3ns, 3.8ns, and 7.3ns per pair, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add this_cpu_inc() readersPaul E. McKenney
This commit adds refscale readers based on this_cpu_inc() and this_cpu_inc() ("refscale.scale_type=percpuinc"). On my x86 laptop, these are about 4.5ns per pair. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add preempt_disable() readersPaul E. McKenney
This commit adds refscale readers based on preempt_disable() and preempt_enable() ("refscale.scale_type=preempt"). On my x86 laptop, these are about 2.8ns. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add local_bh_disable() readersPaul E. McKenney
This commit adds refscale readers based on local_bh_disable() and local_bh_enable() ("refscale.scale_type=bh"). On my x86 laptop, these are about 4.9ns. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07refscale: Add local_irq_disable() and local_irq_save() readersPaul E. McKenney
This commit adds refscale readers based on local_irq_disable() and local_irq_enable() ("refscale.scale_type=irq") and on local_irq_save() and local_irq_restore ("refscale.scale_type=irqsave"). On my x86 laptop, these are about 2.8ns and 7.5ns per enable/disable pair, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07printk: nbcon: Allow unsafe write_atomic() for panicJohn Ogness
There may be console drivers that have not yet figured out a way to implement safe atomic printing (->write_atomic() callback). These drivers could choose to only implement threaded printing (->write_thread() callback), but then it is guaranteed that _no_ output will be printed during panic. Not even attempted. As a result, developers may be tempted to implement unsafe ->write_atomic() callbacks and/or implement some sort of custom deferred printing trickery to try to make it work. This goes against the principle intention of the nbcon API as well as endangers other nbcon drivers that are doing things correctly (safely). As a compromise, allow nbcon drivers to implement unsafe ->write_atomic() callbacks by providing a new console flag CON_NBCON_ATOMIC_UNSAFE. When specified, the ->write_atomic() callback for that console will _only_ be called during the final "hope and pray" flush attempt at the end of a panic: nbcon_atomic_flush_unsafe(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/lkml/b2qps3uywhmjaym4mht2wpxul4yqtuuayeoq4iv4k3zf5wdgh3@tocu6c7mj4lt Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/all/swdpckuwwlv3uiessmtnf2jwlx3jusw6u7fpk5iggqo4t2vdws@7rpjso4gr7qp/ [1] Link: https://lore.kernel.org/all/20251103-fix_netpoll_aa-v4-1-4cfecdf6da7c@debian.org/ [2] Link: https://patch.msgid.link/20251027161212.334219-2-john.ogness@linutronix.de [pmladek@suse.com: Fix build with rework/nbcon-in-kdb branch.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-07srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macroPaul E. McKenney
This commit adds the SRCU_READ_FLAVOR_FAST_UPDOWN=0x8 macro and adjusts rcutorture to make use of it. In this commit, both SRCU_READ_FLAVOR_FAST=0x4 and the new SRCU_READ_FLAVOR_FAST_UPDOWN test SRCU-fast. When the SRCU-fast-updown is added, the new SRCU_READ_FLAVOR_FAST_UPDOWN macro will test it when passed to the rcutorture.reader_flavor module parameter. The old SRCU_READ_FLAVOR_FAST macro's value changed from 0x8 to 0x4. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-07rcu: Mark diagnostic functions as notracePaul E. McKenney
The rcu_lockdep_current_cpu_online(), rcu_read_lock_sched_held(), rcu_read_lock_held(), rcu_read_lock_bh_held(), rcu_read_lock_any_held() are used by tracing-related code paths, so putting traces on them is unlikely to make anyone happy. This commit therefore marks them all "notrace". Reported-by: Leon Hwang <leon.hwang@linux.dev> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-06tracing: Fix memory leaks in create_field_var()Zilin Guan
The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: 02205a6752f22 ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches upSteven Rostedt
The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-06bpf: Use kmalloc_nolock() in range treePuranjay Mohan
The range tree uses bpf_mem_alloc() that is safe to be called from all contexts and uses a pre-allocated pool of memory to serve these allocations. Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely from all contexts and is more scalable than bpf_mem_alloc(). Remove the migrate_disable/enable pairs as they were only needed for bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't need this. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251106170608.4800-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-06cgroup: Fix sleeping from invalid context warning on PREEMPT_RTTejun Heo
cgroup_task_dead() is called from finish_task_switch() which runs with preemption disabled and doesn't allow scheduling even on PREEMPT_RT. The function needs to acquire css_set_lock which is a regular spinlock that can sleep on RT kernels, leading to "sleeping function called from invalid context" warnings. css_set_lock is too large in scope to convert to a raw_spinlock. However, the unlinking operations don't need to run synchronously - they just need to complete after the task is done running. On PREEMPT_RT, defer the work through irq_work. While the work doesn't need to happen immediately, it can't be delayed indefinitely either as the dead task pins the cgroup and task_struct can be pinned indefinitely. Use the lazy version of irq_work to allow batching and lower impact while ensuring timely completion. v2: Use IRQ_WORK_INIT_LAZY instead of immediate irq_work and add explanation for why the work can't be delayed indefinitely (Sebastian Andrzej Siewior). Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") Reported-by: Calvin Owens <calvin@wbinvd.org> Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-07tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobeMasami Hiramatsu (Google)
__unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-07tracing: tprobe-events: Fix to register tracepoint correctlyMasami Hiramatsu (Google)
Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>
2025-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c 9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"") 6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") 93f53db9f9dc ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06futex: Optimize per-cpu reference countingPeter Zijlstra
Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_*() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: 760e6f7befba ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net
2025-11-06sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remainingAaron Lu
When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / | \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
2025-11-05bpf: disasm: add support for BPF_JMP|BPF_JA|BPF_XAnton Protopopov
Add support for indirect jump instruction. Example output from bpftool: 0: (79) r3 = *(u64 *)(r1 +0) 1: (25) if r3 > 0x4 goto pc+666 2: (67) r3 <<= 3 3: (18) r1 = 0xffffbeefspameggs 5: (0f) r1 += r3 6: (79) r1 = *(u64 *)(r1 +0) 7: (0d) gotox r1 Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf, x86: add support for indirect jumpsAnton Protopopov
Add support for a new instruction BPF_JMP|BPF_X|BPF_JA, SRC=0, DST=Rx, off=0, imm=0 which does an indirect jump to a location stored in Rx. The register Rx should have type PTR_TO_INSN. This new type assures that the Rx register contains a value (or a range of values) loaded from a correct jump table – map of type instruction array. For example, for a C switch LLVM will generate the following code: 0: r3 = r1 # "switch (r3)" 1: if r3 > 0x13 goto +0x666 # check r3 boundaries 2: r3 <<= 0x3 # adjust to an index in array of addresses 3: r1 = 0xbeef ll # r1 is PTR_TO_MAP_VALUE, r1->map_ptr=M 5: r1 += r3 # r1 inherits boundaries from r3 6: r1 = *(u64 *)(r1 + 0x0) # r1 now has type INSN_TO_PTR 7: gotox r1 # jit will generate proper code Here the gotox instruction corresponds to one particular map. This is possible however to have a gotox instruction which can be loaded from different maps, e.g. 0: r1 &= 0x1 1: r2 <<= 0x3 2: r3 = 0x0 ll # load from map M_1 4: r3 += r2 5: if r1 == 0x0 goto +0x4 6: r1 <<= 0x3 7: r3 = 0x0 ll # load from map M_2 9: r3 += r1 A: r1 = *(u64 *)(r3 + 0x0) B: gotox r1 # jump to target loaded from M_1 or M_2 During check_cfg stage the verifier will collect all the maps which point to inside the subprog being verified. When building the config, the high 16 bytes of the insn_state are used, so this patch (theoretically) supports jump tables of up to 2^16 slots. During the later stage, in check_indirect_jump, it is checked that the register Rx was loaded from a particular instruction array. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-9-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf: support instructions arrays with constants blindingAnton Protopopov
When bpf_jit_harden is enabled, all constants in the BPF code are blinded to prevent JIT spraying attacks. This happens during JIT phase. Adjust all the related instruction arrays accordingly. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-6-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-05bpf, x86: add new map type: instructions arrayAnton Protopopov
On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are translated by the verifier into "xlated" BPF programs. During this process the original instructions offsets might be adjusted and/or individual instructions might be replaced by new sets of instructions, or deleted. Add a new BPF map type which is aimed to keep track of how, for a given program, the original instructions were relocated during the verification. Also, besides keeping track of the original -> xlated mapping, make x86 JIT to build the xlated -> jitted mapping for every instruction listed in an instruction array. This is required for every future application of instruction arrays: static keys, indirect jumps and indirect calls. A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32 keys and value of size 8. The values have different semantics for userspace and for BPF space. For userspace a value consists of two u32 values – xlated and jitted offsets. For BPF side the value is a real pointer to a jitted instruction. On map creation/initialization, before loading the program, each element of the map should be initialized to point to an instruction offset within the program. Before the program load such maps should be made frozen. After the program verification xlated and jitted offsets can be read via the bpf(2) syscall. If a tracked instruction is removed by the verifier, then the xlated offset is set to (u32)-1 which is considered to be too big for a valid BPF program offset. One such a map can, obviously, be used to track one and only one BPF program. If the verification process was unsuccessful, then the same map can be re-used to verify the program with a different log level. However, if the program was loaded fine, then such a map, being frozen in any case, can't be reused by other programs even after the program release. Example. Consider the following original and xlated programs: Original prog: Xlated prog: 0: r1 = 0x0 0: r1 = 0 1: *(u32 *)(r10 - 0x4) = r1 1: *(u32 *)(r10 -4) = r1 2: r2 = r10 2: r2 = r10 3: r2 += -0x4 3: r2 += -4 4: r1 = 0x0 ll 4: r1 = map[id:88] 6: call 0x1 6: r1 += 272 7: r0 = *(u32 *)(r2 +0) 8: if r0 >= 0x1 goto pc+3 9: r0 <<= 3 10: r0 += r1 11: goto pc+1 12: r0 = 0 7: r6 = r0 13: r6 = r0 8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4 9: call 0x76 15: r0 = 0xffffffff8d2079c0 17: r0 = *(u64 *)(r0 +0) 10: *(u64 *)(r6 + 0x0) = r0 18: *(u64 *)(r6 +0) = r0 11: r0 = 0x0 19: r0 = 0x0 12: exit 20: exit An instruction array map, containing, e.g., instructions [0,4,7,12] will be translated by the verifier to [0,4,13,20]. A map with index 5 (the middle of 16-byte instruction) or indexes greater than 12 (outside the program boundaries) would be rejected. The functionality provided by this patch will be extended in consequent patches to implement BPF Static Keys, indirect jumps, and indirect calls. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-06rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()Paul E. McKenney
This commit removes a harmless but potentially confusing invocation of rcutorture_one_extend() within rcu_torture_one_read(). The immediately preceding call to rcu_torture_one_read_start() already does this cleanup, and the other call to rcu_torture_one_read_start() already relies on this. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-06locktorture: Fix memory leak in param_set_cpumask()Wang Liang
With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when setting the module parameter multiple times by sysfs interface or removing module. Below kmemleak trace is seen for this issue: unreferenced object 0xffff888100aabff8 (size 8): comm "bash", pid 323, jiffies 4295059233 hex dump (first 8 bytes): 07 00 00 00 00 00 00 00 ........ backtrace (crc ac50919): __kmalloc_node_noprof+0x2e5/0x420 alloc_cpumask_var_node+0x1f/0x30 param_set_cpumask+0x26/0xb0 [locktorture] param_attr_store+0x93/0x100 module_attr_store+0x1b/0x30 kernfs_fop_write_iter+0x114/0x1b0 vfs_write+0x300/0x410 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f This issue can be reproduced by: insmod locktorture.ko bind_writers=1 rmmod locktorture or: insmod locktorture.ko bind_writers=1 echo 2 > /sys/module/locktorture/parameters/bind_writers Considering that setting the module parameter 'bind_writers' or 'bind_readers' by sysfs interface has no real effect, set the parameter permissions to 0444. To fix the memory leak when removing module, free 'bind_writers' and 'bind_readers' memory in lock_torture_cleanup(). Fixes: 73e341242483 ("locktorture: Add readers_bind and writers_bind module parameters") Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Require special srcu_struct define/init for SRCU-fast readersPaul E. McKenney
This commit adds CONFIG_PROVE_RCU=y checking to enforce the new rule that srcu_struct structures passed to srcu_read_lock_fast() and other SRCU-fast read-side markers be either initialized with init_srcu_struct_fast() on the one hand or defined using either DEFINE_SRCU_FAST() or DEFINE_STATIC_SRCU_FAST(). This will enable removal of the non-debug read-side checks from srcu_read_lock_fast() and friends, which on my laptop provides a 25% speedup (which admittedly amounts to about half a nanosecond, but when tracing fastpaths...) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05rcutorture: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()Paul E. McKenney
This commit updates the initialization for the "srcu" and "srcud" torture types to use DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast(), respectively, when reader_flavor is equal to SRCU_READ_FLAVOR_FAST. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Make grace-period determination use ssp->srcu_reader_flavorPaul E. McKenney
This commit causes the srcu_readers_unlock_idx() function to take the srcu_struct structure's ->srcu_reader_flavor field into account. This ensures that structures defined via DEFINE_SRCU_FAST( or initialized via init_srcu_struct_fast() have their grace periods use synchronize_srcu() or synchronize_srcu_expedited() instead of smp_mb(), even before the first SRCU reader has been entered. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create a DEFINE_SRCU_FAST()Paul E. McKenney
This commit creates DEFINE_SRCU_FAST() and DEFINE_STATIC_SRCU_FAST() macros that are similar to DEFINE_SRCU() and DEFINE_STATIC_SRCU(), but which create srcu_struct structures that are usable only by readers initiated by srcu_read_lock_fast() and friends. This commit does make DEFINE_SRCU_FAST() available to modules, in which case the per-CPU srcu_data structures are not created at compile time, but rather at module-load time. This means that the >srcu_reader_flavor field of the srcu_data structure is not available. Therefore, this commit instead creates an ->srcu_reader_flavor field in the srcu_struct structure, adds arguments to the DEFINE_SRCU()-related macros to initialize this new field, and extends the checks in the __srcu_check_read_flavor() function to include this new field. This commit also allows dynamically allocated srcu_struct structure to be marked for SRCU-fast readers. It does so by defining a new init_srcu_struct_fast() function that marks the specified srcu_struct structure for use by srcu_read_lock_fast() and friends. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05rcutorture: Test srcu_expedite_current()Paul E. McKenney
This commit adds a ->exp_current member to the rcu_torture_ops structure to test the srcu_expedite_current() function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Create an srcu_expedite_current() functionPaul E. McKenney
This commit creates an srcu_expedite_current() function that expedites the current (and possibly the next) SRCU grace period for the specified srcu_struct structure. This functionality will be inherited by RCU Tasks Trace courtesy of its mapping to SRCU fast. If the current SRCU grace period is already waiting, that wait will complete before the expediting takes effect. If there is no SRCU grace period in flight, this function might well create one. [ paulmck: Apply Zqiang feedback for PREEMPT_RT use. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <bpf@vger.kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05srcu: Permit Tiny SRCU srcu_read_unlock() with interrupts disabledPaul E. McKenney
The current Tiny SRCU implementation of srcu_read_unlock() awakens the grace-period processing when exiting the outermost SRCU read-side critical section. However, not all Linux-kernel configurations and contexts permit swake_up_one() to be invoked while interrupts are disabled, and this can result in indefinitely extended SRCU grace periods. This commit therefore only invokes swake_up_one() when interrupts are enabled, and introduces polling to the grace-period workqueue handler. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Zqiang <qiang.zhang@linux.dev> Closes: https://lore.kernel.org/oe-lkp/202508261642.b15eefbb-lkp@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-11-05trace: use override credential guardChristian Brauner
Use override credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-12-b447b82f2c9b@kernel.org Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05trace: use prepare credential guardChristian Brauner
Use the prepare credential guard for allocating a new set of credentials. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-11-b447b82f2c9b@kernel.org Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate ↵Tejun Heo
races The warned bitfields in struct scx_sched are updated racily from concurrent CPUs causing RMW races, which is fine for these boolean warning flags. Add a comment marking this area to prevent future fields that can't tolerate racy updates from being added here. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Globally track isolated_cpus updateWaiman Long
The current cpuset code passes a local isolcpus_updated flag around in a number of functions to determine if external isolation related cpumasks like wq_unbound_cpumask should be updated. It is a bit cumbersome and makes the code more complex. Simplify the code by using a global boolean flag "isolated_cpus_updating" to track this. This flag will be set in isolated_cpus_update() and cleared in update_isolation_cpumasks(). No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partitionWaiman Long
Commit 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup") is supposed to ensure that domain isolated CPUs designated by the "isolcpus" boot command line option stay either in root partition or in isolated partitions. However, the required check wasn't implemented when a remote partition was created or when an existing partition changed type from "root" to "isolated". Even though this is a relatively minor issue, we still need to add the required prstate_housekeeping_conflict() call in the right places to ensure that the rule is strictly followed. The following steps can be used to reproduce the problem before this fix. # fmt -1 /proc/cmdline | grep isolcpus isolcpus=9 # cd /sys/fs/cgroup/ # echo +cpuset > cgroup.subtree_control # mkdir test # echo 9 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated # cat test/cpuset.cpus.effective 9 # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 9 # cat test/cpuset.cpus.partition root With this fix, the last few steps will become: # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 0-8,10-95 # cat test/cpuset.cpus.partition root invalid (partition config conflicts with housekeeping setup) Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Move up prstate_housekeeping_conflict() helperWaiman Long
Move up the prstate_housekeeping_conflict() helper so that it can be used in remote partition code. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>