linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-11-03	nstree: maintain list of owned namespaces	Christian Brauner
	The namespace tree doesn't express the ownership concept of namespace appropriately. Maintain a list of directly owned namespaces per user namespace. This will allow userspace and the kernel to use the listns() system call to walk the namespace tree by owning user namespace. The rbtree is used to find the relevant namespace entry point which allows to continue iteration and the owner list can be used to walk the tree completely lock free. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-16-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	nstree: assign fixed ids to the initial namespaces	Christian Brauner
	The initial set of namespace comes with fixed inode numbers making it easy for userspace to identify them solely based on that information. This has long preceeded anything here. Similarly, let's assign fixed namespace ids for the initial namespaces. Kill the cookie and use a sequentially increasing number. This has the nice side-effect that the owning user namespace will always have a namespace id that is smaller than any of it's descendant namespaces. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	nstree: introduce a unified tree	Christian Brauner
	This will allow userspace to lookup and stat a namespace simply by its identifier without having to know what type of namespace it is. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-13-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	ns: add active reference count	Christian Brauner
	The namespace tree is, among other things, currently used to support file handles for namespaces. When a namespace is created it is placed on the namespace trees and when it is destroyed it is removed from the namespace trees. While a namespace is on the namespace trees with a valid reference count it is possible to reopen it through a namespace file handle. This is all fine but has some issues that should be addressed. On current kernels a namespace is visible to userspace in the following cases: (1) The namespace is in use by a task. (2) The namespace is persisted through a VFS object (namespace file descriptor or bind-mount). Note that (2) only cares about direct persistence of the namespace itself not indirectly via e.g., file->f_cred file references or similar. (3) The namespace is a hierarchical namespace type and is the parent of a single or multiple child namespaces. Case (3) is interesting because it is possible that a parent namespace might not fulfill any of (1) or (2), i.e., is invisible to userspace but it may still be resurrected through the NS_GET_PARENT ioctl(). Currently namespace file handles allow much broader access to namespaces than what is currently possible via (1)-(3). The reason is that namespaces may remain pinned for completely internal reasons yet are inaccessible to userspace. For example, a user namespace my remain pinned by get_cred() calls to stash the opener's credentials into file->f_cred. As it stands file handles allow to resurrect such a users namespace even though this should not be possible via (1)-(3). This is a fundamental uapi change that we shouldn't do if we don't have to. Consider the following insane case: Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction. When this option is set a userspace task's struct mm_struct may be used for kernel threads such as the idle task and will only be destroyed once the cpu's runqueue switches back to another task. But because of ptrace() permission checks struct mm_struct stashes the user namespace of the task that struct mm_struct originally belonged to. The kernel thread will take a reference on the struct mm_struct and thus pin it. So on an idle system user namespaces can be persisted for arbitrary amounts of time which also means that they can be resurrected using namespace file handles. That makes no sense whatsoever. The problem is of course excarabted on large systems with a huge number of cpus. To handle this nicely we introduce an active reference count which tracks (1)-(3). This is easy to do as all of these things are already managed centrally. Only (1)-(3) will count towards the active reference count and only namespaces which are active may be opened via namespace file handles. The problem is that namespaces may be resurrected. Which means that they can become temporarily inactive and will be reactived some time later. Currently the only example of this is the SIOGCSKNS socket ioctl. The SIOCGSKNS ioctl allows to open a network namespace file descriptor based on a socket file descriptor. If a socket is tied to a network namespace that subsequently becomes inactive but that socket is persisted by another process in another network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the SIOCGSKNS ioctl will resurrect this network namespace. So calls to open_related_ns() and open_namespace() will end up resurrecting the corresponding namespace tree. Note that the active reference count does not regulate the lifetime of the namespace itself. This is still done by the normal reference count. The active reference count can only be elevated if the regular reference count is elevated. The active reference count also doesn't regulate the presence of a namespace on the namespace trees. It only regulates its visiblity to namespace file handles (and in later patches to listns()). A namespace remains on the namespace trees from creation until its actual destruction. This will allow the kernel to always reach any namespace trivially and it will also enable subsystems like bpf to walk the namespace lists on the system for tracing or general introspection purposes. Note that different namespaces have different visibility lifetimes on current kernels. While most namespace are immediately released when the last task using them exits, the user- and pid namespace are persisted and thus both remain accessible via /proc/<pid>/ns/<ns_type>. The user namespace lifetime is aliged with struct cred and is only released through exit_creds(). However, it becomes inaccessible to userspace once the last task using it is reaped, i.e., when release_task() is called and all proc entries are flushed. Similarly, the pid namespace is also visible until the last task using it has been reaped and the associated pid numbers are freed. The active reference counts of the user- and pid namespace are decremented once the task is reaped. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	ns: rename to exit_nsproxy_namespaces()	Christian Brauner
	The current naming is very misleading as this really isn't exiting all of the task's namespaces. It is only exiting the namespaces that hang of off nsproxy. Reflect that in the name. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-10-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	ns: use NS_COMMON_INIT() for all namespaces	Christian Brauner
	Now that we have a common initializer use it for all static namespaces. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	ns: add missing authorship	Christian Brauner
	I authored the files a short while ago. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03	blktrace: add support for REQ_OP_WRITE_ZEROES tracing	Chaitanya Kulkarni
	Currently, REQ_OP_WRITE_ZEROES operations are not handled in the blktrace infrastructure, resulting in incorrect or missing operation labels in ftrace blktrace output. This manifests as write-zeroes operations appearing with incorrect labels like "N" instead of a proper "WZ" designation. This patch adds complete support for REQ_OP_WRITE_ZEROES across the blktrace infrastructure: Add BLK_TC_WRITE_ZEROES trace category in blktrace_api.h and update BLK_TC_END_V2 marker accordingly Map REQ_OP_WRITE_ZEROES to BLK_TC_WRITE_ZEROES in __blk_add_trace() to ensure proper trace event categorization Update fill_rwbs() to generate "WZ" label for write-zeroes operations in ftrace output, making them easily identifiable Add "write-zeroes" string mapping in act_to_str array for debugfs filter interface Update blk_fill_rwbs() to handle REQ_OP_WRITE_ZEROES for block layer event tracing With this fix, write-zeroes operations are now correctly traced and displayed. =========================================================== BEFORE THIS PATCH =========================================================== blkdiscard -z -o 0 -l 40960 /dev/nvme0n1 blkdiscard-3809 [030] ..... 1212.253701: block_bio_queue: 259,0 NS 0 + 80 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253703: block_getrq: 259,0 NS 0 + 80 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253704: block_io_start: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253704: block_plug: [blkdiscard] blkdiscard-3809 [030] ..... 1212.253706: block_unplug: [blkdiscard] 1 blkdiscard-3809 [030] ..... 1212.253706: block_rq_insert: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard] kworker/30:1H-566 [030] ..... 1212.253726: block_rq_issue: 259,0 NS 40960 () 0 + 80 be,0,4 [kworker/30:1H] <idle>-0 [030] d.h1. 1212.253957: block_rq_complete: 259,0 NS () 0 + 80 be,0,4 [0] <idle>-0 [030] dNh1. 1212.253960: block_io_done: 259,0 NS 0 () 0 + 0 none,0,0 [swapper/30] Trace Event Breakdown: Event \| Device \| Op \| Sector \| Sectors \| Byte Size \| Calculation block_bio_queue \| 259,0 \| NS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_getrq \| 259,0 \| NS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_io_start \| 259,0 \| NS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_insert \| 259,0 \| NS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_issue \| 259,0 \| NS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_complete \| 259,0 \| NS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_io_done \| 259,0 \| NS \| 0 \| 0 \| 0 \| Completion (no data) Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes =========================================================== AFTER THIS PATCH =========================================================== blkdiscard -z -o 0 -l 40960 /dev/nvme0n1 blkdiscard-2477 [020] ..... 960.989131: block_bio_queue: 259,0 WZS 0 + 80 [blkdiscard] blkdiscard-2477 [020] ..... 960.989134: block_getrq: 259,0 WZS 0 + 80 [blkdiscard] blkdiscard-2477 [020] ..... 960.989135: block_io_start: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard] blkdiscard-2477 [020] ..... 960.989138: block_plug: [blkdiscard] blkdiscard-2477 [020] ..... 960.989140: block_unplug: [blkdiscard] 1 blkdiscard-2477 [020] ..... 960.989141: block_rq_insert: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard] kworker/20:1H-736 [020] ..... 960.989166: block_rq_issue: 259,0 WZS 40960 () 0 + 80 be,0,4 [kworker/20:1H] <idle>-0 [020] d.h1. 960.989476: block_rq_complete: 259,0 WZS () 0 + 80 be,0,4 [0] <idle>-0 [020] dNh1. 960.989482: block_io_done: 259,0 WZS 0 () 0 + 0 none,0,0 [swapper/20] Trace Event Breakdown: Event \| Device \| Op \| Sector \| Sectors \| Byte Size \| Calculation block_bio_queue \| 259,0 \| WZS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_getrq \| 259,0 \| WZS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_io_start \| 259,0 \| WZS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_insert \| 259,0 \| WZS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_issue \| 259,0 \| WZS \| 0 \| 80 \| 40960 \| Direct from trace block_rq_complete \| 259,0 \| WZS \| 0 \| 80 \| - \| 80 × 512 = 40,960 block_io_done \| 259,0 \| WZS \| 0 \| 0 \| 0 \| Completion (no data) Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes Tested with ftrace blktrace on NVMe devices using blkdiscard with the -z (write-zeroes) flag. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03	perf/core: Fix system hang caused by cpu-clock usage	Dapeng Mi
	cpu-clock usage by the async-profiler tool can trigger a system hang, which got bisected back to the following commit by Octavia Togami: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") causes this issue The root cause of the hang is that cpu-clock is a special type of SW event which relies on hrtimers. The __perf_event_overflow() callback is invoked from the hrtimer handler for cpu-clock events, and __perf_event_overflow() tries to call cpu_clock_event_stop() to stop the event, which calls htimer_cancel() to cancel the hrtimer. But that's a recursion into the hrtimer code from a hrtimer handler, which (unsurprisingly) deadlocks. To fix this bug, use hrtimer_try_to_cancel() instead, and set the PERF_HES_STOPPED flag, which causes perf_swevent_hrtimer() to stop the event once it sees the PERF_HES_STOPPED flag. [ mingo: Fixed the comments and improved the changelog. ] Closes: https://lore.kernel.org/all/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/ Fixes: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") Reported-by: Octavia Togami <octavia.togami@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Octavia Togami <octavia.togami@gmail.com> Cc: stable@vger.kernel.org Link: https://github.com/lucko/spark/issues/530 Link: https://patch.msgid.link/20251015051828.12809-1-dapeng1.mi@linux.intel.com
2025-11-01	genirq/manage: Reduce priority of forced secondary interrupt handler	Lukas Wunner
	Crystal reports that the PCIe Advanced Error Reporting driver gets stuck in an infinite loop on PREEMPT_RT: Both the primary interrupt handler aer_irq() as well as the secondary handler aer_isr() are forced into threads with identical priority. Crystal writes that on the ARM system in question, the primary handler has to clear an error in the Root Error Status register... "before the next error happens, or else the hardware will set the Multiple ERR_COR Received bit. If that bit is set, then aer_isr() can't rely on the Error Source Identification register, so it scans through all devices looking for errors -- and for some reason, on this system, accessing the AER registers (or any Config Space above 0x400, even though there are capabilities located there) generates an Unsupported Request Error (but returns valid data). Since this happens more than once, without aer_irq() preempting, it causes another multi error and we get stuck in a loop." The issue does not show on non-PREEMPT_RT because the primary handler runs in hardirq context and thus can preempt the threaded secondary handler, clear the Root Error Status register and prevent the secondary handler from getting stuck. Emulate the same behavior on PREEMPT_RT by assigning a lower default priority to the secondary handler if the primary handler is forced into a thread. Reported-by: Crystal Wood <crwood@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Crystal Wood <crwood@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/f6dcdb41be2694886b8dbf4fe7b3ab89e9d5114c.1761569303.git.lukas@wunner.de Closes: https://lore.kernel.org/r/20250902224441.368483-1-crwood@redhat.com/
2025-11-01	timers/migration: Remove dead code handling idle CPU checking for remote timers	Frederic Weisbecker
	Idle migrators don't walk the whole tree in order to find out if there are timers to migrate because they recorded the next deadline to be verified within a single check in tmigr_requires_handle_remote(). Remove the related dead code and data. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-7-frederic@kernel.org
2025-11-01	timers/migration: Remove unused "cpu" parameter from tmigr_get_group()	Frederic Weisbecker
	Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-6-frederic@kernel.org
2025-11-01	timers/migration: Assert that hotplug preparing CPU is part of stable active ↵	Frederic Weisbecker
	hierarchy The CPU doing the prepare work for a remote target must be online from the tree point of view and its hierarchy must be active, otherwise propagating its active state up to the new root branch would be either incorrect or racy. Assert those conditions with more sanity checks. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-5-frederic@kernel.org
2025-11-01	timers/migration: Fix imbalanced NUMA trees	Frederic Weisbecker
	When a CPU from a new node boots, the old root may happen to be connected to the new root even if their node mismatch, as depicted in the following scenario: 1) CPU 0 boots and creates the first group for node 0. [GRP0:0] node 0 \| CPU 0 2) CPU 1 from node 1 boots and creates a new top that corresponds to node 1, but it also connects the old root from node 0 to the new root from node 1 by mistake. [GRP1:0] node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0 CPU 1 3) This eventually leads to an imbalanced tree where some node 0 CPUs migrate node 1 timers (and vice versa) way before reaching the crossnode groups, resulting in more frequent remote memory accesses than expected. [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 1 node 0 / \ \| / \ [...] [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0... CPU 1... A balanced tree should only contain groups having children that belong to the same node: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:0] node 0 node 1 / \ / \ / \ / \ [GRP0:0] [...] [...] [GRP0:1] node 0 node 1 \| \| CPU 0... CPU 1... In order to fix this, the hierarchy must be unfolded up to the crossnode level as soon as a node mismatch is detected. For example the stage 2 above should lead to this layout: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 0 node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 \| \| CPU 0 CPU 1 This means that not only GRP1:0 must be created but also GRP1:1 and GRP2:0 in order to prepare a balanced tree for next CPUs to boot. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org
2025-11-01	timers/migration: Remove locking on group connection	Frederic Weisbecker
	Initializing the tmc's group, the group's number of children and the group's parent can all be done without locking because: 1) Reading the group's parent and its group mask is done locklessly. 2) The connections prepared for a given CPU hierarchy are visible to the target CPU once online, thanks to the CPU hotplug enforced memory ordering. 3) In case of a newly created upper level, the new root and its connections and initialization are made visible by the CPU which made the connections. When that CPUs goes idle in the future, the new link is published by tmigr_inactive_up() through the atomic RmW on ->migr_state. 4) If CPUs were still walking up the active hierarchy, they could observe the new root earlier. In this case the ordering is enforced by an early initialization of the group mask and by barriers that maintain address dependency as explained in: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers") 5) Timers are propagated by a chain of group locking from the bottom to the top. And while doing so, the tree also propagates groups links and initialization. Therefore remote expiration, which also relies on group locking, will observe those links and initialization while holding the root lock before walking the tree remotely and update remote timers. This is especially important for migrators in the active hierarchy that may observe the new root early. Therefore the locking is unnecessary at initialization. If anything, it just brings confusion. Remove it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org
2025-11-01	timers/migration: Convert "while" loops to use "for"	Frederic Weisbecker
	Both the "do while" and "while" loops in tmigr_setup_groups() eventually mimic the behaviour of "for" loops. Simplify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org
2025-11-01	tick/sched: Limit non-timekeeper CPUs calling jiffies update	Steve Wahl
	On large NUMA systems, while running a test program that saturates the inter-processor and inter-NUMA links, acquiring the jiffies_lock can be very expensive. If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets delayed and other cpus decide to do the jiffies update themselves, a large number of them decide to do so at the same time. The inexpensive check against tick_next_period is far quicker than actually acquiring the lock, so most of these get in line to obtain the lock. If obtaining the lock is slow enough, this spirals into the vast majority of CPUs continuously being stuck waiting for this lock, just to obtain it and find out that time has already been updated by another cpu. For example, on one random entry to kdb by manually-injected NMI, 2912 of 3840 CPUs were observed to be stuck there. To avoid this, allow only one non-timekeeper CPU to call tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies only if the jiffies update function is actually called. With this change, manually interrupting the test at most two CPUs are observed to invoke tick_do_update_jiffies64() - the timekeeper and one other. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20251027183456.343407-1-steve.wahl@hpe.com
2025-10-31	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Alexei Starovoitov: - Mark migrate_disable/enable() as always_inline to avoid issues with partial inlining (Yonghong Song) - Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii Nakryiko) - Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann) - Conditionally include dynptr copy kfuncs (Malin Jonsson) - Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal) - Do not audit capability check in x86 do_jit() (Ondrej Mosnacek) - Fix arm64 JIT of BPF_ST insn when it writes into arena memory (Puranjay Mohan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf/arm64: Fix BPF_ST into arena memory bpf: Make migrate_disable always inline to avoid partial inlining bpf: Reject negative head_room in __bpf_skb_change_head bpf: Conditionally include dynptr copy kfuncs libbpf: Fix powerpc's stack register definition in bpf_tracing.h bpf: Do not audit capability check in do_jit() bpf: Sync pending IRQ work before freeing ring buffer
2025-10-31	genirq/proc: Fix race in show_irq_affinity()	Muchun Song
	Reading /proc/irq/N/smp_affinity* races with irq_set_affinity() and irq_move_masked_irq(), leading to old or torn output for users. After a user writes a new CPU mask to /proc/irq/N/affinity*, the syscall returns success, yet a subsequent read of the same file immediately returns a value different from what was just written. That's due to a race between show_irq_affinity() and irq_move_masked_irq() which lets the read observe a transient, inconsistent affinity mask. Cure it by guarding the read with irq_desc::lock. [ tglx: Massaged change log ] Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251028090408.76331-1-songmuchun@bytedance.com
2025-11-01	tracing: fprobe: Remove unused local variable	Masami Hiramatsu (Google)
	The 'ret' local variable in fprobe_remove_node_in_module() was used for checking the error state in the loop, but commit dfe0d675df82 ("tracing: fprobe: use rhltable for fprobe_ip_table") removed the loop. So we don't need it anymore. Link: https://lore.kernel.org/all/175867358989.600222.6175459620045800878.stgit@devnote2/ Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Menglong Dong <menglong8.dong@gmail.com>
2025-11-01	tracing: probes: Replace strcpy() with memcpy() in __trace_probe_log_err()	Thorsten Blum
	strcpy() is deprecated; use memcpy() instead. Link: https://lore.kernel.org/all/20250820214717.778243-3-thorsten.blum@linux.dev/ Link: https://github.com/KSPP/linux/issues/88 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: fprobe: fix suspicious rcu usage in fprobe_entry	Menglong Dong
	rcu_read_lock() is not needed in fprobe_entry, but rcu_dereference_check() is used in rhltable_lookup(), which causes suspicious RCU usage warning: WARNING: suspicious RCU usage 6.17.0-rc1-00001-gdfe0d675df82 #1 Tainted: G S ----------------------------- include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage! ...... stack backtrace: CPU: 1 UID: 0 PID: 4652 Comm: ftracetest Tainted: G S Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015 Call Trace: <TASK> dump_stack_lvl+0x7c/0x90 lockdep_rcu_suspicious+0x14f/0x1c0 __rhashtable_lookup+0x1e0/0x260 ? __pfx_kernel_clone+0x10/0x10 fprobe_entry+0x9a/0x450 ? __lock_acquire+0x6b0/0xca0 ? find_held_lock+0x2b/0x80 ? __pfx_fprobe_entry+0x10/0x10 ? __pfx_kernel_clone+0x10/0x10 ? lock_acquire+0x14c/0x2d0 ? __might_fault+0x74/0xc0 function_graph_enter_regs+0x2a0/0x550 ? __do_sys_clone+0xb5/0x100 ? __pfx_function_graph_enter_regs+0x10/0x10 ? _copy_to_user+0x58/0x70 ? __pfx_kernel_clone+0x10/0x10 ? __x64_sys_rt_sigprocmask+0x114/0x180 ? __pfx___x64_sys_rt_sigprocmask+0x10/0x10 ? __pfx_kernel_clone+0x10/0x10 ftrace_graph_func+0x87/0xb0 As we discussed in [1], fix this by using guard(rcu)() in fprobe_entry() to protect the rhltable_lookup() and rhl_for_each_entry_rcu() with rcu_read_lock and suppress this warning. Link: https://lore.kernel.org/all/20250904062729.151931-1-dongml2@chinatelecom.cn/ Link: https://lore.kernel.org/all/20250829021436.19982-1-dongml2@chinatelecom.cn/ [1] Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508281655.54c87330-lkp@intel.com Fixes: dfe0d675df82 ("tracing: fprobe: use rhltable for fprobe_ip_table") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: uprobe: eprobes: Allocate traceprobe_parse_context per probe	Masami Hiramatsu (Google)
	Since traceprobe_parse_context is reusable among a probe arguments, it is more efficient to allocate it outside of the loop for parsing probe argument as kprobe and fprobe events do. Link: https://lore.kernel.org/all/175509541393.193596.16330324746701582114.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: uprobes: Cleanup __trace_uprobe_create() with __free()	Masami Hiramatsu (Google)
	Use __free() to cleanup ugly gotos in __trace_uprobe_create(). Link: https://lore.kernel.org/all/175509540482.193596.6541098946023873304.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: eprobe: Cleanup eprobe event using __free()	Masami Hiramatsu (Google)
	Use __free(trace_event_probe_cleanup) to remove unneeded gotos and cleanup the last part of trace_eprobe_parse_filter(). Link: https://lore.kernel.org/all/175509539571.193596.4674012182718751429.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: probes: Use __free() for trace_probe_log	Masami Hiramatsu (Google)
	Use __free() for trace_probe_log_clear() to cleanup error log interface. Link: https://lore.kernel.org/all/175509538609.193596.16646724647358218778.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-01	tracing: fprobe: use rhltable for fprobe_ip_table	Menglong Dong
	For now, all the kernel functions who are hooked by the fprobe will be added to the hash table "fprobe_ip_table". The key of it is the function address, and the value of it is "struct fprobe_hlist_node". The budget of the hash table is FPROBE_IP_TABLE_SIZE, which is 256. And this means the overhead of the hash table lookup will grow linearly if the count of the functions in the fprobe more than 256. When we try to hook all the kernel functions, the overhead will be huge. Therefore, replace the hash table with rhltable to reduce the overhead. Link: https://lore.kernel.org/all/20250819031825.55653-1-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-10-31	Merge back system sleep material for 6.19	Rafael J. Wysocki

2025-10-31	nstree: simplify return	Christian Brauner
	node_to_ns() checks for NULL and the assert isn't really helpful and will have to be dropped later anyway. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-7-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31	cgroup: add cgroup namespace to tree after owner is set	Christian Brauner
	Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw(). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org Fixes: 7c6059398533 ("cgroup: support ns lookup") Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30	Merge tag 'pm-6.18-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix three regressions, two recent ones and one introduced during the 6.17 development cycle: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki)" * tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Allow pm_restrict_gfp_mask() stacking cpuidle: governors: menu: Select polling state in some more cases Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
2025-10-30	Merge branches 'pm-cpuidle' and 'pm-sleep'	Rafael J. Wysocki
	Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) * pm-cpuidle: cpuidle: governors: menu: Select polling state in some more cases * pm-sleep: PM: sleep: Allow pm_restrict_gfp_mask() stacking Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
2025-10-30	freezer: Clarify that only cgroup1 freezer uses PM freezer	Tejun Heo
	cgroup1 freezer piggybacks on the PM freezer, which inadvertently allowed userspace to produce uninterruptible tasks at will. To avoid the issue, cgroup2 freezer switched to a separate job control based mechanism. While this happened a long time ago, the code and comment haven't been updated making it confusing to people who aren't familiar with the history. Rename cgroup_freezing() to cgroup1_freezing() and update comments on top of freezing() and frozen() to clarify that cgroup2 freezer isn't covered by the PM freezer mechanism. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Qu Wenruo <wqu@suse.com> Link: https://patch.msgid.link/aPZ3q6Hm865NicBC@slm.duckdns.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30	PM: hibernate: add sysfs interface for hibernate_compression_threads	Xueqin Luo
	Add a sysfs attribute `/sys/power/hibernate_compression_threads` to allow runtime configuration of the number of threads used for compressing and decompressing hibernation images. The new sysfs interface enables dynamic adjustment at runtime: # cat /sys/power/hibernate_compression_threads 3 # echo 4 > /sys/power/hibernate_compression_threads This change provides greater flexibility for debugging and performance tuning of hibernation without requiring a reboot. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/c68c62f97fabf32507b8794ad8c16cd22ee656ac.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30	PM: hibernate: make compression threads configurable	Xueqin Luo
	The number of compression/decompression threads has a direct impact on hibernate image generation and resume latency. Using more threads can reduce overall resume time, but on systems with fewer CPU cores it may also introduce contention and reduce efficiency. Performance was evaluated on an 8-core ARM system, averaged over 10 runs: Threads Hibernate(s) Resume(s) -------------------------------- 3 12.14 18.86 4 12.28 17.48 5 11.09 16.77 6 11.08 16.44 With 5–6 threads, resume latency improves by approximately 12% compared to the default 3-thread configuration, with negligible impact on hibernate time. Introduce a new kernel parameter `hibernate_compression_threads=` that allows users and integrators to tune the number of compression/decompression threads at boot. This provides a way to balance performance and CPU utilization across a wide range of hardware without recompiling the kernel. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/f24b3ca6416e230a515a154ed4c121d72a7e05a6.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30	PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threads	Xueqin Luo
	Convert crc->unc_len and crc->unc from fixed-size arrays to dynamically allocated arrays, sized according to the actual number of threads selected at runtime. This removes the fixed limit imposed by CMP_THREADS. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/b5db63bb95729482d2649b12d3a11cb7547b7fcc.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30	printk/nbcon: Release nbcon consoles ownership in atomic flush after each ↵	Petr Mladek
	emitted record printk() tries to flush messages with NBCON_PRIO_EMERGENCY on nbcon consoles immediately. It might take seconds to flush all pending lines on slow serial consoles. Note that there might be hundreds of messages, for example: [ 3.771531][ T1] pci 0000:3e:08.1: [8086:324 replaying previous printk message [ 3.771531][ T1] pci 0000:3e:08.1: [8086:3246] type 00 class 0x088000 PCIe Root Complex Integrated Endpoint [ ... more than 2000 lines, about 200kB messages ... ] [ 3.837752][ T1] pci 0000:20:01.0: Adding to iommu group 18 [ 3.837851][ T replaying previous printk message [ 3.837851][ T1] pci 0000:20:03.0: Adding to iommu group 19 [ 3.837946][ T1] pci 0000:20:05.0: Adding to iommu group 20 [ ... more than 500 messages for iommu groups 21-590 ...] [ 3.912932][ T1] pci 0000:f6:00.1: Adding to iommu group 591 [ 3.913070][ T1] pci 0000:f6:00.2: Adding to iommu group 592 [ 3.913243][ T1] DMAR: Intel(R) Virtualization Technology for Directed I/O [ 3.913245][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [ 3.913245][ T1] software IO TLB: mapped [mem 0x000000004f000000-0x0000000053000000] (64MB) [ 3.913324][ T1] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer [ 3.913325][ T1] RAPL PMU: hw unit of domain package 2^-14 Joules [ 3.913326][ T1] RAPL PMU: hw unit of domain dram 2^-14 Joules [ 3.913327][ T1] RAPL PMU: hw unit of domain psys 2^-0 Joules [ 3.933486][ T1] ------------[ cut here ]------------ [ 3.933488][ T1] WARNING: CPU: 2 PID: 1 at arch/x86/events/intel/uncore.c:1156 uncore_pci_pmu_register+0x15e/0x180 [ 3.930291][ C0] watchdog: Watchdog detected hard LOCKUP on cpu 0 [ 3.930291][ C0] Kernel panic - not syncing: Hard LOCKUP [...] [ 3.930291][ C0] CPU: 0 UID: 0 PID: 18 Comm: pr/ttyS0 Not tainted... [...] [ 3.930291][ C0] RIP: 0010:nbcon_reacquire_nobuf+0x11/0x50 [ 3.930291][ C0] Call Trace: [...] [ 3.930291][ C0] <TASK> [ 3.930291][ C0] serial8250_console_write+0x16d/0x5c0 [ 3.930291][ C0] nbcon_emit_next_record+0x22c/0x250 [ 3.930291][ C0] nbcon_emit_one+0x93/0xe0 [ 3.930291][ C0] nbcon_kthread_func+0x13c/0x1c0 The are visible two takeovers of the console ownership: - The 1st one is triggered by the "WARNING: CPU: 2 PID: 1 at arch/x86/..." line printed with NBCON_PRIO_EMERGENCY. - The 2nd one is triggered by the "Kernel panic - not syncing: Hard LOCKUP" line printed with NBCON_PRIO_PANIC. There are more than 2500 lines, at about 240kB, emitted between the takeover and the 1st "WARNING" line in the emergency context. This amount of pending messages had to be flushed by nbcon_atomic_flush_pending() when WARN() printed its first line. The atomic flush was holding the nbcon console context for too long so that it triggered hard lockup on the CPU running the printk kthread "pr/ttyS0". The kthread needed to reacquire the console ownership for restoring the original serial port state in serial8250_console_write(). Prevent the hardlockup by releasing the nbcon console ownership after each emitted record. Note that __nbcon_atomic_flush_pending_con() used to hold the console ownership all the time because it blocked the printk kthread. Otherwise the kthread tried to flush the messages in parallel which caused repeated takeovers and more replayed messages. It is not longer a problem because the repeated takeovers are blocked by the counter of emergency contexts, see nbcon_cpu_emergency_cnt. Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-4-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-30	printk/nbcon/panic: Allow printk kthread to sleep when the system is in panic	Petr Mladek
	The printk kthread might be running when there is a panic in progress. But it is not able to acquire the console ownership any longer. Prevent the desperate attempts to acquire the ownership and allow sleeping in panic. It would make it behave the same as when there is any CPU in an emergency context. Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-3-pmladek@suse.com [pmladek@suse.com: Rebased on top of 6.18-rc1 (panic_in_progress() moved to panic.c)] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-30	printk/nbcon: Block printk kthreads when any CPU is in an emergency context	Petr Mladek
	In emergency contexts, printk() tries to flush messages directly even on nbcon consoles. And it is allowed to takeover the console ownership and interrupt the printk kthread in the middle of a message. Only one takeover and one repeated message should be enough in most situations. The first emergency message flushes the backlog and printk kthreads get to sleep. Next emergency messages are flushed directly and printk() does not wake up the kthreads. However, the one takeover is not guaranteed. Any printk() in normal context on another CPU could wake up the kthreads. Or a new emergency message might be added before the kthreads get to sleep. Note that the interrupted .write_thread() callbacks usually have to call nbcon_reacquire_nobuf() and restore the original device setting before checking for pending messages. The risk of the repeated takeovers will be even bigger because __nbcon_atomic_flush_pending_con is going to release the console ownership after each emitted record. It will be needed to prevent hardlockup reports on other CPUs which are busy waiting for the context ownership, for example, by nbcon_reacquire_nobuf() or __uart_port_nbcon_acquire(). The repeated takeovers break the output, for example: [ 5042.650211][ T2220] Call Trace: [ 5042.6511 replaying previous printk message [ 5042.651192][ T2220] <TASK> [ 5042.652160][ T2220] kunit_run_ replaying previous printk message [ 5042.652160][ T2220] kunit_run_tests+0x72/0x90 [ 5042.653340][ T22 replaying previous printk message [ 5042.653340][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5042.654628][ T2220] ? stack_trace_save+0x4d/0x70 [ 5042.6553 replaying previous printk message [ 5042.655394][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5042.656713][ T2220] ? save_trace+0x5b/0x180 A more robust solution is to block the printk kthread entirely whenever any CPU enters an emergency context. This ensures that critical messages can be flushed without contention from the normal, non-atomic printing path. Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-2-pmladek@suse.com [pmladek@suse.com: Added changes proposed by John Ogness] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-29	bpf: Use kmalloc_nolock() in bpf streams	Puranjay Mohan
	BPF stream kfuncs need to be non-sleeping as they can be called from programs running in any context, this requires a way to allocate memory from any context. Currently, this is done by a custom per-CPU NMI-safe bump allocation mechanism, backed by alloc_pages_nolock() and free_pages_nolock() primitives. As kmalloc_nolock() and kfree_nolock() primitives are available now, the custom allocator can be removed in favor of these. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-29	rqspinlock: Disable queue destruction for deadlocks	Kumar Kartikeya Dwivedi
	Disable propagation and unwinding of the waiter queue in case the head waiter detects a deadlock condition, but keep it enabled in case of the timeout fallback. Currently, when the head waiter experiences an AA deadlock, it will signal all its successors in the queue to exit with an error. This is not ideal for cases where the same lock is held in contexts which can cause errors in an unrestricted fashion (e.g., BPF programs, or kernel paths invoked through BPF programs), and core kernel logic which is written in a correct fashion and does not expect deadlocks. The same reasoning can be extended to ABBA situations. Depending on the actual runtime schedule, one or both of the head waiters involved in an ABBA situation can detect and exit directly without terminating their waiter queue. If the ABBA situation manifests again, the waiters will keep exiting until progress can be made, or a timeout is triggered in case of more complicated locking dependencies. We still preserve the queue destruction in case of timeouts, as either the locking dependencies are too complex to be captured by AA and ABBA heuristics, or the owner is perpetually stuck. As such, it would be unwise to continue to apply the timeout for each new head waiter without terminating the queue, since we may end up waiting for more than 250 ms in aggregate with all participants in the locking transaction. The patch itself is fairly simple; we can simply signal our successor to become the next head waiter, and leave the queue without attempting to acquire the lock. With this change, the behavior for waiters in case of deadlocks experienced by a predecessor changes. It is guaranteed that call sites will no longer receive errors if the predecessors encounter deadlocks and the successors do not participate in one. This should lower the failure rate for waiters that are not doing improper locking opreations, just because they were unlucky to queue behind a misbehaving waiter. However, timeouts are still a possibility, hence they must be accounted for, so users cannot rely upon errors not occuring at all. Suggested-by: Amery Hung <ameryhung@gmail.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-29	PM: sleep: Allow pm_restrict_gfp_mask() stacking	Rafael J. Wysocki
	Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: 449c9c02537a1 ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: 469d80a3712c ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki
2025-10-29	sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere	Tejun Heo
	The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29	sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()	Tejun Heo
	Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the BPF kfunc checks and calls reenq_local() to perform the actual work. This is a prep patch to allow reenq_local() to be called from other contexts. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29	sched_ext: Split schedule_deferred() into locked and unlocked variants	Tejun Heo
	schedule_deferred() currently requires the rq lock to be held so that it can use scheduler hooks for efficiency when available. However, there are cases where deferred actions need to be scheduled from contexts that don't hold the rq lock. Split into schedule_deferred() which can be called from any context and just queues irq_work, and schedule_deferred_locked() which requires the rq lock and can optimize by using scheduler hooks when available. Update the existing call site to use the _locked variant. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29	Merge branch 'for-6.18-fixes' into for-6.19	Tejun Heo
	Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()") which conflicts with changes for planned sub-sched changes.
2025-10-29	sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()	Andrea Righi
	scx_bpf_cpuperf_set() has a typo where it dereferences the local variable @sch, instead of the global @scx_root pointer. Fix by dereferencing the correct variable. Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29	printk_legacy_map: use LD_WAIT_CONFIG instead of LD_WAIT_SLEEP	Oleg Nesterov
	printk_legacy_map is used to hide lock nesting violations caused by legacy drivers and is using the wrong override type. LD_WAIT_SLEEP is for always sleeping lock types such as mutex_t. LD_WAIT_CONFIG is for lock type which are sleeping while spinning on PREEMPT_RT such as spinlock_t. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251026150726.GA23223@redhat.com [pmladek@suse.com: Fixed indentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-29	PM: EM: Add to em_pd_list only when no failure	Peng Fan
	When em_create_perf_table() returns failure, pd is freed, there dev->em_pd is not valid. Then accessing dev->em_pd->node will trigger kernel panic in em_dev_register_pd_no_update(). So return early if 'ret' is non-zero. Kernel dump: cpu cpu0: EM: invalid power: 0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Mem abort info: pc : em_dev_register_pd_no_update+0xb4/0x79c lr : em_dev_register_pd_no_update+0x9c/0x79c Call trace: em_dev_register_pd_no_update+0xb4/0x79c (P) em_dev_register_perf_domain+0x18/0x58 scmi_cpufreq_register_em+0x84/0xb8 cpufreq_online+0x48c/0xb74 cpufreq_add_dev+0x80/0x98 subsys_interface_register+0x100/0x11c cpufreq_register_driver+0x158/0x278 scmi_cpufreq_probe+0x1f8/0x2e0 scmi_dev_probe+0x28/0x3c really_probe+0xbc/0x29c __driver_probe_device+0x78/0x12c driver_probe_device+0x3c/0x15c __device_attach_driver+0xb8/0x134 bus_for_each_drv+0x84/0xe4 Fixes: cbe5aeedecc7 ("PM: EM: Assign a unique ID when creating a performance domain") Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/20251028-fix-energy-v1-1-ab854fd6a97c@nxp.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-29	perf: Support deferred user unwind	Peter Zijlstra
	Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net