summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-03-20pidfs: improve multi-threaded exec and premature thread-group leader exit ↵Christian Brauner
polling This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent. A quick recap of these two cases: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit. If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. Co-Developed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20tracing: Constify struct event_trigger_opsChristophe JAILLET
'event_trigger_ops mwifiex_if_ops' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 31368 9024 6200 46592 b600 kernel/trace/trace_events_trigger.o After: ===== text data bss dec hex filename 31752 8608 6200 46560 b5e0 kernel/trace/trace_events_trigger.o Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/66e8f990e649678e4be37d4d1a19158ca0dea2f4.1741521295.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-20Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19sched/debug: Make 'const_debug' tunables unconditional __read_mostlyIngo Molnar
With CONFIG_SCHED_DEBUG becoming unconditional, remove the extra 'const_debug' indirection towards __read_mostly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-3-mingo@kernel.org
2025-03-19sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()Ingo Molnar
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-19cgroup/rstat: avoid disabling irqs for O(num_cpu)Eric Dumazet
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while iterating all possible cpus. It only drops the lock if there is scheduler or spin lock contention. If neither, then interrupts can be disabled for a long time. On large machines this can disable interrupts for a long enough time to drop network packets. On 400+ CPU machines I've seen interrupt disabled for over 40 msec. Prevent rstat from disabling interrupts while processing all possible cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This approach was previously discussed in https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, though this was in the context of an non-irq rstat spin lock. Benchmark this change with: 1) a single stat_reader process with 400 threads, each reading a test memcg's memory.stat repeatedly for 10 seconds. 2) 400 memory hog processes running in the test memcg and repeatedly charging memory until oom killed. Then they repeat charging and oom killing. v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds interrupts are disabled by rstat for 45341 usec: # => started at: _raw_spin_lock_irq # => ended at: cgroup_rstat_flush # # # _------=> CPU# # / _-----=> irqs-off/BH-disabled # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / _-=> migrate-disable # ||||| / delay # cmd pid |||||| time | caller # \ / |||||| \ | / stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq stat_rea-96532 52d.... 45342us : cgroup_rstat_flush stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush stat_rea-96532 52d.... 45343us : <stack trace> => memcg1_stat_format => memory_stat_format => memory_stat_show => seq_read_iter => vfs_read => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the longest holder. The longest irqs-off holder has irqs disabled for 4142 usec, a huge reduction from previous 45341 usec rstat finding. Running stat_reader memory.stat reader for 10 seconds: - without memory hogs: 9.84M accesses => 12.7M accesses - with memory hogs: 9.46M accesses => 11.1M accesses The throughput of memory.stat access improves. The mode of memory.stat access latency after grouping by of 2 buckets: - without memory hogs: 64 usec => 16 usec - with memory hogs: 64 usec => 8 usec The memory.stat latency improves. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Greg Thelen <gthelen@google.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-19bpf: Maintain FIFO property for rqspinlock unlockKumar Kartikeya Dwivedi
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave variants enforce strict FIFO ordering anyway, make the same change for normal non-irqsave variants, such that FIFO ordering is enforced. Two new verifier state fields (active_lock_id, active_lock_ptr) are used to denote the top of the stack, and prev_id and prev_ptr are ascertained whenever popping the topmost entry through an unlock. Take special care to make these fields part of the state comparison in refsafe. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-25-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Implement verifier support for rqspinlockKumar Kartikeya Dwivedi
Introduce verifier-side support for rqspinlock kfuncs. The first step is allowing bpf_res_spin_lock type to be defined in map values and allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK field to recognize and validate. Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only one of them (and at most one of them per-object, like before) must be present. The bpf_res_spin_lock can also be used to protect objects that require lock protection for their kfuncs, like BPF rbtree and linked list. The verifier plumbing to simulate success and failure cases when calling the kfuncs is done by pushing a new verifier state to the verifier state stack which will verify the failure case upon calling the kfunc. The path where success is indicated creates all lock reference state and IRQ state (if necessary for irqsave variants). In the case of failure, the state clears the registers r0-r5, sets the return value, and skips kfunc processing, proceeding to the next instruction. When marking the return value for success case, the value is marked as 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program, whenever user checks the return value as 'if (ret)' or 'if (ret < 0)' the verifier never traverses such branches for success cases, and would be aware that the lock is not held in such cases. We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs are invoked. We introduce a kfunc_class state to avoid mixing lock irqrestore kfuncs with IRQ state created by bpf_local_irq_save. With all this infrastructure, these kfuncs become usable in programs while satisfying all safety properties required by the kernel. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-24-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Introduce rqspinlock kfuncsKumar Kartikeya Dwivedi
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock, and their irqsave/irqrestore variants, which wrap the rqspinlock APIs. bpf_res_spin_lock returns a conditional result, depending on whether the lock was acquired (NULL is returned when lock acquisition succeeds, non-NULL upon failure). The memory pointed to by the returned pointer upon failure can be dereferenced after the NULL check to obtain the error code. Instead of using the old bpf_spin_lock type, introduce a new type with the same layout, and the same alignment, but a different name to avoid type confusion. Preemption is disabled upon successful lock acquisition, however IRQs are not. Special kfuncs can be introduced later to allow disabling IRQs when taking a spin lock. Resilient locks are safe against AA deadlocks, hence not disabling IRQs currently does not allow violation of kernel safety. __irq_flag annotation is used to accept IRQ flags for the IRQ-variants, with the same semantics as existing bpf_local_irq_{save, restore}. These kfuncs will require additional verifier-side support in subsequent commits, to allow programs to hold multiple locks at the same time. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-23-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert lpm_trie.c to rqspinlockKumar Kartikeya Dwivedi
Convert all LPM trie usage of raw_spinlock to rqspinlock. Note that rcu_dereference_protected in trie_delete_elem is switched over to plain rcu_dereference, the RCU read lock should be held from BPF program side or eBPF syscall path, and the trie->lock is just acquired before the dereference. It is not clear the reason the protected variant was used from the commit history, but the above reasoning makes sense so switch over. Closes: https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-22-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert percpu_freelist.c to rqspinlockKumar Kartikeya Dwivedi
Convert the percpu_freelist.c code to use rqspinlock, and remove the extralist fallback and trylock-based acquisitions to avoid deadlocks. Key thing to note is the retained while (true) loop to search through other CPUs when failing to push a node due to locking errors. This retains the behavior of the old code, where it would keep trying until it would be able to successfully push the node back into the freelist of a CPU. Technically, we should start iteration for this loop from raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus, we skip execution in the loop body instead. Closes: https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com Closes: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-21-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19bpf: Convert hashtab.c to rqspinlockKumar Kartikeya Dwivedi
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed per-cpu counter crud from the code base which is no longer necessary. Closes: https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com Closes: https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-20-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add locktorture supportKumar Kartikeya Dwivedi
Introduce locktorture support for rqspinlock using the newly added macros as the first in-kernel user and consumer. Guard the code with CONFIG_BPF_SYSCALL ifdef since rqspinlock is not available otherwise. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-19-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add entry to Makefile, MAINTAINERSKumar Kartikeya Dwivedi
Ensure that the rqspinlock code is only built when the BPF subsystem is compiled in. Depending on queued spinlock support, we may or may not end up building the queued spinlock slowpath, and instead fallback to the test-and-set implementation. Also add entries to MAINTAINERS file. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-18-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add basic support for CONFIG_PARAVIRTKumar Kartikeya Dwivedi
We ripped out PV and virtualization related bits from rqspinlock in an earlier commit, however, a fair lock performs poorly within a virtual machine when the lock holder is preempted. As such, retain the virt_spin_lock fallback to test and set lock, but with timeout and deadlock detection. We can do this by simply depending on the resilient_tas_spin_lock implementation from the previous patch. We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that requires more involved algorithmic changes and introduces more complexity. It can be done when the need arises in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add a test-and-set fallbackKumar Kartikeya Dwivedi
Include a test-and-set fallback when queued spinlock support is not available. Introduce a rqspinlock type to act as a fallback when qspinlock support is absent. Include ifdef guards to ensure the slow path in this file is only compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add further logic to ensure fallback to the test-and-set implementation when queued spinlock support is unavailable on an architecture. Unlike other waiting loops in rqspinlock code, the one for test-and-set has no theoretical upper bound under contention, therefore we need a longer timeout than usual. Bump it up to a second in this case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add deadlock detection and recoveryKumar Kartikeya Dwivedi
While the timeout logic provides guarantees for the waiter's forward progress, the time until a stalling waiter unblocks can still be long. The default timeout of 1/4 sec can be excessively long for some use cases. Additionally, custom timeouts may exacerbate recovery time. Introduce logic to detect common cases of deadlocks and perform quicker recovery. This is done by dividing the time from entry into the locking slow path until the timeout into intervals of 1 ms. Then, after each interval elapses, deadlock detection is performed, while also polling the lock word to ensure we can quickly break out of the detection logic and proceed with lock acquisition. A 'held_locks' table is maintained per-CPU where the entry at the bottom denotes a lock being waited for or already taken. Entries coming before it denote locks that are already held. The current CPU's table can thus be looked at to detect AA deadlocks. The tables from other CPUs can be looked at to discover ABBA situations. Finally, when a matching entry for the lock being taken on the current CPU is found on some other CPU, a deadlock situation is detected. This function can take a long time, therefore the lock word is constantly polled in each loop iteration to ensure we can preempt detection and proceed with lock acquisition, using the is_lock_released check. We set 'spin' member of rqspinlock_timeout struct to 0 to trigger deadlock checks immediately to perform faster recovery. Note: Extending lock word size by 4 bytes to record owner CPU can allow faster detection for ABBA. It is typically the owner which participates in a ABBA situation. However, to keep compatibility with existing lock words in the kernel (struct qspinlock), and given deadlocks are a rare event triggered by bugs, we choose to favor compatibility over faster detection. The release_held_lock_entry function requires an smp_wmb, while the release store on unlock will provide the necessary ordering for us. Add comments to document the subtleties of why this is correct. It is possible for stores to be reordered still, but in the context of the deadlock detection algorithm, a release barrier is sufficient and needn't be stronger for unlock's case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect waiters in trylock fallback from stallsKumar Kartikeya Dwivedi
When we run out of maximum rqnodes, the original queued spin lock slow path falls back to a try lock. In such a case, we are again susceptible to stalls in case the lock owner fails to make progress. We use the timeout as a fallback to break out of this loop and return to the caller. This is a fallback for an extreme edge case, when on the same CPU we run out of all 4 qnodes. When could this happen? We are in slow path in task context, we get interrupted by an IRQ, which while in the slow path gets interrupted by an NMI, whcih in the slow path gets another nested NMI, which enters the slow path. All of the interruptions happen after node->count++. We use RES_DEF_TIMEOUT as our spinning duration, but in the case of this fallback, no fairness is guaranteed, so the duration may be too small for contended cases, as the waiting time is not bounded. Since this is an extreme corner case, let's just prefer timing out instead of attempting to spin for longer. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect waiters in queue from stallsKumar Kartikeya Dwivedi
Implement the wait queue cleanup algorithm for rqspinlock. There are three forms of waiters in the original queued spin lock algorithm. The first is the waiter which acquires the pending bit and spins on the lock word without forming a wait queue. The second is the head waiter that is the first waiter heading the wait queue. The third form is of all the non-head waiters queued behind the head, waiting to be signalled through their MCS node to overtake the responsibility of the head. In this commit, we are concerned with the second and third kind. First, we augment the waiting loop of the head of the wait queue with a timeout. When this timeout happens, all waiters part of the wait queue will abort their lock acquisition attempts. This happens in three steps. First, the head breaks out of its loop waiting for pending and locked bits to turn to 0, and non-head waiters break out of their MCS node spin (more on that later). Next, every waiter (head or non-head) attempts to check whether they are also the tail waiter, in such a case they attempt to zero out the tail word and allow a new queue to be built up for this lock. If they succeed, they have no one to signal next in the queue to stop spinning. Otherwise, they signal the MCS node of the next waiter to break out of its spin and try resetting the tail word back to 0. This goes on until the tail waiter is found. In case of races, the new tail will be responsible for performing the same task, as the old tail will then fail to reset the tail word and wait for its next pointer to be updated before it signals the new tail to do the same. We terminate the whole wait queue because of two main reasons. Firstly, we eschew per-waiter timeouts with one applied at the head of the wait queue. This allows everyone to break out faster once we've seen the owner / pending waiter not responding for the timeout duration from the head. Secondly, it avoids complicated synchronization, because when not leaving in FIFO order, prev's next pointer needs to be fixed up etc. Lastly, all of these waiters release the rqnode and return to the caller. This patch underscores the point that rqspinlock's timeout does not apply to each waiter individually, and cannot be relied upon as an upper bound. It is possible for the rqspinlock waiters to return early from a failed lock acquisition attempt as soon as stalls are detected. The head waiter cannot directly WRITE_ONCE the tail to zero, as it may race with a concurrent xchg and a non-head waiter linking its MCS node to the head's MCS node through 'prev->next' assignment. One notable thing is that we must use RES_DEF_TIMEOUT * 2 as our maximum duration for the waiting loop (for the wait queue head), since we may have both the owner and pending bit waiter ahead of us, and in the worst case, need to span their maximum permitted critical section lengths. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Protect pending bit owners from stallsKumar Kartikeya Dwivedi
The pending bit is used to avoid queueing in case the lock is uncontended, and has demonstrated benefits for the 2 contender scenario, esp. on x86. In case the pending bit is acquired and we wait for the locked bit to disappear, we may get stuck due to the lock owner not making progress. Hence, this waiting loop must be protected with a timeout check. To perform a graceful recovery once we decide to abort our lock acquisition attempt in this case, we must unset the pending bit since we own it. All waiters undoing their changes and exiting gracefully allows the lock word to be restored to the unlocked state once all participants (owner, waiters) have been recovered, and the lock remains usable. Hence, set the pending bit back to zero before returning to the caller. Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout event statistics. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-10-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Hardcode cond_acquire loops for arm64Kumar Kartikeya Dwivedi
Currently, for rqspinlock usage, the implementation of smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are susceptible to stalls on arm64, because they do not guarantee that the conditional expression will be repeatedly invoked if the address being loaded from is not written to by other CPUs. When support for event-streams is absent (which unblocks stuck WFE-based loops every ~100us), we may end up being stuck forever. This causes a problem for us, as we need to repeatedly invoke the RES_CHECK_TIMEOUT in the spin loop to break out when the timeout expires. Let us import the smp_cond_load_acquire_timewait implementation Ankur is proposing in [0], and then fallback to it once it is merged. While we rely on the implementation to amortize the cost of sampling check_timeout for us, it will not happen when event stream support is unavailable. This is not the common case, and it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns comparison, hence just let it be. [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com Cc: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add support for timeoutsKumar Kartikeya Dwivedi
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect when the timeout has expired for the slow path to return an error. It depends on being passed two variables initialized to 0: ts, ret. The 'ts' parameter is of type rqspinlock_timeout. This macro resolves to the (ret) expression so that it can be used in statements like smp_cond_load_acquire to break the waiting loop condition. The 'spin' member is used to amortize the cost of checking time by dispatching to the implementation every 64k iterations. The 'timeout_end' member is used to keep track of the timestamp that denotes the end of the waiting period. The 'ret' parameter denotes the status of the timeout, and can be checked in the slow path to detect timeouts after waiting loops. The 'duration' member is used to store the timeout duration for each waiting loop. The default timeout value defined in the header (RES_DEF_TIMEOUT) is 0.25 seconds. This macro will be used as a condition for waiting loops in the slow path. Since each waiting loop applies a fresh timeout using the same rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the values can be easily reinitialized to the default state. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Drop PV and virtualization supportKumar Kartikeya Dwivedi
Changes to rqspinlock in subsequent commits will be algorithmic modifications, which won't remain in agreement with the implementations of paravirt spin lock and virt_spin_lock support. These future changes include measures for terminating waiting loops in slow path after a certain point. While using a fair lock like qspinlock directly inside virtual machines leads to suboptimal performance under certain conditions, we cannot use the existing virtualization support before we make it resilient as well. Therefore, drop it for now. Note that we need to drop qspinlock_stat.h, as it's only relevant in case of CONFIG_PARAVIRT_SPINLOCKS=y, but we need to keep lock_events.h in the includes, which was indirectly pulled in before. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19rqspinlock: Add rqspinlock.h headerKumar Kartikeya Dwivedi
This header contains the public declarations usable in the rest of the kernel for rqspinlock. Let's also type alias qspinlock to rqspinlock_t to ensure consistent use of the new lock type. We want to remove dependence on the qspinlock type in later patches as we need to provide a test-and-set fallback, hence begin abstracting away from now onwards. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Copy out qspinlock.c to kernel/bpf/rqspinlock.cKumar Kartikeya Dwivedi
In preparation for introducing a new lock implementation, Resilient Queued Spin Lock, or rqspinlock, we first begin our modifications by using the existing qspinlock.c code as the base. Simply copy the code to a new file and rename functions and variables from 'queued' to 'resilient_queued'. Since we place the file in kernel/bpf, include needs to be relative. This helps each subsequent commit in clearly showing how and where the code is being changed. The only change after a literal copy in this commit is renaming the functions where necessary, and rename qnodes to rqnodes. Let's also use EXPORT_SYMBOL_GPL for rqspinlock slowpath. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Allow obtaining result of arch_mcs_spin_lock_contendedKumar Kartikeya Dwivedi
To support upcoming changes that require inspecting the return value once the conditional waiting loop in arch_mcs_spin_lock_contended terminates, modify the macro to preserve the result of smp_cond_load_acquire. This enables checking the return value as needed, which will help disambiguate the MCS node’s locked state in future patches. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19locking: Move common qspinlock helpers to a private headerKumar Kartikeya Dwivedi
Move qspinlock helper functions that encode, decode tail word, set and clear the pending and locked bits, and other miscellaneous definitions and macros to a private header. To this end, create a qspinlock.h header file in kernel/locking. Subsequent commits will introduce a modified qspinlock slow path function, thus moving shared code to a private header will help minimize unnecessary code duplication. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-19pidfs: ensure that PIDFS_INFO_EXIT is availableChristian Brauner
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it. This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid. Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid. Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before. Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19Merge tag 'v6.14-rc7' into x86/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-18bpf: clarify a misleading verifier error messageAndrea Terzolo
The current verifier error message states that tail_calls are not allowed in non-JITed programs with BPF-to-BPF calls. While this is accurate, it is not the only scenario where this restriction applies. Some architectures do not support this feature combination even when programs are JITed. This update improves the error message to better reflect these limitations. Suggested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20250318083551.8192-1-andreaterzolo3@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18bpf: Reject attaching fexit/fmod_ret to __noreturn functionsYafang Shao
If we attach fexit/fmod_ret to __noreturn functions, it will cause an issue that the bpf trampoline image will be left over even if the bpf link has been destroyed. Take attaching do_exit() with fexit for example. The fexit works as follows, bpf_trampoline + __bpf_tramp_enter + percpu_ref_get(&tr->pcref); + call do_exit() + __bpf_tramp_exit + percpu_ref_put(&tr->pcref); Since do_exit() never returns, the refcnt of the trampoline image is never decremented, preventing it from being freed. That can be verified with as follows, $ bpftool link show <<<< nothing output $ grep "bpf_trampoline_[0-9]" /proc/kallsyms ffffffffc04cb000 t bpf_trampoline_6442526459 [bpf] <<<< leftover In this patch, all functions annotated with __noreturn are rejected, except for the following cases: - Functions that result in a system reboot, such as panic, machine_real_restart and rust_begin_unwind - Functions that are never executed by tasks, such as rest_init and cpu_startup_entry - Functions implemented in assembly, such as rewind_stack_and_make_dead and xen_cpu_bringup_again, lack an associated BTF ID. With this change, attaching fexit probes to functions like do_exit() will be rejected. $ ./fexit libbpf: prog 'fexit': BPF program load failed: -EINVAL libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG -- Attaching fexit/fmod_ret to __noreturn functions is rejected. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20250318114447.75484-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates ↵Martin KaFai Lau
storage The current cgrp storage has a percpu counter, bpf_cgrp_storage_busy, to detect potential deadlock at a spin_lock that the local storage acquires during new storage creation. There are false positives. It turns out to be too noisy in production. For example, a bpf prog may be doing a bpf_cgrp_storage_get on map_a. An IRQ comes in and triggers another bpf_cgrp_storage_get on a different map_b. It will then trigger the false positive deadlock check in the percpu counter. On top of that, both are doing lookup only and no need to create new storage, so practically it does not need to acquire the spin_lock. The bpf_task_storage_get already has a strategy to minimize this false positive by only failing if the bpf_task_storage_get needs to create a new storage and the percpu counter is busy. Creating a new storage is the only time it must acquire the spin_lock. This patch borrows the same idea. Unlike task storage that has a separate variant for tracing (_recur) and non-tracing, this patch stays with one bpf_cgrp_storage_get helper to keep it simple for now in light of the upcoming res_spin_lock. The variable could potentially use a better name noTbusy instead of nobusy. This patch follows the same naming in bpf_task_storage_get for now. I have tested it by temporarily adding noinline to the cgroup_storage_lookup(), traced it by fentry, and the fentry program succeeded in calling bpf_cgrp_storage_get(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250318182759.3676094-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18locking: Move MCS struct definition to public headerKumar Kartikeya Dwivedi
Move the definition of the struct mcs_spinlock from the private mcs_spinlock.h header in kernel/locking to the mcs_spinlock.h asm-generic header, since we will need to reference it from the qspinlock.h header in subsequent commits. Reviewed-by: Barret Rhoden <brho@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18bpf: Make perf_event_read_output accessible in all program types.Emil Tsalapatis
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like sched_ext schedulers. When the helper is available, provide its bpf_func_proto directly from the bpf base_proto. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18s390: Move s390 sysctls into their own file under arch/s390joel granados
Move s390 sysctls (spin_retry and userprocess_debug) into their own files under arch/s390. Create two new sysctl tables (2390_{fault,spin}_sysctl_table) which will be initialized with arch_initcall placing them after their original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: joel granados <joel.granados@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20250306-jag-mv_ctltables-v2-6-71b243c8d3f8@kernel.org Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18fs: dedup handling of struct filename init and refcounts bumpsMateusz Guzik
No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250313142744.1323281-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-17mm/rmap: basic MM owner tracking for large folios (!hugetlb)David Hildenbrand
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work. So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared". We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts. For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount If a slot is unoccupied, it can be taken by the next MM that maps folio page. In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT. As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit. 32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit. We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another. As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively". Example 1: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages -> We will detect "mapped exclusively". Example 2: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages -> We will detect "maybe mapped shared". Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum. Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks. Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17Merge tag 'probes-fixes-v6.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Clean up tprobe correctly when module unload Tracepoint probes do not set TRACEPOINT_STUB on the 'tpoint' pointer when unloading a module, thus they show as a normal 'fprobe' instead of 'tprobe' and never come back - Fix leakage of tprobe module refcount When a tprobe's target module is loaded, it gets the module's refcount in the module notifier but forgot to put it after registering the probe on it. Fix it by getting the refcount only when registering tprobe. * tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: tprobe-events: Fix leakage of module refcount tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
2025-03-17bpftool: Using the right format specifiersJiayuan Chen
Fixed some formatting specifiers errors, such as using %d for int and %u for unsigned int, as well as other byte-length types. Perform type cast using the type derived from the data type itself, for example, if it's originally an int, it will be cast to unsigned int if forced to unsigned. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev
2025-03-17bpf: Return prog btf_id without capable checkMykyta Yatsenko
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable check. This patch enables scenario, when freplace program, running from user namespace, requires to query target prog's btf. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com
2025-03-17bpf: BPF token support for BPF_BTF_GET_FD_BY_IDMykyta Yatsenko
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not allow running it from user namespace. This creates a problem when freplace program running from user namespace needs to query target program BTF. This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds support for BPF token that can be passed in attributes to syscall. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com
2025-03-17kexec_core: accept unaccepted kexec segments' destination addressesYan Zhao
The UEFI Specification version 2.9 introduces the concept of memory acceptance: some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting memory is expensive. The memory must be allocated by the VMM and then brought to a known safe state: cache must be flushed, memory must be zeroed with the guest's encryption key, and associated metadata must be manipulated. These operations must be performed from a trusted environment (firmware or TDX module). Switching context to and from it also takes time. This cost adds up. On large confidential VMs, memory acceptance alone can take minutes. It is better to delay memory acceptance until the memory is actually needed. The kernel accepts memory when it is allocated from buddy allocator for the first time. This reduces boot time and decreases memory overhead as the VMM can allocate memory as needed. It does not work when the guest attempts to kexec into a new kernel. The kexec segments' destination addresses are not allocated by the buddy allocator. Instead, they are searched from normal system RAM (top-down or bottom-up) and exclude driver-managed memory, ACPI, persistent, and reserved memory. Unaccepted memory is normal system RAM from kernel point of view and kexec can place segments there. Kexec bypasses the code path in buddy allocator where memory gets accepted and it leads to a crash when kexec accesses segments' memory. Accept the destination addresses during the kexec load, immediately after they pass sanity checks. This ensures the code is located in a common place shared by both the kexec_load and kexec_file_load system calls. This will not conflict with the accounting in try_to_accept_memory_one() since the accounting is set during kernel boot and decremented when pages are moved to the freelists. There is no harm in invoking accept_memory() on a page before making it available to the buddy allocator. No need to worry about re-accepting memory since accept_memory() checks the unaccepted bitmap before accepting a memory page. Although a user may perform kexec loading without ever triggering the jump, it doesn't impact much since kexec loading is not in a performance-critical path. Additionally, the destination addresses are always searched and found in the same location on a given system. Changes to the destination address searching logic to locate only memory in either unaccepted or accepted status are unnecessary and complicated. [kirill.shutemov@linux.intel.com: update the commit message] Link: https://lkml.kernel.org/r/20250307084411.2150367-1-kirill.shutemov@linux.intel.com Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ashish Kalra <Ashish.Kalra@amd.com> Cc: Baoquan He <bhe@redhat.com> Cc: Jianxiong Gao <jxgao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17watchdog/perf: optimize bytes copied and remove manual NUL-terminationThorsten Blum
Currently, up to 23 bytes of the source string are copied to the destination buffer (including the comma and anything after it), only to then manually NUL-terminate the destination buffer again at index 'len' (where the comma was found). Fix this by calling strscpy() with 'len' instead of the destination buffer size to copy only as many bytes from the source string as needed. Change the length check to allow 'len' to be less than or equal to the destination buffer size to fill the whole buffer if needed. Remove the if-check for the return value of strscpy(), because calling strscpy() with 'len' always truncates the source string at the comma as expected and NUL-terminates the destination buffer at the corresponding index instead. Remove the manual NUL-termination. No functional changes intended. Link: https://lkml.kernel.org/r/20250313133004.36406-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17signal: avoid clearing TIF_SIGPENDING in recalc_sigpending() if unsetMateusz Guzik
Clearing is an atomic op and the flag is not set most of the time. When creating and destroying threads in the same process with the pthread family, the primary bottleneck is calls to sigprocmask which take the process-wide sighand lock. Avoiding the atomic gives me a 2% bump in start/teardown rate at 24-core scale. [akpm@linux-foundation.org: add unlikely() as well] Link: https://lkml.kernel.org/r/20250303134908.423242-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17printk: Add an option to allow ttynull to be a default console deviceAdam Simonelli
The new option is CONFIG_NULL_TTY_DEFAULT_CONSOLE. if enabled, and CONFIG_VT is disabled, ttynull will become the default primary console device. ttynull will be the only console device usually with this option enabled. Some architectures do call add_preferred_console() which may add another console though. Motivation: Many distributions ship with CONFIG_VT enabled. On tested desktop hardware if CONFIG_VT is disabled, the default console device falls back to /dev/ttyS0 instead of /dev/tty. This could cause issues in user space, and hardware problems: 1. The user space issues include the case where /dev/ttyS0 is disconnected, and the TCGETS ioctl, which some user space libraries use as a probe to determine if a file is a tty, is called on /dev/console and fails. Programs that call isatty() on /dev/console and get an incorrect false value may skip expected logging to /dev/console. 2. The hardware issues include the case if a user has a science instrument or other device connected to the /dev/ttyS0 port, and they were to upgrade to a kernel that is disabling the CONFIG_VT option, kernel logs will then be sent to the device connected to /dev/ttyS0 unless they edit their kernel command line manually. The new CONFIG_NULL_TTY_DEFAULT_CONSOLE option will give users and distribution maintainers an option to avoid this. Disabling CONFIG_VT and enabling CONFIG_NULL_TTY_DEFAULT_CONSOLE will ensure the default kernel console behavior is not dependent on hardware configuration by default, and avoid unexpected new behavior on devices connected to the /dev/ttyS0 serial port. Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Adam Simonelli <adamsimonelli@gmail.com> Link: https://lore.kernel.org/r/20250314160749.3286153-2-adamsimonelli@gmail.com [pmladek@suse.com: Fixed indentation of the commit message.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-03-17sched/topology: Stop exposing partition_sched_domains_lockedJuri Lelli
The are no callers of partition_sched_domains_locked() outside topology.c. Stop exposing such function. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSC96a8FcqWV3G@jlelli-thinkpadt14gen4.remote.csb
2025-03-17cgroup/cpuset: Remove partition_and_rebuild_sched_domainsJuri Lelli
partition_and_rebuild_sched_domains() and partition_sched_domains() are now equivalent. Remove the former as a nice clean up. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MR4ryNDJZDzsSG@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Remove redundant dl_clear_root_domain callJuri Lelli
We completely clean and restore root domains bandwidth accounting after every root domains change, so the dl_clear_root_domain() call in partition_sched_domains_locked() is redundant. Remove it. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRtcX4tz4tcLRR@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb