summaryrefslogtreecommitdiff
path: root/include/linux/sched
AgeCommit message (Collapse)Author
2025-12-05Merge tag 'mm-stable-2025-12-03-21-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...
2025-12-03Merge tag 'sched_ext-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve recovery from misbehaving BPF schedulers. When a scheduler puts many tasks with varying affinity restrictions on a shared DSQ, CPUs scanning through tasks they cannot run can overwhelm the system, causing lockups. Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example scheduler to demonstrate this scenario. - Add lockless peek operation for DSQs to reduce lock contention for schedulers that need to query queue state during load balancing. - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks. - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in structs for future aux__prog parameter. - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's idle state changes. - Fix migration tasks being incorrectly downgraded from stop_sched_class to rt_sched_class across sched_ext enable/disable. Applied late as the fix is low risk and the bug subtle but needs stable backporting. - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT reliability, and backward compatibility improvements. * tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits) sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks sched_ext: tools: Removing duplicate targets during non-cross compilation sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs sched_ext: Update comments replacing breather with aborting mechanism sched_ext: Implement load balancer for bypass mode sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR sched_ext: Add scx_cpu0 example scheduler sched_ext: Hook up hardlockup detector sched_ext: Make handle_lockup() propagate scx_verror() result sched_ext: Refactor lockup handlers into handle_lockup() sched_ext: Make scx_exit() and scx_vexit() return bool sched_ext: Exit dispatch and move operations immediately when aborting sched_ext: Simplify breather mechanism with scx_aborting flag sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode sched_ext: Refactor do_enqueue_task() local and global DSQ paths sched_ext: Use shorter slice in bypass mode sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races sched_ext: Minor cleanups to scx_task_iter ...
2025-12-01Merge tag 'sched-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scalability and load-balancing improvements: - Enable scheduler feature NEXT_BUDDY (Mel Gorman) - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman) - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen) - Implement generic code for architecture specific sched domain NUMA distances (Tim Chen) - Optimize the NUMA distances of the sched-domains builds of Intel Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim Chen) - Implement proportional newidle balance: a randomized algorithm that runs newidle balancing proportional to its success rate. (Peter Zijlstra) Scheduler infrastructure changes: - Implement the 'sched_change' scoped_guard() pattern for the entire scheduler (Peter Zijlstra) - More broadly utilize the sched_change guard (Peter Zijlstra) - Add support to pick functions to take runqueue-flags (Joel Fernandes) - Provide and use set_need_resched_current() (Peter Zijlstra) Fair scheduling enhancements: - Forfeit vruntime on yield (Fernand Sieber) - Only update stats for allowed CPUs when looking for dst group (Adam Li) CPU-core scheduling enhancements: - Optimize core cookie matching check (Fernand Sieber) Deadline scheduler fixes: - Only set free_cpus for online runqueues (Doug Berger) - Fix dl_server time accounting (Peter Zijlstra) - Fix dl_server stop condition (Peter Zijlstra) Proxy scheduling fixes: - Yield the donor task (Fernand Sieber) Fixes and cleanups: - Fix do_set_cpus_allowed() locking (Peter Zijlstra) - Fix migrate_disable_switch() locking (Peter Zijlstra) - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia) - Increase sched_tick_remote timeout (Phil Auld) - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde) - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)" * tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched: Provide and use set_need_resched_current() sched/fair: Proportional newidle balance sched/fair: Small cleanup to update_newidle_cost() sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Revert max_newidle_lb_cost bump sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals sched/fair: Enable scheduler feature NEXT_BUDDY sched: Increase sched_tick_remote timeout sched/fair: Have SD_SERIALIZE affect newidle balancing sched/fair: Skip sched_balance_running cmpxchg when balance is not due sched/deadline: Minor cleanup in select_task_rq_dl() sched/deadline: Use cpumask_weight_and() in dl_bw_cpus sched/deadline: Document dl_server sched/deadline: Fix dl_server stop condition sched/deadline: Fix dl_server time accounting sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() sched/eevdf: Fix min_vruntime vs avg_vruntime sched/core: Add comment explaining force-idle vruntime snapshots sched/core: Optimize core cookie matching check sched/proxy: Yield the donor task ...
2025-11-17sched/fair: Proportional newidle balancePeter Zijlstra
Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-16mm: consistently use current->mm in mm_get_unmapped_area()Ryan Roberts
mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16mm: skip might_alloc() warnings when PF_MEMALLOC is setUladzislau Rezki (Sony)
might_alloc() catches invalid blocking allocations in contexts where sleeping is not allowed. However when PF_MEMALLOC is set, the page allocator already skips reclaim and other blocking paths. In such cases, a blocking gfp_mask does not actually lead to blocking, so triggering might_alloc() splats is misleading. Adjust might_alloc() to skip warnings when the current task has PF_MEMALLOC set, matching the allocator's actual blocking behaviour. Link: https://lkml.kernel.org/r/20251007122035.56347-9-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-14sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docsTejun Heo
With the buddy lockup detector, smp_processor_id() returns the detecting CPU, not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the locked CPU number from watchdog_hardlockup_check() as a parameter instead. Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and scx_rcu_cpu_stall() documenting their return value semantics. Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Factor out scx_dsq_list_node cursor initialization into ↵Tejun Heo
INIT_DSQ_LIST_CURSOR Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR macro in preparation for additional users. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Hook up hardlockup detectorTejun Heo
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a large system with many tasks pinned to different subsets of CPUs, if the BPF scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock can be contended to the point where hardlockup triggers. Unfortunately, hardlockup can be the first signal out of such situations, thus requiring hardlockup handling. Hook scx_hardlockup() into the hardlockup detector to try kicking out the current scheduler in an attempt to recover the system to a good state. The handling strategy can delay watchdog taking its own action by one polling period; however, given that the only remediation for hardlockup is crash, this is likely an acceptable trade-off. v2: Add missing dummy scx_hardlockup() definition for !CONFIG_SCHED_CLASS_EXT (kernel test bot). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Cc: Douglas Anderson <dianders@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass modeTejun Heo
Bypass mode routes tasks through fallback dispatch queues. Originally a single global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node") changed this to per-node DSQs to resolve NUMA-related livelocks. Dan Schatzberg found per-node DSQs can still livelock when many threads are pinned to different small CPU subsets: each CPU must scan many incompatible tasks to find runnable ones, causing severe contention with high CPU counts. Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default idle CPU selection and direct dispatch handle most cases well. This introduces a failure mode when tasks concentrate on one CPU in over-saturated systems. If the BPF scheduler severely skews placement before triggering bypass, that CPU's queue may be too long to drain, causing RCU stalls. A load balancer in a future patch will address this. The bypass DSQ is separate from local DSQ to enable load balancing: local DSQs use rq locks, preventing efficient scanning and transfer across CPUs, especially problematic when systems are already contended. v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use shorter slice in bypass modeTejun Heo
There have been reported cases of bypass mode not making forward progress fast enough. The 20ms default slice is unnecessarily long for bypass mode where the primary goal is ensuring all tasks can make forward progress. Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically switch to it when entering bypass mode. Also make the bypass slice value tunable through the slice_bypass_us module parameter (adjustable between 100us and 100ms) to make it easier to test whether slice durations are a factor in problem cases. v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan). v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea). Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05coredump: mark struct mm_struct as constChristian Brauner
We don't actually modify it. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-7-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to ↵Tejun Heo
finish_task_switch() sched_ext_free() was called from __put_task_struct() when the last reference to the task is dropped, which could be long after the task has finished running. This causes cgroup-related problems: - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d during scheduler load, because the cgroup might be destroyed/unlinked while the zombie or dead task is still lingering on the scx_tasks list. - ops.cgroup_exit() could be called before ops.exit_task() is called on all member tasks, leading to incorrect exit ordering. Fix by moving it to finish_task_switch() to be called right after the final context switch away from the dying task, matching when sched_class->task_dead() is called. Rename it to sched_ext_dead() to match the new calling context. By calling sched_ext_dead() before cgroup_task_dead(), we ensure that: - Tasks visible on scx_tasks list have valid cgroups during scheduler load, as cgroup_mutex prevents cgroup destruction while the task is still linked. - All member tasks have ops.exit_task() called and are removed from scx_tasks before the cgroup can be destroyed and trigger ops.cgroup_exit(). This fix is made possible by the cgroup_task_dead() split in the previous patch. This also makes more sense resource-wise as there's no point in keeping scheduler side resources around for dead tasks. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15sched_ext: Add lockless peek operation for DSQsRyan Newton
The builtin DSQ queue data structures are meant to be used by a wide range of different sched_ext schedulers with different demands on these data structures. They might be per-cpu with low-contention, or high-contention shared queues. Unfortunately, DSQs have a coarse-grained lock around the whole data structure. Without going all the way to a lock-free, more scalable implementation, a small step we can take to reduce lock contention is to allow a lockless, small-fixed-cost peek at the head of the queue. This change allows certain custom SCX schedulers to cheaply peek at queues, e.g. during load balancing, before locking them. But it represents a few extra memory operations to update the pointer each time the DSQ is modified, including a memory barrier on ARM so the write appears correctly ordered. This commit adds a first_task pointer field which is updated atomically when the DSQ is modified, and allows any thread to peek at the head of the queue without holding the lock. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14sched/ext: Implement cgroup_set_idle() callbackzhidao su
Implement the missing cgroup_set_idle() callback that was marked as a TODO. This allows BPF schedulers to be notified when a cgroup's idle state changes, enabling them to adjust their scheduling behavior accordingly. The implementation follows the same pattern as other cgroup callbacks like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the BPF scheduler has implemented the callback and invokes it with the appropriate parameters. Fixes a spelling error in the cgroup_set_bandwidth() documentation. tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage. Signed-off-by: zhidao su <soolaugust@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-02Merge tag 'mm-nonmm-stable-2025-10-02-15-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet completes the removal of this legacy IDR API - "panic: introduce panic status function family" from Jinchao Wang provides a number of cleanups to the panic code and its various helpers, which were rather ad-hoc and scattered all over the place - "tools/delaytop: implement real-time keyboard interaction support" from Fan Yu adds a few nice user-facing usability changes to the delaytop monitoring tool - "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos Petrongonas fixes a panic which was happening with the combination of EFI and KHO - "Squashfs: performance improvement and a sanity check" from Phillip Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere 150x speedup was measured for a well-chosen microbenchmark - plus another 50-odd singleton patches all over the place * tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits) Squashfs: reject negative file sizes in squashfs_read_inode() kallsyms: use kmalloc_array() instead of kmalloc() MAINTAINERS: update Sibi Sankar's email address Squashfs: add SEEK_DATA/SEEK_HOLE support Squashfs: add additional inode sanity checking lib/genalloc: fix device leak in of_gen_pool_get() panic: remove CONFIG_PANIC_ON_OOPS_VALUE ocfs2: fix double free in user_cluster_connect() checkpatch: suppress strscpy warnings for userspace tools cramfs: fix incorrect physical page address calculation kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit Squashfs: fix uninit-value in squashfs_get_parent kho: only fill kimage if KHO is finalized ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name() kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock coccinelle: platform_no_drv_owner: handle also built-in drivers coccinelle: of_table: handle SPI device ID tables lib/decompress: use designated initializers for struct compress_format efi: support booting with kexec handover (KHO) ...
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-09-30Merge tag 'sched-core-2025-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Make migrate_{en,dis}able() inline, to improve performance (Menglong Dong) - Move STDL_INIT() functions out-of-line (Peter Zijlstra) - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra) Fair scheduling: - Defer throttling to when tasks exit to user-space, to reduce the chance & impact of throttle-preemption with held locks and other resources (Aaron Lu, Valentin Schneider) - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the warning was getting triggered on certain topologies (Peter Zijlstra) Misc cleanups & fixes: - Header cleanups (Menglong Dong) - Fix race in push_dl_task() (Harshit Agarwal)" * tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix some typos in include/linux/preempt.h sched: Make migrate_{en,dis}able() inline rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c sched/fair: Do not balance task to a throttled cfs_rq sched/fair: Do not special case tasks in throttled hierarchy sched/fair: update_cfs_group() for throttled cfs_rqs sched/fair: Propagate load for throttled cfs_rq sched/fair: Get rid of throttled_lb_pair() sched/fair: Task based throttle time accounting sched/fair: Switch to task based throttle model sched/fair: Implement throttle task work and related helpers sched/fair: Add related data structure for task based throttle sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig sched: Move STDL_INIT() functions out-of-line sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() sched/deadline: Fix race in push_dl_task()
2025-09-30Merge tag 'cgroup-for-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Extensive cpuset code cleanup and refactoring work with no functional changes: CPU mask computation logic refactoring, introducing new helpers, removing redundant code paths, and improving error handling for better maintainability. - A few bug fixes to cpuset including fixes for partition creation failures when isolcpus is in use, missing error returns, and null pointer access prevention in free_tmpmasks(). - Core cgroup changes include replacing the global percpu_rwsem with per-threadgroup rwsem when writing to cgroup.procs for better scalability, workqueue conversions to use WQ_PERCPU and system_percpu_wq to prepare for workqueue default switching from percpu to unbound, and removal of unused code including the post_attach callback. - New cgroup.stat.local time accounting feature that tracks frozen time duration. - Misc changes including selftests updates (new freezer time tests and backward compatibility fixes), documentation sync, string function safety improvements, and 64-bit division fixes. * tag 'cgroup-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (39 commits) cpuset: remove is_prs_invalid helper cpuset: remove impossible warning in update_parent_effective_cpumask cpuset: remove redundant special case for null input in node mask update cpuset: fix missing error return in update_cpumask cpuset: Use new excpus for nocpu error check when enabling root partition cpuset: fix failure to enable isolated partition when containing isolcpus Documentation: cgroup-v2: Sync manual toctree cpuset: use partition_cpus_change for setting exclusive cpus cpuset: use parse_cpulist for setting cpus.exclusive cpuset: introduce partition_cpus_change cpuset: refactor cpus_allowed_validate_change cpuset: refactor out validate_partition cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers cpuset: refactor CPU mask buffer parsing logic cpuset: Refactor exclusive CPU mask computation logic cpuset: change return type of is_partition_[in]valid to bool cpuset: remove unused assignment to trialcs->partition_root_state cpuset: move the root cpuset write check earlier cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock ...
2025-09-30Merge tag 'sched_ext-for-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Code organization cleanup. Separate internal types and accessors to ext_internal.h to reduce the size of ext.c and improve maintainability. - Prepare for cgroup sub-scheduler support by adding @sch parameter to various functions and helpers, reorganizing scheduler instance handling, and dropping obsolete helpers like scx_kf_exit() and kf_cpu_valid(). - Add new scx_bpf_cpu_curr() and scx_bpf_locked_rq() BPF helpers to provide safer access patterns with proper RCU protection. scx_bpf_cpu_rq() is deprecated with warnings due to potential race conditions. - Improve debugging with migration-disabled counter in error state dumps, SCX_EFLAG_INITIALIZED flag, bitfields for warning flags, and other enhancements to help diagnose issues. - Use cgroup_lock/unlock() for cgroup synchronization instead of scx_cgroup_rwsem based synchronization. This is simpler and allows enable/disable paths to synchronize against cgroup changes independent of the CPU controller. - rhashtable_lookup() replacement to avoid redundant RCU locking was reverted due to RCU usage warnings. Will be redone once rhashtable is updated to use rcu_dereference_all(). - Other misc updates and fixes including bypass handling improvements, scx_task_iter_relock() improvements, tools/sched_ext updates, and compatibility helpers. * tag 'sched_ext-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (28 commits) Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()" sched_ext: Misc updates around scx_sched instance pointer sched_ext: Drop scx_kf_exit() and scx_kf_error() sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit() sched_ext: Drop kf_cpu_valid() sched_ext: Add the @sch parameter to ext_idle helpers sched_ext: Add the @sch parameter to __bstr_format() sched_ext: Separate out scx_kick_cpu() and add @sch to it tools/sched_ext: scx_qmap: Make debug output quieter by default sched_ext: Make qmap dump operation non-destructive sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init() sched_ext: Use bitfields for boolean warning flags sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq() sched_ext: Improve SCX_KF_DISPATCH comment sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast() sched_ext: Verify RCU protection in scx_bpf_cpu_curr() sched_ext: Add migration-disabled counter to error state dump sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning tools/sched_ext: Add compat helper for scx_bpf_cpu_curr() sched_ext: deprecation warn for scx_bpf_cpu_rq() ...
2025-09-23sched_ext: Improve SCX_KF_DISPATCH commentTejun Heo
The comment for SCX_KF_DISPATCH was incomplete and didn't explain that ops.dispatch() may temporarily release the rq lock, allowing ENQUEUE and SELECT_CPU operations to be nested inside DISPATCH contexts. Update the comment to clarify this nesting behavior and provide better context for when these operations can occur within dispatch. Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lockOleg Nesterov
The ancient comment above task_lock() states that it can be nested outside of read_lock(&tasklist_lock), but this is no longer true: CPU_0 CPU_1 CPU_2 task_lock() read_lock(tasklist) write_lock_irq(tasklist) read_lock(tasklist) task_lock() Unless CPU_0 calls read_lock() in IRQ context, queued_read_lock_slowpath() won't get the lock immediately, it will spin waiting for the pending writer on CPU_2, resulting in a deadlock. Link: https://lkml.kernel.org/r/20250914110908.GA18769@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21mm: constify arch_pick_mmap_layout() for improved const-correctnessMax Kellermann
This function only reads from the rlimit pointer (but writes to the mm_struct pointer which is kept without `const`). All callees are already const-ified or (internal functions) are being constified by this patch. Link: https://lkml.kernel.org/r/20250901205021.3573313-9-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13mm: update coredump logic to correctly use bitmap mm flagsLorenzo Stoakes
The coredump logic is slightly different from other users in that it both stores mm flags and additionally sets and gets using masks. Since the MMF_DUMPABLE_* flags must remain as they are for uABI reasons, and of course these are within the first 32-bits of the flags, it is reasonable to provide access to these in the same fashion so this logic can all still keep working as it has been. Therefore, introduce coredump-specific helpers __mm_flags_get_dumpable() and __mm_flags_set_mask_dumpable() for this purpose, and update all core dump users of mm flags to use these. [lorenzo.stoakes@oracle.com: abstract set_mask_bits() invocation to mm_types.h to satisfy ARC] Link: https://lkml.kernel.org/r/0e7ad263-1ff7-446d-81fe-97cff9c0e7ed@lucifer.local Link: https://lkml.kernel.org/r/2a5075f7e3c5b367d988178c79a3063d12ee53a9.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-10cgroup: replace global percpu_rwsem with per threadgroup resem when writing ↵Yi Tao
to cgroup.procs The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to pr_warn_once(). Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched: Move STDL_INIT() functions out-of-linePeter Zijlstra
Since all these functions are address-taken in SDTL_INIT() and called indirectly, it doesn't really make sense for them to be inline. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-03sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()Peter Zijlstra
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during their testing starting from v6.16-rc1. Debug that followed pointed to the tl->mask() for the NODE domain being incorrectly resolved to that of the highest NUMA domain. tl->mask() for NODE is set to the sd_numa_mask() which depends on the global "sched_domains_curr_level" hack. "sched_domains_curr_level" is set to the "tl->numa_level" during tl traversal in build_sched_domains() calling sd_init() but was not reset before topology_span_sane(). Since "tl->numa_level" still reflected the old value from build_sched_domains(), topology_span_sane() for the NODE domain trips when the span of the last NUMA domain overlaps. Instead of replicating the "sched_domains_curr_level" hack, get rid of it entirely and instead, pass the entire "sched_domain_topology_level" object to tl->cpumask() function to prevent such mishap in the future. sd_numa_mask() now directly references "tl->numa_level" instead of relying on the global "sched_domains_curr_level" hack to index into sched_domains_numa_masks[]. The original warning was reproducible on the following NUMA topology reported by Leon: $ sudo numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: 2927 MB node 0 free: 1603 MB node 1 cpus: 2 3 node 1 size: 3023 MB node 1 free: 3008 MB node 2 cpus: 4 5 node 2 size: 3023 MB node 2 free: 3007 MB node 3 cpus: 6 7 node 3 size: 3023 MB node 3 free: 3002 MB node 4 cpus: 8 9 node 4 size: 3022 MB node 4 free: 2718 MB node distances: node 0 1 2 3 4 0: 10 39 38 37 36 1: 39 10 38 37 36 2: 38 38 10 37 36 3: 37 37 37 10 36 4: 36 36 36 36 10 The above topology can be mimicked using the following QEMU cmd that was used to reproduce the warning and test the fix: sudo qemu-system-x86_64 -enable-kvm -cpu host \ -m 20G -smp cpus=10,sockets=10 -machine q35 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=4G,id=m2 \ -object memory-backend-ram,size=4G,id=m3 \ -object memory-backend-ram,size=4G,id=m4 \ -numa node,cpus=0-1,memdev=m0,nodeid=0 \ -numa node,cpus=2-3,memdev=m1,nodeid=1 \ -numa node,cpus=4-5,memdev=m2,nodeid=2 \ -numa node,cpus=6-7,memdev=m3,nodeid=3 \ -numa node,cpus=8-9,memdev=m4,nodeid=4 \ -numa dist,src=0,dst=1,val=39 \ -numa dist,src=0,dst=2,val=38 \ -numa dist,src=0,dst=3,val=37 \ -numa dist,src=0,dst=4,val=36 \ -numa dist,src=1,dst=0,val=39 \ -numa dist,src=1,dst=2,val=38 \ -numa dist,src=1,dst=3,val=37 \ -numa dist,src=1,dst=4,val=36 \ -numa dist,src=2,dst=0,val=38 \ -numa dist,src=2,dst=1,val=38 \ -numa dist,src=2,dst=3,val=37 \ -numa dist,src=2,dst=4,val=36 \ -numa dist,src=3,dst=0,val=37 \ -numa dist,src=3,dst=1,val=37 \ -numa dist,src=3,dst=2,val=37 \ -numa dist,src=3,dst=4,val=36 \ -numa dist,src=4,dst=0,val=36 \ -numa dist,src=4,dst=1,val=36 \ -numa dist,src=4,dst=2,val=36 \ -numa dist,src=4,dst=3,val=36 \ ... [ prateek: Moved common functions to include/linux/sched/topology.h, reuse the common bits for s390 and ppc, commit message ] Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1] Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Valentin Schneider <vschneid@redhat.com> # x86 Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-31Merge tag 'sched_ext-for-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-29Merge tag 'sched-core-2025-07-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler changes: - Better tracking of maximum lag of tasks in presence of different slices duration, for better handling of lag in the fair scheduler (Vincent Guittot) - Clean up and standardize #if/#else/#endif markers throughout the entire scheduler code base (Ingo Molnar) - Make SMP unconditional: build the SMP scheduler's data structures and logic on UP kernel too, even though they are not used, to simplify the scheduler and remove around 200 #ifdef/[#else]/#endif blocks from the scheduler (Ingo Molnar) - Reorganize cgroup bandwidth control interface handling for better interfacing with sched_ext (Tejun Heo) Balancing: - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris Mason) - Remove sched_domain_topology_level::flags to simplify the code (Prateek Nayak) - Simplify and clean up build_sched_topology() (Li Chen) - Optimize build_sched_topology() on large machines (Li Chen) Real-time scheduling: - Add initial version of proxy execution: a mechanism for mutex-owning tasks to inherit the scheduling context of higher priority waiters. Currently limited to a single runqueue and conditional on CONFIG_EXPERT, and other limitations (John Stultz, Peter Zijlstra, Valentin Schneider) - Deadline scheduler (Juri Lelli): - Fix dl_servers initialization order (Juri Lelli) - Fix DL scheduler's root domain reinitialization logic (Juri Lelli) - Fix accounting bugs after global limits change (Juri Lelli) - Fix scalability regression by implementing less agressive dl_server handling (Peter Zijlstra) PSI: - Improve scalability by optimizing psi_group_change() cpu_clock() usage (Peter Zijlstra) Rust changes: - Make Task, CondVar and PollCondVar methods inline to avoid unnecessary function calls (Kunwu Chan, Panagiotis Foliadis) - Add might_sleep() support for Rust code: Rust's "#[track_caller]" mechanism is used so that Rust's might_sleep() doesn't need to be defined as a macro (Fujita Tomonori) - Introduce file_from_location() (Boqun Feng) Debugging & instrumentation: - Make clangd usable with scheduler source code files again (Peter Zijlstra) - tools: Add root_domains_dump.py which dumps root domains info (Juri Lelli) - tools: Add dl_bw_dump.py for printing bandwidth accounting info (Juri Lelli) Misc cleanups & fixes: - Remove play_idle() (Feng Lee) - Fix check_preemption_disabled() (Sebastian Andrzej Siewior) - Do not call __put_task_struct() on RT if pi_blocked_on is set (Luis Claudio R. Goncalves) - Correct the comment in place_entity() (wang wei)" * tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits) sched/idle: Remove play_idle() sched: Do not call __put_task_struct() on rt if pi_blocked_on is set sched: Start blocked_on chain processing in find_proxy_task() sched: Fix proxy/current (push,pull)ability sched: Add an initial sketch of the find_proxy_task() function sched: Fix runtime accounting w/ split exec & sched contexts sched: Move update_curr_task logic into update_curr_se locking/mutex: Add p->blocked_on wrappers for correctness checks locking/mutex: Rework task_struct::blocked_on sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable sched/topology: Remove sched_domain_topology_level::flags x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled x86/smpboot: moves x86_topology to static initialize and truncate x86/smpboot: remove redundant CONFIG_SCHED_SMT smpboot: introduce SDTL_INIT() helper to tidy sched topology setup tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info tools/sched: Add root_domains_dump.py which dumps root domains info sched/deadline: Fix accounting after global limits change sched/deadline: Reset extra_bw to max_bw when clearing root domains sched/deadline: Initialize dl_servers after SMP ...
2025-07-29Merge tag 'locking-futex-2025-07-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex updates from Thomas Gleixner: - Switch the reference counting to a RCU based per-CPU reference to address a performance bottleneck vs the single instance rcuref variant - Make the futex selftest build on 32-bit architectures which only support 64-bit time_t, e.g. RISCV-32 - Cleanups and improvements in selftests and futex bench * tag 'locking-futex-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/futex: Fix spelling mistake "Succeffuly" -> "Successfully" selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t perf bench futex: Remove support for IMMUTABLE selftests/futex: Remove support for IMMUTABLE futex: Remove support for IMMUTABLE futex: Make futex_private_hash_get() static futex: Use RCU-based per-CPU reference counting instead of rcuref_t selftests/futex: Adapt the private hash test to RCU related changes
2025-07-26sched/task_stack: Add missing const qualifier to end_of_stack()Kees Cook
Add missing const qualifier to the non-CONFIG_THREAD_INFO_IN_TASK version of end_of_stack() to match the CONFIG_THREAD_INFO_IN_TASK version. Fixes a warning with CONFIG_KSTACK_ERASE=y on archs that don't select THREAD_INFO_IN_TASK (such as LoongArch): error: passing 'const struct task_struct *' to parameter of type 'struct task_struct *' discards qualifiers The stackleak_task_low_bound() function correctly uses a const task parameter, but the legacy end_of_stack() prototype didn't like that. Build tested on loongarch (with CONFIG_KSTACK_ERASE=y) and m68k (with CONFIG_DEBUG_STACK_USAGE=y). Fixes: a45728fd4120 ("LoongArch: Enable HAVE_ARCH_STACKLEAK") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20250726004313.GA3650901@ax162 Cc: Youling Tang <tangyouling@kylinos.cn> Cc: Huacai Chen <chenhuacai@loongson.cn> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-14sched: Do not call __put_task_struct() on rt if pi_blocked_on is setLuis Claudio R. Goncalves
With PREEMPT_RT enabled, some of the calls to put_task_struct() coming from rt_mutex_adjust_prio_chain() could happen in preemptible context and with a mutex enqueued. That could lead to this sequence: rt_mutex_adjust_prio_chain() put_task_struct() __put_task_struct() sched_ext_free() spin_lock_irqsave() rtlock_lock() ---> TRIGGERS lockdep_assert(!current->pi_blocked_on); This is not a SCHED_EXT bug. The first cleanup function called by __put_task_struct() is sched_ext_free() and it happens to take a (RT) spin_lock, which in the scenario described above, would trigger the lockdep assertion of "!current->pi_blocked_on". Crystal Wood was able to identify the problem as __put_task_struct() being called during rt_mutex_adjust_prio_chain(), in the context of a process with a mutex enqueued. Instead of adding more complex conditions to decide when to directly call __put_task_struct() and when to defer the call, unconditionally resort to the deferred call on PREEMPT_RT to simplify the code. Fixes: 893cdaaa3977 ("sched: avoid false lockdep splat in put_task_struct()") Suggested-by: Crystal Wood <crwood@redhat.com> Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Wander Lairson Costa <wander@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/aGvTz5VaPFyj0pBV@uudg.org
2025-07-14sched/topology: Remove sched_domain_topology_level::flagsK Prateek Nayak
Support for overlapping domains added in commit e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") also allowed forcefully setting SD_OVERLAP for !NUMA domains via FORCE_SD_OVERLAP sched_feat(). Since NUMA domains had to be presumed overlapping to ensure correct behavior, "sched_domain_topology_level::flags" was introduced. NUMA domains added the SDTL_OVERLAP flag would ensure SD_OVERLAP was always added during build_sched_domains() for these domains, even when FORCE_SD_OVERLAP was off. Condition for adding the SD_OVERLAP flag at the aforementioned commit was as follows: if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; The FORCE_SD_OVERLAP debug feature was removed in commit af85596c74de ("sched/topology: Remove FORCE_SD_OVERLAP") which left the NUMA domains as the exclusive users of SDTL_OVERLAP, SD_OVERLAP, and SD_NUMA flags. Get rid of SDTL_OVERLAP and SD_OVERLAP as they have become redundant and instead rely on SD_NUMA to detect the only overlapping domain currently supported. Since SDTL_OVERLAP was the only user of "tl->flags", get rid of "sched_domain_topology_level::flags" too. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/ba4dbdf8-bc37-493d-b2e0-2efb00ea3e19@amd.com
2025-07-14smpboot: introduce SDTL_INIT() helper to tidy sched topology setupLi Chen
Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty
2025-07-11futex: Use RCU-based per-CPU reference counting instead of rcuref_tPeter Zijlstra
The use of rcuref_t for reference counting introduces a performance bottleneck when accessed concurrently by multiple threads during futex operations. Replace rcuref_t with special crafted per-CPU reference counters. The lifetime logic remains the same. The newly allocate private hash starts in FR_PERCPU state. In this state, each futex operation that requires the private hash uses a per-CPU counter (an unsigned int) for incrementing or decrementing the reference count. When the private hash is about to be replaced, the per-CPU counters are migrated to a atomic_t counter mm_struct::futex_atomic. The migration process: - Waiting for one RCU grace period to ensure all users observe the current private hash. This can be skipped if a grace period elapsed since the private hash was assigned. - futex_private_hash::state is set to FR_ATOMIC, forcing all users to use mm_struct::futex_atomic for reference counting. - After a RCU grace period, all users are guaranteed to be using the atomic counter. The per-CPU counters can now be summed up and added to the atomic_t counter. If the resulting count is zero, the hash can be safely replaced. Otherwise, active users still hold a valid reference. - Once the atomic reference count drops to zero, the next futex operation will switch to the new private hash. call_rcu_hurry() is used to speed up transition which otherwise might be delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The side effects would be that on auto scaling the new hash is used later and the SET_SLOTS prctl() will block longer. [bigeasy: commit description + mm get/ put_async] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
2025-06-25sched_ext: Drop kfuncs marked for removal in 6.15Jake Hillion
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names around for compatibility with old binaries. These were scheduled for cleanup in 6.15 but were missed. Submitting for cleanup in for-next. Removed the kfuncs, their flags, and any references I could find to them in doc comments. Left the entries in include/scx/compat.bpf.h as they're still useful to make new binaries compatible with old kernels. Tested by applying to my kernel. It builds and a modern version of scx_lavd loads fine. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-24sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panicDavid Dai
For systems using a sched_ext scheduler and has panic_on_rcu_stall enabled, try kicking out the current scheduler before issuing a panic. While there are numerous reasons for RCU CPU stalls that are not directly attributed to the scheduler, deferring the panic gives sched_ext an opportunity to provide additional debug info when ejecting the current scheduler. Also, handling the event more gracefully allows us to potentially recover the system instead of incurring additional down time. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Dai <david.dai@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext: Add support for cgroup bandwidth control interfaceTejun Heo
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20sched_ext, sched/core: Factor out struct scx_task_groupTejun Heo
More sched_ext fields will be added to struct task_group. In preparation, factor out sched_ext fields into struct scx_task_group to reduce clutter in the common header. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13sched/smp: Use the SMP version of sched_exec()Ingo Molnar
Simplify the scheduler making CONFIG_SMP=y sched_exec() code unconditional. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-27-mingo@kernel.org
2025-06-13sched/smp: Make SMP unconditionalIngo Molnar
Simplify the scheduler by making CONFIG_SMP=y primitives and data structures unconditional. Introduce transitory wrappers for functionality not yet converted to SMP. Note that this patch is pretty large, because there's no clear separation between various aspects of the SMP scheduler, it's basically a huge block of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to boot and work on UP systems. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-11task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END checkPasha Tatashin
Remove __HAVE_ARCH_KSTACK_END as it has been obsolete since removal of metag architecture in v4.17. Link: https://lkml.kernel.org/r/20250403-kstack-end-v1-1-7798e71f34d1@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-2-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-16sched/topology: Introduce sched_update_asym_prefer_cpu()K Prateek Nayak
A subset of AMD Processors supporting Preferred Core Rankings also feature the ability to dynamically switch these rankings at runtime to bias load balancing towards or away from the LLC domain with larger cache. To support dynamically updating "sg->asym_prefer_cpu" without needing to rebuild the sched domain, introduce sched_update_asym_prefer_cpu() which recomutes the "asym_prefer_cpu" when the core-ranking of a CPU changes. sched_update_asym_prefer_cpu() swaps the "sg->asym_prefer_cpu" with the CPU whose ranking has changed if the new ranking is greater than that of the "asym_prefer_cpu". If CPU whose ranking has changed is the current "asym_prefer_cpu", it scans the CPUs of the sched groups to find the new "asym_prefer_cpu" and sets it accordingly. get_group() for non-overlapping sched domains returns the sched group for the first CPU in the sched_group_span() which ensures all CPUs in the group see the updated value of "asym_prefer_cpu". Overlapping groups are allocated differently and will require moving the "asym_prefer_cpu" to "sg->sgc" but since the current implementations do not set "SD_ASYM_PACKING" at NUMA domains, skip additional indirection and place a SCHED_WARN_ON() to alert any future users. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409053446.23367-3-kprateek.nayak@amd.com
2025-04-02Merge tag 'objtool-urgent-2025-04-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Ingo Molnar: "These are objtool fixes and updates by Josh Poimboeuf, centered around the fallout from the new CONFIG_OBJTOOL_WERROR=y feature, which, despite its default-off nature, increased the profile/impact of objtool warnings: - Improve error handling and the presentation of warnings/errors - Revert the new summary warning line that some test-bot tools interpreted as new regressions - Fix a number of objtool warnings in various drivers, core kernel code and architecture code. About half of them are potential problems related to out-of-bounds accesses or potential undefined behavior, the other half are additional objtool annotations - Update objtool to latest (known) compiler quirks and objtool bugs triggered by compiler code generation - Misc fixes" * tag 'objtool-urgent-2025-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) objtool/loongarch: Add unwind hints in prepare_frametrace() rcu-tasks: Always inline rcu_irq_work_resched() context_tracking: Always inline ct_{nmi,irq}_{enter,exit}() sched/smt: Always inline sched_smt_active() objtool: Fix verbose disassembly if CROSS_COMPILE isn't set objtool: Change "warning:" to "error: " for fatal errors objtool: Always fail on fatal errors Revert "objtool: Increase per-function WARN_FUNC() rate limit" objtool: Append "()" to function name in "unexpected end of section" warning objtool: Ignore end-of-section jumps for KCOV/GCOV objtool: Silence more KCOV warnings, part 2 objtool, drm/vmwgfx: Don't ignore vmw_send_msg() for ORC objtool: Fix STACK_FRAME_NON_STANDARD for cold subfunctions objtool: Fix segfault in ignore_unreachable_insn() objtool: Fix NULL printf() '%s' argument in builtin-check.c:save_argv() objtool, lkdtm: Obfuscate the do_nothing() pointer objtool, regulator: rk808: Remove potential undefined behavior in rk806_set_mode_dcdc() objtool, ASoC: codecs: wcd934x: Remove potential undefined behavior in wcd934x_slim_irq_handler() objtool, Input: cyapa - Remove undefined behavior in cyapa_update_fw_store() objtool, panic: Disable SMAP in __stack_chk_fail() ...
2025-04-01sched/smt: Always inline sched_smt_active()Josh Poimboeuf
sched_smt_active() can be called from noinstr code, so it should always be inlined. The CONFIG_SCHED_SMT version already has __always_inline. Do the same for its !CONFIG_SCHED_SMT counterpart. Fixes the following warning: vmlinux.o: error: objtool: intel_idle_ibrs+0x13: call to sched_smt_active() leaves .noinstr.text section Fixes: 321a874a7ef8 ("sched/smt: Expose sched_smt_present static key") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/1d03907b0a247cf7fb5c1d518de378864f603060.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/r/202503311434.lyw2Tveh-lkp@intel.com/
2025-03-25Merge tag 'timers-core-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Fix a memory ordering issue in posix-timers Posix-timer lookup is lockless and reevaluates the timer validity under the timer lock, but the update which validates the timer is not protected by the timer lock. That allows the store to be reordered against the initialization stores, so that the lookup side can observe a partially initialized timer. That's mostly a theoretical problem, but incorrect nevertheless. - Fix a long standing inconsistency of the coarse time getters The coarse time getters read the base time of the current update cycle without reading the actual hardware clock. NTP frequency adjustment can set the base time backwards. The fine grained interfaces compensate this by reading the clock and applying the new conversion factor, but the coarse grained time getters use the base time directly. That allows the user to observe time going backwards. Cure it by always forwarding base time, when NTP changes the frequency with an immediate step. - Rework of posix-timer hashing The posix-timer hash is not scalable and due to the CRIU timer restore mechanism prone to massive contention on the global hash bucket lock. Replace the global hash lock with a fine grained per bucket locking scheme to address that. - Rework the proc/$PID/timers interface. /proc/$PID/timers is provided for CRIU to be able to restore a timer. The printout happens with sighand lock held and interrupts disabled. That's not required as this can be done with RCU protection as well. - Provide a sane mechanism for CRIU to restore a timer ID CRIU restores timers by creating and deleting them until the kernel internal per process ID counter reached the requested ID. That's horribly slow for sparse timer IDs. Provide a prctl() which allows CRIU to restore a timer with a given ID. When enabled the ID pointer is used as input pointer to read the requested ID from user space. When disabled, the normal allocation scheme (next ID) is active as before. This is backwards compatible for both kernel and user space. - Make hrtimer_update_function() less expensive. The sanity checks are valuable, but expensive for high frequency usage in io/uring. Make the debug checks conditional and enable them only when lockdep is enabled. - Small updates, cleanups and improvements * tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) selftests/timers: Improve skew_consistency by testing with other clockids timekeeping: Fix possible inconsistencies in _COARSE clockids posix-timers: Drop redundant memset() invocation selftests/timers/posix-timers: Add a test for exact allocation mode posix-timers: Provide a mechanism to allocate a given timer ID posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held posix-timers: Make per process list RCU safe posix-timers: Avoid false cacheline sharing posix-timers: Switch to jhash32() posix-timers: Improve hash table performance posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t posix-timers: Make lock_timer() use guard() posix-timers: Rework timer removal posix-timers: Simplify lock/unlock_timer() posix-timers: Use guards in a few places posix-timers: Remove SLAB_PANIC from kmem cache posix-timers: Remove a few paranoid warnings posix-timers: Cleanup includes posix-timers: Add cond_resched() to posix_timer_add() search loop posix-timers: Initialise timer before adding it to the hash table ...
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'sched_ext-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add mechanism to count and report internal events. This significantly improves visibility on subtle corner conditions. - The default idle CPU selection logic is revamped and improved in multiple ways including being made topology aware. - sched_ext was disabling ttwu_queue for simplicity, which can be costly when hardware topology is more complex. Implement SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively enable ttwu_queue. - tools/sched_ext updates to improve compatibility among others. - Other misc updates and fixes. - sched_ext/for-6.14-fixes were pulled a few times to receive prerequisite fixes and resolve conflicts. * tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits) sched_ext: idle: Refactor scx_select_cpu_dfl() sched_ext: idle: Honor idle flags in the built-in idle selection policy sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local() sched_ext: Add trace point to track sched_ext core events sched_ext: Change the event type from u64 to s64 sched_ext: Documentation: add task lifecycle summary tools/sched_ext: Provide a compatible helper for scx_bpf_events() selftests/sched_ext: Add NUMA-aware scheduler test tools/sched_ext: Provide consistent access to scx flags sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior sched_ext: idle: Introduce scx_bpf_nr_node_ids() sched_ext: idle: Introduce node-aware idle cpu kfunc helpers sched_ext: idle: Per-node idle cpumasks sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE sched_ext: idle: Make idle static keys private sched/topology: Introduce for_each_node_numadist() iterator mm/numa: Introduce nearest_node_nodemask() nodemask: numa: reorganize inclusion path nodemask: add nodes_copy() tools/sched_ext: Sync with scx repo ...