summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-06-19maple_tree: fix MA_STATE_PREALLOC flag in mas_preallocate()Liam R. Howlett
Temporarily clear the preallocation flag when explicitly requesting allocations. Pre-existing allocations are already counted against the request through mas_node_count_gfp(), but the allocations will not happen if the MA_STATE_PREALLOC flag is set. This flag is meant to avoid re-allocating in bulk allocation mode, and to detect issues with preallocation calculations. The MA_STATE_PREALLOC flag should also always be set on zero allocations so that detection of underflow allocations will print a WARN_ON() during consumption. User visible effect of this flaw is a WARN_ON() followed by a null pointer dereference when subsequent requests for larger number of nodes is ignored, such as the vma merge retry in mmap_region() caused by drivers altering the vma flags (which happens in v6.6, at least) Link: https://lkml.kernel.org/r/20250616184521.3382795-3-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reported-by: Hailong Liu <hailong.liu@oppo.com> Link: https://lore.kernel.org/all/1652f7eb-a51b-4fee-8058-c73af63bacd1@oppo.com/ Link: https://lore.kernel.org/all/20250428184058.1416274-1-Liam.Howlett@oracle.com/ Link: https://lore.kernel.org/all/20250429014754.1479118-1-Liam.Howlett@oracle.com/ Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: zhangpeng.00@bytedance.com <zhangpeng.00@bytedance.com> Cc: Steve Kang <Steve.Kang@unisoc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19MAINTAINERS: add missing test files to mm gup sectionLorenzo Stoakes
We previously overlooked GUP test files that sensibly should belong to the GUP section, include them now. Link: https://lkml.kernel.org/r/20250616200844.560225-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19MAINTAINERS: add missing mm/workingset.c file to mm reclaim sectionLorenzo Stoakes
The working set logic belongs very much to the reclaim section and is otherwise not assigned to any other MAINTAINERS section so add it here. Link: https://lkml.kernel.org/r/20250616201643.561626-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19selftests/mm: skip uprobe vma merge test if uprobes are not enabledPedro Falcato
If uprobes are not enabled, the test currently fails with: 7151 12:46:54.627936 # # # RUN merge.handle_uprobe_upon_merged_vma ... 7152 12:46:54.639014 # # f /sys/bus/event_source/devices/uprobe/type 7153 12:46:54.639306 # # fopen: No such file or directory 7154 12:46:54.650451 # # # merge.c:473:handle_uprobe_upon_merged_vma:Expected read_sysfs("/sys/bus/event_source/devices/uprobe/type", &type) (1) == 0 (0) 7155 12:46:54.650730 # # # handle_uprobe_upon_merged_vma: Test terminated by assertion 7156 12:46:54.661750 # # # FAIL merge.handle_uprobe_upon_merged_vma 7157 12:46:54.662030 # # not ok 8 merge.handle_uprobe_upon_merged_vma Skipping is a more sane and friendly behavior here. Link: https://lkml.kernel.org/r/20250610122209.3177587-1-pfalcato@suse.de Fixes: efe99fabeb11 ("selftests/mm: add test about uprobe pte be orphan during vma merge") Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reported-by: Aishwarya <aishwarya.tcv@arm.com> Closes: https://lore.kernel.org/linux-mm/20250610103729.72440-1-aishwarya.tcv@arm.com/ Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by : Donet Tom <donettom@linux.ibm.com> Reviewed-by : Donet Tom <donettom@linux.ibm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Pu Lehui <pulehui@huawei.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19bcache: remove unnecessary select MIN_HEAPKuan-Wei Chiu
After reverting the transition to the generic min heap library, bcache no longer depends on MIN_HEAP. The select entry can be removed to reduce code size and shrink the kernel's attack surface. This change effectively reverts the bcache-related part of commit 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions"). This is part of a series of changes to address a performance regression caused by the use of the generic min_heap implementation. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-4-visitorckw@gmail.com Fixes: 866898efbb25 ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19Revert "bcache: remove heap-related macros and switch to generic min_heap"Kuan-Wei Chiu
This reverts commit 866898efbb25bb44fd42848318e46db9e785973a. The generic bottom-up min_heap implementation causes performance regression in invalidate_buckets_lru(), a hot path in bcache. Before the cache is fully populated, new_bucket_prio() often returns zero, leading to many equal comparisons. In such cases, bottom-up sift_down performs up to 2 * log2(n) comparisons, while the original top-down approach completes with just O() comparisons, resulting in a measurable performance gap. The performance degradation is further worsened by the non-inlined min_heap API functions introduced in commit 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions"), adding function call overhead to this critical path. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. This revert aims to restore bcache's original low-latency behavior. Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-3-visitorckw@gmail.com Fixes: 866898efbb25 ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19Revert "bcache: update min_heap_callbacks to use default builtin swap"Kuan-Wei Chiu
Patch series "bcache: Revert min_heap migration due to performance regression". This patch series reverts the migration of bcache from its original heap implementation to the generic min_heap library. While the original change aimed to simplify the code and improve maintainability, it introduced a severe performance regression in real-world scenarios. As reported by Robert, systems using bcache now suffer from periodic latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. This degrades bcache's value as a low-latency caching layer, and leads to frequent timeouts and application stalls in production environments. The primary cause of this regression is the behavior of the generic min_heap implementation's bottom-up sift_down, which performs up to 2 * log2(n) comparisons when many elements are equal. The original top-down variant used by bcache only required O(1) comparisons in such cases. The issue was further exacerbated by commit 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions"), which introduced non-inlined versions of the min_heap API, adding function call overhead to a performance-critical hot path. This patch (of 3): This reverts commit 3d8a9a1c35227c3f1b0bd132c9f0a80dbda07b65. Although removing the custom swap function simplified the code, this change is part of a broader migration to the generic min_heap API that introduced significant performance regressions in bcache. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. This revert is part of a series of changes to restore previous performance by undoing the min_heap transition. Link: https://lkml.kernel.org/r/20250614202353.1632957-1-visitorckw@gmail.com Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-2-visitorckw@gmail.com Fixes: 866898efbb25 ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19selftests/mm: add configs to fix testcase failureDev Jain
If CONFIG_UPROBES is not set, a merge subtest fails: Failure log: 7151 12:46:54.627936 # # # RUN merge.handle_uprobe_upon_merged_vma ... 7152 12:46:54.639014 # # f /sys/bus/event_source/devices/uprobe/type 7153 12:46:54.639306 # # fopen: No such file or directory 7154 12:46:54.650451 # # # merge.c:473:handle_uprobe_upon_merged_vma:Expected read_sysfs("/sys/bus/event_source/devices/uprobe/type", &type) (1) == 0 (0) 7155 12:46:54.650730 # # # handle_uprobe_upon_merged_vma: Test terminated by assertion 7156 12:46:54.661750 # # # FAIL merge.handle_uprobe_upon_merged_vma 7157 12:46:54.662030 # # not ok 8 merge.handle_uprobe_upon_merged_vma CONFIG_UPROBES is enabled by CONFIG_UPROBE_EVENTS, which gets enabled by CONFIG_FTRACE. Therefore add these configs to selftests/mm/config so that CI systems can include this config in the kernel build. To be completely safe, add CONFIG_PROFILING too, to enable the dependency chain PROFILING -> PERF_EVENTS -> UPROBE_EVENTS -> UPROBES. Link: https://lkml.kernel.org/r/20250613034912.53791-1-dev.jain@arm.com Fixes: efe99fabeb11 ("selftests/mm: add test about uprobe pte be orphan during vma merge") Signed-off-by: Dev Jain <dev.jain@arm.com> Reported-by: Aishwarya <aishwarya.tcv@arm.com> Closes: https://lore.kernel.org/all/20250610103729.72440-1-aishwarya.tcv@arm.com/ Tested-by: Aishwarya TCV <aishwarya.tcv@arm.com> Tested-by : Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Pu Lehui <pulehui@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19kho: initialize tail pages for higher order folios properlyPratyush Yadav
Currently, when restoring higher order folios, kho_restore_folio() only calls prep_compound_page() on all the pages. That is not enough to properly initialize the folios. The managed page count does not get updated, the reserved flag does not get dropped, and page count does not get initialized properly. Restoring a higher order folio with it results in the following BUG with CONFIG_DEBUG_VM when attempting to free the folio: BUG: Bad page state in process test pfn:104e2b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x104e2b flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) raw: 002fffff80000000 0000000000000000 00000000ffffffff 0000000000000000 raw: ffffffffffffffff 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount [...] Call Trace: <TASK> dump_stack_lvl+0x4b/0x70 bad_page.cold+0x97/0xb2 __free_frozen_pages+0x616/0x850 [...] Combine the path for 0-order and higher order folios, initialize the tail pages with a count of zero, and call adjust_managed_page_count() to account for all the pages instead of just missing them. In addition, since all the KHO-preserved pages get marked with MEMBLOCK_RSRV_NOINIT by deserialize_bitmap(), the reserved flag is not actually set (as can also be seen from the flags of the dumped page in the logs above). So drop the ClearPageReserved() calls. [ptyadav@amazon.de: declare i in the loop instead of at the top] Link: https://lkml.kernel.org/r/20250613125916.39272-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20250605171143.76963-1-pratyush@kernel.org Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19MAINTAINERS: add linux-mm@ list to Kexec HandoverPratyush Yadav
Along with kexec, KHO also has parts dealing with memory management, like page/folio initialization, memblock, and preserving/unpreserving memory for next kernel. Copy linux-mm@ to KHO patches so the right set of eyes can look at changes to those parts. Link: https://lkml.kernel.org/r/20250613131917.4488-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19mm: userfaultfd: fix race of userfaultfd_move and swap cacheKairui Song
This commit fixes two kinds of races, they may have different results: Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same BUG_ON if the filemap lookup returned NULL and folio is added to swap cache after that. If another kind of race is triggered (folio changed after lookup) we may see RSS counter is corrupted: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the folio is being accounted to the wrong VMA. I'm not sure if there will be any data corruption though, seems no. The issues above are critical already. On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get_folio(S1) // Got NULL ... < interrupted again> ... <swapin folio A and free S1> <swapout folio A re-using S1> move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // folio A is ignored !!! Fix this by checking the swap cache again after acquiring the src_pte lock. And to avoid the filemap overhead, we check swap_map directly [2]. The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far we don't need to worry about that, since folios can only be exposed to the swap cache in the swap out path, and this is covered in this patch by checking the swap cache again after acquiring the src_pte lock. Testing with a simple C program that allocates and moves several GB of memory did not show any observable performance change. Link: https://lkml.kernel.org/r/20250604151038.21968-1-ryncsn@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2] Reviewed-by: Lokesh Gidra <lokeshgidra@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chris Li <chrisl@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"David Hildenbrand
After commit 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") we are able to longterm pin folios that are not supposed to get longterm pinned, simply because they temporarily have the LRU flag cleared (esp. temporarily isolated). For example, two __get_longterm_locked() callers can race, or __get_longterm_locked() can race with anything else that temporarily isolates folios. The introducing commit mentions the use case of a driver that uses vm_ops->fault to insert pages allocated through cma_alloc() into the page tables, assuming they can later get longterm pinned. These pages/ folios would never have the LRU flag set and consequently cannot get isolated. There is no known in-tree user making use of that so far, fortunately. To handle that in the future -- and avoid retrying forever to isolate/migrate them -- we will need a different mechanism for the CMA area *owner* to indicate that it actually already allocated the page and is fine with longterm pinning it. The LRU flag is not suitable for that. Probably we can lookup the relevant CMA area and query the bitmap; we only have have to care about some races, probably. If already allocated, we could just allow longterm pinning) Anyhow, let's fix the "must not be longterm pinned" problem first by reverting the original commit. Link: https://lkml.kernel.org/r/20250611131314.594529-1-david@redhat.com Fixes: 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") Signed-off-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/all/20250522092755.GA3277597@tiffany/ Reported-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19selftests/mm: increase timeout from 180 to 900 secondsShivank Garg
The mm selftests are timing out with the current 180-second limit. Testing shows that run_vmtests.sh takes approximately 11 minutes (664 seconds) to complete. Increase the timeout to 900 seconds (15 minutes) to provide sufficient buffer for the tests to complete successfully. Link: https://lkml.kernel.org/r/20250609120606.73145-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19mm/shmem, swap: fix softlockup with mTHP swapinKairui Song
Following softlockup can be easily reproduced on my test machine with: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled swapon /dev/zram0 # zram0 is a 48G swap device mkdir -p /sys/fs/cgroup/memory/test echo 1G > /sys/fs/cgroup/test/memory.max echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs while true; do dd if=/dev/zero of=/tmp/test.img bs=1M count=5120 cat /tmp/test.img > /dev/null rm /tmp/test.img done Then after a while: watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] Modules linked in: zram virtiofs CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)· Tainted: [L]=SOFTLOCKUP Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> shmem_alloc_folio+0x31/0xc0 shmem_swapin_folio+0x309/0xcf0 ? filemap_get_entry+0x117/0x1e0 ? xas_load+0xd/0xb0 ? filemap_get_entry+0x101/0x1e0 shmem_get_folio_gfp+0x2ed/0x5b0 shmem_file_read_iter+0x7f/0x2e0 vfs_read+0x252/0x330 ksys_read+0x68/0xf0 do_syscall_64+0x4c/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f03f9a46991 Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 </TASK> The reason is simple, readahead brought some order 0 folio in swap cache, and the swapin mTHP folio being allocated is in conflict with it, so swapcache_prepare fails and causes shmem_swap_alloc_folio to return -EEXIST, and shmem simply retries again and again causing this loop. Fix it by applying a similar fix for anon mTHP swapin. The performance change is very slight, time of swapin 10g zero folios with shmem (test for 12 times): Before: 2.47s After: 2.48s [kasong@tencent.com: add comment] Link: https://lkml.kernel.org/r/20250610181645.45922-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250610181645.45922-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250609171751.36305-1-ryncsn@gmail.com Fixes: 1dd44c0af4fa ("mm: shmem: skip swapcache for swapin of synchronous swap device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-15Linux 6.16-rc2v6.16-rc2Linus Torvalds
2025-06-15Merge tag 'kbuild-fixes-v6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Move warnings about linux/export.h from W=1 to W=2 - Fix structure type overrides in gendwarfksyms * tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: gendwarfksyms: Fix structure type overrides kbuild: move warnings about linux/export.h from W=1 to W=2
2025-06-16gendwarfksyms: Fix structure type overridesSami Tolvanen
As we always iterate through the entire die_map when expanding type strings, recursively processing referenced types in type_expand_child() is not actually necessary. Furthermore, the type_string kABI rule added in commit c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings") can fail to override type strings for structures due to a missing kabi_get_type_string() check in this function. Fix the issue by dropping the unnecessary recursion and moving the override check to type_expand(). Note that symbol versions are otherwise unchanged with this patch. Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings") Reported-by: Giuliano Procida <gprocida@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-16kbuild: move warnings about linux/export.h from W=1 to W=2Masahiro Yamada
This hides excessive warnings, as nobody builds with W=2. Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1") Fixes: 7d95680d64ac ("scripts/misc-check: check unnecessary #include <linux/export.h> when W=1") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com>
2025-06-14Merge tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - SMB3.1.1 POSIX extensions fix for char remapping - Fix for repeated directory listings when directory leases enabled - deferred close handle reuse fix * tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: improve directory cache reuse for readdir operations smb: client: fix perf regression with deferred closes smb: client: disable path remapping with POSIX extensions
2025-06-14Merge tag 'iommu-fixes-v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fix from Joerg Roedel: - Fix PTE size calculation for NVidia Tegra * tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu/tegra: Fix incorrect size calculation
2025-06-14Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-14Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a race between SQPOLL exit and fdinfo reading. It's slim and I was only able to reproduce this with an artificial delay in the kernel. Followup sparse fix as well to unify the access to ->thread. - Fix for multiple buffer peeking, avoiding truncation if possible. - Run local task_work for IOPOLL reaping when the ring is exiting. This currently isn't done due to an assumption that polled IO will never need task_work, but a fix on the block side is going to change that. * tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux: io_uring: run local task_work from ring exit IOPOLL reaping io_uring/kbuf: don't truncate end buffer for multiple buffer peeks io_uring: consistently use rcu semantics with sqpoll thread io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
2025-06-14Merge tag 'rust-fixes-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull Rust fix from Miguel Ojeda: - 'hrtimer': fix future compile error when the 'impl_has_hr_timer!' macro starts to get called * tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: rust: time: Fix compile error in impl_has_hr_timer macro
2025-06-14Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM" * tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: add mmap_prepare() compatibility layer for nested file systems init: fix build warnings about export.h MAINTAINERS: add Barry as a THP reviewer drivers/rapidio/rio_cm.c: prevent possible heap overwrite mm: close theoretical race where stale TLB entries could linger mm/vma: reset VMA iterator on commit_merge() OOM failure docs: proc: update VmFlags documentation in smaps scatterlist: fix extraneous '@'-sign kernel-doc notation selftests/mm: skip failed memfd setups in gup_longterm
2025-06-13Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "All fixes for drivers. The core change in the error handler is simply to translate an ALUA specific sense code into a retry the ALUA components can handle and won't impact any other devices" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: error: alua: I/O errors for ALUA state transitions scsi: storvsc: Increase the timeouts to storvsc_timeout scsi: s390: zfcp: Ensure synchronous unit_add scsi: iscsi: Fix incorrect error path labels for flashnode operations scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers scsi: core: ufs: Fix a hang in the error handler
2025-06-13Merge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "Quiet week, only two pull requests came my way, xe has a couple of fixes and then a bunch of fixes across the board, vc4 probably fixes the biggest problem: vc4: - Fix infinite EPROBE_DEFER loop in vc4 probing amdxdna: - Fix amdxdna firmware size meson: - modesetting fixes sitronix: - Kconfig fix for st7171-i2c dma-buf: - Fix -EBUSY WARN_ON_ONCE in dma-buf udmabuf: - Use dma_sync_sgtable_for_cpu in udmabuf xe: - Fix regression disallowing 64K SVM migration - Use a bounce buffer for WA BB" * tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel: drm/xe/lrc: Use a temporary buffer for WA BB udmabuf: use sgtable-based scatterlist wrappers dma-buf: fix compare in WARN_ON_ONCE drm/sitronix: st7571-i2c: Select VIDEOMODE_HELPERS drm/meson: fix more rounding issues with 59.94Hz modes drm/meson: use vclk_freq instead of pixel_freq in debug print drm/meson: fix debug log statement when setting the HDMI clocks drm/vc4: fix infinite EPROBE_DEFER loop drm/xe/svm: Fix regression disallowing 64K SVM migration accel/amdxdna: Fix incorrect PSP firmware size
2025-06-13io_uring: run local task_work from ring exit IOPOLL reapingJens Axboe
In preparation for needing to shift NVMe passthrough to always use task_work for polled IO completions, ensure that those are suitably run at exit time. See commit: 9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work") for details on why that is necessary. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13nvme: always punt polled uring_cmd end_io work to task_workJens Axboe
Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13Merge tag 'acpi-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix an ACPI APEI error injection driver failure that started to occur after switching it over to using a faux device, address an EC driver issue related to invalid ECDT tables, clean up the usage of mwait_idle_with_hints() in the ACPI PAD driver, add a new IRQ override quirk, and fix a NULL pointer dereference related to nosmp: - Update the faux device handling code in the driver core and address an ACPI APEI error injection driver failure that started to occur after switching it over to using a faux device on top of that (Dan Williams) - Update data types of variables passed as arguments to mwait_idle_with_hints() in the ACPI PAD (processor aggregator device) driver to match the function definition after recent changes (Uros Bizjak) - Fix a NULL pointer dereference in the ACPI CPPC library that occurs when nosmp is passed to the kernel in the command line (Yunhui Cui) - Ignore ECDT tables with an invalid ID string to prevent using an incorrect GPE for signaling events on some systems (Armin Wolf) - Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan)" * tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: resource: Use IRQ override on MACHENIKE 16P ACPI: EC: Ignore ECDT tables with an invalid ID string ACPI: CPPC: Fix NULL pointer dereference when nosmp is used ACPI: PAD: Update arguments of mwait_idle_with_hints() ACPI: APEI: EINJ: Do not fail einj_init() on faux_device_create() failure driver core: faux: Quiet probe failures driver core: faux: Suppress bind attributes
2025-06-13Merge tag 'pm-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix the cpupower utility installation, fix up the recently added Rust abstractions for cpufreq and OPP, restore the x86 update eliminating mwait_play_dead_cpuid_hint() that has been reverted during the 6.16 merge window along with preventing the failure caused by it from happening, and clean up mwait_idle_with_hints() usage in intel_idle: - Implement CpuId Rust abstraction and use it to fix doctest failure related to the recently introduced cpumask abstraction (Viresh Kumar) - Do minor cleanups in the `# Safety` sections for cpufreq abstractions added recently (Viresh Kumar) - Unbreak cpupower systemd service units installation on some systems by adding a unitdir variable for specifying the location to install them (Francesco Poli) - Eliminate mwait_play_dead_cpuid_hint() again after reverting its elimination during the 6.16 merge window due to a problem with handling "dead" SMT siblings, but this time prevent leaving them in C1 after initialization by taking them online and back offline when a proper cpuidle driver for the platform has been registered (Rafael Wysocki) - Update data types of variables passed as arguments to mwait_idle_with_hints() to match the function definition after recent changes (Uros Bizjak)" * tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: rust: cpu: Add CpuId::current() to retrieve current CPU ID rust: Use CpuId in place of raw CPU numbers rust: cpu: Introduce CpuId abstraction intel_idle: Update arguments of mwait_idle_with_hints() cpufreq: Convert `/// SAFETY` lines to `# Safety` sections cpupower: split unitdir from libdir in Makefile Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" ACPI: processor: Rescan "dead" SMT siblings during initialization intel_idle: Rescan "dead" SMT siblings during initialization x86/smp: PM/hibernate: Split arch_resume_nosmt() intel_idle: Use subsys_initcall_sync() for initialization
2025-06-13Merge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'Rafael J. Wysocki
Merge assorted ACPI updates for 6.16-rc2: - Update data types of variables passed as arguments to mwait_idle_with_hints() in the ACPI PAD (processor aggregator device) driver to match the function definition after recent changes (Uros Bizjak). - Fix a NULL pointer dereference in the ACPI CPPC library that occurs when nosmp is passed to the kernel in the command line (Yunhui Cui). - Ignore ECDT tables with an invalid ID string to prevent using an incorrect GPE for signaling events on some systems (Armin Wolf). - Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan). * acpi-pad: ACPI: PAD: Update arguments of mwait_idle_with_hints() * acpi-cppc: ACPI: CPPC: Fix NULL pointer dereference when nosmp is used * acpi-ec: ACPI: EC: Ignore ECDT tables with an invalid ID string * acpi-resource: ACPI: resource: Use IRQ override on MACHENIKE 16P
2025-06-13Merge branch 'pm-cpuidle'Rafael J. Wysocki
Merge cpuidle updates for 6.16-rc2: - Update data types of variables passed as arguments to mwait_idle_with_hints() to match the function definition after recent changes (Uros Bizjak). - Eliminate mwait_play_dead_cpuid_hint() again after reverting its elimination during the merge window due to a problem with handling "dead" SMT siblings, but this time prevent leaving them in C1 after initialization by taking them online and back offline when a proper cpuidle driver for the platform has been registered (Rafael Wysocki). * pm-cpuidle: intel_idle: Update arguments of mwait_idle_with_hints() Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" ACPI: processor: Rescan "dead" SMT siblings during initialization intel_idle: Rescan "dead" SMT siblings during initialization x86/smp: PM/hibernate: Split arch_resume_nosmt() intel_idle: Use subsys_initcall_sync() for initialization
2025-06-13Merge branch 'pm-tools'Rafael J. Wysocki
Merge a cpupower utility fix for 6.16-rc2 that unbreaks systemd service units installation on some sysems (Francesco Poli). * pm-tools: cpupower: split unitdir from libdir in Makefile
2025-06-13Merge tag 'spi-fix-v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A collection of driver specific fixes, most minor apart from the OMAP ones which disable some recent performance optimisations in some non-standard cases where we could start driving the bus incorrectly. The change to the stm32-ospi driver to use the newer reset APIs is a fix for interactions with other IP sharing the same reset line in some SoCs" * tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine spi: stm32-ospi: clean up on error in probe() spi: stm32-ospi: Make usage of reset_control_acquire/release() API spi: offload: check offload ops existence before disabling the trigger spi: spi-pci1xxxx: Fix error code in probe spi: loongson: Fix build warnings about export.h spi: omap2-mcspi: Disable multi-mode when the previous message kept CS asserted spi: omap2-mcspi: Disable multi mode when CS should be kept asserted after message
2025-06-13Merge tag 'regulator-fix-v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fix from Mark Brown: "One minor fix for a leak in the DT parsing code in the max20086 driver" * tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: max20086: Fix refcount leak in max20086_parse_regulators_dt()
2025-06-13posix-cpu-timers: fix race between handle_posix_cpu_timers() and ↵Oleg Nesterov
posix_cpu_timer_del() If an exiting non-autoreaping task has already passed exit_notify() and calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent or debugger right after unlock_task_sighand(). If a concurrent posix_cpu_timer_del() runs at that moment, it won't be able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or lock_task_sighand() will fail. Add the tsk->exit_state check into run_posix_cpu_timers() to fix this. This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because exit_task_work() is called before exit_notify(). But the check still makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail anyway in this case. Cc: stable@vger.kernel.org Reported-by: Benoît Sevens <bsevens@google.com> Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-13Merge tag 'trace-v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - Do not free "head" variable in filter_free_subsystem_filters() The first error path jumps to "free_now" label but first frees the newly allocated "head" variable. But the "free_now" code checks this variable, and if it is not NULL, it will iterate the list. As this list variable was already initialized, the "free_now" code will not do anything as it is empty. But freeing it will cause a UAF bug. The error path should simply jump to the "free_now" label and leave the "head" variable alone. * tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Do not free "head" on error path of filter_free_subsystem_filters()
2025-06-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Rework of system register accessors for system registers that are directly writen to memory, so that sanitisation of the in-memory value happens at the correct time (after the read, or before the write). For convenience, RMW-style accessors are also provided. - Multiple fixes for the so-called "arch-timer-edge-cases' selftest, which was always broken. x86: - Make KVM_PRE_FAULT_MEMORY stricter for TDX, allowing userspace to pass only the "untouched" addresses and flipping the shared/private bit in the implementation. - Disable SEV-SNP support on initialization failure * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY KVM: SEV: Disable SEV-SNP support on initialization failure KVM: arm64: selftests: Determine effective counter width in arch_timer_edge_cases KVM: arm64: selftests: Fix xVAL init in arch_timer_edge_cases KVM: arm64: selftests: Fix thread migration in arch_timer_edge_cases KVM: arm64: selftests: Fix help text for arch_timer_edge_cases KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operand KVM: arm64: Don't use __vcpu_sys_reg() to get the address of a sysreg KVM: arm64: Add RMW specific sysreg accessor KVM: arm64: Add assignment-specific sysreg accessor
2025-06-13io_uring/kbuf: don't truncate end buffer for multiple buffer peeksJens Axboe
If peeking a bunch of buffers, normally io_ring_buffers_peek() will truncate the end buffer. This isn't optimal as presumably more data will be arriving later, and hence it's better to stop with the last full buffer rather than truncate the end buffer. Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13Merge tag 'v6.16-p4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a broken self-test in hkdf (new regression)" * tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: hkdf - move to late_initcall
2025-06-13Merge tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "As usual, highlighting the ones users have been noticing: - Fix a small issue with has_case_insensitive not being propagated on snapshot creation; this led to fsck errors, which we're harmless because we're not using this flag yet (it's for overlayfs + casefolding). - Log the error being corrected in the journal when we're doing fsck repair: this was one of the "lessons learned" from the i_nlink 0 -> subvolume deletion bug, where reconstructing what had happened by analyzing the journal was a bit more difficult than it needed to be. - Don't schedule btree node scan to run in the superblock: this fixes a regression from the 6.16 recovery passes rework, and let to it running unnecessarily. The real issue here is that we don't have online, "self healing" style topology repair yet: topology repair currently has to run before we go RW, which means that we may schedule it unnecessarily after a transient error. This will be fixed in the future. - We now track, in btree node flags, the reason it was scheduled to be rewritten. We discovered a deadlock in recovery when many btree nodes need to be rewritten because they're degraded: fully fixing this will take some work but it's now easier to see what's going on. For the bug report where this came up, a device had been kicked RO due to transient errors: manually setting it back to RW was sufficient to allow recovery to succeed. - Mark a few more fsck errors as autofix: as a reminder to users, please do keep reporting cases where something needs to be repaired and is not repaired automatically (i.e. cases where -o fix_errors or fsck -y is required). - rcu_pending.c now works with PREEMPT_RT - 'bcachefs device add', then umount, then remount wasn't working - we now emit a uevent so that the new device's new superblock is correctly picked up - Assorted repair fixes: btree node scan will no longer incorrectly update sb->version_min, - Assorted syzbot fixes" * tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs: (23 commits) bcachefs: Don't trace should_be_locked unless changing bcachefs: Ensure that snapshot creation propagates has_case_insensitive bcachefs: Print devices we're mounting on multi device filesystems bcachefs: Don't trust sb->nr_devices in members_to_text() bcachefs: Fix version checks in validate_bset() bcachefs: ioctl: avoid stack overflow warning bcachefs: Don't pass trans to fsck_err() in gc_accounting_done bcachefs: Fix leak in bch2_fs_recovery() error path bcachefs: Fix rcu_pending for PREEMPT_RT bcachefs: Fix downgrade_table_extra() bcachefs: Don't put rhashtable on stack bcachefs: Make sure opts.read_only gets propagated back to VFS bcachefs: Fix possible console lock involved deadlock bcachefs: mark more errors autofix bcachefs: Don't persistently run scan_for_btree_nodes bcachefs: Read error message now prints if self healing bcachefs: Only run 'increase_depth' for keys from btree node csan bcachefs: Mark need_discard_freespace_key_bad autofix bcachefs: Update /dev/disk/by-uuid on device add bcachefs: Add more flags to btree nodes for rewrite reason ...
2025-06-13Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublistsBagas Sanjaya
Stephen Rothwell reports htmldocs warning on ublk docs: Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils] Fix the warning by separating sublists of auto buffer registration fallback behavior from their appropriate parent list item. Fixes: ff20c516485e ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13iommu/tegra: Fix incorrect size calculationJason Gunthorpe
This driver uses a mixture of ways to get the size of a PTE, tegra_smmu_set_pde() did it as sizeof(*pd) which became wrong when pd switched to a struct tegra_pd. Switch pd back to a u32* in tegra_smmu_set_pde() so the sizeof(*pd) returns 4. Fixes: 50568f87d1e2 ("iommu/terga: Do not use struct page as the handle for as->pd memory") Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Closes: https://lore.kernel.org/all/62e7f7fe-6200-4e4f-ad42-d58ad272baa6@tecnico.ulisboa.pt/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Link: https://lore.kernel.org/r/0-v1-da7b8b3d57eb+ce-iommu_terga_sizeof_jgg@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-06-13block: Fix bvec_set_folio() for very large foliosMatthew Wilcox (Oracle)
Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if we attempt to add a folio that is larger than 4GB, we'll silently truncate the offset and len. Widen the parameters to size_t, assert that the length is less than 4GB and set the first page that contains the interesting data rather than the first page of the folio. Fixes: 26db5ee15851 (block: add a bvec_set_folio helper) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAPMatthew Wilcox (Oracle)
It is possible for physically contiguous folios to have discontiguous struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not. This is correctly handled by folio_page_idx(), so remove this open-coded implementation. Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engineThangaraj Samynathan
Removes MSI-X from the interrupt request path, as the DMA engine used by the SPI controller does not support MSI-X interrupts. Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com> Link: https://patch.msgid.link/20250612023059.71726-1-thangaraj.s@microchip.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-06-13Merge tag 'drm-misc-fixes-2025-06-12' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes drm-misc-fixes for v6.16-rc2: - Fix infinite EPROBE_DEFER loop in vc4 probing. - Fix amdxdna firmware size. - mode fixes for meson. - Kconfig fix for st7171-i2c. - Fix -EBUSY WARN_ON_ONCE in dma-buf - Use dma_sync_sgtable_for_cpu in udmabuf. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://lore.kernel.org/r/62c06195-8bc1-4dae-8777-e86d94e4d9d9@linux.intel.com
2025-06-12mm: add mmap_prepare() compatibility layer for nested file systemsLorenzo Stoakes
Nested file systems, that is those which invoke call_mmap() within their own f_op->mmap() handlers, may encounter underlying file systems which provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). We have a chicken-and-egg scenario here - until all file systems are converted to using .mmap_prepare(), we cannot convert these nested handlers, as we can't call f_op->mmap from an .mmap_prepare() hook. So we have to do it the other way round - invoke the .mmap_prepare() hook from an .mmap() one. in order to do so, we need to convert VMA state into a struct vm_area_desc descriptor, invoking the underlying file system's f_op->mmap_prepare() callback passing a pointer to this, and then setting VMA state accordingly and safely. This patch achieves this via the compat_vma_mmap_prepare() function, which we invoke from call_mmap() if f_op->mmap_prepare() is specified in the passed in file pointer. We place the fundamental logic into mm/vma.h where VMA manipulation belongs. We also update the VMA userland tests to accommodate the changes. The compat_vma_mmap_prepare() function and its associated machinery is temporary, and will be removed once the conversion of file systems is complete. We carefully place this code so it can be used with CONFIG_MMU and also with cutting edge nommu silicon. [akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build] [lorenzo.stoakes@oracle.com: remove unused declarations] Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-13Merge tag 'drm-xe-fixes-2025-06-12' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix regression disallowing 64K SVM migration (Maarten) - Use a bounce buffer for WA BB (Lucas) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/aEsBQoh5Si3ouPgE@fedora
2025-06-12Merge tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap fix from Yury Norov: "Fix for __GENMASK() and __GENMASK_ULL() in UAPI" * tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux: uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again