summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-11-24mm/zone_device: support large zone device private foliosBalbir Singh
Patch series "mm: support device-private THP", v7. This patch series introduces support for Transparent Huge Page (THP) migration in zone device-private memory. The implementation enables efficient migration of large folios between system memory and device-private memory Background Current zone device-private memory implementation only supports PAGE_SIZE granularity, leading to: - Increased TLB pressure - Inefficient migration between CPU and device memory This series extends the existing zone device-private infrastructure to support THP, leading to: - Reduced page table overhead - Improved memory bandwidth utilization - Seamless fallback to base pages when needed In my local testing (using lib/test_hmm) and a throughput test, the series shows a 350% improvement in data transfer throughput and a 80% improvement in latency These patches build on the earlier posts by Ralph Campbell [1] Two new flags are added in vma_migration to select and mark compound pages. migrate_vma_setup(), migrate_vma_pages() and migrate_vma_finalize() support migration of these pages when MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments. The series also adds zone device awareness to (m)THP pages along with fault handling of large zone device private pages. page vma walk and the rmap code is also zone device aware. Support has also been added for folios that might need to be split in the middle of migration (when the src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src side of the migration can migrate large pages, but the destination has not been able to allocate large pages. The code supported and used folio_split() when migrating THP pages, this is used when MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to migrate_vma_setup(). The test infrastructure lib/test_hmm.c has been enhanced to support THP migration. A new ioctl to emulate failure of large page allocations has been added to test the folio split code path. hmm-tests.c has new test cases for huge page migration and to test the folio split path. A new throughput test has been added as well. The nouveau dmem code has been enhanced to use the new THP migration capability. mTHP support: The patches hard code, HPAGE_PMD_NR in a few places, but the code has been kept generic to support various order sizes. With additional refactoring of the code support of different order sizes should be possible. The future plan is to post enhancements to support mTHP with a rough design as follows: 1. Add the notion of allowable thp orders to the HMM based test driver 2. For non PMD based THP paths in migrate_device.c, check to see if a suitable order is found and supported by the driver 3. Iterate across orders to check the highest supported order for migration 4. Migrate and finalize The mTHP patches can be built on top of this series, the key design elements that need to be worked out are infrastructure and driver support for multiple ordered pages and their migration. HMM support for large folios was added in 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry"). This patch (of 16) Add routines to support allocation of large order zone device folios and helper functions for zone device folios, to check if a folio is device private and helpers for setting zone device data. When large folios are used, the existing page_free() callback in pgmap is called when the folio is freed, this is true for both PAGE_SIZE and higher order pages. Zone device private large folios do not support deferred split and scan like normal THP folios. Link: https://lkml.kernel.org/r/20251001065707.920170-1-balbirs@nvidia.com Link: https://lkml.kernel.org/r/20251001065707.920170-2-balbirs@nvidia.com Link: https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ [1] Signed-off-by: Balbir Singh <balbirs@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Francois Dugast <francois.dugast@intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24Merge branch 'mm-hotfixes-stable' into mm-stable in order to mergeAndrew Morton
"mm/huge_memory: only get folio_order() once during __folio_split()" into mm-stable.
2025-11-24mm/filemap: fix logic around SIGBUS in filemap_map_pages()Kiryl Shutsemau
Chris noticed that filemap_map_pages() calculates can_map_large only once for the first page in the fault around range. The value is not valid for the following pages in the range and must be recalculated. Instead of recalculating can_map_large on each iteration, pass down file_end to filemap_map_folio_range() and let it make the decision on what can be mapped. Link: https://lkml.kernel.org/r/20251120161411.859078-1-kirill@shutemov.name Fixes: 74207de2ba10 ("mm/memory: do not populate page table entries beyond i_size")h Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Mason <clm@meta.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24mm/huge_memory: fix NULL pointer deference when splitting folioWei Yang
Commit c010d47f107f ("mm: thp: split huge page to any lower order pages") introduced an early check on the folio's order via mapping->flags before proceeding with the split work. This check introduced a bug: for shmem folios in the swap cache and truncated folios, the mapping pointer can be NULL. Accessing mapping->flags in this state leads directly to a NULL pointer dereference. This commit fixes the issue by moving the check for mapping != NULL before any attempt to access mapping->flags. Link: https://lkml.kernel.org/r/20251119235302.24773-1-richard.weiyang@gmail.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24mm/mmap_lock: reset maple state on lock_vma_under_rcu() retryLiam R. Howlett
The retry in lock_vma_under_rcu() drops the rcu read lock before reacquiring the lock and trying again. This may cause a use-after-free if the maple node the maple state was using was freed. The maple state is protected by the rcu read lock. When the lock is dropped, the state cannot be reused as it tracks pointers to objects that may be freed during the time where the lock was not held. Any time the rcu read lock is dropped, the maple state must be invalidated. Resetting the address and state to MA_START is the safest course of action, which will result in the next operation starting from the top of the tree. Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure"), vma_start_read() would drop rcu read lock and return NULL, so the retry would not have happened. However, now that vma_start_read() drops rcu read lock on failure followed by a retry, we may end up using a freed maple tree node cached in the maple state. [surenb@google.com: changelog alteration] Link: https://lkml.kernel.org/r/CAJuCfpEWMD-Z1j=nPYHcQW4F7E2Wka09KTXzGv7VE7oW1S8hcw@mail.gmail.com Link: https://lkml.kernel.org/r/20251111215605.1721380-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24mm/memfd: fix information leak in hugetlb foliosDeepanshu Kartikey
When allocating hugetlb folios for memfd, three initialization steps are missing: 1. Folios are not zeroed, leading to kernel memory disclosure to userspace 2. Folios are not marked uptodate before adding to page cache 3. hugetlb_fault_mutex is not taken before hugetlb_add_to_page_cache() The memfd allocation path bypasses the normal page fault handler (hugetlb_no_page) which would handle all of these initialization steps. This is problematic especially for udmabuf use cases where folios are pinned and directly accessed by userspace via DMA. Fix by matching the initialization pattern used in hugetlb_no_page(): - Zero the folio using folio_zero_user() which is optimized for huge pages - Mark it uptodate with folio_mark_uptodate() - Take hugetlb_fault_mutex before adding to page cache to prevent races The folio_zero_user() change also fixes a potential security issue where uninitialized kernel memory could be disclosed to userspace through read() or mmap() operations on the memfd. Link: https://lkml.kernel.org/r/20251112145034.2320452-1-kartikey406@gmail.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20251112031631.2315651-1-kartikey406@gmail.com/ [v1] Closes: https://syzkaller.appspot.com/bug?extid=f64019ba229e3a5c411b Suggested-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> (v2) Cc: Christoph Hellwig <hch@lst.de> (v6) Cc: Dave Airlie <airlied@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-24mm: swap: remove duplicate nr_swap_pages decrement in get_swap_page_of_type()Youngjun Park
After commit 4f78252da887, nr_swap_pages is decremented in swap_range_alloc(). Since cluster_alloc_swap_entry() calls swap_range_alloc() internally, the decrement in get_swap_page_of_type() causes double-decrementing. As a representative userspace-visible runtime example of the impact, /proc/meminfo reports increasingly inaccurate SwapFree values. The discrepancy grows with each swap allocation, and during hibernation when large amounts of memory are written to swap, the reported value can deviate significantly from actual available swap space, misleading users and monitoring tools. Remove the duplicate decrement. Link: https://lkml.kernel.org/r/20251102082456.79807-1-youngjun.park@lge.com Fixes: 4f78252da887 ("mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-23mempool: clarify behavior of mempool_alloc_preallocated()Thomas Weißschuh
The documentation of that function promises to never sleep. However on PREEMPT_RT a spinlock_t might in fact sleep. Reword the documentation so users can predict its behavior better. mempool could also replace spinlock_t with raw_spinlock_t which doesn't sleep even on PREEMPT_RT but that would take away the improved preemptibility of sleeping locks. Link: https://lkml.kernel.org/r/20251014-mempool-doc-v1-1-bc9ebf169700@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: drop the file name in the top of file commentChristoph Hellwig
Mentioning the name of the file is redundant, so drop it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-12-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: de-typedefChristoph Hellwig
Switch all uses of the deprecated mempool_t typedef in the core mempool code to use struct mempool instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-11-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: remove mempool_{init,create}_kvmalloc_poolChristoph Hellwig
This was added for bcachefs and is unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-10-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: legitimize the io_schedule_timeout in mempool_alloc_from_poolChristoph Hellwig
The timeout here is and old workaround with a Fixme comment. But thinking about it, it makes sense to keep it, so reword the comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-9-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: add mempool_{alloc,free}_bulkChristoph Hellwig
Add a version of the mempool allocator that works for batch allocations of multiple objects. Calling mempool_alloc in a loop is not safe because it could deadlock if multiple threads are performing such an allocation at the same time. As an extra benefit the interface is build so that the same array can be used for alloc_pages_bulk / release_pages so that at least for page backed mempools the fast path can use a nice batch optimization. Note that mempool_alloc_bulk does not take a gfp_mask argument as it must always be able to sleep and doesn't support any non-trivial modifiers. NOFO or NOIO constrainst must be set through the scoped API. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-8-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-23mempool: factor out a mempool_alloc_from_pool helperChristoph Hellwig
Add a helper for the mempool_alloc slowpath to better separate it from the fast path, and also use it to implement mempool_alloc_preallocated which shares the same logic. [hughd@google.com: fix lack of retrying with __GFP_DIRECT_RECLAIM] [vbabka@suse.cz: really use limited flags for first mempool attempt] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113084022.1255121-7-hch@lst.de Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-11-20mm/damon/tests/core-kunit: remove DAMON_MIN_REGION redefinitionSeongJae Park
A few DAMON core functions including damon_set_regions() were hard-coded to use DAMON_MIN_REGION as their regions management granularity. For simple and human-readable unit tests' expectations, DAMON core layer kunit test re-defines DAMON_MIN_REGION to '1'. A previous patch series [1] has removed the hard-coded part but kept the redefinition and updated related function calls to explicitly use DAMON_MIN_REGION. Remove the unnecessary redefinition and update relevant function calls to pass literals (number '1') instead of the DAMON_MIN_REGION. Link: https://lkml.kernel.org/r/20251112154114.66053-7-sj@kernel.org Link: https://lore.kernel.org/20250828171242.59810-1-sj@kernel.org [1] Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/vaddr: consistently use only pmd_entry for damos_migrateSeongJae Park
For page table walks, it is usual [1] to have only one pmd entry function. The vaddr.c code for DAMOS_MIGRATE_{HOT,COLD} is not following the pattern. Instead, it uses both pmd and pte entry functions without a special reason. Refactor it to use only the pmd entry function, to make the code under mm/ more consistent. Link: https://lkml.kernel.org/r/20251112154114.66053-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: David Hildenbrand <david@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/vaddr: use vm_normal_folio{,_pmd}() instead of damon_get_folio()SeongJae Park
A few page table walk entry callback functions in vaddr.c uses damon_get_folio() with p{te,md}_pfn() to get the folio, and then put_folio(). Simplify and drop unnecessary folio get/put by using vm_normal_folio() and its friends instead. Note that this cleanup was suggested by David Hildenbrand during a review of another patch series [1] and the patch was updated following the suggestion. This patch further applies the cleanup to DAMON code that merged before the patch. Link: https://lkml.kernel.org/r/20251112154114.66053-5-sj@kernel.org Link: https://lore.kernel.org/0cb3d5a5-683b-4dba-90a8-b45ab83eec53@redhat.com [1] Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: David Hildenbrand <david@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/vaddr: cleanup using pmd_trans_huge_lock()SeongJae Park
Three pmd walk functions in vaddr.c are using pmd_trans_huge() and pmd_lock() to handle THPs. Simplify the code by replacing the two function calls with a single pmd_trans_huge_lock() call. Note that this cleanup is not only reducing the lines of code, but also simplifies code execution flows for migration entries case, as kindly explained [1] by Hugh, who suggested this cleanup. [sj@kernel.org: provide lvalue to pmd_present()] Link: https://lkml.kernel.org/r/20251117154415.11041-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251112154114.66053-4-sj@kernel.org Link: https://lore.kernel.org/296c2b3f-6748-158f-b85d-2952165c0588@google.com [1] Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: Hugh Dickins <hughd@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon: rename damos->filters to damos->core_filtersSeongJae Park
DAMOS filters that are handled by the ops layer are linked to damos->ops_filters. Owing to the ops_ prefix on the name, it is easy to understand it is for ops layer handled filters. The other types of filters, which are handled by the core layer, are linked to damos->filters. Because of the name, it is easy to confuse the list is there for not only core layer handled ones but all filters. Avoid such confusions by renaming the field to core_filters. Link: https://lkml.kernel.org/r/20251112154114.66053-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon: rename damos core filter helpers to have word coreSeongJae Park
Patch series "mm/damon: misc cleanups". Yet another batch of misc cleanups and refactoring for DAMON code, tests, and documents. First two patches (1and 2) rename DAMOS core filters related code for readability. Three following patches (3-5) refactor page table walk callback functions in DAMON, as suggested by Hugh and David, and I promised. Next two patches (6 and 7) refactor DAMON core layer kunit test and sysfs interface selftest to be simple and deduplicated. Final two patches (8 and 9) fix up sphinx and grammatical errors on documents. This patch (of 9): DAMOS filters handled by the core layer are called core filters, while those handled by the ops layer are called ops filters. They share the same type but are managed in different places since core filters are evaluated before the ops filters. They also have different helper functions that depend on their managed places. The helper functions for ops filters have '_ops_' keyword on their name, so it is easy to know they are for ops filters. Meanwhile, the helper functions for core filters are not having the 'core' keyword on their name. This makes it easy to be mistakenly used for ops filters. Actually there was such a bug. To avoid future mistakes from similar confusions, rename DAMOS core filters helper functions to have a keyword 'core' on their names. Link: https://lkml.kernel.org/r/20251112154114.66053-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251112154114.66053-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Justin Stitt <justinstitt@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damon_commit_target_regions() testSeongJae Park
Add a new test for damon_commit_target_regions(). Link: https://lkml.kernel.org/r/20251111184415.141757-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damos_commit() testSeongJae Park
Add a new unit test for damos_commit(). Link: https://lkml.kernel.org/r/20251111184415.141757-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damos_commit_dests() testSeongJae Park
Add a new unit test for damos_commit_dests(). Link: https://lkml.kernel.org/r/20251111184415.141757-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/core: pass migrate_dests to damos_commit_dests()SeongJae Park
damos_commit_dests() receives 'struct damos' pointers, while it uses only their ->migrate_dests fields. This makes code unnecessarily difficult to read. It also makes unit tests writing complicated. Refactor the function to receive pointers to the ->migrate_dests fields. Link: https://lkml.kernel.org/r/20251111184415.141757-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damos_commit_quota() testSeongJae Park
Add a new unit test for damos_commit_quota(). Link: https://lkml.kernel.org/r/20251111184415.141757-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damos_commit_quota_goals() testSeongJae Park
Add a new unit test for damos_commit_quota_goals(). Link: https://lkml.kernel.org/r/20251111184415.141757-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add damos_commit_quota_goal() testSeongJae Park
Add a new unit test for damos_commit_quota_goal(). Link: https://lkml.kernel.org/r/20251111184415.141757-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: add test cases to damos_test_commit_filter()SeongJae Park
damos_test_commit_filter() is covering only a single test case. Extend it to cover multiple combinations of inputs. Link: https://lkml.kernel.org/r/20251111184415.141757-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: extend damos_test_commit_filter_for() for union ↵SeongJae Park
fields damos_commit_filter() also updates union fields of 'struct damos_filter'. Extend damos_test_commit_filter_for() to cover the expectations of the union fields. Link: https://lkml.kernel.org/r/20251111184415.141757-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: split out damos_test_commit_filter() core logicSeongJae Park
damos_test_commit_filter() is written for only a single test case. Split out the core logic of damos_test_commit_filter() as a general one so that it can be reused for multiple test cases. Link: https://lkml.kernel.org/r/20251111184415.141757-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/damon/tests/core-kunit: remove dynamic allocs on damos_test_commit_filter()SeongJae Park
Patch series "mm/damon/tests: add more tests for online parameters commit". A DAMON feature called parameters "commit" allows DAMON API callers and ABI users to update nearly every DAMON parameter while DAMON is running. This is being used for flexible DAMON use cases such as taking a snapshot of the monitoring results with minimum overhead, or adjusting access-aware system operations (DAMOS) for user-space driven auto-tuning or investigations. Compared to the usefulness of the feature and size of the implementation, the test coverage is pretty small. Only the filter commit part has a single test case, namely damos_test_commit_filter(). Actually, we found and fixed a few bugs of the feature in the past. The single existing test was also added to avoid reintroduction of a found bug. Add more unit tests for the feature. First four patches (1-4) refactor and extend the existing test for DAMOS filter commit for multiple test cases. Next three patches (5-7) add tests for DAMOS quota commit. Next two patches (8 and 9) refactor damos_commit_dests() for ease of code reading and test writing, and implement a new unit test of the function that is being refactored in a test-friendly way. Final two patches (10 and 11) further add new unit tests for damos_commit() and damon_commit_target_regions(). This patch (of 11): damos_test_commit_filter() is dynamically allocating test-purpose DAMOS filters. Allocation failure checks are making the code longer, complicated, and difficult to extend for more test cases. Refactor the code to remove the dynamic allocation. Link: https://lkml.kernel.org/r/20251111184415.141757-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251111184415.141757-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: use vma_start_write_killable() in dup_mmap()Matthew Wilcox (Oracle)
Allow waiting for the VMA write lock to be interrupted by fatal signals. The explicit check for fatal_signal_pending() can be removed as it is checked during vma_start_write_killable(). Improves the latency of killing the task as we do not wait for the reader to finish before checking for signals. Link: https://lkml.kernel.org/r/20251110203204.1454057-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Chris Li <chriscli@google.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: add vma_start_write_killable()Matthew Wilcox (Oracle)
Patch series "vma_start_write_killable"", v2. When we added the VMA lock, we made a major oversight in not adding a killable variant. That can run us into trouble where a thread takes the VMA lock for read (eg handling a page fault) and then goes out to lunch for an hour (eg doing reclaim). Another thread tries to modify the VMA, taking the mmap_lock for write, then attempts to lock the VMA for write. That blocks on the first thread, and ensures that every other page fault now tries to take the mmap_lock for read. Because everything's in an uninterruptible sleep, we can't kill the task, which makes me angry. This patchset just adds vma_start_write_killable() and converts one caller to use it. Most users are somewhat tricky to convert, so expect follow-up individual patches per call-site which need careful analysis to make sure we've done proper cleanup. This patch (of 2): The vma can be held read-locked for a substantial period of time, eg if memory allocation needs to go into reclaim. It's useful to be able to send fatal signals to threads which are waiting for the write lock. Link: https://lkml.kernel.org/r/20251110203204.1454057-1-willy@infradead.org Link: https://lkml.kernel.org/r/20251110203204.1454057-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Chris Li <chriscli@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: vmstat: correct the comment above preempt_disable_nested()Qi Zheng
The comment explaining why these parts use preempt_disable_nested() is in __mod_zone_page_state(), not in __mod_node_page_state(), so we should see __mod_zone_page_state(). Just correct it. Link: https://lkml.kernel.org/r/20251110084437.46701-1-qi.zheng@linux.dev Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Harry Yoo <harry.yoo@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/madvise: allow guard page install/remove under VMA lockLorenzo Stoakes
We only need to keep the page table stable so we can perform this operation under the VMA lock. PTE installation is stabilised via the PTE lock. One caveat is that, if we prepare vma->anon_vma we must hold the mmap read lock. We can account for this by adapting the VMA locking logic to explicitly check for this case and prevent a VMA lock from being acquired should it be the case. This check is safe, as while we might be raced on anon_vma installation, this would simply make the check conservative, there's no way for us to see an anon_vma and then for it to be cleared, as doing so requires the mmap/VMA write lock. We abstract the VMA lock validity logic to is_vma_lock_sufficient() for this purpose, and add prepares_anon_vma() to abstract the anon_vma logic. In order to do this we need to have a way of installing page tables explicitly for an identified VMA, so we export walk_page_range_vma() in an unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA read lock be taken. We additionally update the comments in madvise_guard_install() to more accurately reflect the cases in which the logic may be reattempted, specifically THP huge pages being present. Link: https://lkml.kernel.org/r/cca1edbd99cd1386ad20556d08ebdb356c45ef91.1762795245.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: SeongJae Park <sj@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: rename walk_page_range_mm()Lorenzo Stoakes
Patch series "mm: perform guard region install/remove under VMA lock", v2. There is no reason why can't perform guard region operations under the VMA lock, as long we take proper precautions to ensure that we do so in a safe manner. This is fine, as VMA lock acquisition is always best-effort, so if we are unable to do so, we can simply fall back to using the mmap read lock. Doing so will reduce mmap lock contention for callers performing guard region operations and help establish a precedent of trying to use the VMA lock where possible. As part of this change we perform a trivial rename of page walk functions which bypass safety checks (i.e. whether or not mm_walk_ops->install_pte is specified) in order that we can keep naming consistent with the mm walk. This is because we need to expose a VMA-specific walk that still allows us to install PTE entries. This patch (of 2): Make it clear we're referencing an unsafe variant of this function explicitly. This is laying the foundation for exposing more such functions and maintaining a consistent naming scheme. As a part of this change, rename check_ops_valid() to check_ops_safe() for consistency. Link: https://lkml.kernel.org/r/cover.1762795245.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c684d91464a438d6e31172c9450416a373f10649.1762795245.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: SeongJae Park <sj@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: memcg: dump memcg protection info on oom or alloc failuresShakeel Butt
Currently kernel dumps memory state on oom and allocation failures. One of the question usually raised on those dumps is why the kernel has not reclaimed the reclaimable memory instead of triggering oom. One potential reason is the usage of memory protection provided by memcg. So, let's also dump the memory protected by the memcg in such reports to ease the debugging. Link: https://lkml.kernel.org/r/20251107234041.3632644-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: set the VM_MAYBE_GUARD flag on guard region installLorenzo Stoakes
Now we have established the VM_MAYBE_GUARD flag and added the capacity to set it atomically, do so upon MADV_GUARD_INSTALL. The places where this flag is used currently and matter are: * VMA merge - performed under mmap/VMA write lock, therefore excluding racing writes. * /proc/$pid/smaps - can race the write, however this isn't meaningful as the flag write is performed at the point of the guard region being established, and thus an smaps reader can't reasonably expect to avoid races. Due to atomicity, a reader will observe either the flag being set or not. Therefore consistency will be maintained. In all other cases the flag being set is irrelevant and atomicity guarantees other flags will be read correctly. Note that non-atomic updates of unrelated flags do not cause an issue with this flag being set atomically, as writes of other flags are performed under mmap/VMA write lock, and these atomic writes are performed under mmap/VMA read lock, which excludes the write, avoiding RMW races. Note that we do not encounter issues with KCSAN by adjusting this flag atomically, as we are only updating a single bit in the flag bitmap and therefore we do not need to annotate these changes. We intentionally set this flag in advance of actually updating the page tables, to ensure that any racing atomic read of this flag will only return false prior to page tables being updated, to allow for serialisation via page table locks. Note that we set vma->anon_vma for anonymous mappings. This is because the expectation for anonymous mappings is that an anon_vma is established should they possess any page table mappings. This is also consistent with what we were doing prior to this patch (unconditionally setting anon_vma on guard region installation). We also need to update retract_page_tables() to ensure that madvise(..., MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain guard regions. This was previously guarded by anon_vma being set to catch MAP_PRIVATE cases, but the introduction of VM_MAYBE_GUARD necessitates that we check this flag instead. We utilise vma_flag_test_atomic() to do so - we first perform an optimistic check, then after the PTE page table lock is held, we can check again safely, as upon guard marker install the flag is set atomically prior to the page table lock being taken to actually apply it. So if the initial check fails either: * Page table retraction acquires page table lock prior to VM_MAYBE_GUARD being set - guard marker installation will be blocked until page table retraction is complete. OR: * Guard marker installation acquires page table lock after setting VM_MAYBE_GUARD, which raced and didn't pick this up in the initial optimistic check, blocking page table retraction until the guard regions are installed - the second VM_MAYBE_GUARD check will prevent page table retraction. Either way we're safe. We refactor the retraction checks into a single file_backed_vma_is_retractable(), there doesn't seem to be any reason that the checks were separated as before. Note that VM_MAYBE_GUARD being set atomically remains correct as vma_needs_copy() is invoked with the mmap and VMA write locks held, excluding any race with madvise_guard_install(). Link: https://lkml.kernel.org/r/e9e9ce95b6ac17497de7f60fc110c7dd9e489e8d.1763460113.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrei Vagin <avagin@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD oneLorenzo Stoakes
Gather all the VMA flags whose presence implies that page tables must be copied on fork into a single bitmap - VM_COPY_ON_FORK - and use this rather than specifying individual flags in vma_needs_copy(). We also add VM_MAYBE_GUARD to this list, as it being set on a VMA implies that there may be metadata contained in the page tables (that is - guard markers) which would will not and cannot be propagated upon fork. This was already being done manually previously in vma_needs_copy(), but this makes it very explicit, alongside VM_PFNMAP, VM_MIXEDMAP and VM_UFFD_WP all of which imply the same. Note that VM_STICKY flags ought generally to be marked VM_COPY_ON_FORK too - because equally a flag being VM_STICKY indicates that the VMA contains metadat that is not propagated by being faulted in - i.e. that the VMA metadata does not fully describe the VMA alone, and thus we must propagate whatever metadata there is on a fork. However, for maximum flexibility, we do not make this necessarily the case here. Link: https://lkml.kernel.org/r/5d41b24e7bc622cda0af92b6d558d7f4c0d1bc8c.1763460113.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: implement sticky VMA flagsLorenzo Stoakes
It is useful to be able to designate that certain flags are 'sticky', that is, if two VMAs are merged one with a flag of this nature and one without, the merged VMA sets this flag. As a result we ignore these flags for the purposes of determining VMA flag differences between VMAs being considered for merge. This patch therefore updates the VMA merge logic to perform this action, with flags possessing this property being described in the VM_STICKY bitmap. Those flags which ought to be ignored for the purposes of VMA merge are described in the VM_IGNORE_MERGE bitmap, which the VMA merge logic is also updated to use. As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge. Ultimately it seems that we should make VM_SOFTDIRTY a sticky flag in its own right, but this change is out of scope for this series. The only sticky flag designated as such is VM_MAYBE_GUARD, so as a result of this change, once the VMA flag is set upon guard region installation, VMAs with guard ranges will now not have their merge behaviour impacted as a result and can be freely merged with other VMAs without VM_MAYBE_GUARD set. Also update the comments for vma_modify_flags() to directly reference sticky flags now we have established the concept. We also update the VMA userland tests to account for the changes. Link: https://lkml.kernel.org/r/22ad5269f7669d62afb42ce0c79bad70b994c58d.1763460113.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrei Vagin <avagin@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: update vma_modify_flags() to handle residual flags, documentLorenzo Stoakes
The vma_modify_*() family of functions each either perform splits, a merge or no changes at all in preparation for the requested modification to occur. When doing so for a VMA flags change, we currently don't account for any flags which may remain (for instance, VM_SOFTDIRTY) despite the requested change in the case that a merge succeeded. This is made more important by subsequent patches which will introduce the concept of sticky VMA flags which rely on this behaviour. This patch fixes this by passing the VMA flags parameter as a pointer and updating it accordingly on merge and updating callers to accommodate for this. Additionally, while we are here, we add kdocs for each of the vma_modify_*() functions, as the fact that the requested modification is not performed is confusing so it is useful to make this abundantly clear. We also update the VMA userland tests to account for this change. Link: https://lkml.kernel.org/r/23b5b549b0eaefb2922625626e58c2a352f3e93c.1763460113.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrei Vagin <avagin@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smapsLorenzo Stoakes
Patch series "introduce VM_MAYBE_GUARD and make it sticky", v4. Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level. This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU. This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these. The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present. It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason. To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which: a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible). b. When they are merged, the resultant VMA must have the flag set. The VMA logic is updated to propagate these flags correctly. Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes. We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue. Additionally, we add the ability for allow-listed VMA flags to be atomically writable with only mmap/VMA read locks held. The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure does not cause any races by being allowed to do so. This allows us to maintain guard region installation as a read-locked operation and not endure the overhead of obtaining a write lock here. Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented. This patch (of 9): Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions). Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level. This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges. This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split). We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet. We also update the smaps logic and documentation to identify these VMAs. Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork. We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose. Link: https://lkml.kernel.org/r/cover.1763460113.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/cf8ef821eba29b6c5b5e138fffe95d6dcabdedb9.1763460113.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Andrei Vagin <avagin@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/hugetlb: extract sysctl into hugetlb_sysctl.cHui Zhu
Following the extraction of sysfs code, this patch moves the sysctl interface implementation into a dedicated file to further improve code organization and maintainability of the hugetlb subsystem. The following components are moved to mm/hugetlb_sysctl.c: - proc_hugetlb_doulongvec_minmax() - hugetlb_sysctl_handler_common() - hugetlb_sysctl_handler() - hugetlb_mempolicy_sysctl_handler() (CONFIG_NUMA) - hugetlb_overcommit_handler() - hugetlb_table[] sysctl table definition - hugetlb_sysctl_init() The hugetlb_internal.h header file is updated to declare the sysctl initialization function with proper #ifdef guards for configurations without CONFIG_SYSCTL support. The Makefile is updated to compile hugetlb_sysctl.o when CONFIG_HUGETLBFS is enabled. This refactoring reduces the size of hugetlb.c and logically separates the sysctl interface from core hugetlb management code. MAINTAINERS is updated to add new file hugetlb_sysctl.c. No functional changes are introduced; all code is moved as-is from hugetlb.c with consistent formatting. Link: https://lkml.kernel.org/r/5bbee7ab5be71d0bb1aebec38642d7e83526bb7a.1762398359.git.zhuhui@kylinos.cn Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm/hugetlb: extract sysfs into hugetlb_sysfs.cHui Zhu
Patch series "mm/hugetlb: refactor sysfs/sysctl interfaces", v5. hugetlb.c has grown significantly and become difficult to maintain. This patch series extracts the sysfs and sysctl interface code into separate dedicated files to improve code organization. The refactoring includes: - Patch 1: Extract sysfs interface into mm/hugetlb_sysfs.c - Patch 2: Extract sysctl interface into mm/hugetlb_sysctl.c No functional changes are introduced in this series. The code is moved as-is, with only minor formatting adjustments for code style consistency. This should make future maintenance and enhancements to the hugetlb subsystem easier. Testing: The patch series has been compile-tested and maintains the same functionality as the original code. This patch (of 2): Currently, hugetlb.c contains both core management logic and sysfs interface implementations, making it difficult to maintain. This patch extracts the sysfs-related code into a dedicated file to improve code organization. The following components are moved to mm/hugetlb_sysfs.c: - sysfs attribute definitions and handlers - sysfs kobject management functions - NUMA per-node hstate attribute registration Several inline helper functions and macros are moved to mm/hugetlb_internal.h: - hstate_is_gigantic_no_runtime() - next_node_allowed() - get_valid_node_allowed() - hstate_next_node_to_alloc() - hstate_next_node_to_free() - for_each_node_mask_to_alloc/to_free macros To support code sharing, these functions are changed from static to exported symbols: - remove_hugetlb_folio() - add_hugetlb_folio() - init_new_hugetlb_folio() - prep_and_add_allocated_folios() - demote_pool_huge_page() - __nr_hugepages_store_common() The Makefile is updated to compile hugetlb_sysfs.o when CONFIG_HUGETLBFS is enabled. This maintains all existing functionality while improving maintainability by separating concerns. MAINTAINERS is updated to add new file hugetlb_sysfs.c. Link: https://lkml.kernel.org/r/cover.1762398359.git.zhuhui@kylinos.cn Link: https://lkml.kernel.org/r/656a03dff7e2bb20e24e841ede81fdca01d21410.1762398359.git.zhuhui@kylinos.cn Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: constify __dump_folio() argumentsMatthew Wilcox (Oracle)
These arguments aren't modified by the function; mark them as const to help the compiler. Link: https://lkml.kernel.org/r/20251106203526.2368275-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20memory_hotplug: optimise try_offline_memory_block()Matthew Wilcox (Oracle)
Extract the zone number directly from the page instead of using the page's zone number to look up the zone and asking the zone what its number is. Link: https://lkml.kernel.org/r/20251106201452.2292631-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20migrate: optimise alloc_migration_target()Matthew Wilcox (Oracle)
Extract the zone number directly from the folio instead of using the folio's zone number to look up the zone and asking the zone what its number is. [ziy@nvidia.com: fix folio_zonenum() return type] Link: https://lkml.kernel.org/r/26E8FF35-503E-4F14-98F7-7B4FA25FBD37@nvidia.com Link: https://lkml.kernel.org/r/20251106201452.2292631-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20hugetlb: optimise hugetlb_folio_init_tail_vmemmap()Matthew Wilcox (Oracle)
Extract the zone number directly from the folio instead of using the folio's zone number to look up the zone and asking the zone what its number is. Also we should use &folio->page instead of casting from folio to page Link: https://lkml.kernel.org/r/20251106201452.2292631-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: kill mm_wr_locked from unmap_vmas() and unmap_single_vma()Kefeng Wang
Kill mm_wr_locked since commit f8e97613fed2 ("mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()") remove the user. Link: https://lkml.kernel.org/r/20251104085709.2688433-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-20mm: vmscan: simplify the folio refcount check in pageout()Baolin Wang
Since we no longer attempt to write back filesystem folios in pageout() (they will be filtered out by the following check in pageout()), and only tmpfs/shmem folios and anonymous swapcache folios can be written back, we can remove the redundant folio_test_private() when checking the folio's refcount, as tmpfs/shmem and swapcache folios do not use the PG_private flag. While we're at it, we can open-code the folio refcount check instead of adding a simple helper that has only one user. Link: https://lkml.kernel.org/r/4cbbec5bb92397aa4597105f1f499aabf7a1901c.1758166683.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>