summaryrefslogtreecommitdiff
path: root/mm/hugetlb.c
AgeCommit message (Collapse)Author
2025-04-01mm/hugetlb: move hugetlb_sysctl_init() to the __init sectionMarc Herbert
hugetlb_sysctl_init() is only invoked once by an __init function and is merely a wrapper around another __init function so there is not reason to keep it. Fixes the following warning when toning down some GCC inline options: WARNING: modpost: vmlinux: section mismatch in reference: hugetlb_sysctl_init+0x1b (section: .text) -> __register_sysctl_init (section: .init.text) Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21mm/mm_init: rename __init_reserved_page_zone to __init_page_from_nidMike Rapoport (Microsoft)
__init_reserved_page_zone() function finds the zone for pfn and nid and performs initialization of a struct page with that zone and nid. There is nothing in that function about reserved pages and it is misnamed. Rename it to __init_page_from_nid() to better reflect what the function does. Link: https://lkml.kernel.org/r/20250225083017.567649-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/hugetlb: update nr_huge_pages and surplus_huge_pages togetherLiu Shixin
In alloc_surplus_hugetlb_folio(), we increase nr_huge_pages and surplus_huge_pages separately. In the middle window, if we set nr_hugepages to smaller and satisfy count < persistent_huge_pages(h), the surplus_huge_pages will be increased by adjust_pool_surplus(). After adding delay in the middle window, we can reproduce the problem easily by following step: 1. echo 3 > /proc/sys/vm/nr_overcommit_hugepages 2. mmap two hugepages. When nr_huge_pages=2 and surplus_huge_pages=1, goto step 3. 3. echo 0 > /proc/sys/vm/nr_huge_pages Finally, nr_huge_pages is less than surplus_huge_pages. To fix the problem, call only_alloc_fresh_hugetlb_folio() instead and move down __prep_account_new_huge_page() into the hugetlb_lock. Link: https://lkml.kernel.org/r/20250305035409.2391344-1-liushixin2@huawei.com Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: move hugetlb specific things in folio to page[3]David Hildenbrand
Let's just move the hugetlb specific stuff to a separate page, and stop letting it overlay other fields for now. This frees up some space in page[2], which we will use on 32bit to free up some space in page[1]. While we could move these things to page[3] instead, it's cleaner to just move the hugetlb specific things out of the way and pack the core-folio stuff as tight as possible. ... and we can minimize the work required in dump_folio. We can now avoid re-initializing &folio->_deferred_list in hugetlb code. Hopefully dynamically allocating "strut folio" in the future will further clean this up. Link: https://lkml.kernel.org/r/20250303163014.1128035-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: log time needed to allocate hugepagesThomas Prescher
Having this information allows users to easily tune the hugepages_node_threads parameter. Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-3-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: add hugetlb_alloc_threads cmdline optionThomas Prescher
Add a command line option that enables control of how many threads should be used to allocate huge pages. [akpm@linux-foundation.org: tidy up a comment] Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-2-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: improve parallel huge page allocation timeThomas Prescher
Patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages", v2. Allocating huge pages can take a very long time on servers with terabytes of memory even when they are allocated at boot time where the allocation happens in parallel. Before this series, the kernel used a hard coded value of 2 threads per NUMA node for these allocations. This value might have been good enough in the past but it is not sufficient to fully utilize newer systems. This series changes the default so the kernel uses 25% of the available hardware threads for these allocations. In addition, we allow the user that wish to micro-optimize the allocation time to override this value via a new kernel parameter. We tested this on 2 generations of Xeon CPUs and the results show a big improvement of the overall allocation time. +-----------------------+-------+-------+-------+-------+-------+ | threads | 8 | 16 | 32 | 64 | 128 | +-----------------------+-------+-------+-------+-------+-------+ | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s | | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s | +-----------------------+-------+-------+-------+-------+-------+ On skylake, we see an improvment of 2.75x when using 32 threads, on cascade lake we can get even better at 4.3x when we use 128 threads. This speedup is quite significant and users of large machines like these should have the option to make the machines boot as fast as possible. This patch (of 3): Before this patch, the kernel currently used a hard coded value of 2 threads per NUMA node for these allocations. This patch changes this policy and the kernel now uses 25% of the available hardware threads for the allocations. Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-1-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: move hugetlb CMA code in to its own fileFrank van der Linden
hugetlb.c contained a number of CONFIG_CMA ifdefs, and the code inside them was large enough to merit being in its own file, so move it, cleaning up things a bit. Hide some direct variable access behind functions to accommodate the move. No functional change intended. Link: https://lkml.kernel.org/r/20250228182928.2645936-28-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: enable bootmem allocation from CMA areasFrank van der Linden
If hugetlb_cma_only is enabled, we know that hugetlb pages can only be allocated from CMA. Now that there is an interface to do early reservations from a CMA area (returning memblock memory), it can be used to allocate hugetlb pages from CMA. This also allows for doing pre-HVO on these pages (if enabled). Make sure to initialize the page structures and associated data correctly. Create a flag to signal that a hugetlb page has been allocated from CMA to make things a little easier. Some configurations of powerpc have a special hugetlb bootmem allocator, so introduce a boolean arch_specific_huge_bootmem_alloc that returns true if such an allocator is present. In that case, CMA bootmem allocations can't be used, so check that function before trying. Link: https://lkml.kernel.org/r/20250228182928.2645936-27-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: add hugetlb_cma_only cmdline optionFrank van der Linden
Add an option to force hugetlb gigantic pages to be allocated using CMA only (if hugetlb_cma is enabled). This avoids a fallback to allocation from the rest of system memory if the CMA allocation fails. This makes the size of hugetlb_cma a hard upper boundary for gigantic hugetlb page allocations. This is useful because, with a large CMA area, the kernel's unmovable allocations will have less room to work with and it is undesirable for new hugetlb gigantic page allocations to be done from that remaining area. It will eat in to the space available for unmovable allocations, leading to unwanted system behavior (OOMs because the kernel fails to do unmovable allocations). So, with this enabled, an administrator can force a hard upper bound for runtime gigantic page allocations, and have more predictable system behavior. Link: https://lkml.kernel.org/r/20250228182928.2645936-26-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: do pre-HVO for bootmem allocated pagesFrank van der Linden
For large systems, the overhead of vmemmap pages for hugetlb is substantial. It's about 1.5% of memory, which is about 45G for a 3T system. If you want to configure most of that system for hugetlb (e.g. to use as backing memory for VMs), there is a chance of running out of memory on boot, even though you know that the 45G will become available later. To avoid this scenario, and since it's a waste to first allocate and then free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages ('gigantic' pages). pre-HVO is done by adding functions that are called from sparse_init_nid_early and sparse_init_nid_late. The first is called before memmap allocation, so it takes care of allocating memmap HVO-style. The second verifies that all bootmem pages look good, specifically it checks that they do not intersect with multiple zones. This can only be done from sparse_init_nid_late path, when zones have been initialized. The hugetlb page size must be aligned to the section size, and aligned to the size of memory described by the number of page structures contained in one PMD (since pre-HVO is not prepared to split PMDs). This should be true for most 'gigantic' pages, it is for 1G pages on x86, where both of these alignment requirements are 128M. This will only have an effect if hugetlb_bootmem_alloc was called early in boot. If not, it won't do anything, and HVO for bootmem hugetlb pages works as before. Link: https://lkml.kernel.org/r/20250228182928.2645936-20-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: add pre-HVO frameworkFrank van der Linden
Define flags for pre-HVOed bootmem hugetlb pages, and act on them. The most important flag is the HVO flag, signalling that a bootmem allocated gigantic page has already been HVO-ed. If this flag is seen by the hugetlb bootmem gather code, the page is marked as HVO optimized. The HVO code will then not try to optimize it again. Instead, it will just map the tail page mirror pages read-only, completing the HVO steps. No functional change, as nothing sets the flags yet. Link: https://lkml.kernel.org/r/20250228182928.2645936-18-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: move huge_boot_pages list init to hugetlb_bootmem_allocFrank van der Linden
Instead of initializing the per-node hugetlb bootmem pages list from the alloc function, we can now do it in a somewhat cleaner way, since there is an explicit hugetlb_bootmem_alloc function. Initialize the lists there. Link: https://lkml.kernel.org/r/20250228182928.2645936-17-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: deal with multiple calls to hugetlb_bootmem_allocFrank van der Linden
Architectures that want pre-HVO of hugetlb vmemmap pages will need to call hugetlb_bootmem_alloc from an earlier spot in boot (before sparse_init). To facilitate some architectures doing this, protect hugetlb_bootmem_alloc against multiple calls. Also provide a helper function to check if it's been called, so that the early HVO code, to be added later, can see if there is anything to do. Link: https://lkml.kernel.org/r/20250228182928.2645936-16-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: check bootmem pages for zone intersectionsFrank van der Linden
Bootmem hugetlb pages are allocated using memblock, which isn't (and mostly can't be) aware of zones. So, they may end up crossing zone boundaries. This would create confusion, a hugetlb page that is part of multiple zones is bad. Worse, HVO might then end up stealthily re-assigning pages to a different zone when a hugetlb page is freed, since the tail page structures beyond the first vmemmap page would inherit the zone of the first page structures. While the chance of this happening is low, you can definitely create a configuration where this happens (especially using ZONE_MOVABLE). To avoid this issue, check if bootmem hugetlb pages intersect with multiple zones during the gather phase, and discard them, handing them to the page allocator, if they do. Record the number of invalid bootmem pages per node and subtract them from the number of available pages at the end, making it easier to do these checks in multiple places later on. Link: https://lkml.kernel.org/r/20250228182928.2645936-14-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: set migratetype for bootmem foliosFrank van der Linden
The pageblocks that back memblock allocated hugetlb folios might not have the migrate type set, in the CONFIG_DEFERRED_STRUCT_PAGE_INIT case. memblock allocated hugetlb folios might be given to the buddy allocator eventually (if nr_hugepages is lowered), so make sure that the migrate type for the pageblocks contained in them is set when initializing them. Set it to the default that memmap init also uses (MIGRATE_MOVABLE). Link: https://lkml.kernel.org/r/20250228182928.2645936-12-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: convert cmdline parameters from setup to earlyFrank van der Linden
Convert the cmdline parameters (hugepagesz, hugepages, default_hugepagesz and hugetlb_free_vmemmap) to early parameters. Since parse_early_param might run before MMU setups on some platforms (powerpc), validation of huge page sizes as specified in command line parameters would fail. So instead, for the hstate-related values, just record the them and parse them on demand, from hugetlb_bootmem_alloc. The allocation of hugetlb bootmem pages is now done in hugetlb_bootmem_alloc, which is called explicitly at the start of mm_core_init(). core_initcall would be too late, as that happens with memblock already torn down. This change will allow earlier allocation and initialization of bootmem hugetlb pages later on. No functional change intended. Link: https://lkml.kernel.org/r/20250228182928.2645936-8-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: use online nodes for bootmem allocationFrank van der Linden
Later commits will move hugetlb bootmem allocation to earlier in init, when N_MEMORY has not yet been set on nodes. Use online nodes instead. At most, this wastes just a few cycles once during boot (and most likely none). Link: https://lkml.kernel.org/r/20250228182928.2645936-7-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: remove redundant __ClearPageReservedFrank van der Linden
In hugetlb_folio_init_tail_vmemmap, the reserved flag is cleared for the tail page just before it is zeroed out, which is redundant. Remove the __ClearPageReserved call. Link: https://lkml.kernel.org/r/20250228182928.2645936-6-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm, hugetlb: use cma_declare_contiguous_multiFrank van der Linden
hugetlb_cma is fine with using multiple CMA ranges, as long as it can get its gigantic pages allocated from them. So, use cma_declare_contiguous_multi to allow for multiple ranges, increasing the chances of getting what we want on systems with gaps in physical memory. Link: https://lkml.kernel.org/r/20250228182928.2645936-5-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: fix surplus pages in dissolve_free_huge_page()Jinjiang Tu
In dissolve_free_huge_page(), free huge pages are dissolved without adjusting surplus count. However, free huge pages may be accounted as surplus pages, and will lead to wrong surplus count. I reproduce this issue on qemu. The steps are: 1) Node1 is memory-less at first. Hot-add memory to node1 by executing the two commands in qemu monitor: object_add memory-backend-ram,id=mem1,size=1G device_add pc-dimm,id=dimm1,memdev=mem1,node=1 2) online one memory block of Node1 with: echo online_movable > /sys/devices/system/node/node1/memoryX/state 3) create 64 huge pages for node1 4) run a program to reserve (don't consume) all the huge pages 5) echo 0 > nr_huge_pages for node1. After this step, free huge pages in Node1 are surplus. 6) create 80 huge pages for node0 7) offline memory of node1, The memory range to offline contains the free surplus huge pages created in step3) ~ step5) echo offline > /sys/devices/system/node/node1/memoryX/state 8) kill the program in step 4) The result: Node0 Node1 total 80 0 free 80 0 surplus 0 61 To fix it, adjust surplus when destroying huge pages if the node has surplus pages in dissolve_free_hugetlb_folio(). The result with this patch: Node0 Node1 total 80 0 free 80 0 surplus 0 0 Link: https://lkml.kernel.org/r/20250304132106.2872754-1-tujinjiang@huawei.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-08Merge tag 'mm-hotfixes-stable-2025-03-08-16-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "33 hotfixes. 24 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 26 are for MM and 7 are for non-MM. - "mm: memory_failure: unmap poisoned folio during migrate properly" from Ma Wupeng fixes a couple of two year old bugs involving the migration of hwpoisoned folios. - "selftests/damon: three fixes for false results" from SeongJae Park fixes three one year old bugs in the SAMON selftest code. The remainder are singletons and doubletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits) mm/page_alloc: fix uninitialized variable rapidio: add check for rio_add_net() in rio_scan_alloc_net() rapidio: fix an API misues when rio_add_net() fails MAINTAINERS: .mailmap: update Sumit Garg's email address Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" mm: fix finish_fault() handling for large folios mm: don't skip arch_sync_kernel_mappings() in error paths mm: shmem: remove unnecessary warning in shmem_writepage() userfaultfd: fix PTE unmapping stack-allocated PTE copies userfaultfd: do not block on locking a large folio with raised refcount mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages mm: shmem: fix potential data corruption during shmem swapin mm: fix kernel BUG when userfaultfd_move encounters swapcache selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms selftests/damon/damos_quota: make real expectation of quota exceeds include/linux/log2.h: mark is_power_of_2() with __always_inline NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback mm, swap: avoid BUG_ON in relocate_cluster() mm: swap: use correct step in loop to wait all clusters in wait_for_allocation() ...
2025-03-05mm/hugetlb: wait for hugetlb folios to be freedGe Yang
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-01Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "Ryan's been hard at work finding and fixing mm bugs in the arm64 code, so here's a small crop of fixes for -rc5. The main changes are to fix our zapping of non-present PTEs for hugetlb entries created using the contiguous bit in the page-table rather than a block entry at the level above. Prior to these fixes, we were pulling the contiguous bit back out of the PTE in order to determine the size of the hugetlb page but this is clearly bogus if the thing isn't present and consequently both the clearing of the PTE(s) and the TLB invalidation were unreliable. Although the problem was found by code inspection, we really don't want this sitting around waiting to trigger and the changes are CC'd to stable accordingly. Note that the diffstat looks a lot worse than it really is; huge_ptep_get_and_clear() now takes a size argument from the core code and so all the arch implementations of that have been updated in a pretty mechanical fashion. - Fix a sporadic boot failure due to incorrect randomization of the linear map on systems that support it - Fix the zapping (both clearing the entries *and* invalidating the TLB) of hugetlb PTEs constructed using the contiguous bit" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() arm64/mm: Fix Boot panic on Ampere Altra
2025-02-27mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()Ryan Roberts
In order to fix a bug, arm64 needs to be told the size of the huge page for which the huge_pte is being cleared in huge_ptep_get_and_clear(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear() and set_huge_pte_at(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, loongarch, mips, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-02-17mm: hugetlb: avoid fallback for specific node allocation of 1G pagesLuiz Capitulino
When using the HugeTLB kernel command-line to allocate 1G pages from a specific node, such as: default_hugepagesz=1G hugepages=1:1 If node 1 happens to not have enough memory for the requested number of 1G pages, the allocation falls back to other nodes. A quick way to reproduce this is by creating a KVM guest with a memory-less node and trying to allocate 1 1G page from it. Instead of failing, the allocation will fallback to other nodes. This defeats the purpose of node specific allocation. Also, specific node allocation for 2M pages don't have this behavior: the allocation will just fail for the pages it can't satisfy. This issue happens because HugeTLB calls memblock_alloc_try_nid_raw() for 1G boot-time allocation as this function falls back to other nodes if the allocation can't be satisfied. Use memblock_alloc_exact_nid_raw() instead, which ensures that the allocation will only be satisfied from the specified node. Link: https://lkml.kernel.org/r/20250211034856.629371-1-luizcap@redhat.com Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01mm/hugetlb: fix hugepage allocation for interleaved memory nodesRitesh Harjani (IBM)
gather_bootmem_prealloc() assumes the start nid as 0 and size as num_node_state(N_MEMORY). That means in case if memory attached numa nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail to scan few of these nodes. Since memory attached numa nodes can be interleaved in any fashion, hence ensure that the current code checks for all numa node ids (.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that it can distributes all nr_node_ids among the these many no. threads. e.g. qemu cmdline ======================== numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" mem_cmd="-object memory-backend-ram,id=mem1,size=16G" w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): ========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 0 kB with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): =========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 2097152 kB Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Pavithra Prakash <pavrampu@linux.ibm.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Gang Li <gang.li@linux.dev> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-25mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()David Hildenbrand
We are demoting hugetlb folios to smaller hugetlb folios; let's avoid messing with pages where avoidable and handle it more similar to __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20250113131611.2554758-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: rename folio_putback_active_hugetlb() to folio_putback_hugetlb()David Hildenbrand
Now that folio_putback_hugetlb() is only called on folios that were previously isolated through folio_isolate_hugetlb(), let's rename it to match folio_putback_lru(). Add some kernel doc to clarify how this function is supposed to be used. Link: https://lkml.kernel.org/r/20250113131611.2554758-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/migrate: don't call folio_putback_active_hugetlb() on dst hugetlb folioDavid Hildenbrand
We replaced a simple put_page() by a putback_active_hugepage() call in commit 3aaa76e125c1 ("mm: migrate: hugetlb: putback destination hugepage to active list"), to set the "active" flag on the dst hugetlb folio. Nowadays, we decoupled the "active" list from the flag, by calling the flag "migratable". Calling "putback" on something that wasn't allocated is weird and not future proof, especially if we might reach that path when migration failed and we just want to free the freshly allocated hugetlb folio. Let's simply handle the migratable flag and the active list flag in move_hugetlb_state(), where we know that allocation succeeded and already handle the temporary flag; use a simple folio_put() to return our reference. Link: https://lkml.kernel.org/r/20250113131611.2554758-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb()David Hildenbrand
Let's make the function name match "folio_isolate_lru()", and add some kernel doc. Link: https://lkml.kernel.org/r/20250113131611.2554758-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: unify restore reserve accounting for new allocationsPeter Xu
Either hugetlb pages dequeued from hstate, or newly allocated from buddy, would require restore-reserve accounting to be managed properly. Merge the two paths on it. Add a small comment to make it slightly nicer. Link: https://lkml.kernel.org/r/20250107204002.2683356-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: drop vma_has_reserves()Peter Xu
After the previous cleanup, vma_has_reserves() is mostly an empty helper except that it says "use reserve count" is inverted meaning from "needs a global reserve count", which is still true. To avoid confusions on having two inverted ways to ask the same question, always use the gbl_chg everywhere, and drop the function. When at it, rename "chg" to "gbl_chg" in dequeue_hugetlb_folio_vma(). It might be helpful for readers to see that the "chg" here is the global reserve count, not the vma resv count. Link: https://lkml.kernel.org/r/20250107204002.2683356-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: simplify vma_has_reserves()Peter Xu
vma_has_reserves() is a helper "trying" to know whether the vma should consume one reservation when allocating the hugetlb folio. However it's not clear on why we need such complexity, as such information is already represented in the "chg" variable. From alloc_hugetlb_folio() context, "chg" (or in the function's context, "gbl_chg") is defined as: - If gbl_chg=1, the allocation cannot reuse an existing reservation - If gbl_chg=0, the allocation should reuse an existing reservation Firstly, map_chg is defined as following, to cover all cases of hugetlb reservation scenarios (mostly, via vma_needs_reservation(), but cow_from_owner is an outlier): CONDITION HAS RESERVATION? ========= ================ - SHARED: always check against per-inode resv_map (ignore NONRESERVE) - If resv exists ==> YES [1] - If not ==> NO [2] - PRIVATE: complicated... - Request came from a CoW from owner resv map ==> NO [3] (when cow_from_owner==true) - If does not own a resv_map at all.. ==> NO [4] (examples: VM_NORESERVE, private fork()) - If owns a resv_map, but resv donsn't exists ==> NO [5] - If owns a resv_map, and resv exists ==> YES [6] Further on, gbl_chg considered spool setup, so that is a decision based on all the context. If we look at vma_has_reserves(), it almost does check that has already been processed by map_chg accounting (I marked each return value to the case above): static bool vma_has_reserves(struct vm_area_struct *vma, long chg) { if (vma->vm_flags & VM_NORESERVE) { if (vma->vm_flags & VM_MAYSHARE && chg == 0) return true; ==> [1] else return false; ==> [2] or [4] } if (vma->vm_flags & VM_MAYSHARE) { if (chg) return false; ==> [2] else return true; ==> [1] } if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { if (chg) return false; ==> [5] else return true; ==> [6] } return false; ==> [4] } It didn't check [3], but [3] case was actually already covered now by the "chg" / "gbl_chg" / "map_chg" calculations. In short, vma_has_reserves() doesn't provide anything more than return "!chg".. so just simplify all the things. There're a lot of comments describing truncation races, IIUC there should have no race as long as map_chg is properly done. Link: https://lkml.kernel.org/r/20250107204002.2683356-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: clean up map/global resv accounting when allocatePeter Xu
alloc_hugetlb_folio() isn't a function easy to read, especially on reservation accountings for either VMA or globally (majorly, spool only). The 1st complexity lies in the special private CoW path, aka, cow_from_owner=true case. The 2nd complexity may be the confusing updates of gbl_chg after it's set once, which looks like they can change anytime on the fly. Logically, cow_from_user is only about vma reservation. We could already decouple the flag and consolidate it into map charge flag very early. Then we don't need to keep checking the CoW special flag every time. This patch does it by making map_chg a tri-state flag. Tri-state needed is unfortunate, and it's because currently vma_needs_reservation() has a side effect internally, that it must be followed by either a end() or commit(). We keep the same semantic as before on one thing: "if (map_chg)" means we need a separate per-vma resv count. It keeps most of the old code like before untouched with the new enum. After this patch, we take these steps to decide these variables, hopefully slightly easier to follow: - First, decide map_chg. This will take cow_from_owner into account, once and for all. It's about whether we could take a resv count from the vma, no matter it's shared, private, etc. - Then, decide gbl_chg. The only diff here is spool, comparing to map_chg. Now only update each flag once and for all, instead of keep any of them flipping which can be very hard to follow. With cow_from_owner merged into map_chg, we could remove quite a few such checks all over. Side benefit of such is that we can get rid of one more confusing flag, which is deferred_reserve. Cleanup the comments a bit too. E.g., MAP_NORESERVE may not need to check against spool limit, AFAIU, if it's on a shared mapping, and if the page cache folio has its inode's resv map available (in which case map_chg would have been set zero, hence the code should be correct, not the comment). There's one trivial detail that needs attention that this patch touched, which is this check right after vma_commit_reservation(): if (map_chg > map_commit) It changes to: if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) It should behave the same like before, because previously the only way to make "map_chg > map_commit" happen is map_chg=1 && map_commit=0. That's exactly the rewritten line. Meanwhile, either commit() or end() will need to be skipped if ENFORCE, to keep the old behavior. Even though it looks a lot changed, but no functional change expected. Link: https://lkml.kernel.org/r/20250107204002.2683356-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: rename avoid_reserve to cow_from_ownerPeter Xu
The old name "avoid_reserve" can be too generic and can be used wrongly in the new call sites that want to allocate a hugetlb folio. It's confusing on two things: (1) whether one can opt-in to avoid global reservation, and (2) whether it should take more than one count. In reality, this flag is only used in an extremely hacky path, in an extremely hacky way in hugetlb CoW path only, and always use with 1 saying "skip global reservation". Rename the flag to avoid future abuse of this flag, making it a boolean so as to reflect its true representation that it's not a counter. To make it even harder to abuse, add a comment above the function to explain it. Link: https://lkml.kernel.org/r/20250107204002.2683356-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: stop using avoid_reserve flag in fork()Peter Xu
When fork() and stumble on top of a dma-pinned hugetlb private page, CoW must happen during fork() to guarantee dma coherency. In this specific path, hugetlb pages need to be allocated for the child process. Stop using avoid_reserve=1 flag here: it's not required to be used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma) will have no private vma resv map, and that will make sure it won't be able to use a vma reservation later. No functional change intended with this change. Said that, it's still wanted to do this, so as to reduce the usage of avoid_reserve to the only one user, which is also why this flag was introduced initially in commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"). I don't see whoever else should set it at all. Further patch will clean up resv accounting based on this. Link: https://lkml.kernel.org/r/20250107204002.2683356-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: fix avoid_reserve to allow taking folio from subpoolPeter Xu
Patch series "mm/hugetlb: Refactor hugetlb allocation resv accounting", v2. This is a follow up on Ackerley's series here as replacement: https://lore.kernel.org/r/cover.1728684491.git.ackerleytng@google.com The goal of this series is to cleanup hugetlb resv accounting, especially during folio allocation, to decouple a few things: - Hugetlb folios v.s. Hugetlbfs: IOW, the hope is in the future hugetlb folios can be allocated completely without hugetlbfs. - Decouple VMA v.s. hugetlb folio allocations: allocating a hugetlb folio should not always require a hugetlbfs VMA. For example, either it got allocated from the inode level (see hugetlbfs_fallocate() where it used a pesudo VMA for allocation), or it can be allocated by other kernel subsystems. It paves way for other users to allocate hugetlb folios out of either system reservations, or subpools (instead of hugetlbfs, as a file system). For longer term, this prepares hugetlb as a separate concept versus hugetlbfs, so that hugetlb folios can be allocated by not only hugetlbfs and other things. Tests I've done: - I had a reproducer in patch 1 for the bug I found, this will start to work after patch 1 or the whole set applied. - Hugetlb regression tests (on x86_64 2MBs), includes: - All vmtests on hugetlbfs - libhugetlbfs test suite (which may fail some tests, but no new failures will be introduced by this series, so all such failures happen before this series so shouldn't be relevant). This patch (of 7): Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map. Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally. Logically it can cause global reservation to go wrong. I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <fcntl.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #define MSIZE (2UL << 20) int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child; if (argc < 2) { printf("usage: %s <hugetlb_file>\n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0; } It can only reproduce with a sub-mount when there're reserved pages on the spool, like: # sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool Then run the reproducer on the mountpoint: # ./reproducer ./hugetlb-pool/test Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's. I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit. Link: https://lkml.kernel.org/r/20250107204002.2683356-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20250107204002.2683356-2-peterx@redhat.com Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: replace free hugepage folios after migrationyangge
My machine has 4 NUMA nodes, each equipped with 32GB of memory. I have configured each NUMA node with 16GB of CMA and 16GB of in-use hugetlb pages. The allocation of contiguous memory via cma_alloc() can fail probabilistically. When there are free hugetlb folios in the hugetlb pool, during the migration of in-use hugetlb folios, new folios are allocated from the free hugetlb pool. After the migration is completed, the old folios are released back to the free hugetlb pool instead of being returned to the buddy system. This can cause test_pages_isolated() check to fail, ultimately leading to the failure of cma_alloc(). Call trace: cma_alloc() __alloc_contig_migrate_range() // migrate in-use hugepage test_pages_isolated() __test_page_isolated_in_pageblock() PageBuddy(page) // check if the page is in buddy To address this issue, we introduce a function named replace_free_hugepage_folios(). This function will replace the hugepage in the free hugepage pool with a new one and release the old one to the buddy system. After the migration of in-use hugetlb pages is completed, we will invoke replace_free_hugepage_folios() to ensure that these hugepages are properly released to the buddy system. Following this step, when test_pages_isolated() is executed for inspection, it will successfully pass. Additionally, when alloc_contig_range() is used to migrate multiple in-use hugetlb pages, it can result in some in-use hugetlb pages being released back to the free hugetlb pool and subsequently being reallocated and used again. For example: [huge 0] [huge 1] To migrate huge 0, we obtain huge x from the pool. After the migration is completed, we return the now-freed huge 0 back to the pool. When it's time to migrate huge 1, we can simply reuse the now-freed huge 0 from the pool. As a result, when replace_free_hugepage_folios() is executed, it cannot release huge 0 back to the buddy system. To address this issue, we should prevent the reuse of isolated free hugepages during the migration process. Link: https://lkml.kernel.org/r/1734503588-16254-1-git-send-email-yangge1116@126.com Link: https://lkml.kernel.org/r/1736582300-11364-1-git-send-email-yangge1116@126.com Signed-off-by: yangge <yangge1116@126.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13memcg/hugetlb: introduce mem_cgroup_charge_hugetlbJoshua Hahn
This patch introduces mem_cgroup_charge_hugetlb which combines the logic of mem_cgroup_hugetlb_try_charge / mem_cgroup_hugetlb_commit_charge and removes the need for mem_cgroup_hugetlb_cancel_charge. It also reduces the footprint of memcg in hugetlb code and consolidates all memcg related error paths into one. Link: https://lkml.kernel.org/r/20241211203951.764733-3-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/hugetlb: support FOLL_FORCE|FOLL_WRITEGuillaume Morin
Eric reported that PTRACE_POKETEXT fails when applications use hugetlb for mapping text using huge pages. Before commit 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings"), PTRACE_POKETEXT worked by accident, but it was buggy and silently ended up mapping pages writable into the page tables even though VM_WRITE was not set. In general, FOLL_FORCE|FOLL_WRITE does currently not work with hugetlb. Let's implement FOLL_FORCE|FOLL_WRITE properly for hugetlb, such that what used to work in the past by accident now properly works, allowing applications using hugetlb for text etc. to get properly debugged. This change might also be required to implement uprobes support for hugetlb [1]. [1] https://lore.kernel.org/lkml/ZiK50qob9yl5e0Xz@bender.morinfr.org/ Link: https://lkml.kernel.org/r/Z1NshNfWuzUCPebA@bender.morinfr.org Signed-off-by: Guillaume Morin <guillaume@morinfr.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Hagberg <ehagberg@janestreet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/hugetlb: don't map folios writable without VM_WRITE when copying during ↵David Hildenbrand
fork() If we have to trigger a hugetlb folio copy during fork() because the anon folio might be pinned, we currently unconditionally create a writable PTE. However, the VMA might not have write permissions (VM_WRITE) at that point. Fix it by checking the VMA for VM_WRITE. Make the code less error prone by moving checking for VM_WRITE into make_huge_pte(), and letting callers only specify whether we should try making it writable. A simple reproducer that longterm-pins the folios using liburing to then mprotect(PROT_READ) the folios befor fork() [1] results in: Before: [FAIL] access should not have worked After: [PASS] access did not work as expected [1] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/hugetlb-mkwrite-fork.c This is rather a corner case, so stable might not be warranted. Link: https://lkml.kernel.org/r/20241204153100.1967364-1-david@redhat.com Fixes: 4eae4efa2c29 ("hugetlb: do early cow when page pinned on src mm") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Guillaume Morin <guillaume@morinfr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13hugetlb: prioritize surplus allocation from current nodeKoichiro Den
Previously, surplus allocations triggered by mmap were typically made from the node where the process was running. On a page fault, the area was reliably dequeued from the hugepage_freelists for that node. However, since commit 003af997c8a9 ("hugetlb: force allocating surplus hugepages on mempolicy allowed nodes"), dequeue_hugetlb_folio_vma() may fall back to other nodes unnecessarily even if there is no MPOL_BIND policy, causing folios to be dequeued from nodes other than the current one. Also, allocating from the node where the current process is running is likely to result in a performance win, as mmap-ing processes often touch the area not so long after allocation. This change minimizes surprises for users relying on the previous behavior while maintaining the benefit introduced by the commit. So, prioritize the node the current process is running on when possible. Link: https://lkml.kernel.org/r/20241204165503.628784-1-koichiro.den@canonical.com Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Acked-by: Aristeu Rozanski <aris@ruivo.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: clear uffd-wp PTE/PMD state on mremap()Ryan Roberts
When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30mm: hugetlb: independent PMD page table shared countLiu Shixin
The folio refcount may be increased unexpectly through try_get_folio() by caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to check whether a pmd page table is shared. The check is incorrect if the refcount is increased by the above caller, and this can cause the page table leaked: BUG: Bad page state in process sh pfn:109324 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff) page_type: f2(table) raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000 page dumped because: nonzero mapcount ... CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7 Tainted: [B]=BAD_PAGE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x80/0xf8 dump_stack+0x18/0x28 bad_page+0x8c/0x130 free_page_is_bad_report+0xa4/0xb0 free_unref_page+0x3cc/0x620 __folio_put+0xf4/0x158 split_huge_pages_all+0x1e0/0x3e8 split_huge_pages_write+0x25c/0x2d8 full_proxy_write+0x64/0xd8 vfs_write+0xcc/0x280 ksys_write+0x70/0x110 __arm64_sys_write+0x24/0x38 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 The issue may be triggered by damon, offline_page, page_idle, etc, which will increase the refcount of page table. 1. The page table itself will be discarded after reporting the "nonzero mapcount". 2. The HugeTLB page mapped by the page table miss freeing since we treat the page table as shared and a shared page table will not be unmapped. Fix it by introducing independent PMD page table shared count. As described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390 gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv pmds, so we can reuse the field as pt_share_count. Link: https://lkml.kernel.org/r/20241216071147.3984217-1-liushixin2@huawei.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ken Chen <kenneth.w.chen@intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18mm: use aligned address in copy_user_gigantic_page()Kefeng Wang
In current kernel, hugetlb_wp() calls copy_user_large_folio() with the fault address. Where the fault address may be not aligned with the huge page size. Then, copy_user_large_folio() may call copy_user_gigantic_page() with the address, while copy_user_gigantic_page() requires the address to be huge page size aligned. So, this may cause memory corruption or information leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for copy_user_gigantic_page(). Link: https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com Fixes: 530dd9926dc1 ("mm: memory: improve copy_user_large_folio()") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14memcg/hugetlb: add hugeTLB counters to memcgJoshua Hahn
This patch introduces a new counter to memory.stat that tracks hugeTLB usage, only if hugeTLB accounting is done to memory.current. This feature is enabled the same way hugeTLB accounting is enabled, via the memory_hugetlb_accounting mount flag for cgroupsv2. 1. Why is this patch necessary? Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds hugeTLB usage to memory.current. However, the metric is not reported in memory.stat. Given that users often interpret memory.stat as a breakdown of the value reported in memory.current, the disparity between the two reports can be confusing. This patch solves this problem by including the metric in memory.stat as well, but only if it is also reported in memory.current (it would also be confusing if the value was reported in memory.stat, but not in memory.current) Aside from the consistency between the two files, we also see benefits in observability. Userspace might be interested in the hugeTLB footprint of cgroups for many reasons. For instance, system admins might want to verify that hugeTLB usage is distributed as expected across tasks: i.e. memory-intensive tasks are using more hugeTLB pages than tasks that don't consume a lot of memory, or are seen to fault frequently. Note that this is separate from wanting to inspect the distribution for limiting purposes (in which case, hugeTLB controller makes more sense). 2. We already have a hugeTLB controller. Why not use that? It is true that hugeTLB tracks the exact value that we want. In fact, by enabling the hugeTLB controller, we get all of the observability benefits that I mentioned above, and users can check the total hugeTLB usage, verify if it is distributed as expected, etc. With this said, there are 2 problems: (a) They are still not reported in memory.stat, which means the disparity between the memcg reports are still there. (b) We cannot reasonably expect users to enable the hugeTLB controller just for the sake of hugeTLB usage reporting, especially since they don't have any use for hugeTLB usage enforcing [2]. 3. Implementation Details: In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge which is called from alloc_hugetlb_folio will set memcg for the folio only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the feature is enabled. Regardless of whether memcg accounts for hugetlb, the newly added global counter is updated and shown in /proc/vmstat. The global counter is added because vmstats is the preferred framework for cgroup stats. It makes stat items consistent between global and cgroups. It also provides a per-node breakdown, which is useful. Because it does not use cgroup-specific hooks, we also keep generic MM code separate from memcg code. [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/ [2] Of course, we can't make a new patch for every feature that can be duplicated. However, since the existing solution of enabling the hugeTLB controller is an imperfect solution that still leaves a discrepancy between memory.stat and memory.curent, I think that it is reasonable to isolate the feature in this case. Link: https://lkml.kernel.org/r/20241101204402.1885383-1-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Suggested-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: add PTE_MARKER_GUARD PTE markerLorenzo Stoakes
Add a new PTE marker that results in any access causing the accessing process to segfault. This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range. This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination. Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect. We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or MADV_FREE. We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker be encountered there, as we explicitly do not support this operation and this should not occur. Link: https://lkml.kernel.org/r/f47f3d5acca2dcf9bbf655b6d33f3dc713e4a4a0.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabkba@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/hugetlb: perform vmemmap optimization batchly for specific node allocationsuhua
When HVO is enabled and huge page memory allocs are made, the freed memory can be aggregated into higher order memory in the following paths, which facilitates further allocs for higher order memory. echo 200000 > /proc/sys/vm/nr_hugepages echo 200000 > /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages grub default_hugepagesz=2M hugepagesz=2M hugepages=200000 Currently not support for releasing aggregations to higher order in the following way, which will releasing to lower order. grub: default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 This patch supports the release of huge page optimizations aggregates to higher order memory. eg: cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-xxx ... default_hugepagesz=2M hugepagesz=2M hugepages=0:100000,1:100000 Before: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 55282 97039 99307 0 1 1 0 1 1 1 0 Node 0, zone Normal, type Movable 25 11 345 87 48 21 2 20 9 3 75061 Node 0, zone Normal, type Reclaimable 4 2 2 4 3 0 2 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 98888 99650 99679 2 3 1 2 2 2 0 0 Node 1, zone Normal, type Movable 1 1 0 1 1 0 1 0 1 1 75937 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 After: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Unmovable 152 158 37 2 2 0 3 4 2 6 717 Node 0, zone Normal, type Movable 1 37 53 3 55 49 16 6 2 1 75000 Node 0, zone Normal, type Reclaimable 1 4 3 1 2 1 1 1 1 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 ... Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 1, zone Normal, type Unmovable 5 3 2 1 3 4 2 2 2 0 779 Node 1, zone Normal, type Movable 1 0 1 1 1 0 1 0 1 1 75849 Node 1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Link: https://lkml.kernel.org/r/20241012070802.1876-1-suhua1@kingsoft.com Signed-off-by: suhua <suhua1@kingsoft.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>