diff options
author | Luiz Capitulino <luizcap@redhat.com> | 2025-03-06 17:44:50 -0500 |
---|---|---|
committer | Andrew Morton <akpm@linux-foundation.org> | 2025-03-17 22:06:57 -0700 |
commit | 9039b9096ea27a20f0349d1537537663c935c8ed (patch) | |
tree | 0c74fd7d4f89264f50f3d0bcbc7ad1c052a07530 /mm | |
parent | 11e88e9265ec192cff33fc2e43e36c211851b32c (diff) |
mm: page_ext: add an iteration API for page extensions
Patch series "mm: page_ext: Introduce new iteration API", v3.
Introduction
============
[ Thanks to David Hildenbrand for identifying the root cause of this
issue and proving guidance on how to fix it. The new API idea, bugs
and misconceptions are all mine though ]
Currently, trying to reserve 1G pages with page_owner=on and sparsemem
causes a crash. The reproducer is very simple:
1. Build the kernel with CONFIG_SPARSEMEM=y and the table extensions
2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line
3. Reserve one 1G page at run-time, this should crash (see patch 1 for
backtrace)
[ A crash with page_table_check is also possible, but harder to trigger ]
Apparently, starting with commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP
for gigantic folios") we now pass the full allocation order to page
extension clients and the page extension implementation assumes that all
PFNs of an allocation range will be stored in the same memory section (which
is not true for 1G pages).
To fix this, this series introduces a new iteration API for page extension
objects. The API checks if the next page extension object can be retrieved
from the current section or if it needs to look up for it in another
section.
Please, find all details in patch 1.
I tested this series on arm64 and x86 by reserving 1G pages at run-time
and doing kernel builds (always with page_owner=on and page_table_check=on).
This patch (of 3):
The page extension implementation assumes that all page extensions of a
given page order are stored in the same memory section. The function
page_ext_next() relies on this assumption by adding an offset to the
current object to return the next adjacent page extension.
This behavior works as expected for flatmem but fails for sparsemem when
using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for
gigantic folios") exposes this issue, making it possible for a crash when
using page_owner or page_table_check page extensions.
The problem is that for 1G pages, the page extensions may span memory
section boundaries and be stored in different memory sections. This issue
was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP
for gigantic folios") because alloc_contig_pages() never passed more than
MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing
mentioned commit changed this behavior allowing the full 1G page order to
be passed.
Reproducer:
1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions
support
2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line
3. Reserve one 1G page at run-time, this should crash (backtrace below)
To address this issue, this commit introduces a new API for iterating
through page extensions. The main iteration macro is for_each_page_ext()
and it must be called with the RCU read lock taken. Here's an usage
example:
"""
struct page_ext_iter iter;
struct page_ext *page_ext;
...
rcu_read_lock();
for_each_page_ext(page, 1 << order, page_ext, iter) {
struct my_page_ext *obj = get_my_page_ext_obj(page_ext);
...
}
rcu_read_unlock();
"""
The loop construct uses page_ext_iter_next() which checks to see if we
have crossed sections in the iteration. In this case,
page_ext_iter_next() retrieves the next page_ext object from another
section.
Thanks to David Hildenbrand for helping identify the root cause and
providing suggestions on how to fix and optmize the solution (final
implementation and bugs are all mine through).
Lastly, here's the backtrace, without kasan you can get random crashes:
[ 76.052526] BUG: KASAN: slab-out-of-bounds in __update_page_owner_handle+0x238/0x298
[ 76.060283] Write of size 4 at addr ffff07ff96240038 by task tee/3598
[ 76.066714]
[ 76.068203] CPU: 88 UID: 0 PID: 3598 Comm: tee Kdump: loaded Not tainted 6.13.0-rep1 #3
[ 76.076202] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10
[ 76.088972] Call trace:
[ 76.091411] show_stack+0x20/0x38 (C)
[ 76.095073] dump_stack_lvl+0x80/0xf8
[ 76.098733] print_address_description.constprop.0+0x88/0x398
[ 76.104476] print_report+0xa8/0x278
[ 76.108041] kasan_report+0xa8/0xf8
[ 76.111520] __asan_report_store4_noabort+0x20/0x30
[ 76.116391] __update_page_owner_handle+0x238/0x298
[ 76.121259] __set_page_owner+0xdc/0x140
[ 76.125173] post_alloc_hook+0x190/0x1d8
[ 76.129090] alloc_contig_range_noprof+0x54c/0x890
[ 76.133874] alloc_contig_pages_noprof+0x35c/0x4a8
[ 76.138656] alloc_gigantic_folio.isra.0+0x2c0/0x368
[ 76.143616] only_alloc_fresh_hugetlb_folio.isra.0+0x24/0x150
[ 76.149353] alloc_pool_huge_folio+0x11c/0x1f8
[ 76.153787] set_max_huge_pages+0x364/0xca8
[ 76.157961] __nr_hugepages_store_common+0xb0/0x1a0
[ 76.162829] nr_hugepages_store+0x108/0x118
[ 76.167003] kobj_attr_store+0x3c/0x70
[ 76.170745] sysfs_kf_write+0xfc/0x188
[ 76.174492] kernfs_fop_write_iter+0x274/0x3e0
[ 76.178927] vfs_write+0x64c/0x8e0
[ 76.182323] ksys_write+0xf8/0x1f0
[ 76.185716] __arm64_sys_write+0x74/0xb0
[ 76.189630] invoke_syscall.constprop.0+0xd8/0x1e0
[ 76.194412] do_el0_svc+0x164/0x1e0
[ 76.197891] el0_svc+0x40/0xe0
[ 76.200939] el0t_64_sync_handler+0x144/0x168
[ 76.205287] el0t_64_sync+0x1ac/0x1b0
Link: https://lkml.kernel.org/r/cover.1741301089.git.luizcap@redhat.com
Link: https://lkml.kernel.org/r/a45893880b7e1601082d39d2c5c8b50bcc096305.1741301089.git.luizcap@redhat.com
Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Luiz Capitulino <luizcap@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r-- | mm/page_ext.c | 13 |
1 files changed, 13 insertions, 0 deletions
diff --git a/mm/page_ext.c b/mm/page_ext.c index 641d93f6af4c..c351fdfe9e9a 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -508,6 +508,19 @@ void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) #endif /** + * page_ext_lookup() - Lookup a page extension for a PFN. + * @pfn: PFN of the page we're interested in. + * + * Must be called with RCU read lock taken and @pfn must be valid. + * + * Return: NULL if no page_ext exists for this page. + */ +struct page_ext *page_ext_lookup(unsigned long pfn) +{ + return lookup_page_ext(pfn_to_page(pfn)); +} + +/** * page_ext_get() - Get the extended information for a page. * @page: The page we're interested in. * |