| Age | Commit message (Collapse) | Author |
|
Let's sanity-check in folio_set_order() whether we would be trying to
create a folio with an order that would make it exceed MAX_FOLIO_ORDER.
This will enable the check whenever a folio/compound page is initialized
through prepare_compound_head() / prepare_compound_page() with
CONFIG_DEBUG_VM set.
Link: https://lkml.kernel.org/r/20250901150359.867252-11-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Grepping for "prep_compound_page" leaves on clueless how devdax gets its
compound pages initialized.
Let's add a comment that might help finding this open-coded
prep_compound_page() initialization more easily.
Further, let's be less smart about the ordering of initialization and just
perform the prep_compound_head() call after all tail pages were
initialized: just like prep_compound_page() does.
No need for a comment to describe the initialization order: again, just
like prep_compound_page().
Link: https://lkml.kernel.org/r/20250901150359.867252-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's check that no hstate that corresponds to an unreasonable folio size
is registered by an architecture. If we were to succeed registering, we
could later try allocating an unsupported gigantic folio size.
Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER
is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we
have to use a BUILD_BUG_ON_INVALID() to make it compile.
No existing kernel configuration should be able to trigger this check:
either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or
gigantic folios will not exceed a memory section (the case on sparse).
Link: https://lkml.kernel.org/r/20250901150359.867252-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's reject unreasonable folio sizes early, where we can still fail.
We'll add sanity checks to prepare_compound_head/prepare_compound_page
next.
Is there a way to configure a system such that unreasonable folio sizes
would be possible? It would already be rather questionable.
If so, we'd probably want to bail out earlier, where we can avoid a WARN
and just report a proper error message that indicates where something went
wrong such that we messed up.
Link: https://lkml.kernel.org/r/20250901150359.867252-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
alloc_contig_range_noprof()
Let's reject them early, which in turn makes folio_alloc_gigantic() reject
them properly.
To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
and calculate MAX_FOLIO_NR_PAGES based on that.
While at it, let's just make the order a "const unsigned order".
Link: https://lkml.kernel.org/r/20250901150359.867252-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: remove nth_page()", v2.
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when it
might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages". If we
only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #13 : disallow folios to have non-contiguous pages
Patch #14 -> #20 : remove nth_page() usage within folios
Patch #22 : disallow CMA allocations of non-contiguous pages
Patch #23 -> #33 : sanity+check + remove nth_page() usage within SG entry
Patch #34 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #35 : remove nth_page() in kfence
Patch #36 : adjust stale comment regarding nth_page
Patch #37 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
This patch (of 37):
In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc. All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section
(2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).
Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MFD_ALL_FLAGS is already an unsigned int. Remove redundant casts to
unsigned int.
Link: https://lkml.kernel.org/r/efbbe6093b64a5b19f974871d5262d6e75dff2c0.1756639225.git.joeypabalinas@gmail.com
Signed-off-by: Joey Pabalinas <joeypabalinas@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
zpdesc is the page descriptor used by the zsmalloc backend allocator,
which in turn is used by zswap and zram. The zpool layer is gone.
Link: https://lkml.kernel.org/r/20250829162212.208258-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With zswap using zsmalloc directly, there are no more in-tree users of
this code. Remove it.
With zpool gone, zsmalloc is now always a simple dependency and no
longer something the user needs to configure. Hide CONFIG_ZSMALLOC
from the user and have zswap and zram pull it in as needed.
Link: https://lkml.kernel.org/r/20250829162212.208258-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: remove zpool".
zpool is an indirection layer for zswap to switch between multiple
allocator backends at runtime. Since 6.15, zsmalloc is the only allocator
left in-tree, so there is no point in keeping zpool around.
This patch (of 3):
zswap goes through the zpool layer to enable runtime-switching of
allocator backends for compressed data. However, since zbud and z3fold
were removed in 6.15, zsmalloc has been the only option available.
As such, the zpool indirection is unnecessary. Make zswap deal with
zsmalloc directly. This is comparable to zram, which also directly
interacts with zsmalloc and has never supported a different backend.
Note that this does not preclude future improvements and experiments with
different allocation strategies. Should it become necessary, it's
possible to provide an alternate implementation for the zsmalloc API,
selectable at compile time. However, zsmalloc is also rather mature and
feature rich, with years of widespread production exposure; it's
encouraged to make incremental improvements rather than fork it.
In any case, the complexity of runtime pluggability seems excessive and
unjustified at this time. Switch zswap to zsmalloc to remove the last
user of the zpool API.
[hannes@cmpxchg.org: fix default compressr test]
Link: https://lkml.kernel.org/r/20250915153640.GA828739@cmpxchg.org
Link: https://lkml.kernel.org/r/20250829162212.208258-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20250829162212.208258-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Nacked-by: Vitaly Wool <vitaly.wool@konsulko.se>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kasan: unify kasan_enabled() and remove arch-specific
implementations", v6.
This patch series addresses the fragmentation in KASAN initialization
across architectures by introducing a unified approach that eliminates
duplicate static keys and arch-specific kasan_arch_is_ready()
implementations.
The core issue is that different architectures have inconsistent approaches
to KASAN readiness tracking:
- PowerPC, LoongArch, and UML arch, each implement own kasan_arch_is_ready()
- Only HW_TAGS mode had a unified static key (kasan_flag_enabled)
- Generic and SW_TAGS modes relied on arch-specific solutions
or always-on behavior
This patch (of 2):
Introduce CONFIG_ARCH_DEFER_KASAN to identify architectures [1] that need
to defer KASAN initialization until shadow memory is properly set up, and
unify the static key infrastructure across all KASAN modes.
[1] PowerPC, UML, LoongArch selects ARCH_DEFER_KASAN.
The core issue is that different architectures haveinconsistent approaches
to KASAN readiness tracking:
- PowerPC, LoongArch, and UML arch, each implement own
kasan_arch_is_ready()
- Only HW_TAGS mode had a unified static key (kasan_flag_enabled)
- Generic and SW_TAGS modes relied on arch-specific solutions or always-on
behavior
This patch addresses the fragmentation in KASAN initialization across
architectures by introducing a unified approach that eliminates duplicate
static keys and arch-specific kasan_arch_is_ready() implementations.
Let's replace kasan_arch_is_ready() with existing kasan_enabled() check,
which examines the static key being enabled if arch selects
ARCH_DEFER_KASAN or has HW_TAGS mode support. For other arch,
kasan_enabled() checks the enablement during compile time.
Now KASAN users can use a single kasan_enabled() check everywhere.
Link: https://lkml.kernel.org/r/20250810125746.1105476-1-snovitoll@gmail.com
Link: https://lkml.kernel.org/r/20250810125746.1105476-2-snovitoll@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217049
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> #powerpc
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Gow <davidgow@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Marco Elver <elver@google.com>
Cc: Qing Zhang <zhangqing@loongson.cn>
Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
changes required by mm-stable material: hugetlb and damon.
|
|
There can be multiple inode switch works that are trying to switch
inodes to / from the same wb. This can happen in particular if some
cgroup exits which owns many (thousands) inodes and we need to switch
them all. In this case several inode_switch_wbs_work_fn() instances will
be just spinning on the same wb->list_lock while only one of them makes
forward progress. This wastes CPU cycles and quickly leads to softlockup
reports and unusable system.
Instead of running several inode_switch_wbs_work_fn() instances in
parallel switching to the same wb and contending on wb->list_lock, run
just one work item per wb and manage a queue of isw items switching to
this wb.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
This patch paves the path to enable huge mappings in vmalloc space and
linear map space by default on arm64. For this we must ensure that we
can handle any permission games on the kernel (init_mm) pagetable.
Previously, __change_memory_common() used apply_to_page_range() which
does not support changing permissions for block mappings. We move away
from this by using the pagewalk API, similar to what riscv does right
now. It is the responsibility of the caller to ensure that the range
over which permissions are being changed falls on leaf mapping
boundaries. For systems with BBML2, this will be handled in future
patches by dyanmically splitting the mappings when required.
Unlike apply_to_page_range(), the pagewalk API currently enforces the
init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
mmap_lock for our usecase, this patch extends this generic API to be
used locklessly, so as to retain the existing behaviour for changing
permissions. Apart from this reason, it is noted at [1] that KFENCE can
manipulate kernel pgtable entries during softirqs. It does this by
calling set_memory_valid() -> __change_memory_common(). This being a
non-sleepable context, we cannot take the init_mm mmap lock.
Add comments to highlight the conditions under which we can use the
lockless variant - no underlying VMA, and the user having exclusive
control over the range, thus guaranteeing no concurrent access.
We require that the start and end of a given range do not partially
overlap block mappings, or cont mappings. Return -EINVAL in case a
partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
add a corresponding comment in update_range_prot() to warn that
eliminating such a condition is the responsibility of the caller.
Note that, the pte level callback may change permissions for a whole
contpte block, and that will be done one pte at a time, as opposed to an
atomic operation for the block mappings. This is fine as any access will
decode either the old or the new permission until the TLBI.
apply_to_page_range() currently performs all pte level callbacks while
in lazy mmu mode. Since arm64 can optimize performance by batching
barriers when modifying kernel pgtables in lazy mmu mode, we would like
to continue to benefit from this optimisation. Unfortunately
walk_kernel_page_table_range() does not use lazy mmu mode. However,
since the pagewalk framework is not allocating any memory, we can safely
bracket the whole operation inside lazy mmu mode ourselves. Therefore,
wrap the call to walk_kernel_page_table_range() with the lazy MMU
helpers.
Link: https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/ [1]
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Yang Shi <yshi@os.amperecomputing.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
9536fbe10c9d ("net/mlx5e: Add PSP steering in local NIC RX")
7601a0a46216 ("net/mlx5e: Add a miss level for ipsec crypto offload")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"15 hotfixes. 11 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 13 of these
fixes are for MM.
The usual shower of singletons, plus
- fixes from Hugh to address various misbehaviors in get_user_pages()
- patches from SeongJae to address a quite severe issue in DAMON
- another series also from SeongJae which completes some fixes for a
DAMON startup issue"
* tag 'mm-hotfixes-stable-2025-09-17-21-10' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
zram: fix slot write race condition
nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/*
samples/damon/mtier: avoid starting DAMON before initialization
samples/damon/prcl: avoid starting DAMON before initialization
samples/damon/wsse: avoid starting DAMON before initialization
MAINTAINERS: add Lance Yang as a THP reviewer
MAINTAINERS: add Jann Horn as rmap reviewer
mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control
mm/damon/core: introduce damon_call_control->dealloc_on_cancel
mm: folio_may_be_lru_cached() unless folio_test_large()
mm: revert "mm: vmscan.c: fix OOM on swap stress test"
mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
mm/gup: check ref_count instead of lru before migration
|
|
kmsan_handle_dma_sg() has call to kmsan_handle_dma() function which was
missed during conversion to physical addresses. Update that caller too
and fix the following compilation error:
mm/kmsan/hooks.c:372:6: error: too many arguments to function call, expected 3, have 4
371 | kmsan_handle_dma(sg_page(item), item->offset, item->length,
| ~~~~~~~~~~~~~~~~
372 | dir);
| ^~~
mm/kmsan/hooks.c:362:19: note: 'kmsan_handle_dma' declared here
362 | EXPORT_SYMBOL_GPL(kmsan_handle_dma);
Fixes: 6eb1e769b2c1 ("kmsan: convert kmsan_handle_dma to use physical addresses")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509170638.AMGNCMEE-lkp@intel.com/
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/4b2d7d0175b30177733bbbd42bf979d77eb73c29.1758090947.git.leon@kernel.org
|
|
alloc_slab_obj_exts() should mark failed obj_exts vector allocations
independent on whether the vector is being allocated for a new or an
existing slab. Current implementation skips doing this for existing
slabs. Fix this by marking failed allocations unconditionally.
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Reported-by: Shakeel Butt <shakeel.butt@linux.dev>
Closes: https://lore.kernel.org/all/avhakjldsgczmq356gkwmvfilyvf7o6temvcmtt5lqd4fhp5rk@47gp2ropyixg/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # v6.10+
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When object extension vector allocation fails, we set slab->obj_exts to
OBJEXTS_ALLOC_FAIL to indicate the failure. Later, once the vector is
successfully allocated, we will use this flag to mark codetag references
stored in that vector as empty to avoid codetag warnings.
slab_obj_exts() used to retrieve the slab->obj_exts vector pointer checks
slab->obj_exts for being either NULL or a pointer with MEMCG_DATA_OBJEXTS
bit set. However it does not handle the case when slab->obj_exts equals
OBJEXTS_ALLOC_FAIL. Add the missing condition to avoid extra warning.
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Reported-by: Shakeel Butt <shakeel.butt@linux.dev>
Closes: https://lore.kernel.org/all/jftidhymri2af5u3xtcqry3cfu6aqzte3uzlznhlaylgrdztsi@5vpjnzpsemf5/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # v6.10+
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The struct slab pointer has been obtained from the object being freed on
all the paths that lead to this function. In all cases this already
includes the test for slab type of the struct page which struct slab is
overlaying. Thus we would not reach this function if it was not a valid
slab pointer in the first place.
One less obvious case is that kmem_cache_free() trusts virt_to_slab()
blindly so it may be NULL if the slab type check is false. But with
SLAB_CONSISTENCY_CHECKS, cache_from_obj() called also from
kmem_cache_free() catches this and returns NULL, which terminates
freeing immediately.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
We touch slab->freelist and slab->inuse before checking the slab pointer
is actually sane. Do that validation first, which will be safer. We can
thus also remove the check from alloc_debug_processing().
This adds a new "s->flags & SLAB_CONSISTENCY_CHECKS" test but
alloc_single_from_partial() is only called for caches with debugging
enabled so it's acceptable.
In alloc_single_from_new_slab() we just created the struct slab and call
alloc_debug_processing() to mainly set up redzones, tracking etc, while
not really expecting the consistency checks to fail. Thus don't validate
it there.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In alloc_debug_processing() we can call validate_slab_ptr() upfront and
then don't need to recheck when alloc_consistency_checks() fails for
other reasons.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
We will want to do the validation earlier in some callers or remove it
completely, so extract it from check_slab() first. No functional change.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
This will make it clear where we currently cast struct slab to folio
only to check the slab type, and allow to change the implementation
later with memdesc conversion.
For now use a struct page based implementation instead of struct folio
to be compatible with further upcoming changes.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
We already know that slab is a valid slab as that's checked by the
caller. In the future, we won't be able to get to a slab pointer
from a non-slab page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
for_each_free_mem_pfn_range_in_zone_from() and its "backend" implementation
__next_mem_pfn_range_in_zone() were only used by deferred initialization of
the memory map.
Remove them as they are not used anymore.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
deferred_init_memmap_chunk() calls deferred_init_maxorder() to initialize
struct pages in MAX_ORDER_NR_PAGES because according to commit 0e56acae4b4d
("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger
sections") this provides better cache locality than initializing the memory
map in larger sections.
The looping through free memory ranges is quite cumbersome in the current
implementation as it is divided between deferred_init_memmap_chunk() and
deferred_init_maxorder(). Besides, the latter has two loops, one that
initializes struct pages and another one that frees them.
There is no need in two loops because it is safe to free pages in groups
smaller than MAX_ORDER_NR_PAGES. Even if lookup for a buddy page will
access a struct page ahead of the pages being initialized, that page is
guaranteed to be initialized either by memmap_init_reserved_pages() or by
init_unavailable_range().
Simplify the code by moving initialization and freeing of the pages into
deferred_init_memmap_chunk() and dropping deferred_init_maxorder().
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
deferred_init_memmap() loops over free memory ranges and creates a
padata_mt_job for every free range that intersects with the zone being
initialized.
padata_do_multithreaded() then splits every such range to several chunks
and runs a thread that initializes struct pages in that chunk using
deferred_init_memmap_chunk(). The number of threads is limited by amount of
the CPUs on the node (or 1 for memoryless nodes).
Looping through free memory ranges is then repeated in
deferred_init_memmap_chunk() first to find the first range that should be
initialized and then to traverse the ranges until the end of the chunk is
reached.
Remove the loop over free memory regions in deferred_init_memmap() and pass
the entire zone to padata_do_multithreaded() so that it will be divided to
several chunks by the parallelization code.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
deferred_grow_zone() initializes one or more sections in the memory map
if buddy runs out of initialized struct pages when
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
It loops through memblock regions and initializes and frees pages in
MAX_ORDER_NR_PAGES chunks.
Essentially the same loop is implemented in deferred_init_memmap_chunk(),
the only actual difference is that deferred_init_memmap_chunk() does not
count initialized pages.
Make deferred_init_memmap_chunk() count the initialized pages and return
their number, wrap it with deferred_init_memmap_job() for multithreaded
initialization with padata_do_multithreaded() and replace open-coded
initialization of struct pages in deferred_grow_zone() with a call to
deferred_init_memmap_chunk().
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
We iterate pfn from order 0 to MAX_PAGE_ORDER aligned to find large buddy.
While if the order is less than start_pfn aligned order, we would get the
same pfn and do the same check again.
Iterate from start_pfn aligned order to reduce duplicated work.
[richard.weiyang@gmail.com: add comment on assignment of order]
Link: https://lkml.kernel.org/r/20250828091618.7869-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250902025807.11467-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250828091618.7869-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250902025807.11467-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Adopting addr_unit would make DAMON_MINREGION 'addr_unit * 4096' bytes and
cause data alignment issues[1].
Add damon_ctx->min_sz_region to change DAMON_MIN_REGION from a global
macro value to per-context variable.
Link: https://lkml.kernel.org/r/20250828171242.59810-12-sj@kernel.org
Link: https://lore.kernel.org/all/527714dd-0e33-43ab-bbbd-d89670ba79e7@huawei.com [1]
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Only DAMON kernel API callers can use addr_unit parameter. Implement a
sysfs file to let DAMON sysfs ABI users use it.
Additionally, addr_unit must be set to a non-zero value.
Link: https://lkml.kernel.org/r/20250828171242.59810-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support of addr_unit for DAMOS_STAT action handling from the DAMOS
operation implementation for the physical address space.
Link: https://lkml.kernel.org/r/20250828171242.59810-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support of addr_unit for DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD
action handling from the DAMOS operation implementation for the physical
address space.
Link: https://lkml.kernel.org/r/20250828171242.59810-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support of addr_unit for DAMOS_LRU_PRIO and DAMOS_LRU_DEPRIO action
handling from the DAMOS operation implementation for the physical address
space.
Link: https://lkml.kernel.org/r/20250828171242.59810-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support of addr_unit for DAMOS_PAGEOUT action handling from the DAMOS
operation implementation for the physical address space.
Link: https://lkml.kernel.org/r/20250828171242.59810-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support of addr_unit paramer for access monitoing operations of paddr.
Link: https://lkml.kernel.org/r/20250828171242.59810-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: support ARM32 with LPAE", v3.
Previously, DAMON's physical address space monitoring only supported
memory ranges below 4GB on LPAE-enabled systems. This was due to the use
of 'unsigned long' in 'struct damon_addr_range', which is 32-bit on ARM32
even with LPAE enabled[1].
To add DAMON support for ARM32 with LPAE enabled, a new core layer
parameter called 'addr_unit' was introduced[2]. Operations set layer can
translate a core layer address to the real address by multiplying the
parameter value to the core layer address. Support of the parameter is up
to each operations layer implementation, though. For example, operations
set implementations for virtual address space can simply ignore the
parameter. Add the support on paddr, which is the DAMON operations set
implementation for the physical address space, as we have a clear use case
for that.
This patch (of 11):
In some cases, some of the real address that handled by the underlying
operations set cannot be handled by DAMON since it uses only 'unsinged
long' as the address type. Using DAMON for physical address space
monitoring of 32 bit ARM devices with large physical address extension
(LPAE) is one example[1].
Add a parameter name 'addr_unit' to core layer to help such cases. DAMON
core API callers can set it as the scale factor that will be used by the
operations set for translating the core layer's addresses to the real
address by multiplying the parameter value to the core layer address.
Support of the parameter is up to each operations set layer. The support
from the physical address space operations set (paddr) will be added with
following commits.
Link: https://lkml.kernel.org/r/20250828171242.59810-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250828171242.59810-2-sj@kernel.org
Link: https://lore.kernel.org/20250408075553.959388-1-zuoze1@huawei.com [1]
Link: https://lore.kernel.org/all/20250416042551.158131-1-sj@kernel.org/ [2]
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
enum pageblock_bits defines the meaning of pageblock bits. Currently
PB_migratetype_bits says the lowest 3 bits represents migratetype and
PB_migrate_end/MIGRATETYPE_MASK's definition rely on it with magical
computation.
Remove the definition of PB_migratetype_bits/PB_migrate_end. Use
PB_migrate_[0|1|2] to represent lowest bits for migratetype. Then we can
simplify related definition.
Also, MIGRATETYPE_AND_ISO_MASK is MIGRATETYPE_MASK add isolation bit. Use
MIGRATETYPE_MASK in the definition of MIGRATETYPE_AND_ISO_MASK looks
cleaner.
No functional change intended.
Link: https://lkml.kernel.org/r/20250827070105.16864-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/pageblock: improve readability of some pageblock
handling", v3.
During code reading, found two possible points to improve the readability
of pageblock handling.
Patch 1: isolate bit is standalone and there are dedicated helpers.
Instead of check the bit directly, we could use the helper to do it.
Patch 2: remove PB_migratetype_bits and PB_migrate_end to reduce magical
computation.
This patch (of 2):
Since commit e904bce2d9d4 ("mm/page_isolation: make page isolation a
standalone bit"), it provides dedicated helper to handle isolation.
Change to use these helpers to be better reading.
No functional change intended.
Link: https://lkml.kernel.org/r/20250827070105.16864-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250827070105.16864-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Kernel file pages are tricky to track because they are indistinguishable
from files whose usage is accounted to the root cgroup.
To maintain good accounting, introduce a vmstat counter tracking kernel
file pages.
Confirmed that these work as expected at a high level by mounting a btrfs
using AS_KERNEL_FILE for metadata pages, and seeing the counter rise with
fs usage then go back to a minimal level after drop_caches and finally
down to 0 after unmounting the fs.
Link: https://lkml.kernel.org/r/08ff633e3a005ed5f7691bfd9f58a5df8e474339.1755812945.git.boris@bur.io
Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: syzbot@syzkaller.appspotmail.com
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qu Wenruo <wqu@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "introduce kernel file mapped folios", v4.
Btrfs currently tracks its metadata pages in the page cache, using a fake
inode (fs_info->btree_inode) with offsets corresponding to where the
metadata is stored in the filesystem's full logical address space.
A consequence of this is that when btrfs uses filemap_add_folio(), this
usage is charged to the cgroup of whichever task happens to be running at
the time. These folios don't belong to any particular user cgroup, so I
don't think it makes much sense for them to be charged in that way. Some
negative consequences as a result:
- A task can be holding some important btrfs locks, then need to lookup
some metadata and go into reclaim, extending the duration it holds
that lock for, and unfairly pushing its own reclaim pain onto other
cgroups.
- If that cgroup goes into reclaim, it might reclaim these folios a
different non-reclaiming cgroup might need soon. This is naturally
offset by LRU reclaim, but still.
We have two options for how to manage such file pages:
1. charge them to the root cgroup.
2. don't charge them to any cgroup at all.
2. breaks the invariant that every mapped page has a cgroup. This is
workable, but unnecessarily risky. Therefore, go with 1.
A very similar proposal to use the root cgroup was previously made by Qu,
where he eventually proposed the idea of setting it per address_space.
This makes good sense for the btrfs use case, as the behavior should apply
to all use of the address_space, not select allocations. I.e., if someone
adds another filemap_add_folio() call using btrfs's btree_inode, we would
almost certainly want to account that to the root cgroup as well.
This patch (of 3):
Add the flag AS_KERNEL_FILE to the address_space to indicate that this
mapping's memory is exempt from the usual memcg accounting.
[boris@bur.io: fix CONFIG_MEMCG build for AS_KERNEL_FILE]
Link: https://lkml.kernel.org/r/6de59ddeec81b5c294d337c001ba0061631d4ec6.1755816635.git.boris@bur.io
Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/
Link: https://lkml.kernel.org/r/f09c4e2c90351d4cb30a1969f7a863b9238bd291.1755812945.git.boris@bur.io
Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "selftests/mm: uffd-stress fixes", v2.
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature
of the test. Patch 1 changes the hugepage constraint in the
run_vmtests.sh script, whereas patch 2 changes the constraint in the test
itself.
This patch (of 2):
We observed uffd-stress selftest failure on arm64 and intermittent
failures on x86 too:
running ./uffd-stress hugetlb-private 128 32
bounces: 17, mode: rnd read, ERROR: UFFDIO_COPY error: -12 (errno=12, @uffd-common.c:617) [FAIL]
not ok 18 uffd-stress hugetlb-private 128 32 # exit=1
For this particular case, the number of free hugepages from run_vmtests.sh
will be 128, and the test will allocate 64 hugepages in the source
location. The stress() function will start spawning threads which will
operate on the destination location, triggering uffd-operations like
UFFDIO_COPY from src to dst, which means that we will require 64 more
hugepages for the dst location.
Let us observe the locking_thread() function. It will lock the mutex kept
at dst, triggering uffd-copy. Suppose that 127 (64 for src and 63 for
dst) hugepages have been reserved. In case of BOUNCE_RANDOM, it may
happen that two threads trying to lock the mutex at dst, try to do so at
the same hugepage number. If one thread succeeds in reserving the last
hugepage, then the other thread may fail in alloc_hugetlb_folio(),
returning -ENOMEM. I can confirm that this is indeed the case by this
hacky patch:
:--- a/mm/hugetlb.c
; +++ b/mm/hugetlb.c
; @@ -6929,6 +6929,11 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
;
; folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
; if (IS_ERR(folio)) {
; + pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
; + if (actual_pte) {
; + ret = -EEXIST;
; + goto out;
; + }
; ret = -ENOMEM;
; goto out;
; }
This code path gets triggered indicating that the PMD at which one thread
is trying to map a hugepage, gets filled by a racing thread.
Therefore, instead of using freepgs to compute the amount of memory, use
freepgs - (min(32, nr_cpus) - 1), so that the test still has some extra
hugepages to use. The adjustment is a function of min(32, nr_cpus) - the
value of nr_parallel in the test - because in the worst case, nr_parallel
number of threads will try to map a hugepage on the same PMD, one will win
the allocation race, and the other nr_parallel - 1 threads will fail, so
we need extra nr_parallel - 1 hugepages to satisfy this request. Note
that, in case the adjusted value underflows, there is a check for the
number of free hugepages in the test itself, which will fail:
get_free_hugepages() < bytes / page_size A negative value will be passed
on to bytes which is of type size_t, thus the RHS will become a large
value and the check will fail, so we are safe.
Link: https://lkml.kernel.org/r/20250909061531.57272-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250909061531.57272-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We've got the number of pages in the folio earlier, thus remove the redundant
folio_nr_pages() call.
Link: https://lkml.kernel.org/r/67c80182ebd949e3894908e01e224697c143aabb.1756200587.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is more straightforward to use the term `folio'. No functional changes.
Link: https://lkml.kernel.org/r/a2d39608d99cba1130cacd9cffbafc6949193c08.1756200587.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently order is signed in one version of the function and unsigned in
the other. Tidy that up.
In page_alloc.c, order is unsigned in the vast majority of cases. But,
there is a cluster of exceptions in compaction-related code (probably
stemming from the fact that compact_control.order is signed). So, prefer
local consistency and make this one signed too.
Link: https://lkml.kernel.org/r/20250826-cleanup-should_compact_retry-v1-1-d2ca89727fcf@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio_index is only needed for mixed usage of page cache and swap cache.
The remaining three caller in page-writeback are for page cache tag
marking. Swap cache space doesn't use tag (explicitly sets
mapping_set_no_writeback_tags), so use folio->index here directly.
Link: https://lkml.kernel.org/r/20250825163721.17734-1-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In general, khugepaged_scan_mm_slot() iterates khugepaged_scan.mm_head list
to get a mm_struct for collapse memory.
Use list_xxx() helper would be more obvious to the list iteration
operation.
No functional change.
Link: https://lkml.kernel.org/r/20250822025732.9025-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Using forced_collapse directly is clearer and enforce_sysfs is not really
needed.
Link: https://lkml.kernel.org/r/20250821150038.2025521-1-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|