summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-05-12mm: page-flags-layout.h: change the KASAN_TAG_WIDTH for HW_TAGSGuilherme Giacomo Simoes
KASAN_TAG_WIDTH is 8 bits for both (HW_TAGS and SW_TAGS), but for HW_TAGS the KASAN_TAG_WIDTH can be 4 bits bits because due to the design of the MTE the memory words for storing metadata only need 4 bits. Change the preprocessor define KASAN_TAG_WIDTH for check if SW_TAGS is define, so KASAN_TAG_WIDTH should be 8 bits, but if HW_TAGS is define, so KASAN_TAG_WIDTH should be 4 bits to save a few flags bits. Link: https://lkml.kernel.org/r/20250428201409.5482-1-trintaeoitogc@gmail.com Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: establish mm/vma_exec.c for shared exec/mm VMA functionalityLorenzo Stoakes
Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: add folio_expected_ref_count() for reference count calculationShivank Garg
Patch series " JFS: Implement migrate_folio for jfs_metapage_aops" v5. This patchset addresses a warning that occurs during memory compaction due to JFS's missing migrate_folio operation. The warning was introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystem don't implement migrate_folio. The syzbot reported following [1]: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Modules linked in: CPU: 1 UID: 0 PID: 5861 Comm: syz-executor280 Not tainted 6.15.0-rc1-next-20250411-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 RIP: 0010:fallback_migrate_folio mm/migrate.c:953 [inline] RIP: 0010:move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 To fix this issue, this series implement metapage_migrate_folio() for JFS which handles both single and multiple metapages per page configurations. While most filesystems leverage existing migration implementations like filemap_migrate_folio(), buffer_migrate_folio_norefs() or buffer_migrate_folio() (which internally used folio_expected_refs()), JFS's metapage architecture requires special handling of its private data during migration. To support this, this series introduce the folio_expected_ref_count(), which calculates external references to a folio from page/swap cache, private data, and page table mappings. This standardized implementation replaces the previous ad-hoc folio_expected_refs() function and enables JFS to accurately determine whether a folio has unexpected references before attempting migration. Implement folio_expected_ref_count() to calculate expected folio reference counts from: - Page/swap cache (1 per page) - Private data (1) - Page table mappings (1 per map) While originally needed for page migration operations, this improved implementation standardizes reference counting by consolidating all refcount contributors into a single, reusable function that can benefit any subsystem needing to detect unexpected references to folios. The folio_expected_ref_count() returns the sum of these external references without including any reference the caller itself might hold. Callers comparing against the actual folio_ref_count() must account for their own references separately. Link: https://syzkaller.appspot.com/bug?extid=8bb6fd945af4e0ad9299 [1] Link: https://lkml.kernel.org/r/20250430100150.279751-1-shivankg@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-2-shivankg@amd.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Co-developed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12util_macros.h: make the header more resilientAndy Shevchenko
Add missing header inclusions. Link: https://lkml.kernel.org/r/20250428072754.3265274-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12sched/numa: add tracepoint that tracks the skipping of numa balancing due to ↵Libo Chen
cpuset memory pinning Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Link: https://lkml.kernel.org/r/20250424024523.2298272-3-libo.chen@oracle.com Signed-off-by: Libo Chen <libo.chen@oracle.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Srikanth Aithal <sraithal@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/rmap: inline folio_test_large_maybe_mapped_shared() into callersLance Yang
To prevent the function from being used when CONFIG_MM_ID is disabled, we intend to inline it into its few callers, which also would help maintain the expected code placement. Link: https://lkml.kernel.org/r/20250424155606.57488-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: implement for_each_valid_pfn() for CONFIG_SPARSEMEMDavid Woodhouse
Implement for_each_valid_pfn() based on two helper functions. The first_valid_pfn() function largely mirrors pfn_valid(), calling into a pfn_section_first_valid() helper which is trivial for the !VMEMMAP case, and in the VMEMMAP case will skip to the next subsection as needed. Since next_valid_pfn() knows that its argument *is* a valid PFN, it doesn't need to do any checking at all while iterating over the low bits within a (sub)section mask; the whole (sub)section is either present or not. Note that the VMEMMAP version of pfn_section_first_valid() may return a value *higher* than end_pfn when skipping to the next subsection, and first_valid_pfn() happily returns that higher value. This is fine. [dwmw2@infradead.org: fix next_valid_pfn() for sparsemem] Link: https://lkml.kernel.org/r/c15100fcf6781a60b852c4dbb43bdc98a678fcf0.camel@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-4-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: implement for_each_valid_pfn() for CONFIG_FLATMEMDavid Woodhouse
In the FLATMEM case, the default pfn_valid() just checks that the PFN is within the range [ ARCH_PFN_OFFSET .. ARCH_PFN_OFFSET + max_mapnr ). The for_each_valid_pfn() function can therefore be a simple for() loop using those as min/max respectively. Link: https://lkml.kernel.org/r/20250423133821.789413-3-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: introduce for_each_valid_pfn() and use it from reserve_bootmem_region()David Woodhouse
Patch series "mm: Introduce for_each_valid_pfn()", v4. There are cases where a naïve loop over a PFN range, calling pfn_valid() on each one, is horribly inefficient. Ruihan Li reported the case where memmap_init() iterates all the way from zero to a potentially large value of ARCH_PFN_OFFSET, and we at Amazon found the reserve_bootmem_region() one as it affects hypervisor live update. Others are more cosmetic. By introducing a for_each_valid_pfn() helper it can optimise away a lot of pointless calls to pfn_valid(), skipping immediately to the next valid PFN and also skipping *all* checks within a valid (sub)region according to the granularity of the memory model in use. This patch (of 7) Especially since commit 9092d4f7a1f8 ("memblock: update initialization of reserved pages"), the reserve_bootmem_region() function can spend a significant amount of time iterating over every 4KiB PFN in a range, calling pfn_valid() on each one, and ultimately doing absolutely nothing. On a platform used for virtualization, with large NOMAP regions that eventually get used for guest RAM, this leads to a significant increase in steal time experienced during kexec for a live update. Introduce for_each_valid_pfn() and use it from reserve_bootmem_region(). This implementation is precisely the same naïve loop that the functio used to have, but subsequent commits will provide optimised versions for FLATMEM and SPARSEMEM, and this version will remain for those architectures which provide their own pfn_valid() implementation, until/unless they also provide a matching for_each_valid_pfn(). Link: https://lkml.kernel.org/r/20250423133821.789413-1-dwmw2@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-2-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add KHO support to kexec file loadsAlexander Graf
Kexec has 2 modes: A user space driven mode and a kernel driven mode. For the kernel driven mode, kernel code determines the physical addresses of all target buffers that the payload gets copied into. With KHO, we can only safely copy payloads into the "scratch area". Teach the kexec file loader about it, so it only allocates for that area. In addition, enlighten it with support to ask the KHO subsystem for its respective payloads to copy into target memory. Also teach the KHO subsystem how to fill the images for file loads. Link: https://lkml.kernel.org/r/20250509074635.3187114-8-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: enable KHO support for memory preservationMike Rapoport (Microsoft)
Introduce APIs allowing KHO users to preserve memory across kexec and get access to that memory after boot of the kexeced kernel kho_preserve_folio() - record a folio to be preserved over kexec kho_restore_folio() - recreates the folio from the preserved memory kho_preserve_phys() - record physically contiguous range to be preserved over kexec. The memory preservations are tracked by two levels of xarrays to manage chunks of per-order 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a 1TB x86 system would fit inside a single 512 byte bitmap. For order 0 allocations each bitmap will cover 16M of address space. Thus, for 16G of memory at most 512K of bitmap memory will be needed for order 0. At serialization time all bitmaps are recorded in a linked list of pages for the next kernel to process and the physical address of the list is recorded in KHO FDT. The next kernel then processes that list, reserves the memory ranges and later, when a user requests a folio or a physical range, KHO restores corresponding memory map entries. Link: https://lkml.kernel.org/r/20250509074635.3187114-7-changyuanl@google.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add KHO parsing supportAlexander Graf
When we have a KHO kexec, we get an FDT blob and scratch region to populate the state of the system. Provide helper functions that allow architecture code to easily handle memory reservations based on them and give device drivers visibility into the KHO FDT and memory reservations so they can recover their own state. Include a fix from Arnd Bergmann <arnd@arndb.de> https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/. Link: https://lkml.kernel.org/r/20250509074635.3187114-6-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add Kexec HandOver (KHO) generation helpersAlexander Graf
Add the infrastructure to generate Kexec HandOver metadata. Kexec HandOver is a mechanism that allows Linux to preserve state - arbitrary properties as well as memory locations - across kexec. It does so using 2 concepts: 1) KHO FDT - Every KHO kexec carries a KHO specific flattened device tree blob that describes preserved memory regions. Device drivers can register to KHO to serialize and preserve their states before kexec. 2) Scratch Regions - CMA regions that we allocate in the first kernel. CMA gives us the guarantee that no handover pages land in those regions, because handover pages must be at a static physical memory location. We use these regions as the place to load future kexec images so that they won't collide with any handover data. Link: https://lkml.kernel.org/r/20250509074635.3187114-5-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: introduce memmap_init_kho_scratch()Mike Rapoport (Microsoft)
With deferred initialization of struct page it will be necessary to initialize memory map for KHO scratch regions early. Add memmap_init_kho_scratch() method that will allow such initialization in upcoming patches. Link: https://lkml.kernel.org/r/20250509074635.3187114-4-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: add support for scratch memoryAlexander Graf
With KHO (Kexec HandOver), we need a way to ensure that the new kernel does not allocate memory on top of any memory regions that the previous kernel was handing over. But to know where those are, we need to include them in the memblock.reserved array which may not be big enough to hold all ranges that need to be persisted across kexec. To resize the array, we need to allocate memory. That brings us into a catch 22 situation. The solution to that is limit memblock allocations to the scratch regions: safe regions to operate in the case when there is memory that should remain intact across kexec. KHO provides several "scratch regions" as part of its metadata. These scratch regions are contiguous memory blocks that known not to contain any memory that should be persisted across kexec. These regions should be large enough to accommodate all memblock allocations done by the kexeced kernel. We introduce a new memblock_set_scratch_only() function that allows KHO to indicate that any memblock allocation must happen from the scratch regions. Later, we may want to perform another KHO kexec. For that, we reuse the same scratch regions. To ensure that no eventually handed over data gets allocated inside a scratch region, we flip the semantics of the scratch region with memblock_clear_scratch_only(): After that call, no allocations may happen from scratch memblock regions. We will lift that restriction in the next patch. Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: add MEMBLOCK_RSRV_KERN flagMike Rapoport (Microsoft)
Patch series "kexec: introduce Kexec HandOver (KHO)", v8. Kexec today considers itself purely a boot loader: When we enter the new kernel, any state the previous kernel left behind is irrelevant and the new kernel reinitializes the system. However, there are use cases where this mode of operation is not what we actually want. In virtualization hosts for example, we want to use kexec to update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure that IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we need to do the same for the PCI subsystem. If we want to kexec while an SEV-SNP enabled virtual machine is running, we need to preserve the VM context pages and physical memory. See "pkernfs: Persisting guest memory and kernel/device state safely across kexec" Linux Plumbers Conference 2023 presentation for details: https://lpc.events/event/17/contributions/1485/ To start us on the journey to support all the use cases above, this patch implements basic infrastructure to allow hand over of kernel state across kexec (Kexec HandOver, aka KHO). As a really simple example target, we use memblock's reserve_mem. With this patchset applied, memory that was reserved using "reserve_mem" command line options remains intact after kexec and it is guaranteed to reside at the same physical address. == Alternatives == There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command line All of the approaches above fundamentally have the same problem: They require the administrator to explicitly carve out a physical memory location because they have no mechanism outside of the kernel command line to pass data (including memory reservations) between kexec'ing kernels. KHO provides that base foundation. We will determine later whether we still need any of the approaches above for fast bulk memory handover of for example IOMMU page tables. But IMHO they would all be users of KHO, with KHO providing the foundational primitive to pass metadata and bulk memory reservations as well as provide easy versioning for data. == Overview == We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is a Flattened Device Tree (fdt) which has a generator and parser already included in Linux. KHO is enabled in the kernel command line by `kho=on`. When the root user enables KHO through /sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every KHO users to register preserved memory regions, which contain drivers' states. When the actual kexec happens, the fdt is part of the image set that we boot into. In addition, we keep "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When drivers initialize that support KHO, they introspect the fdt, restore preserved memory regions, and retrieve their states stored in the preserved memory. == Limitations == Currently KHO is only implemented for file based kexec. The kernel interfaces in the patch set are already in place to support user space kexec as well, but it is still not implemented it yet inside kexec tools. == How to Use == To use the code, please boot the kernel with the "kho=on" command line parameter. KHO will automatically create scratch regions. If you want to set the scratch size explicitly you can use "kho_scratch=" command line parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB per NUMA node scratch regions on boot. Make sure to have a reserved memory range requested with reserv_mem command line option, for example, "reserve_mem=64m:4k:n1". Then before you invoke file based "kexec -l", finalize KHO FDT: # echo 1 > /sys/kernel/debug/kho/out/finalize You can preview the generated FDT using `dtc`, # dtc /sys/kernel/debug/kho/out/fdt # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock `dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`. Now kexec into the new kernel, # kexec -l Image --initrd=initrd -s # kexec -e (The order of KHO finalization and "kexec -l" does not matter.) The new kernel will boot up and contain the previous kernel's reserve_mem contents at the same physical address as the first kernel. You can also review the FDT passed from the old kernel, # dtc /sys/kernel/debug/kho/in/fdt # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock This patch (of 17): To denote areas that were reserved for kernel use either directly with memblock_reserve_kern() or via memblock allocations. Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/ Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/ Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/ Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/ Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12khugepaged: pass folio instead of head page to trace eventsFan Ni
The trace functions trace_mm_collapse_huge_page_isolate() and trace_mm_khugepaged_scan_pmd() each have a single user, which always passes in the head page of a folio. Refactor both functions to take a folio directly. Link: https://lkml.kernel.org/r/20250425002425.533698-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: remove unused macro INIT_PASIDCheng-Han Wu
The macro INIT_PASID was originally used by mm_init_pasid. However, since commit a6cbd44093ef ("kernel/fork: Initialize mm's PASID"), mm_init_pasid has been removed. Therefore, INIT_PASID is no longer needed and is removed. Link: https://lkml.kernel.org/r/20250427145004.13049-1-hank20010209@gmail.com Signed-off-by: Cheng-Han Wu <hank20010209@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: add swappiness=max arg to memory.reclaim for only anon reclaimZhongkun He
Patch series "add max arg to swappiness in memory.reclaim and lru_gen", v4. This patchset adds max arg to swappiness in memory.reclaim and lru_gen for anon only proactive memory reclaim. With commit <68cd9050d871> ("mm: add swappiness= arg to memory.reclaim") we can submit an additional swappiness=<val> argument to memory.reclaim. It is very useful because we can dynamically adjust the reclamation ratio based on the anonymous folios and file folios of each cgroup. For example,when swappiness is set to 0, we only reclaim from file folios. But we can not relciam memory just from anon folios. This patchset introduces a new macro, SWAPPINESS_ANON_ONLY, defined as MAX_SWAPPINESS + 1, represent the max arg semantics. It specifically indicates that reclamation should occur only from anonymous pages. Patch 1 adds swappiness=max arg to memory.reclaim suggested-by: Yosry Ahmed Patch 2 add more comments for cache_trim_mode from Johannes Weiner in [1]. Patch 3 add max arg to lru_gen for proactive memory reclaim in MGLRU. The MGLRU already supports reclaiming exclusively from anonymous pages. This patch formalizes that behavior by introducing a max parameter to represent the corresponding semantics. Patch 4 using SWAPPINESS_ANON_ONLY in MGLRU Using SWAPPINESS_ANON_ONLY instead of MAX_SWAPPINESS + 1 to indicate reclaiming only from anonymous pages makes the code more readable and explicit Here is the previous discussion: https://lore.kernel.org/all/20250314033350.1156370-1-hezhongkun.hzk@bytedance.com/ https://lore.kernel.org/all/20250312094337.2296278-1-hezhongkun.hzk@bytedance.com/ https://lore.kernel.org/all/20250318135330.3358345-1-hezhongkun.hzk@bytedance.com/ This patch (of 4): With commit <68cd9050d871> ("mm: add swappiness= arg to memory.reclaim") we can submit an additional swappiness=<val> argument to memory.reclaim. It is very useful because we can dynamically adjust the reclamation ratio based on the anonymous folios and file folios of each cgroup. For example,when swappiness is set to 0, we only reclaim from file folios. However,we have also encountered a new issue: when swappiness is set to the MAX_SWAPPINESS, it may still only reclaim file folios. So, we hope to add a new arg 'swappiness=max' in memory.reclaim where proactive memory reclaim only reclaims from anonymous folios when swappiness is set to max. The swappiness semantics from a user perspective remain unchanged. For example, something like this: echo "2M swappiness=max" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 'max' (a new mode) regardless of the file folios. Users have a more comprehensive view of the application's memory distribution because there are many metrics available. For example, if we find that a certain cgroup has a large number of inactive anon folios, we can reclaim only those and skip file folios, because with the zram/zswap, the IO tradeoff that cache_trim_mode or other file first logic is making doesn't hold - file refaults will cause IO, whereas anon decompression will not. With this patch, the swappiness argument of memory.reclaim has a new mode 'max', means reclaiming just from anonymous folios both in traditional LRU and MGLRU. Link: https://lkml.kernel.org/r/cover.1745225696.git.hezhongkun.hzk@bytedance.com Link: https://lore.kernel.org/all/20250314141833.GA1316033@cmpxchg.org/ [1] Link: https://lkml.kernel.org/r/519e12b9b1f8c31a01e228c8b4b91a2419684f77.1745225696.git.hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/hugetlb: use separate nodemask for bootmem allocationsFrank van der Linden
Hugetlb boot allocation has used online nodes for allocation since commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation"). This was needed to be able to do the allocations earlier in boot, before N_MEMORY was set. This might lead to a different distribution of gigantic hugepages across NUMA nodes if there are memoryless nodes in the system. What happens is that the memoryless nodes are tried, but then the memblock allocation fails and falls back, which usually means that the node that has the highest physical address available will be used (top-down allocation). While this will end up getting the same number of hugetlb pages, they might not be be distributed the same way. The fallback for each memoryless node might not end up coming from the same node as the successful round-robin allocation from N_MEMORY nodes. While administrators that rely on having a specific number of hugepages per node should use the hugepages=N:X syntax, it's better not to change the old behavior for the plain hugepages=N case. To do this, construct a nodemask for hugetlb bootmem purposes only, containing nodes that have memory. Then use that for round-robin bootmem allocations. This saves some cycles, and the added advantage here is that hugetlb_cma can use it too, avoiding the older issue of pointless attempts to create a CMA area for memoryless nodes (which will also cause the per-node CMA area size to be too small). Link: https://lkml.kernel.org/r/20250402205613.3086864-1-fvdl@google.com Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation") Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/memcg: move mem_cgroup_init() ahead of cgroup_init()Huan Yang
Patch series "Use kmem_cache for memcg alloc", v3. (willy tldr: "you've gone from allocating 8 objects per 32KiB to allocating 13 objects per 32KiB, a 62% improvement in memory consumption" [1]) The mem_cgroup_alloc function creates mem_cgroup struct and it's associated structures including mem_cgroup_per_node. Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel, 1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure), we can observe the memory allocation for these structures using the following shell commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup # Trigger allocation if cgroup subtree do not enable memcg echo +memory > /sys/fs/cgroup/cgroup.subtree_control Ftrace Output: # mem_cgroup struct allocation sh-6312 [000] ..... 58015.698365: kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-6312 [000] ..... 58015.698389: kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false Key Observations: 1. Both structures use kmalloc with requested sizes between 2KB-4KB 2. Allocation alignment forces 4KB slab usage due to pre-defined sizes (64B, 128B,..., 2KB, 4KB, 8KB) 3. Memory waste per memcg instance: Base struct: 4096 - 2312 = 1784 bytes Per-node struct: 4096 - 2896 = 1200 bytes Total waste: 2984 bytes (1-node system) NUMA scaling: (1200 + 8) * nr_node_ids bytes So, it's a little waste. This patchset introduces dedicated kmem_cache: Patch2 - mem_cgroup kmem_cache - memcg_cachep Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep The benefits of this change can be observed with the following tracing commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup # In another terminal: echo +memory > /sys/fs/cgroup/cgroup.subtree_control The output might now look like this: # mem_cgroup struct allocation sh-9827 [000] ..... 289.513598: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9827 [000] ..... 289.513602: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false This indicates that the `mem_cgroup` struct now requests 2312 bytes and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes and is allocated 2944 bytes. The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the `kmem_cache`. Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as: # mem_cgroup struct allocation sh-9269 [003] ..... 80.396366: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475 bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9269 [003] ..... 80.396411: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6 bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial for performance. Please let me know if there are any issues or if I've misunderstood anything. This patchset also move mem_cgroup_init ahead of cgroup_init() due to cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before use it. This patch (of 3): When cgroup_init() creates root_mem_cgroup through css_alloc callback, some critical resources might not be fully initialized, forcing later operations to perform conditional checks for resource availability. This patch move mem_cgroup_init() to address the init order, it invoke before cgroup_init, so, compare to subsys_initcall, it can use to prepare some key resources before root_mem_cgroup alloc. Link: https://lkml.kernel.org/r/aAsRCj-niMMTtmK8@casper.infradead.org [1] Link: https://lkml.kernel.org/r/20250425031935.76411-1-link@vivo.com Link: https://lkml.kernel.org/r/20250425031935.76411-2-link@vivo.com Signed-off-by: Huan Yang <link@vivo.com> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Francesco Valla <francesco@valla.it> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Huang Shijie <shijie@os.amperecomputing.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Raul E Rangel <rrangel@chromium.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/huge_memory: remove useless folio pointers passingGavin Guo
Since the previous commit "mm/huge_memory: Adjust try_to_migrate_one() and split_huge_pmd_locked()" has simplified the logic by leveraging the folio verification in page_vma_mapped_walk(), this patch removes the unnecessary folio pointers passing. Link: https://lkml.kernel.org/r/20250425103859.825879-3-gavinguo@igalia.com Link: https://lore.kernel.org/all/98d1d195-7821-4627-b518-83103ade56c0@redhat.com/ Link: https://lore.kernel.org/all/91599a3c-e69e-4d79-bac5-5013c96203d7@redhat.com/ Signed-off-by: Gavin Guo <gavinguo@igalia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Florent Revest <revest@google.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12vmscan,cgroup: apply mems_effective to reclaimGregory Price
It is possible for a reclaimer to cause demotions of an lruvec belonging to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply this limitation based on the lruvec's memcg and prevent demotion. Notably, this may still allow demotion of shared libraries or any memory first instantiated in another cgroup. This means cpusets still cannot cannot guarantee complete isolation when demotion is enabled, and the docs have been updated to reflect this. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently - with the noted exceptions. Note on locking: The cgroup_get_e_css reference protects the css->effective_mems, and calls of this interface would be subject to the same race conditions associated with a non-atomic access to cs->effective_mems. So while this interface cannot make strong guarantees of correctness, it can therefore avoid taking a global or rcu_read_lock for performance. Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12cpuset: rename cpuset_node_allowed to cpuset_current_node_allowedGregory Price
Patch series "vmscan: enforce mems_effective during demotion", v5. Change reclaim to respect cpuset.mems_effective during demotion when possible. Presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. Implement cpuset_node_allowed() to check the cpuset.mems_effective associated wih the mem_cgroup of the lruvec being scanned. This only applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy than mem_cgroup in v1. This requires renaming the existing cpuset_node_allowed() to be cpuset_current_now_allowed() - which is more descriptive anyway - to implement the new cpuset_node_allowed() which takes a target cgroup. This patch (of 2): Rename cpuset_node_allowed to reflect that the function checks the current task's cpuset.mems. This allows us to make a new cpuset_node_allowed function that checks a target cgroup's cpuset.mems. Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Update Christoph's Email address and make it consistentChristoph Lameter (Ampere)
Use cl@gentwo.org throughout and remove the old email addresses. Link: https://lkml.kernel.org/r/8b962f57-4d98-cbb0-cd82-b6ba456733e8@gentwo.org Signed-off-by: Christoph Lameter <cl@gentwo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/damon/core: introduce damos quota goal metrics for memory node utilizationSeongJae Park
Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory". Utilizing DAMON for memory tiering usually requires manual tuning and/or tedious controls. Let it self-tune hotness and coldness thresholds for promotion and demotion aiming high utilization of high memory tiers, by introducing new DAMOS quota goal metrics representing the used and the free memory ratios of specific NUMA nodes. And introduce a sample DAMON module that demonstrates how the new feature can be used for memory tiering use cases. Backgrounds =========== A type of tiered memory system exposes the memory tiers as NUMA nodes. A straightforward pages placement strategy for such systems is placing access-hot and cold pages on upper and lower tiers, reespectively, pursuing higher utilization of upper tiers. Since access temperature can be dynamic, periodically finding and migrating hot pages and cold pages to proper tiers (promoting and demoting) is also required. Linux kernel provides several features for such dynamic and transparent pages placement. Page Faults and LRU ------------------- One widely known way is using NUMA balancing in tiering mode (a.k.a NUMAB-2) and reclaim-based demotion features. In the setup, NUMAB-2 finds hot pages using access check-purpose page faults (a.k.a prot_none) and promote those inside each process' context, until there is no more pages to promote, or the upper tier is filled up and memory pressure happens. In the latter case, LRU-based reclaim logic wakes up as a response to the memory pressure and demotes cold pages to lower tiers in asynchronous (kswapd) and/or synchronous ways (direct reclaim). DAMON ----- Yet another available solution is using DAMOS with migrate_hot and migrate_cold DAMOS actions for promotions and demotions, respectively. To make it optimum, users need to specify aggressiveness and access temperature thresholds for promotions and demotions in a good balance that results in high utilization of upper tiers. The number of parameters is not small, and optimum parameter values depend on characteristics of the underlying hardware and the workload. As a result, it often requires manual, time consuming and repetitive tuning of the DAMOS schemes for given workloads and systems combinations. Self-tuned DAMON-based Memory Tiering ===================================== To solve such manual tuning problems, DAMOS provides aim-oriented feedback-driven quotas self-tuning. Using the feature, we design a self-tuned DAMON-based memory tiering for general multi-tier memory systems. For each memory tier node, if it has a lower tier, run a DAMOS scheme that demotes cold pages of the node, auto-tuning the aggressiveness aiming an amount of free space of the node. The free space is for keeping the headroom that avoids significant memory pressure during upper tier memory usage spike, and promoting hot pages from the lower tier. For each memory tier node, if it has an upper tier, run a DAMOS scheme that promotes hot pages of the current node to the upper tier, auto-tuning the aggressiveness aiming a high utilization ratio of the upper tier. The target ratio is to ensure higher tiers are utilized as much as possible. It should match with the headroom for demotion scheme, but have slight overlap, to ensure promotion and demotion are not entirely stopped. The aim-oriented aggressiveness auto-tuning of DAMOS is already available. Hence, to make such tiering solution implementation, only new quota goal metrics for utilization and free space ratio of specific NUMA node need to be developed. Discussions =========== The design imposes below discussion points. Expected Behaviors ------------------ The system will let upper tier memory node accommodates as many hot data as possible. If total amount of the data is less than the top tier memory's promotion/demotion target utilization, entire data will be just placed on the top tier. Promotion scheme will do nothing since there is no data to promote. Demotion scheme will also do nothing since the free space ratio of the top tier is higher than the goal. Only if the amount of data is larger than the top tier's utilization ratio, demotion scheme will demote cold pages and ensure the headroom free space. Since the promotion and demotion schemes for a single node has small overlap at their target utilization and free space goals, promotions and demotions will continue working with a moderate aggressiveness level. It will keep all data is placed on access hotness under dynamic access pattern, while minimizing the migration overhead. In any case, each node will keep headroom free space and as many upper tiers are utilized as possible. Ease of Use ----------- Users still need to set the target utilization and free space ratio, but it will be easier to set. We argue 99.7 % utilization and 0.5 % free space ratios can be good default values. It can be easily adjusted based on desired headroom size of given use case. Users are also still required to answer the minimum coldness and hotness thresholds. Together with monitoring intervals auto-tuning[2], DAMON will always show meaningful amount of hot and cold memory. And DAMOS quota's prioritization mechanism will make good decision as long as the source information is that colorful. Hence, users can very naively set the minimum criterias. We believe any access observation and no access observation within last one aggregation interval is enough for minimum hot and cold regions criterias. General Tiered Memory Setup Applicability ----------------------------------------- The design can be applied to any number of tiers having any performance characteristics, as long as they can be hierarchical. Hence, applying the system to different tiered memory system will be straightforward. Note that this assumes only single CPU NUMA node case. Because today's DAMON is not aware of which CPU made each access, applying this on systems having multiple CPU NUMA nodes can be complicated. We are planning to extend DAMON for the use case, but that's out of the scope of this patch series. How To Use ---------- Users can implement the auto-tuned DAMON-based memory tiering using DAMON sysfs interface. It can be easily done using DAMON user-space tool like user-space tool. Below evaluation results section shows an example DAMON user-space tool command for that. For wider and simpler deployment, having a kernel module that sets up and run the DAMOS schemes via DAMON kernel API can be useful. The module can enable the memory tiering at boot time via kernel command line parameter or at run time with single command. This patch series implements a sample DAMON kernel module that shows how such module can be implemented. Comparison To Page Faults and LRU-based Approaches -------------------------------------------------- The existing page faults based promotion (NUMAB-2) does hot pages detection and migration in the process context. When there are many pages to promote, it can block the progress of the application's real works. DAMOS works in asynchronous worker thread, so it doesn't block the real works. NUMAB-2 doesn't provide a way to control aggressiveness of promotion other than the maximum amount of pages to promote per given time widnow. If hot pages are found, promotions can happen in the upper-bound speed, regardless of upper tier's memory pressure. If the maximum speed is not well set for the given workload, it can result in slow promotion or unnecessary memory pressure. Self-tuned DAMON-based memory tiering alleviates the problem by adjusting the speed based on current utilization of the upper tier. LRU-based demotion can be triggered in both asynchronous (kswapd) and synchronous (direct reclaim) ways. Other than the way of finding cold pages, asynchronous LRU-based demotion and DAMON-based demotion has no big difference. DAMON-based demotion can make a better balancing with DAMON-based promotion, though. The LRU-based demotion can do better than DAMON-based demotion when the tier is having significant memory pressure. It would be wise to use DAMON-based demotion as a proactive and primary one, but utilizing LRU-based demotions together as a fast backup solution. Evaluation ========== In short, under a setup that requires fast and frequent promotions, self-tuned DAMON-based memory tiering's hot pages promotion improves performance about 4.42 %. We believe this shows self-tuned DAMON-based promotion's effectiveness. Meanwhile, NUMAB-2's hot pages promotion degrades the performance about 7.34 %. We suspect the degradation is mostly due to NUMAB-2's synchronous nature that can block the application's progress, which highlights the advantage of DAMON-based solution's asynchronous nature. Note that the test was done with the RFC version of this patch series. We don't run it again since this patch series got no meaningful change after the RFC, while the test takes pretty long time. Setup ----- Hardware. Use a machine that equips 250 GiB DRAM memory tier and 50 GiB CXL memory tier. The tiers are exposed as NUMA nodes 0 and 1, respectively. Kernel. Use Linux kernel v6.13 that modified as following. Add all DAMON patches that available on mm tree of 2025-03-15, and this patch series. Also modify it to ignore mempolicy() system calls, to avoid bad effects from application's traditional NUMA systems assumed optimizations. Workload. Use a modified version of Taobench benchmark[3] that available on DCPerf benchmark suite. It represents an in-memory caching workload. We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB, 2,500 seconds and 1,440 seconds. The parameters are chosen to ensure the workload uses more than DRAM memory tier. Its RSS under the parameter grows to 270 GiB within the warmup time. It turned out the workload has a very static access pattrn. Only about 13 % of the RSS is frequently accessed from the beginning to end. Hence promotion shows no meaningful performance difference regardless of different design and implementations. We therefore modify the kernel to periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold pages once per minute. The intention is to simulate periodic access pattern changes. The hotness and coldness threshold is very naively set so that it is more like random access pattern change rather than strict hot/cold pages exchange. This is why we call the workload as "modified". It is implemented as two DAMOS schemes each running on an asynchronous thread. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_hot 1 \ --damos_quota_interval 60s --damos_quota_space 10G \ --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_cold 0 \ --damos_quota_interval 60s --damos_quota_space 10G \ --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1 System configurations. Use below variant system configurations. - Baseline. No memory tiering features are turned on. - Numab_tiering. On the baseline, enable NUMAB-2 and relcaim-based demotion. In detail, following command is executed: echo 2 > /proc/sys/kernel/numa_balancing; echo 1 > /sys/kernel/mm/numa/demotion_enabled; echo 7 > /proc/sys/vm/zone_reclaim_mode - DAMON_tiering. On the baseline, utilize self-tuned DAMON-based memory tiering implementation via DAMON user-space tool. It utilizes two kernel threads, namely promotion thread and demotion thread. Demotion thread monitors access pattern of DRAM node using DAMON with auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio, and demote coldest pages up to 200 MiB per second aiming 0.5% free space of DRAM node. Promotion thread monitors CXL node using same intervals auto-tuning, and promote hot pages in same way but aiming for 99.7% utilization of DRAM node. Because DAMON provides only best-effort accuracy, add young page DAMOS filters to allow only and reject all young pages at promoting and demoting, respectively. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_cold 1 --damos_access_rate 0% 0% \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_free_bp 0.5% 0 \ --damos_filter reject young \ --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_hot 0 --damos_access_rate 5% max \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_used_bp 99.7% 0 \ --damos_filter allow young \ --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \ --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1 Measurment Results ------------------ On each system configuration, run the modified version of Taobench and collect 'score'. 'score' is a metric that calculated and provided by Taobench to represents the performance of the run on the system. To handle the measurement errors, repeat the measurement five times. The results are as below. Config Score Stdev (%) Normalized Baseline 1.6165 0.0319 1.9764 1.0000 Numab_tiering 1.4976 0.0452 3.0209 0.9264 DAMON_tiering 1.6881 0.0249 1.4767 1.0443 'Config' column shows the system config of the measurement. 'Score' column shows the 'score' measurement in average of the five runs on the system config. 'Stdev' column shows the standsard deviation of the five measurements of the scores. '(%)' column shows the 'Stdev' to 'Score' ratio in percentage. Finally, 'Normalized' column shows the averaged score values of the configs that normalized to that of 'Baseline'. The periodic hot pages demotion and cold pages promotion that was conducted to simulate dynamic access pattern was started from the beginning of the workload. It resulted in the DRAM tier utilization always under the watermark, and hence no real demotion was happened for all test runs. This means the above results show no difference between LRU-based and DAMON-based demotions. Only difference between NUMAB-2 and DAMON-based promotions are represented on the results. Numab_tiering config degraded the performance about 7.36 %. We suspect this happened because NUMAB-2's synchronous promotion was blocking the Taobench's real work progress. DAMON_tiering config improved the performance about 4.43 %. We believe this shows effectiveness of DAMON-based promotion that didn't block Taobench's real work progress due to its asynchronous nature. Also this means DAMON's monitoring results are accurate enough to provide visible amount of improvement. Evaluation Limitations ---------------------- As mentioned above, this evaluation shows only comparison of promotion mechanisms. DAMON-based tiering is recommended to be used together with reclaim-based demotion as a faster backup under significant memory pressure, though. From some perspective, the modified version of Taobench may seems making the picture distorted too much. It would be better to evaluate with more realistic workload, or more finely tuned micro benchmarks. Patch Sequence ============== The first patch (patch 1) implements two new quota goal metrics on core layer and expose it to DAMON core kernel API. The second and third ones (patches 2 and 3) further link it to DAMON sysfs interface. Three following patches (patches 4-6) document the new feature and sysfs file on design, usage, and ABI documents. The final one (patch 7) implements a working version of a self-tuned DAMON-based memory tiering solution in an incomplete but easy to understand form as a kernel module under samples/damon/ directory. References ========== [1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/ [2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org [3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md This patch (of 7): Used and free space ratios for specific NUMA nodes can be useful inputs for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop. Implement DAMOS quota goal metrics for such self-tuned schemes. Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-13Merge 6.15-rc6 into usb-nextGreg Kroah-Hartman
We need the USB fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-13Merge 6.15-rc6 into char-misc-nextGreg Kroah-Hartman
We need the iio/hyperv fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-12Merge branch 'for-6.16/tsm-mr' into tsm-nextDan Williams
Merge measurement-register infrastructure for v6.16. Resolve conflicts with the establishment of drivers/virt/coco/guest/ for cross-vendor common TSM functionality. Address a mis-merge with a fixup from Lukas: Link: http://lore.kernel.org/20250509134031.70559-1-lukas.bulwahn@redhat.com
2025-05-12scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES bufferSteve Siwinski
The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-05-12scsi: sd: Remove the stream_status member from scsi_stream_status_headerChristoph Hellwig
Having a variable length array at the end of scsi_stream_status_header only causes problems. Remove it and switch sd_is_perm_stream(), which is the only place that currently uses it, to use the scsi_stream_status directly following it in the local buf structure. Besides being a much better data structure design, this also avoids a -Wflex-array-member-not-at-end warning. Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505060640.3398500-1-hch@lst.de Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-05-12netlink: fix policy dump for int with validation callbackJakub Kicinski
Recent devlink change added validation of an integer value via NLA_POLICY_VALIDATE_FN, for sparse enums. Handle this in policy dump. We can't extract any info out of the callback, so report only the type. Fixes: 429ac6211494 ("devlink: define enum for attr types of dynamic attributes") Reported-by: syzbot+01eb26848144516e7f0a@syzkaller.appspotmail.com Link: https://patch.msgid.link/20250509212751.1905149-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12Merge branch 'for-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com IWL reviews: [v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com [v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com [v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com [v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com [v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12helpers: make few bpf helpers publicMykyta Yatsenko
Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and __bpf_dynptr_write available outside of the helpers.c by adding their prototypes into linux/include/bpf.h. bpf_dynptr_check_off_len() implementation is moved to header and made inline explicitly, as small function should typically be inlined. These functions are going to be used from bpf_trace.c in the next patch of this series. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12soc: qcom: llcc-qcom: Add support for SM8750Melody Olvera
Add system cache table and configs for SM8750 SoCs. Signed-off-by: Melody Olvera <melody.olvera@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250512-sm8750_llcc_master-v5-3-d78dca6282a5@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-05-12drm: nova-drm: add initial driver skeletonDanilo Krummrich
Add the initial nova-drm driver skeleton. nova-drm is connected to nova-core through the auxiliary bus and implements the DRM parts of the nova driver stack. For now, it implements the fundamental DRM abstractions, i.e. creates a DRM device and registers it, exposing a three sample IOCTLs. DRM_IOCTL_NOVA_GETPARAM - provides the PCI bar size from the bar that maps the GPUs VRAM from nova-core DRM_IOCTL_NOVA_GEM_CREATE - creates a new dummy DRM GEM object and returns a handle DRM_IOCTL_NOVA_GEM_INFO - provides metadata for the DRM GEM object behind a given handle I implemented a small userspace test suite [1] that utilizes this interface. Link: https://gitlab.freedesktop.org/dakr/drm-test [1] Reviewed-by: Maxime Ripard <mripard@kernel.org> Acked-by: Dave Airlie <airlied@redhat.com> Link: https://lore.kernel.org/r/20250424160452.8070-3-dakr@kernel.org [ Kconfig: depend on DRM=y rather than just DRM. - Danilo ] Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-05-12io_uring: drain based on allocates reqsPavel Begunkov
Don't rely on CQ sequence numbers for draining, as it has become messy and needs cq_extra adjustments. Instead, base it on the number of allocated requests and only allow flushing when all requests are in the drain list. As a result, cq_extra is gone, no overhead for its accounting in aux cqe posting, less bloating as it was inlined before, and it's in general simpler than trying to track where we should bump it and where it should be put back like in cases of overflow. Also, it'll likely help with cleaning and unifying some of the CQ posting helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com [axboe: fold in fix from link2] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12ACPI: MRRM: Minimal parse of ACPI MRRM tableTony Luck
The resctrl file system code needs to know how many region tags are supported. Parse the ACPI MRRM table and save the max_mem_region value. Provide a function for resctrl to collect that value. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20250505173819.419271-2-tony.luck@intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-12ACPICA: Update copyright yearSaket Dumbre
ACPICA commit 45253be18b3f37d46cd0072aa3f8a0a21a70e0a4 Changes needed by acpisrc to update copyright year when building for release. Link: https://github.com/acpica/acpica/commit/45253be1 Signed-off-by: Saket Dumbre <saket.dumbre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-12ACPICA: Logfile: Changes for version 20250404Saket Dumbre
ACPICA commit 52de9040740c562a35244de19d21c0313964c874 Link: https://github.com/acpica/acpica/commit/52de9040 Signed-off-by: Saket Dumbre <saket.dumbre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2251077.Icojqenx9y@rjwysocki.net
2025-05-12ACPICA: Replace strncpy() with memcpy()Ahmed Salem
ACPICA commit 83019b471e1902151e67c588014ba2d09fa099a3 strncpy() is deprecated for NUL-terminated destination buffers[1]. Use memcpy() for length-bounded destinations. Link: https://github.com/KSPP/linux/issues/90 [1] Link: https://github.com/acpica/acpica/commit/83019b47 Signed-off-by: Ahmed Salem <x0rw3ll@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/1910878.atdPhlSkOF@rjwysocki.net
2025-05-12ACPICA: Apply ACPI_NONSTRING in more placesAhmed Salem
ACPICA commit 1035a3d453f7dd49a235a59ee84ebda9d2d2f41b Add ACPI_NONSTRING for destination char arrays without a terminating NUL character. This is a follow-up to commit 35ad99236f3a ("ACPICA: Apply ACPI_NONSTRING") where not all instances received the same treatment, in preparation for replacing strncpy() calls with memcpy() Link: https://github.com/acpica/acpica/commit/1035a3d4 Signed-off-by: Ahmed Salem <x0rw3ll@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3833065.MHq7AAxBmi@rjwysocki.net
2025-05-12ACPICA: Avoid sequence overread in call to strncmp()Ahmed Salem
ACPICA commit 8b83a8d88dfec59ea147fad35fc6deea8859c58c ap_get_table_length() checks if tables are valid by calling ap_is_valid_header(). The latter then calls ACPI_VALIDATE_RSDP_SIG(Table->Signature). ap_is_valid_header() accepts struct acpi_table_header as an argument, so the signature size is always fixed to 4 bytes. The problem is when the string comparison is between ACPI-defined table signature and ACPI_SIG_RSDP. Common ACPI table header specifies the Signature field to be 4 bytes long[1], with the exception of the RSDP structure whose signature is 8 bytes long "RSD PTR " (including the trailing blank character)[2]. Calling strncmp(sig, rsdp_sig, 8) would then result in a sequence overread[3] as sig would be smaller (4 bytes) than the specified bound (8 bytes). As a workaround, pass the bound conditionally based on the size of the signature being passed. Link: https://uefi.org/specs/ACPI/6.5_A/05_ACPI_Software_Programming_Model.html#system-description-table-header [1] Link: https://uefi.org/specs/ACPI/6.5_A/05_ACPI_Software_Programming_Model.html#root-system-description-pointer-rsdp-structure [2] Link: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wstringop-overread [3] Link: https://github.com/acpica/acpica/commit/8b83a8d8 Signed-off-by: Ahmed Salem <x0rw3ll@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2248233.Mh6RI2rZIc@rjwysocki.net
2025-05-12ACPICA: actbl2.h: ACPI 6.5: RAS2: Rename structure and field names of the ↵Shiju Jose
RAS2 table ACPICA commit 2c8a38f747de9d977491a494faf0dfaf799b777b Rename the structure and field names of the RAS2 table to shorten them and avoid long lines in the ACPI RAS2 drivers. 1. struct acpi_ras2_shared_memory to struct acpi_ras2_shmem 2. In struct acpi_ras2_shared_memory: fields, - set_capabilities[16] to set_caps[16] - num_parameter_blocks to num_param_blks - set_capabilities_status to set_caps_status 3. struct acpi_ras2_patrol_scrub_parameter to struct acpi_ras2_patrol_scrub_param 4. In struct acpi_ras2_patrol_scrub_parameter: fields, - patrol_scrub_command to command - requested_address_range to req_addr_range - actual_address_range to actl_addr_range Link: https://github.com/acpica/acpica/commit/2c8a38f7 Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/1942053.CQOukoFCf9@rjwysocki.net
2025-05-12ACPICA: Introduce ACPI_NONSTRINGKees Cook
ACPICA commit 878823ca20f1987cba0c9d4c1056be0d117ea4fe In order to distinguish character arrays from C Strings (i.e. strings with a terminating NUL character), add support for the "nonstring" attribute provided by GCC. (A better name might be "ACPI_NONCSTRING", but that's the attribute name, so stick to the existing naming convention.) GCC 15's -Wunterminated-string-initialization will warn about truncation of the NUL byte for string initializers unless the destination is marked with "nonstring". Prepare for applying this attribute to the project. Link: https://github.com/acpica/acpica/commit/878823ca Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/1841930.VLH7GnMWUR@rjwysocki.net Signed-off-by: Kees Cook <kees@kernel.org> [ rjw: Pick up the tag from Kees ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-12ACPICA: actbl2.h: ERDT: Add typedef and other definitionsTony Luck
ACPICA commit dddd9270531d74af523afa68515d8aae6a18bbe0 The ERDT table (and its many subtables) enumerate capabilities and methods for Intel Resource Director Technology to monitor and control L3 cache allocation and memory bandwidth by CPU cores and IO devices. Structure defined in the Intel Resource Director Technology (RDT) Architecture specification downloadable from www.intel.com/sdm Link: https://github.com/acpica/acpica/commit/dddd9270 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3296755.5fSG56mABF@rjwysocki.net
2025-05-12ACPICA: infrastructure: Add new DMT_BUF types and shorten a long nameTony Luck
ACPICA commit b8713f71b4023a0396fe61503bbbf5226e5eed1b Some ERDT subtables have 11 and 24 byte reserved fields. Add the ACPI_DMT_BUF11 and ACPI_DMT_BUF24 types to describe these reserved fields in struct acpi_dmtable_info structures. Shorten the ACPI_SUBTABLE_HEADER_16 name to ACPI_SUBTBL_HDR Link: https://github.com/acpica/acpica/commit/b8713f71 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3643286.iIbC2pHGDl@rjwysocki.net
2025-05-12ACPICA: MRRM: Some cleanupsTony Luck
ACPICA commit 022e2e4169841f429dbda677a4780830bf4c2177 1) Added source specification to MRRM table comment in actbl2.h 2) Shorten typedef from ACPI_TABLE_MRRM_MEM_RANGE_ENTRY to struct acpi_mrrm_mem_range_entry 3) Add new typedefs to source/tools/acpisrc/astable.c 4) Fix cut and paste errors in acpi_dm_table_info_mrrm0[] definition 5) Fix indent and source code style errors in actbl2.h 6) The base/length fields in the memory range structure are system memory addresses, not "MMIO". Update the acpi_dm_table_info_mrrm0[] strings. 7) Add main/sub table comments to acpi_dm_dump_mrrm() and dt_compile_mrrm() Link: https://github.com/acpica/acpica/commit/022e2e41 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2018955.PYKUYFuaPT@rjwysocki.net
2025-05-12ACPICA: actbl2: Add definitions for RIMTSunil V L
ACPICA commit 73c32bc89cad64ab19c1231a202361e917e6823c RISC-V IO Mapping Table (RIMT) is a new static table defined for RISC-V to communicate IOMMU information to the OS. The specification for RIMT is available at [1]. Add structure definitions for RIMT. Link: https://github.com/riscv-non-isa/riscv-acpi-rimt [1] Link: https://github.com/acpica/acpica/commit/73c32bc8 Signed-off-by: Sunil V L <sunilvl@ventanamicro.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/10665648.nUPlyArG6x@rjwysocki.net
2025-05-12ACPICA: actbl2.h: MRRM: Add typedef and other definitionsTony Luck
ACPICA commit 04fd53b2647b9f6f98cfca551383689cb3b59362 The MRRM table describes association between physical address ranges and "region numbers". Structure defined in the Intel Resource Director Technology (RDT) Architecture specification downloadable from www.intel.com/sdm Link: https://github.com/acpica/acpica/commit/04fd53b2 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3372188.44csPzL39Z@rjwysocki.net