summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-03-17fs/dax: properly refcount fs dax pagesAlistair Popple
Currently fs dax pages are considered free when the refcount drops to one and their refcounts are not increased when mapped via PTEs or decreased when unmapped. This requires special logic in mm paths to detect that these pages should not be properly refcounted, and to detect when the refcount drops to one instead of zero. On the other hand get_user_pages(), etc. will properly refcount fs dax pages by taking a reference and dropping it when the page is unpinned. Tracking this special behaviour requires extra PTE bits (eg. pte_devmap) and introduces rules that are potentially confusing and specific to FS DAX pages. To fix this, and to possibly allow removal of the special PTE bits in future, convert the fs dax page refcounts to be zero based and instead take a reference on the page each time it is mapped as is currently the case for normal pages. This may also allow a future clean-up to remove the pgmap refcounting that is currently done in mm/gup.c. Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/huge_memory: add vmf_insert_folio_pmd()Alistair Popple
Currently DAX folio/page reference counts are managed differently to normal pages. To allow these to be managed the same as normal pages introduce vmf_insert_folio_pmd. This will map the entire PMD-sized folio and take references as it would for a normally mapped page. This is distinct from the current mechanism, vmf_insert_pfn_pmd, which simply inserts a special devmap PMD entry into the page table without holding a reference to the page for the mapping. It is not currently useful to implement a more generic vmf_insert_folio() which selects the correct behaviour based on folio_order(). This is because PTE faults require only a subpage of the folio to be PTE mapped rather than the entire folio. It would be possible to add this context somewhere but callers already need to handle PTE faults and PMD faults separately so a more generic function is not useful. Link: https://lkml.kernel.org/r/7bf92a2e68225d13ea368d53bbfee327314d1c40.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Wiliams <dan.j.williams@intel.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/huge_memory: add vmf_insert_folio_pud()Alistair Popple
Currently DAX folio/page reference counts are managed differently to normal pages. To allow these to be managed the same as normal pages introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio and take references as it would for a normally mapped page. This is distinct from the current mechanism, vmf_insert_pfn_pud, which simply inserts a special devmap PUD entry into the page table without holding a reference to the page for the mapping. Link: https://lkml.kernel.org/r/649a1ef91d556593948351e94f51ef73a14f6794.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/rmap: add support for PUD sized mappings to rmapAlistair Popple
The rmap doesn't currently support adding a PUD mapping of a folio. This patch adds support for entire PUD mappings of folios, primarily to allow for more standard refcounting of device DAX folios. Currently DAX is the only user of this and it doesn't require support for partially mapped PUD-sized folios so we don't support for that for now. Link: https://lkml.kernel.org/r/248582c07896e30627d1aeaeebc6949cfd91b851.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/memory: add vmf_insert_page_mkwrite()Alistair Popple
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This creates a special devmap PTE entry for the pfn but does not take a reference on the underlying struct page for the mapping. This is because DAX page refcounts are treated specially, as indicated by the presence of a devmap entry. To allow DAX page refcounts to be managed the same as normal page refcounts introduce vmf_insert_page_mkwrite(). This will take a reference on the underlying page much the same as vmf_insert_page, except it also permits upgrading an existing mapping to be writable if requested/possible. Link: https://lkml.kernel.org/r/4ce3aa984c060f370105e0bfef1035869578be47.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Wiliams <dan.j.williams@intel.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/memory: enhance insert_page_into_pte_locked() to create writable mappingsAlistair Popple
In preparation for using insert_page() for DAX, enhance insert_page_into_pte_locked() to handle establishing writable mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which bypasses the typical set_pte_range() in finish_fault. Link: https://lkml.kernel.org/r/f7354fd9c2f5d0c2fa321733039f9f87e791023e.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: allow compound zone device pagesAlistair Popple
Zone device pages are used to represent various type of device memory managed by device drivers. Currently compound zone device pages are not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user of higher order zone device pages and have their own page reference counting. A future change will unify FS DAX reference counting with normal page reference counting rules and remove the special FS DAX reference counting. Supporting that requires compound zone device pages. Supporting compound zone device pages requires compound_head() to distinguish between head and tail pages whilst still preserving the special struct page fields that are specific to zone device pages. A tail page is distinguished by having bit zero being set in page->compound_head, with the remaining bits pointing to the head page. For zone device pages page->compound_head is shared with page->pgmap. The page->pgmap field must be common to all pages within a folio, even if the folio spans memory sections. Therefore pgmap is the same for both head and tail pages and can be moved into the folio and we can use the standard scheme to find compound_head from a tail page. Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mm_init: move p2pdma page refcount initialisation to p2pdmaAlistair Popple
Currently ZONE_DEVICE page reference counts are initialised by core memory management code in __init_zone_device_page() as part of the memremap() call which driver modules make to obtain ZONE_DEVICE pages. This initialises page refcounts to 1 before returning them to the driver. This was presumably done because it drivers had a reference of sorts on the page. It also ensured the page could always be mapped with vm_insert_page() for example and would never get freed (ie. have a zero refcount), freeing drivers of manipulating page reference counts. However it complicates figuring out whether or not a page is free from the mm perspective because it is no longer possible to just look at the refcount. Instead the page type must be known and if GUP is used a secondary pgmap reference is also sometimes needed. To simplify this it is desirable to remove the page reference count for the driver, so core mm can just use the refcount without having to account for page type or do other types of tracking. This is possible because drivers can always assume the page is valid as core kernel will never offline or remove the struct page. This means it is now up to drivers to initialise the page refcount as required. P2PDMA uses vm_insert_page() to map the page, and that requires a non-zero reference count when initialising the page so set that when the page is first mapped. Link: https://lkml.kernel.org/r/6aedb0ac2886dcc4503cb705273db5b3863a0b66.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/gup: remove redundant check for PCI P2PDMA pageAlistair Popple
PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the check in __gup_device_huge() is redundant. Remove it Link: https://lkml.kernel.org/r/260e3dcfaf05ff1c734a49698ed4332b5dae04c2.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Wiliams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17fs/dax: always remove DAX page-cache entries when breaking layoutsAlistair Popple
Prior to any truncation operations file systems call dax_break_mapping() to ensure pages in the range are not under going DMA. Later DAX page-cache entries will be removed by truncate_folio_batch_exceptionals() in the generic page-cache code. However this makes it possible for folios to be removed from the page-cache even though they are still DMA busy if the file-system hasn't called dax_break_mapping(). It also means they can never be waited on in future because FS DAX will lose track of them once the page-cache entry has been deleted. Instead it is better to delete the FS DAX entry when the file-system calls dax_break_mapping() as part of it's truncate operation. This ensures only idle pages can be removed from the FS DAX page-cache and makes it easy to detect if a file-system hasn't called dax_break_mapping() prior to a truncate operation. Link: https://lkml.kernel.org/r/3be6115eaaa8d28fee37fcba3287be4f226a7d24.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: shmem: factor out the within_size logic into a new helperBaolin Wang
Factor out the within_size logic into a new helper to remove duplicate code. Link: https://lkml.kernel.org/r/527dea9d7e32fe6b94c7fe00df2c126203017911.1738918357.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: shmem: change the return value of shmem_find_swap_entries()Baolin Wang
The shmem_find_swap_entries() originally returned the index corresponding to the swap entry, but no callers used this return value. It should return the number of entries that were found like other functions, which can be used by the callers. No functional changes. Link: https://lkml.kernel.org/r/070489b5946b8379b2a2d25f78115cef167cd145.1738918357.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: shmem: remove duplicate error validationBaolin Wang
Remove duplicate error code checks for 'start' and 'end', as the get_order_from_str() will only return -EINVAL if the cmdline string is configured incorrectly. Link: https://lkml.kernel.org/r/dfadaba4c8b24c5ae1467fe8b6744b654c65ec91.1738918357.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: shmem: remove 'fadvise()' commentsBaolin Wang
Similar to commit 255ff62d1586 ("docs: tmpfs: drop 'fadvise()' from the documentation"), fadvise() has no HUGEPAGE advise currently. Remove the confusing fadvise() comments. Link: https://lkml.kernel.org/r/fae702b9775f58b55b45be5eaad22d8586d0290a.1738918357.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: shmem: drop the unused macroBaolin Wang
Patch series "Some trivial cleanups for shmem". Patch 1 - Patch 5 do some trivial cleanups and refactoring for shmem. Patch 6 adds myself as shmem reviewer. This patch (of 6): Drop the unused 'BLOCKS_PER_PAGE' macro. Link: https://lkml.kernel.org/r/cover.1738918357.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/69264cee1d938442477e657004e4924f8a5c4dd4.1738918357.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zpool: remove zpool_malloc_support_movable()Yosry Ahmed
zpool_malloc_support_movable() always returns true for zsmalloc, the only remaining zpool driver. Remove it and set the gfp flags in zswap_compress() accordingly. Opportunistically use GFP_NOWAIT instead of __GFP_NOWARN | __GFP_KSWAPD_RECLAIM for conciseness as they are equivalent. Link: https://lkml.kernel.org/r/20250305061134.4105762-6-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zsmalloc: remove object mapping APIs and per-CPU map areasYosry Ahmed
zs_map_object() and zs_unmap_object() are no longer used, remove them. Since these are the only users of per-CPU mapping_areas, remove them and the associated CPU hotplug callbacks too. [yosry.ahmed@linux.dev: update the docs] Link: https://lkml.kernel.org/r/Z8ier-ZZp8T6MOTH@google.com Link: https://lkml.kernel.org/r/20250305061134.4105762-5-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zpool: remove object mapping APIsYosry Ahmed
zpool_map_handle(), zpool_unmap_handle(), and zpool_can_sleep_mapped() are no longer used. Remove them with the underlying driver callbacks. Link: https://lkml.kernel.org/r/20250305061134.4105762-4-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zswap: use object read/write APIs instead of object mapping APIsYosry Ahmed
Use the new object read/write APIs instead of mapping APIs. On compress side, zpool_obj_write() is more concise and provides exactly what zswap needs to write the compressed object to the zpool, instead of map->copy->unmap. On the decompress side, zpool_obj_read_begin() is sleepable, which allows avoiding the memcpy() for zsmalloc and slightly simplifying the code by: - Avoiding checking if the zpool driver is sleepable, reducing special cases and shrinking the huge comment. - Having a single zpool_obj_read_end() call rather than multiple conditional zpool_unmap_handle() calls. The !virt_addr_valid() case can be removed in the future if the crypto API supports kmap addresses or by using kmap_to_page(), completely eliminating the memcpy() path in zswap_decompress(). This a step toward that. In that spirit, opportunistically make the comment more specific about the kmap case instead of generic non-linear addresses. This is the only case that needs to be handled in practice, and the generic comment makes it seem like a bigger problem that it actually is. Link: https://lkml.kernel.org/r/20250305061134.4105762-3-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: zpool: add interfaces for object read/write APIsYosry Ahmed
Patch series "Switch zswap to object read/write APIs". This patch series updates zswap to use the new object read/write APIs defined by zsmalloc in [1], and remove the old object mapping APIs and the related code from zpool and zsmalloc. This patch (of 5): Zsmalloc introduced new APIs to read/write objects besides mapping them. Add the necessary zpool interfaces. Link: https://lkml.kernel.org/r/20250305061134.4105762-1-yosry.ahmed@linux.dev Link: https://lkml.kernel.org/r/20250305061134.4105762-2-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/shrinker: fix name consistency issue in shrinker_debugfs_rename()Liu Ye
After calling debugfs_change_name function, the return value should be checked and the old name restored. If debugfs_change_name fails, the new name memory should be freed. The effect is that the shrinker->name is not consistent with the name displayed in debugfs. Link: https://lkml.kernel.org/r/20250305071759.661055-1-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by:Qi Zheng <zhengqi.arch@bytedance.co Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/paddr: respect ops_filters_default_rejectSeongJae Park
Use damos->ops_filters_default_reject, which is set based on the installed filters' behaviors, from physical address space DAMON operations set. Link: https://lkml.kernel.org/r/20250304211913.53574-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: set damos_filter default allowance behavior based on ↵SeongJae Park
installed filters Decide whether to allow or reject by default on core and opertions layer handled filters evaluation stages. It is decided as the opposite of the last installed filter's behavior. If there is no filter at all, allow by default. If there is any operations layer handled filters, core layer's filtering stage sets allowing as the default behavior regardless of the last filter of core layer-handling ones, since the last filter of core layer handled filters in the case is not really the last filter of the entire filtering stage. Also, make the core layer's DAMOS filters handling stage uses the newly set behavior field. [sj@kernel.org: setup damos->{core,ops}_filters_default_reject for initial start] Link: https://lkml.kernel.org/r/20250315222610.35245-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250304211913.53574-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/paddr: support only damos->ops_filtersSeongJae Park
DAMON physical address space operation set implementation (paddr) started handling both damos->filters and damos->ops_filters to avoid breakage during the change for the ->ops_filters setup. Now the change is done, so paddr's support of ->filters is only a waste that can safely be dropped. Remove it. Link: https://lkml.kernel.org/r/20250304211913.53574-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: put ops-handled filters to damos->ops_filtersSeongJae Park
damos->ops_filters has introduced to be used for all operations layer handled filters. But DAMON kernel API callers can put any type of DAMOS filters to any of damos->filters and damos->ops_filters. DAMON user-space ABI users have no way to use ->ops_filters at all. Update damos_add_filter(), which should be used by API callers to install DAMOS filters, to add filters to ->filters and ->ops_filters depending on their handling layer. The change forces both API callers and ABI users to use proper lists since ABI users use the API internally. Link: https://lkml.kernel.org/r/20250304211913.53574-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: support committing ops_filtersSeongJae Park
DAMON kernel API callers should use damon_commit_ctx() to install DAMON parameters including DAMOS filters. But damos_commit_ops_filters(), which is called by damon_commit_ctx() for filters installing, is not handling damos->ops_filters. Hence, no DAMON kernel API caller can use damos->ops_filters. Do the committing of the ops_filters to make it usable. Link: https://lkml.kernel.org/r/20250304211913.53574-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/paddr: support ops_filtersSeongJae Park
DAMON keeps all DAMOS filters in damos->filters. Upcoming changes will make it to use damos->ops_filters for all operations layer handled DAMOS filters, though. DAMON physical address space operations set implementation (paddr) is not ready for the changes, since it handles only damos->filters. To avoid any breakage during the upcoming changes, make paddr to handle both lists. After the change is made, ->filters support on paddr can be safely removed. Link: https://lkml.kernel.org/r/20250304211913.53574-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: introduce damos->ops_filtersSeongJae Park
Patch series "mm/damon: make allow filters after reject filters useful and intuitive". DAMOS filters do allow or reject elements of memory for given DAMOS scheme only if those match the filter criterias. For elements that don't match any DAMOS filter, 'allowing' is the default behavior. This makes allow-filters that don't have any reject-filter after them meaningless sources of overhead. The decision was made to keep the behavior consistent with that before the introduction of allow-filters. This, however, makes usage of DAMOS filters confusing and inefficient. It is more intuitive and still consistent behavior to reject by default unless there is no filter at all or the last filter is a reject filter. Update the filtering logic in the way and update documents to clarify the behavior. Note that this is changing the old behavior. But the old behavior for the problematic filter combination was definitely confusing, inefficient and anyway useless. Also, the behavior has relatively recently introduced. It is difficult to anticipate any user that depends on the behavior. Hence this is not a user-breaking behavior change but an obvious improvement. This patch (of 9): DAMOS filters can be categorized into two groups depending on which layer they are handled, namely core layer and ops layer. The groups are important because the filtering behavior depends on evaluation sequence of filters, and core layer-handled filters are evaluated before operations layer-handled ones. The behavior is clearly documented, but the implementation is bit inefficient and complicated. All filters are maintained in a single list (damos->filters) in mix. Filters evaluation logics in core layer and operations layer iterates all the filters on the list, while skipping filters that should be not handled by the layer of the logic. It is inefficient. Making future extensions having differentiations for filters of different handling layers will also be complicated. Add a new list that will be used for having all operations layer-handled DAMOS filters to DAMOS scheme data structure. Also add the support of its initialization and basic traversal functions. Link: https://lkml.kernel.org/r/20250304211913.53574-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250304211913.53574-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17writeback: fix calculations in trace_balance_dirty_pages() for cgwbTang Yizhou
In the commit dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") of the cgroup writeback backpressure propagation patchset, Tejun made some adaptations to trace_balance_dirty_pages() for cgroup writeback. However, this adaptation was incomplete and Tejun missed further adaptation in the subsequent patches. In the cgroup writeback scenario, if sdtc in balance_dirty_pages() is assigned to mdtc, then upon entering trace_balance_dirty_pages(), __entry->limit should be assigned based on the dirty_limit of the corresponding memcg's wb_domain, rather than global_wb_domain. To address this issue and simplify the implementation, introduce a 'limit' field in struct dirty_throttle_control to store the hard_limit value computed in wb_position_ratio() by calling hard_dirty_limit(). This field will then be used in trace_balance_dirty_pages() to assign the value to __entry->limit. Link: https://lkml.kernel.org/r/20250304110318.159567-4-yizhou.tang@shopee.com Fixes: dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17writeback: let trace_balance_dirty_pages() take struct dtc as parameterTang Yizhou
Patch series "Fix calculations in trace_balance_dirty_pages() for cgwb", v2. In my experiment, I found that the output of trace_balance_dirty_pages() in the cgroup writeback scenario was strange because trace_balance_dirty_pages() always uses global_wb_domain.dirty_limit for related calculations instead of the dirty_limit of the corresponding memcg's wb_domain. The basic idea of the fix is to store the hard dirty limit value computed in wb_position_ratio() into struct dirty_throttle_control and use it for calculations in trace_balance_dirty_pages(). This patch (of 3): Currently, trace_balance_dirty_pages() already has 12 parameters. In the patch #3, I initially attempted to introduce an additional parameter. However, in include/linux/trace_events.h, bpf_trace_run12() only supports up to 12 parameters and bpf_trace_run13() does not exist. To reduce the number of parameters in trace_balance_dirty_pages(), we can make it accept a pointer to struct dirty_throttle_control as a parameter. To achieve this, we need to move the definition of struct dirty_throttle_control from mm/page-writeback.c to include/linux/writeback.h. Link: https://lkml.kernel.org/r/20250304110318.159567-1-yizhou.tang@shopee.com Link: https://lkml.kernel.org/r/20250304110318.159567-2-yizhou.tang@shopee.com Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Tang Yizhou <yizhou.tang@shopee.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: log time needed to allocate hugepagesThomas Prescher
Having this information allows users to easily tune the hugepages_node_threads parameter. Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-3-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: add hugetlb_alloc_threads cmdline optionThomas Prescher
Add a command line option that enables control of how many threads should be used to allocate huge pages. [akpm@linux-foundation.org: tidy up a comment] Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-2-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: hugetlb: improve parallel huge page allocation timeThomas Prescher
Patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages", v2. Allocating huge pages can take a very long time on servers with terabytes of memory even when they are allocated at boot time where the allocation happens in parallel. Before this series, the kernel used a hard coded value of 2 threads per NUMA node for these allocations. This value might have been good enough in the past but it is not sufficient to fully utilize newer systems. This series changes the default so the kernel uses 25% of the available hardware threads for these allocations. In addition, we allow the user that wish to micro-optimize the allocation time to override this value via a new kernel parameter. We tested this on 2 generations of Xeon CPUs and the results show a big improvement of the overall allocation time. +-----------------------+-------+-------+-------+-------+-------+ | threads | 8 | 16 | 32 | 64 | 128 | +-----------------------+-------+-------+-------+-------+-------+ | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s | | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s | +-----------------------+-------+-------+-------+-------+-------+ On skylake, we see an improvment of 2.75x when using 32 threads, on cascade lake we can get even better at 4.3x when we use 128 threads. This speedup is quite significant and users of large machines like these should have the option to make the machines boot as fast as possible. This patch (of 3): Before this patch, the kernel currently used a hard coded value of 2 threads per NUMA node for these allocations. This patch changes this policy and the kernel now uses 25% of the available hardware threads for the allocations. Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-1-7db8c6dc0453@cyberus-technology.de Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17memcg: bypass root memcg check for skmem chargingShakeel Butt
The root memcg is never associated with a socket in mem_cgroup_sk_alloc, so there is no need to check if the given memcg is root for the skmem charging code path. Link: https://lkml.kernel.org/r/20250228022354.2624249-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17page_counter: track failcnt only for legacy cgroupsShakeel Butt
Currently page_counter tracks failcnt for counters used by v1 and v2 controllers. However failcnt is only exported for v1 deployment and thus there is no need to maintain it in v2. The oom report does expose failcnt for memory and swap in v2 but v2 already maintains MEMCG_MAX and MEMCG_SWAP_MAX event counters which can be used. Link: https://lkml.kernel.org/r/20250228075808.207484-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17memcg: don't call propagate_protected_usage() for v1Shakeel Butt
Patch series "page_counter cleanup and size reduction". Commit c6f53ed8f213a ("mm, memcg: cg2 memory{.swap,}.peak write handlers") accidently increased the size of struct page_counter. This series rearrange the fields to reduce its size and also has some cleanups. This patch (of 3): Memcg-v1 does not support memory protection (min/low) and thus there is no need to track protected memory usage for it. Link: https://lkml.kernel.org/r/20250228075808.207484-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20250228075808.207484-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/page_alloc: clarify should_claim_block() commentaryBrendan Jackman
There's lots of text here but it's a little hard to follow, this is an attempt to break it up and align its structure more closely with the code. Reword the top-level function comment to just explain what question the function answers from the point of view of the caller. Break up the internal logic into different sections that can have their own commentary describing why that part of the rationale is present. Note the page_group_by_mobility_disabled logic is not explained in the commentary, that is outside the scope of this patch... Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-2-cb2ef1a4e610@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/page_alloc: clarify terminology in migratetype fallback codeBrendan Jackman
Patch series "mm/page_alloc: Some clarifications for migratetype fallback", v4. A couple of patches to try and make the code easier to follow. This patch (of 2): This code is rather confusing because: 1. "Steal" is sometimes used to refer to the general concept of allocating from a from a block of a fallback migratetype (steal_suitable_fallback()) but sometimes it refers specifically to converting a whole block's migratetype (can_steal_fallback()). 2. can_steal_fallback() sounds as though it's answering the question "am I functionally permitted to allocate from that other type" but in fact it is encoding a heuristic preference. 3. The same piece of data has different names in different places: can_steal vs whole_block. This reinforces point 2 because it looks like the different names reflect a shift in intent from "am I allowed to steal" to "do I want to steal", but no such shift exists. Fix 1. by avoiding the term "steal" in ambiguous contexts. Start using the term "claim" to refer to the special case of stealing the entire block. Fix 2. by using "should" instead of "can", and also rename its parameters and add some commentary to make it more explicit what they mean. Fix 3. by adopting the new "claim" terminology universally for this set of variables. Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-0-cb2ef1a4e610@google.com Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-1-cb2ef1a4e610@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: implement a command to update auto-tuned monitoring intervalsSeongJae Park
DAMON kernel API callers can show auto-tuned sampling and aggregation intervals from the monmitoring attributes data structure. That can be useful for debugging or tuning of the feature. DAMON user-space ABI users has no way to see that, though. Implement a new DAMON sysfs interface command, namely 'update_tuned_intervals', for the purpose. If the command is written to the kdamond state file, the tuned sampling and aggregation intervals will be updated to the corresponding sysfs interface files. Link: https://lkml.kernel.org/r/20250303221726.484227-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: commit intervals tuning goalSeongJae Park
Connect DAMON sysfs interface for sampling and aggregation intervals auto-tuning with DAMON core API, so that users can really use the feature using the sysfs files. Link: https://lkml.kernel.org/r/20250303221726.484227-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/sysfs: implement intervals tuning goal directorySeongJae Park
Implement DAMON sysfs interface directory and its files for setting DAMON sampling and aggregation intervals auto-tuning goal. Link: https://lkml.kernel.org/r/20250303221726.484227-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: implement intervals auto-tuningSeongJae Park
Implement the DAMON sampling and aggregation intervals auto-tuning mechanism as briefly described on 'struct damon_intervals_goal'. The core part for deciding the direction and amount of the changes is implemented reusing the feedback loop function which is being used for DAMOS quotas auto-tuning. Unlike the DAMOS quotas auto-tuning use case, limit the maximum decreasing amount after the adjustment to 50% of the current value, though. This is because the intervals have no good merits at rapid reductions since it could unnecessarily increase the monitoring overhead. Link: https://lkml.kernel.org/r/20250303221726.484227-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon: add data structure for monitoring intervals auto-tuningSeongJae Park
Patch series "mm/damon: auto-tune aggregation interval". DAMON requires time-consuming and repetitive aggregation interval tuning. Introduce a feature for automating it using a feedback loop that aims an amount of observed access events, like auto-exposing cameras. Background: Access Frequency Monitoring and Aggregation Interval ================================================================ DAMON checks if each memory element (damon_region) is accessed or not for every user-specified time interval called 'sampling interval'. It aggregates the check intervals on per-element counter called 'nr_accesses'. DAMON users can read the counters to get the access temperature of a given element. The counters are reset for every another user-specified time interval called 'aggregation interval'. This can be illustrated as DAMON continuously capturing a snapshot of access events that happen and captured within the last aggregation interval. This implies the aggregation interval plays a key role for the quality of the snapshots, like the camera exposure time. If it is too short, the amount of access events that happened and captured for each snapshot is small, so each snapshot will show no many interesting things but just a cold and dark world with hopefuly one pale blue dot or two. If it is too long, too many events are aggregated in a single shot, so each snapshot will look like world of flames, or Muspellheim. It will be difficult to find practical insights in both cases. Problem: Time Consuming and Repetitive Tuning ============================================= The appropriate length of the aggregation interval depends on how frequently the system and workloads are making access events that DAMON can observe. Hence, users have to tune the interval with excessive amount of tests with the target system and workloads. If the system and workloads are changed, the tuning should be done again. If the characteristic of the workloads is dynamic, it becomes more challenging. It is therefore time-consuming and repetitive. The tuning challenge mainly stems from the wrong question. It is not asking users what quality of monitoring results they want, but how DAMON should operate for their hidden goal. To make the right answer, users need to fully understand DAMON's mechanisms and the characteristics of their workloads. Users shouldn't be asked to understand the underlying mechanism. Understanding the characteristics of the workloads shouldn't be the role of users but DAMON. Aim-oriented Feedback-driven Auto-Tuning ========================================= Fortunately, the appropriate length of the aggregation interval can be inferred using a feedback loop. If the current snapshots are showing no much intresting information, in other words, if it shows only rare access events, increasing the aggregation interval helps, and vice versa. We tested this theory on a few real-world workloads, and documented one of the experience with an official DAMON monitoring intervals tuning guideline. Since it is a simple theory that requires repeatable tries, it can be a good job for machines. Based on the guideline's theory, we design an automation of aggregation interval tuning, in a way similar to that of camera auto-exposure feature. It defines the amount of interesting information as the ratio of DAMON-observed access events that DAMON actually observed to theoretical maximum amount of it within each snapshot. Events are accounted in byte and sampling attempts granularity. For example, let's say there is a region of 'X' bytes size. DAMON tried access check smapling for the region 'Y' times in total for a given aggregation. Among the 'Y' attempts, 'Z' times it shown positive results. Then, the theoritical maximum number of access events for the region is 'X * Y'. And the number of access events that DAMON has observed for the region is 'X * Z'. The abount of the interesting information is '(X * Z / X * Y)'. Note that each snapshot would have multiple regions. Users can set an arbitrary value of the ratio as their target. Once the target is set, the automation periodically measures the current value of the ratio and increase or decrease the aggregation interval if the ratio value is lower or higher than the target. The amount of the change is proportion to the distance between the current adn the target values. To avoid auto-tuning goes too long way, let users set the minimum and the maximum aggregation interval times. Changing only aggregation interval while sampling interval is kept makes the maximum level of access frequency in each snapshot, or discernment of regions inconsistent. Also, unnecessarily short sampling interval causes meaningless monitoring overhed. The automation therefore adjusts the sampling interval together with aggregation interval, while keeping the ratio between the two intervals. Users can set the ratio, or the discernment. Discussion ========== The modified question (aimed amount of access events, or lights, in each snapshot) is easy to answer by both the users and the kernel. If users are interested in finding more cold regions, the value should be lower, and vice versa. If users have no idea, kernel can suggest a fair default value based on some theories and experiments. For example, based on the Pareto principle (80/20 rule), we could expect 20% target ratio will capture 80% of real access events. Since 80% might be too high, applying the rule once again, 4% (20% * 20%) may capture about 56% (80% * 80%) of real access events. Sampling to aggregation intervals ratio and min/max aggregation intervals are also arguably easy to answer. What users want is discernment of regions for efficient system operation, for examples, X amount of colder regions or Y amount of warmer regions, not exactly how many times each cache line is accessed in nanoseconds degree. The appropriate min/max aggregation interval can relatively naively set, and may better to set for aimed monitoring overhead. Since sampling interval is directly deciding the overhead, setting it based on the sampling interval can be easy. With my experiences, I'd argue the intervals ratio 0.05, and 5 milliseconds to 20 seconds sampling interval range (100 milliseconds to 400 seconds aggregation interval) can be a good default suggestion. Evaluation ========== On a machine running a real world server workload, I ran DAMON to monitor its physical address space for about 23 hours, with this feature turned on. We set it to tune sampling interval in a range from 5 milliseconds to 10 seconds, aiming 4 % DAMON-observed access ratio per three aggregation intervals. The exact command I used is as below. damo start --monitoring_intervals_goal 4% 3 5ms 10s --damos_action stat During the test run, DAMON continuously updated sampling and aggregation intervals as designed, within the given range. For all the time, DAMON was able to find the intervals that meets the target access events ratio in the given intervals range (sampling interval between 5 milliseconds and 10 seconds). For most of the time, tuned sampling interval was converged in 300-400 milliseconds. It made only small amount of changes within the range. The average of the tuned sampling interval during the test was about 380 milliseconds. The workload periodically gets less load and decreases its CPU usage. Presumably this also caused it making less memory access events. Reactively to such event,s DAMON also increased the intervals as expected. It was still able to find the optimum interval that satisfying the target access ratio within the given intervals range. Usually it was converged to about 5 seconds. Once the workload gets normal amount of load again, DAMON reactively reduced the intervals to the normal range. I collected and visualized DAMON's monitoring results on the server a few times. Every time the visualized access pattern looked not biased to only cold or hot pages but diverse and balanced. Let me show some of the snapshots that I collected at the nearly end of the test (after about 23 hours have passed since starting DAMON on the server). The recency histogram looks as below. Please note that this visualization shows only a very coarse grained information. For more details about the visualization format, please refer to DAMON user-space tool documentation[1]. # ./damo report access --style recency-sz-hist --tried_regions_of 0 0 0 --access_rate 0 0 <last accessed time (us)> <total size> [-19 h 7 m 45.514 s, -17 h 12 m 58.963 s) 6.198 GiB |**** | [-17 h 12 m 58.963 s, -15 h 18 m 12.412 s) 0 B | | [-15 h 18 m 12.412 s, -13 h 23 m 25.860 s) 0 B | | [-13 h 23 m 25.860 s, -11 h 28 m 39.309 s) 0 B | | [-11 h 28 m 39.309 s, -9 h 33 m 52.757 s) 0 B | | [-9 h 33 m 52.757 s, -7 h 39 m 6.206 s) 0 B | | [-7 h 39 m 6.206 s, -5 h 44 m 19.654 s) 0 B | | [-5 h 44 m 19.654 s, -3 h 49 m 33.103 s) 0 B | | [-3 h 49 m 33.103 s, -1 h 54 m 46.551 s) 0 B | | [-1 h 54 m 46.551 s, -0 ns) 16.967 GiB |********* | [-0 ns, --6886551440000 ns) 38.835 GiB |********************| memory bw estimate: 9.425 GiB per second total size: 62.000 GiB It shows about 38 GiB of memory was accessed at least once within last aggregation interval (given ~300 milliseconds tuned sampling interval, this is about six seconds). This is about 61 % of the total memory. In other words, DAMON found warmest 61 % memory of the system. The number is particularly interesting given our Pareto principle based theory for the tuning goal value. We set it as 20 % of 20 % (4 %), thinking it would capture 80 % of 80 % (64 %) real access events. And it foudn 61 % hot memory, or working set. Nevertheless, to make the theory clearer, much more discussion and tests would be needed. At the moment, nonetheless, we can say making the target value higher helps finding more hot memory regions. The histogram also shows an amount of cold memory. About 17 GiB memory of the system has not accessed at least for last aggregation interval (about six seconds), and at most for about last two hours. The real longest unaccessed time of the 17 GiB memory was about 19 minutes, though. This is a limitation of this visualization format. It further found very cold 6 GiB memory. It has not accessed at least for last 17 hours and at most 19 hours. What about hot memory distribution? To see this, I capture and visualize the snapshot in access temperature histogram. Again, please refer to the DAMON user-space tool documentation[1] for the format and what access temperature mean. Both the visualization and metric shows only very coarse grained and limited information. The resulting histogram look like below. # ./damo report access --style temperature-sz-hist --tried_regions_of 0 0 0 <temperature> <total size> [-6,840,763,776,000, -5,501,580,939,800) 6.198 GiB |*** | [-5,501,580,939,800, -4,162,398,103,600) 0 B | | [-4,162,398,103,600, -2,823,215,267,400) 0 B | | [-2,823,215,267,400, -1,484,032,431,200) 0 B | | [-1,484,032,431,200, -144,849,595,000) 0 B | | [-144,849,595,000, 1,194,333,241,200) 55.802 GiB |********************| [1,194,333,241,200, 2,533,516,077,400) 4.000 KiB |* | [2,533,516,077,400, 3,872,698,913,600) 4.000 KiB |* | [3,872,698,913,600, 5,211,881,749,800) 8.000 KiB |* | [5,211,881,749,800, 6,551,064,586,000) 12.000 KiB |* | [6,551,064,586,000, 7,890,247,422,200) 4.000 KiB |* | memory bw estimate: 5.178 GiB per second total size: 62.000 GiB We can see most of the memory is in similar access temperature range, and definitely some pages are extremely hot. To see the picture in more detail, let's capture and visualize the snapshot per DAMON-region, sorted by their access temperature. The total number of the regions was about 300. Due to the limited space, I'm showing only a few parts of the output here. # ./damo report access --style hot --tried_regions_of 0 0 0 heatmap: 00000000888888889999999888888888888888888888888888888888888888888888888888888888 # min/max temperatures: -6,827,258,184,000, 17,589,052,500, column size: 793.600 MiB |999999999999999999999999999999999999999| 4.000 KiB access 100 % 18 h 9 m 43.918 s |999999999999999999999999999999999999999| 8.000 KiB access 100 % 17 h 56 m 5.351 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 15 h 24 m 19.634 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 14 h 10 m 55.606 s |999999999999999999999999999999999999999| 4.000 KiB access 100 % 11 h 34 m 18.993 s [...] |99999999999999999999999999999| 8.000 KiB access 100 % 1 m 27.945 s |11111111111111111111111111111| 80.000 KiB access 15 % 1 m 21.180 s |00000000000000000000000000000| 24.000 KiB access 5 % 1 m 21.180 s |00000000000000000000000000000| 5.919 GiB access 10 % 1 m 14.415 s |99999999999999999999999999999| 12.000 KiB access 100 % 1 m 7.650 s [...] |0| 4.000 KiB access 5 % 0 ns |0| 12.000 KiB access 5 % 0 ns |0| 188.000 KiB access 0 % 0 ns |0| 24.000 KiB access 0 % 0 ns |0| 48.000 KiB access 0 % 0 ns [...] |0000000000000000000000000000000| 8.000 KiB access 0 % 6 m 45.901 s |00000000000000000000000000000000| 36.000 KiB access 0 % 7 m 26.491 s |00000000000000000000000000000000| 4.000 KiB access 0 % 12 m 37.682 s |000000000000000000000000000000000| 8.000 KiB access 0 % 18 m 9.168 s |000000000000000000000000000000000| 16.000 KiB access 0 % 19 m 3.288 s |0000000000000000000000000000000000000000| 6.198 GiB access 0 % 18 h 57 m 52.582 s memory bw estimate: 8.798 GiB per second total size: 62.000 GiB We can see DAMON found small and extremely hot regions that accessed for all access check sampling (once per about 300 milliseconds) for more than 10 hours. The access temperature rapidly decreases. DAMON was also able to find small and big regions that not accessed for up to about 19 minutes. It even found an outlier cold region of 6 GiB that not accessed for about 19 hours. It is unclear what the outlier region is, as of this writing. For the testing, DAMON was consuming about 0.1% of single CPU time. This is again expected results, since DAMON was using about 370 milliseconds sampling interval in most case. # ps -p $kdamond_pid -o %cpu %CPU 0.1 I also ran similar tests against kernel build workload and an in-memory cache workload benchmark[2]. Detialed results including tuned intervals and captured access pattern were of course different sicne those depend on the workloads. But the auto-tuning feature was always working as expected like the above results for the real world workload. To wrap up, with intervals auto-tuning feature, DAMON was able to capture access pattern snapshots of a quality on a real world server workload. The auto-tuning feature was able to adaptively react to the dynamic access patterns of the workload and reliably provide consistent monitoring results without manual human interventions. Also, the auto-tuning made DAMON consumes only necessary amount of resource for the required quality. References ========== [1] https://github.com/damonitor/damo/blob/next/USAGE.md#access-report-styles [2] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md This patch (of 8): Add data structures for DAMON sampling and aggregation intervals automatic tuning that aims specific amount of DAMON-observed access events per snapshot. In more detail, define the data structure for the tuning goal, link it to the monitoring attributes data structure so that DAMON kernel API callers can make the request, and update parameters setup DAMON function to respect the new parameter. Link: https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250303221726.484227-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/list_lru: make the case where mlru is NULL as unlikelyZeng Jingxiang
In the following memcg_list_lru_alloc() function, mlru here is almost always NULL, so in most cases this should save a function call, mark mlru as unlikely to optimize the code, and reusing the mlru for the next attempt when the tree insertion fails. do { xas_lock_irqsave(&xas, flags); if (!xas_load(&xas) && !css_is_dying(&pos->css)) { xas_store(&xas, mlru); if (!xas_error(&xas)) mlru = NULL; } xas_unlock_irqrestore(&xas, flags); } while (xas_nomem(&xas, GFP_KERNEL)); > if (mlru) kfree(mlru); Link: https://lkml.kernel.org/r/20250227082223.1173847-1-jingxiangzeng.cas@gmail.com Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412290924.UTP7GH2Z-lkp@intel.com/ Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jingxiang Zeng <linuszeng@tencent.com> Cc: Kairui Song <kasong@tencent.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: rename GENERIC_PTDUMP and PTDUMP_COREAnshuman Khandual
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: make DEBUG_WX depdendent on GENERIC_PTDUMPAnshuman Khandual
DEBUG_WX selects PTDUMP_CORE without even ensuring that the given platform implements GENERIC_PTDUMP. This problem has been latent until now, as all the platforms subscribing ARCH_HAS_DEBUG_WX also subscribe GENERIC_PTDUMP. Link: https://lkml.kernel.org/r/20250226122404.1927473-5-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mmu_notifier: use MMU_NOTIFY_CLEAR in remove_device_exclusive_entry()David Hildenbrand
Let's limit the use of MMU_NOTIFY_EXCLUSIVE to the case where we convert a present PTE to device-exclusive. For the other case, we can simply use MMU_NOTIFY_CLEAR, because it really is clearing the device-exclusive entry first, to then install the present entry. Update the documentation of MMU_NOTIFY_EXCLUSIVE, to document the single use case more thoroughly. If ever required, we could add a separate MMU_NOTIFY_CLEAR_EXCLUSIVE; for now using MMU_NOTIFY_CLEAR seems to be sufficient. Link: https://lkml.kernel.org/r/20250226132257.2826043-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/memory: document restore_exclusive_pte()David Hildenbrand
Let's document how this function is to be used, and why the folio lock is involved. Link: https://lkml.kernel.org/r/20250226132257.2826043-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/memory: pass folio and pte to restore_exclusive_pte()David Hildenbrand
Let's pass the folio and the pte to restore_exclusive_pte(), so we can avoid repeated page_folio() and ptep_get(). To do that, pass the pte to try_restore_exclusive_pte() and use a folio in there already. While at it, just avoid the "swp_entry_t entry" variable in try_restore_exclusive_pte() and add a folio-locked check to restore_exclusive_pte(). Link: https://lkml.kernel.org/r/20250226132257.2826043-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/memory: remove PageAnonExclusive sanity-check in restore_exclusive_pte()David Hildenbrand
In commit b832a354d787 ("mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte()") we accidentally changed the sanity check to essentially ignore anonymous folio by mis-placing the "!" ... but we really always only get anonymous folios in restore_exclusive_pte(). However, in the meantime we removed the separate "writable device-exclusive entries" and always detect if the PTE can be writable using can_change_pte_writable() -- which also consults PageAnonExclusive. So let's just get rid of this sanity check completely. Link: https://lkml.kernel.org/r/20250226132257.2826043-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>