diff options
author | Alistair Popple <apopple@nvidia.com> | 2025-02-28 14:31:14 +1100 |
---|---|---|
committer | Andrew Morton <akpm@linux-foundation.org> | 2025-03-17 22:06:41 -0700 |
commit | 38607c62b34b46317c46d5baf1df03ac6e48a1c6 (patch) | |
tree | 8c0d14aa6284b39fee26c611469272e399b5e54c /mm/memremap.c | |
parent | 653d7825c149932f254e0cd22153ccc945e7e545 (diff) |
fs/dax: properly refcount fs dax pages
Currently fs dax pages are considered free when the refcount drops to one
and their refcounts are not increased when mapped via PTEs or decreased
when unmapped. This requires special logic in mm paths to detect that
these pages should not be properly refcounted, and to detect when the
refcount drops to one instead of zero.
On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is unpinned.
Tracking this special behaviour requires extra PTE bits (eg. pte_devmap)
and introduces rules that are potentially confusing and specific to FS DAX
pages. To fix this, and to possibly allow removal of the special PTE bits
in future, convert the fs dax page refcounts to be zero based and instead
take a reference on the page each time it is mapped as is currently the
case for normal pages.
This may also allow a future clean-up to remove the pgmap refcounting that
is currently done in mm/gup.c.
Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Asahi Lina <lina@asahilina.net>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linmiaohe <linmiaohe@huawei.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'mm/memremap.c')
-rw-r--r-- | mm/memremap.c | 47 |
1 files changed, 22 insertions, 25 deletions
diff --git a/mm/memremap.c b/mm/memremap.c index 68099af9df4c..9a8879bf1ae4 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -458,8 +458,13 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_zone_device_folio(struct folio *folio) { - if (WARN_ON_ONCE(!folio->pgmap->ops || - !folio->pgmap->ops->page_free)) + struct dev_pagemap *pgmap = folio->pgmap; + + if (WARN_ON_ONCE(!pgmap->ops)) + return; + + if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX && + !pgmap->ops->page_free)) return; mem_cgroup_uncharge(folio); @@ -484,26 +489,36 @@ void free_zone_device_folio(struct folio *folio) * For other types of ZONE_DEVICE pages, migration is either * handled differently or not done at all, so there is no need * to clear folio->mapping. + * + * FS DAX pages clear the mapping when the folio->share count hits + * zero which indicating the page has been removed from the file + * system mapping. */ - folio->mapping = NULL; - folio->pgmap->ops->page_free(folio_page(folio, 0)); + if (pgmap->type != MEMORY_DEVICE_FS_DAX) + folio->mapping = NULL; - switch (folio->pgmap->type) { + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_COHERENT: - put_dev_pagemap(folio->pgmap); + pgmap->ops->page_free(folio_page(folio, 0)); + put_dev_pagemap(pgmap); break; - case MEMORY_DEVICE_FS_DAX: case MEMORY_DEVICE_GENERIC: /* * Reset the refcount to 1 to prepare for handing out the page * again. */ + pgmap->ops->page_free(folio_page(folio, 0)); folio_set_count(folio, 1); break; + case MEMORY_DEVICE_FS_DAX: + wake_up_var(&folio->page); + break; + case MEMORY_DEVICE_PCI_P2PDMA: + pgmap->ops->page_free(folio_page(folio, 0)); break; } } @@ -519,21 +534,3 @@ void zone_device_page_init(struct page *page) lock_page(page); } EXPORT_SYMBOL_GPL(zone_device_page_init); - -#ifdef CONFIG_FS_DAX -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs) -{ - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX) - return false; - - /* - * fsdax page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (folio_ref_sub_return(folio, refs) == 1) - wake_up_var(&folio->_refcount); - return true; -} -EXPORT_SYMBOL(__put_devmap_managed_folio_refs); -#endif /* CONFIG_FS_DAX */ |