[PATCH 10/12] fs/dax: Properly refcount fs dax pages

Fri Oct 25 10:52:50 AEDT 2024

Alistair Popple wrote:
[..]
> >
> > Was there a discussion I missed about why the conversion to typical
> > folios allows the page->share accounting to be dropped.
> 
> The problem with keeping it is we now treat DAX pages as "normal"
> pages according to vm_normal_page(). As such we use the normal paths
> for unmapping pages.
> 
> Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
> aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
> etc. to return true leading to all sorts of issues in at least the
> unmap paths.

Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with
PAGE_MAPPING_ANON.

> There hasn't been a previous discussion on this, but given this is
> only used to print warnings it seemed easier to get rid of it. I
> probably should have called that out more clearly in the commit
> message though.
> 
> > I assume this is because the page->mapping validation was dropped, which
> > I think might be useful to keep at least for one development cycle to
> > make sure this conversion is not triggering any of the old warnings.
> >
> > Otherwise, the ->share field of 'struct page' can also be cleaned up.
> 
> Yes, we should also clean up the ->share field, unless you have an
> alternate suggestion to solve the above issue.

kmalloc mininimum alignment is 8, so there is room to do this?

---

diff --git a/fs/dax.c b/fs/dax.c
index c62acd2812f8..a70f081c32cb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry)
 
 static inline bool dax_page_is_shared(struct page *page)
 {
-	return page->mapping == PAGE_MAPPING_DAX_SHARED;
+	return folio_test_dax_shared(page_folio(page));
 }
 
 /*
@@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page)
  */
 static inline void dax_page_share_get(struct page *page)
 {
-	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
+	if (!dax_page_is_shared(page)) {
 		/*
 		 * Reset the index if the page was already mapped
 		 * regularly before.
 		 */
 		if (page->mapping)
 			page->share = 1;
-		page->mapping = PAGE_MAPPING_DAX_SHARED;
+		page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
 	}
 	page->share++;
 }
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1b3a76710487..21b355999ce0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
 #define PAGE_MAPPING_ANON	0x1
 #define PAGE_MAPPING_MOVABLE	0x2
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+/* to be removed once typical page refcounting for dax proves stable */
+#define PAGE_MAPPING_DAX_SHARED	0x4
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED)
 
 /*
  * Different with flags above, this flag is used only for fsdax mode.  It
  * indicates that this page->mapping is now under reflink case.
  */
-#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
 
 static __always_inline bool folio_mapping_flags(const struct folio *folio)
 {
@@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
 	return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
+static __always_inline bool folio_test_dax_shared(const struct folio *folio)
+{
+	return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0;
+}
+
 static __always_inline bool PageAnon(const struct page *page)
 {
 	return folio_test_anon(page_folio(page));
---

...and keep the validation around at least for one post conversion
development cycle?

> > It does have implications for the dax dma-idle tracking thought, see
> > below.
> >
> >>  {
> >> -	unsigned long pfn;
> >> +	unsigned long order = dax_entry_order(entry);
> >> +	struct folio *folio = dax_to_folio(entry);
> >>  
> >> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> +	if (!dax_entry_size(entry))
> >>  		return;
> >>  
> >> -	for_each_mapped_pfn(entry, pfn) {
> >> -		struct page *page = pfn_to_page(pfn);
> >> -
> >> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> >> -		if (dax_page_is_shared(page)) {
> >> -			/* keep the shared flag if this page is still shared */
> >> -			if (dax_page_share_put(page) > 0)
> >> -				continue;
> >> -		} else
> >> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> >> -		page->mapping = NULL;
> >> -		page->index = 0;
> >> -	}
> >> +	/*
> >> +	 * We don't hold a reference for the DAX pagecache entry for the
> >> +	 * page. But we need to initialise the folio so we can hand it
> >> +	 * out. Nothing else should have a reference either.
> >> +	 */
> >> +	WARN_ON_ONCE(folio_ref_count(folio));
> >
> > Per above I would feel more comfortable if we kept the paranoia around
> > to ensure that all the pages in this folio have dropped all references
> > and cleared ->mapping and ->index.
> >
> > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
> > delete in a follow-on development cycle, but in the meantime it helps to
> > prove the correctness of the conversion.
> 
> I'm ok with paranoia, but as noted above the issue is that at a minimum
> page->mapping (and probably index) now needs to be valid for any code
> that might walk the page tables.

A quick look seems to say the confusion is limited to aliasing
PAGE_MAPPING_ANON.

> > [..]
> >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> >>  	struct inode *inode = iter->inode;
> >>  	unsigned long vaddr = vmf->address;
> >>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
> >> +	struct page *page = pfn_t_to_page(pfn);
> >>  	vm_fault_t ret;
> >>  
> >>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
> >>  
> >> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> >> +	page_ref_inc(page);
> >> +	ret = dax_insert_pfn(vmf, pfn, false);
> >> +	put_page(page);
> >
> > Per above I think it is problematic to have pages live in the system
> > without a refcount.
> 
> I'm a bit confused by this - the pages have a reference taken on them
> when they are mapped. They only live in the system without a refcount
> when the mm considers them free (except for the bit between getting
> created in dax_associate_entry() and actually getting mapped but as
> noted I will fix that).
> 
> > One scenario where this might be needed is invalidate_inode_pages() vs
> > DMA. The invaldation should pause and wait for DMA pins to be dropped
> > before the mapping xarray is cleaned up and the dax folio is marked
> > free.
> 
> I'm not really following this scenario, or at least how it relates to
> the comment above. If the page is pinned for DMA it will have taken a
> refcount on it and so the page won't be considered free/idle per
> dax_wait_page_idle() or any of the other mm code.

[ tl;dr: I think we're ok, analysis below, but I did talk myself into
the proposed dax_busy_page() changes indeed being broken and needing to
remain checking for refcount > 1, not > 0 ]

It's not the mm code I am worried about. It's the filesystem block
allocator staying in-sync with the allocation state of the page.

fs/dax.c is charged with converting idle storage blocks to pfns to
mapped folios. Once they are mapped, DMA can pin the folio, but nothing
in fs/dax.c pins the mapping. In the pagecache case the page reference
is sufficient to keep the DMA-busy page from being reused. In the dax
case something needs to arrange for DMA to be idle before
dax_delete_mapping_entry().

However, looking at XFS it indeed makes that guarantee. First it does
xfs_break_dax_layouts() then it does truncate_inode_pages() =>
dax_delete_mapping_entry().

It follows that that the DMA-idle condition still needs to look for the
case where the refcount is > 1 rather than 0 since refcount == 1 is the
page-mapped-but-DMA-idle condition.

[..]
> >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> >>  	bool write = iter->flags & IOMAP_WRITE;
> >>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
> >> -	int err = 0;
> >> +	int ret, err = 0;
> >>  	pfn_t pfn;
> >>  	void *kaddr;
> >> +	struct page *page;
> >>  
> >>  	if (!pmd && vmf->cow_page)
> >>  		return dax_fault_cow_page(vmf, iter);
> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
> >>  
> >> -	/* insert PMD pfn */
> >> +	page = pfn_t_to_page(pfn);
> >
> > I think this is clearer if dax_insert_entry() returns folios with an
> > elevated refrence count that is dropped when the folio is invalidated
> > out of the mapping.
> 
> I presume this comment is for the next line:
> 
> +	page_ref_inc(page);
>  
> I can move that into dax_insert_entry(), but we would still need to
> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
> transition when the page is unmapped and therefore
> freed. Alternatively we can make it so vmf_insert_*() don't take
> references on the page, and instead ownership of the reference is
> transfered to the mapping. Personally I prefered having those
> functions take their own reference but let me know what you think.

Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
then the page was never allocated because it was never mapped. What
happens with the code as proposed is that put_page() triggers page-free
semantics on vmf_insert_XXX() failures, right?

There is no need to invoke the page-free / final-put path on
vmf_insert_XXX() error because the storage-block / pfn never actually
transitioned into a page / folio.

> > [..]
> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
> >>  	lock_page(page);
> >>  }
> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
> >> -
> >> -#ifdef CONFIG_FS_DAX
> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> >> -{
> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >> -		return false;
> >> -
> >> -	/*
> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
> >> -	 * refcount is 1, then the page is free and the refcount is
> >> -	 * stable because nobody holds a reference on the page.
> >> -	 */
> >> -	if (folio_ref_sub_return(folio, refs) == 1)
> >> -		wake_up_var(&folio->_refcount);
> >> -	return true;
> >
> > It follow from the refcount disvussion above that I think there is an
> > argument to still keep this wakeup based on the 2->1 transitition.
> > pagecache pages are refcount==1 when they are dma-idle but still
> > allocated. To keep the same semantics for dax a dax_folio would have an
> > elevated refcount whenever it is referenced by mapping entry.
> 
> I'm not sold on keeping it as it doesn't seem to offer any benefit
> IMHO. I know both Jason and Christoph were keen to see it go so it be
> good to get their feedback too. Also one of the primary goals of this
> series was to refcount the page normally so we could remove the whole
> "page is free with a refcount of 1" semantics.

The page is still free at refcount 0, no argument there. But, by
introducing a new "page refcount is elevated while mapped" (as it
should), it follows that "page is DMA idle at refcount == 1", right?
Otherwise, the current assumption that fileystems can have
dax_layout_busy_page_range() poll on the state of the pfn in the mapping
is broken because page refcount == 0 also means no page to mapping
association.