[PATCH 10/12] fs/dax: Properly refcount fs dax pages

Fri Oct 25 13:46:30 AEDT 2024

Dan Williams <dan.j.williams at intel.com> writes:

> Alistair Popple wrote:
> [..]
>> >
>> > Was there a discussion I missed about why the conversion to typical
>> > folios allows the page->share accounting to be dropped.
>> 
>> The problem with keeping it is we now treat DAX pages as "normal"
>> pages according to vm_normal_page(). As such we use the normal paths
>> for unmapping pages.
>> 
>> Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
>> aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
>> etc. to return true leading to all sorts of issues in at least the
>> unmap paths.
>
> Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with
> PAGE_MAPPING_ANON.
>
>> There hasn't been a previous discussion on this, but given this is
>> only used to print warnings it seemed easier to get rid of it. I
>> probably should have called that out more clearly in the commit
>> message though.
>> 
>> > I assume this is because the page->mapping validation was dropped, which
>> > I think might be useful to keep at least for one development cycle to
>> > make sure this conversion is not triggering any of the old warnings.
>> >
>> > Otherwise, the ->share field of 'struct page' can also be cleaned up.
>> 
>> Yes, we should also clean up the ->share field, unless you have an
>> alternate suggestion to solve the above issue.
>
> kmalloc mininimum alignment is 8, so there is room to do this?

Oh right, given the aliasing I had assumed there wasn't room.

> ---
> diff --git a/fs/dax.c b/fs/dax.c
> index c62acd2812f8..a70f081c32cb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry)
>  
>  static inline bool dax_page_is_shared(struct page *page)
>  {
> -	return page->mapping == PAGE_MAPPING_DAX_SHARED;
> +	return folio_test_dax_shared(page_folio(page));
>  }
>  
>  /*
> @@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page)
>   */
>  static inline void dax_page_share_get(struct page *page)
>  {
> -	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
> +	if (!dax_page_is_shared(page)) {
>  		/*
>  		 * Reset the index if the page was already mapped
>  		 * regularly before.
>  		 */
>  		if (page->mapping)
>  			page->share = 1;
> -		page->mapping = PAGE_MAPPING_DAX_SHARED;
> +		page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
>  	}
>  	page->share++;
>  }
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1b3a76710487..21b355999ce0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
>  #define PAGE_MAPPING_ANON	0x1
>  #define PAGE_MAPPING_MOVABLE	0x2
>  #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
> -#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
> +/* to be removed once typical page refcounting for dax proves stable */
> +#define PAGE_MAPPING_DAX_SHARED	0x4
> +#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED)
>  
>  /*
>   * Different with flags above, this flag is used only for fsdax mode.  It
>   * indicates that this page->mapping is now under reflink case.
>   */
> -#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
>  
>  static __always_inline bool folio_mapping_flags(const struct folio *folio)
>  {
> @@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
>  	return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
>  }
>  
> +static __always_inline bool folio_test_dax_shared(const struct folio *folio)
> +{
> +	return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0;
> +}
> +
>  static __always_inline bool PageAnon(const struct page *page)
>  {
>  	return folio_test_anon(page_folio(page));
> ---
>
> ...and keep the validation around at least for one post conversion
> development cycle?

Looks reasonable, will add that back for at least a development
cycle. In reality it will probably stay forever and I will add a comment
to the PAGE_MAPPING_DAX_SHARED definition saying it can be easily
removed if more flags are needed.

>> > It does have implications for the dax dma-idle tracking thought, see
>> > below.
>> >
>> >>  {
>> >> -	unsigned long pfn;
>> >> +	unsigned long order = dax_entry_order(entry);
>> >> +	struct folio *folio = dax_to_folio(entry);
>> >>  
>> >> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> >> +	if (!dax_entry_size(entry))
>> >>  		return;
>> >>  
>> >> -	for_each_mapped_pfn(entry, pfn) {
>> >> -		struct page *page = pfn_to_page(pfn);
>> >> -
>> >> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>> >> -		if (dax_page_is_shared(page)) {
>> >> -			/* keep the shared flag if this page is still shared */
>> >> -			if (dax_page_share_put(page) > 0)
>> >> -				continue;
>> >> -		} else
>> >> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>> >> -		page->mapping = NULL;
>> >> -		page->index = 0;
>> >> -	}
>> >> +	/*
>> >> +	 * We don't hold a reference for the DAX pagecache entry for the
>> >> +	 * page. But we need to initialise the folio so we can hand it
>> >> +	 * out. Nothing else should have a reference either.
>> >> +	 */
>> >> +	WARN_ON_ONCE(folio_ref_count(folio));
>> >
>> > Per above I would feel more comfortable if we kept the paranoia around
>> > to ensure that all the pages in this folio have dropped all references
>> > and cleared ->mapping and ->index.
>> >
>> > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
>> > delete in a follow-on development cycle, but in the meantime it helps to
>> > prove the correctness of the conversion.
>> 
>> I'm ok with paranoia, but as noted above the issue is that at a minimum
>> page->mapping (and probably index) now needs to be valid for any code
>> that might walk the page tables.
>
> A quick look seems to say the confusion is limited to aliasing
> PAGE_MAPPING_ANON.

Correct. Looks like we can solve that though.

>> > [..]
>> >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>> >>  	struct inode *inode = iter->inode;
>> >>  	unsigned long vaddr = vmf->address;
>> >>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>> >> +	struct page *page = pfn_t_to_page(pfn);
>> >>  	vm_fault_t ret;
>> >>  
>> >>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>> >>  
>> >> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>> >> +	page_ref_inc(page);
>> >> +	ret = dax_insert_pfn(vmf, pfn, false);
>> >> +	put_page(page);
>> >
>> > Per above I think it is problematic to have pages live in the system
>> > without a refcount.
>> 
>> I'm a bit confused by this - the pages have a reference taken on them
>> when they are mapped. They only live in the system without a refcount
>> when the mm considers them free (except for the bit between getting
>> created in dax_associate_entry() and actually getting mapped but as
>> noted I will fix that).
>> 
>> > One scenario where this might be needed is invalidate_inode_pages() vs
>> > DMA. The invaldation should pause and wait for DMA pins to be dropped
>> > before the mapping xarray is cleaned up and the dax folio is marked
>> > free.
>> 
>> I'm not really following this scenario, or at least how it relates to
>> the comment above. If the page is pinned for DMA it will have taken a
>> refcount on it and so the page won't be considered free/idle per
>> dax_wait_page_idle() or any of the other mm code.
>
> [ tl;dr: I think we're ok, analysis below, but I did talk myself into
> the proposed dax_busy_page() changes indeed being broken and needing to
> remain checking for refcount > 1, not > 0 ]
>
> It's not the mm code I am worried about. It's the filesystem block
> allocator staying in-sync with the allocation state of the page.
>
> fs/dax.c is charged with converting idle storage blocks to pfns to
> mapped folios. Once they are mapped, DMA can pin the folio, but nothing
> in fs/dax.c pins the mapping. In the pagecache case the page reference
> is sufficient to keep the DMA-busy page from being reused. In the dax
> case something needs to arrange for DMA to be idle before
> dax_delete_mapping_entry().

Ok. How does that work today? My current mental model is that something
has to call dax_layout_busy_page() whilst holding the correct locks to
prevent a new mapping being established prior to calling
dax_delete_mapping_entry(). Is that correct?

> However, looking at XFS it indeed makes that guarantee. First it does
> xfs_break_dax_layouts() then it does truncate_inode_pages() =>
> dax_delete_mapping_entry().
>
> It follows that that the DMA-idle condition still needs to look for the
> case where the refcount is > 1 rather than 0 since refcount == 1 is the
> page-mapped-but-DMA-idle condition.

Sorry, but I'm still not following this line of reasoning. If the
refcount == 1 the page is either mapped xor DMA-busy. So a refcount >= 1
is enough to conclude that the page cannot be reused because it is
either being accessed from userspace via a CPU mapping or from some
device DMA or some other in kernel user.

The current proposal is that dax_busy_page() returns true if refcount >=
1, and dax_wait_page_idle() will wait until the refcount ==
0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
but obviously can't force other pinners to release their reference hence
the need to wait. Callers should already be holding locks to ensure new
mappings can't be established and hence can't become DMA-busy after the
unmap.

> [..]
>> >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
>> >>  	bool write = iter->flags & IOMAP_WRITE;
>> >>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
>> >> -	int err = 0;
>> >> +	int ret, err = 0;
>> >>  	pfn_t pfn;
>> >>  	void *kaddr;
>> >> +	struct page *page;
>> >>  
>> >>  	if (!pmd && vmf->cow_page)
>> >>  		return dax_fault_cow_page(vmf, iter);
>> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>> >>  
>> >> -	/* insert PMD pfn */
>> >> +	page = pfn_t_to_page(pfn);
>> >
>> > I think this is clearer if dax_insert_entry() returns folios with an
>> > elevated refrence count that is dropped when the folio is invalidated
>> > out of the mapping.
>> 
>> I presume this comment is for the next line:
>> 
>> +	page_ref_inc(page);
>>  
>> I can move that into dax_insert_entry(), but we would still need to
>> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
>> transition when the page is unmapped and therefore
>> freed. Alternatively we can make it so vmf_insert_*() don't take
>> references on the page, and instead ownership of the reference is
>> transfered to the mapping. Personally I prefered having those
>> functions take their own reference but let me know what you think.
>
> Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
> then the page was never allocated because it was never mapped. What
> happens with the code as proposed is that put_page() triggers page-free
> semantics on vmf_insert_XXX() failures, right?

Right. And actually that means I can't move the page_ref_inc(page) into
what will be called dax_create_folio(), because an entry may have been
created previously that had a failed vmf_insert_XXX() which will
therefore have a zero refcount folio associated with it.

But I think that model is wrong. I think the model needs to be the page
gets allocated when the entry is first created (ie. when
dax_create_folio() is called). A subsequent free (ether due to
vmf_insert_XXX() failing or the page being unmapped or becoming
DMA-idle) should then delete the entry.

I think that makes the semantics around dax_busy_page() nicer as well -
no need for the truncate to have a special path to call
dax_delete_mapping_entry().

> There is no need to invoke the page-free / final-put path on
> vmf_insert_XXX() error because the storage-block / pfn never actually
> transitioned into a page / folio.

It's not mapping a page/folio that transitions a pfn into a page/folio
it is the allocation of the folio that happens in dax_create_folio()
(aka. dax_associate_new_entry()). So we need to delete the entry (as
noted above I don't do that currently) if the insertion fails.

>> > [..]
>> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>> >>  	lock_page(page);
>> >>  }
>> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> >> -
>> >> -#ifdef CONFIG_FS_DAX
>> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> >> -{
>> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> >> -		return false;
>> >> -
>> >> -	/*
>> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> >> -	 * refcount is 1, then the page is free and the refcount is
>> >> -	 * stable because nobody holds a reference on the page.
>> >> -	 */
>> >> -	if (folio_ref_sub_return(folio, refs) == 1)
>> >> -		wake_up_var(&folio->_refcount);
>> >> -	return true;
>> >
>> > It follow from the refcount disvussion above that I think there is an
>> > argument to still keep this wakeup based on the 2->1 transitition.
>> > pagecache pages are refcount==1 when they are dma-idle but still
>> > allocated. To keep the same semantics for dax a dax_folio would have an
>> > elevated refcount whenever it is referenced by mapping entry.
>> 
>> I'm not sold on keeping it as it doesn't seem to offer any benefit
>> IMHO. I know both Jason and Christoph were keen to see it go so it be
>> good to get their feedback too. Also one of the primary goals of this
>> series was to refcount the page normally so we could remove the whole
>> "page is free with a refcount of 1" semantics.
>
> The page is still free at refcount 0, no argument there. But, by
> introducing a new "page refcount is elevated while mapped" (as it
> should), it follows that "page is DMA idle at refcount == 1", right?

No. The page is either mapped xor DMA-busy - ie. not free. If we want
(need?) to tell the difference we can use folio_maybe_dma_pinned(),
assuming the driver doing DMA has called pin_user_pages() as it should.

That said I'm not sure why we care about the distinction between
DMA-idle and mapped? If the page is not free from the mm perspective the
block can't be reallocated by the filesystem.

> Otherwise, the current assumption that fileystems can have
> dax_layout_busy_page_range() poll on the state of the pfn in the mapping
> is broken because page refcount == 0 also means no page to mapping
> association.

And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
is using the page so the page isn't busy and is free to be reallocated
right?

 - Alistair