[PATCH] mm/mmu_notifier: avoid double notification when it is useless

Wed Oct 4 11:15:59 AEDT 2017

On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write to
> > +a page that might now be use by some completely different task.
> 
> used
> 
> > +is true ven if the thread doing the page table update is preempted right after
> 
> even
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		goto out_free_pages;
> >  	VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +	/*
> > +	 * Leave pmd empty until pte is filled note we must notify here as
> > +	 * concurrent CPU thread might write to new page before the call to
> > +	 * mmu_notifier_invalidate_range_end() happen which can lead to a
> 
> happens
> 
> > +	 * device seeing memory write in different order than CPU.
> > +	 *
> > +	 * See Documentation/vm/mmu_notifier.txt
> > +	 */
> >  	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -	/* leave pmd empty until pte is filled */
> >  
> 
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
> 
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
> 
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
> 
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.

Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.

> 
> > +				 * No need to notify as we downgrading page
> 
> are
> 
> > +				 * table protection not changing it to point
> > +				 * to a new page.
> > +	 			 *
> 
> > +		 * No need to notify as we downgrading page table to read only
> 
> are
> 
> > +	 * No need to notify as we replacing a read only page with another
> 
> are
> 
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			if (pte_soft_dirty(pteval))
> >  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -		} else
> > +		} else {
> > +			/*
> > +			 * We should not need to notify here as we reach this
> > +			 * case only from freeze_page() itself only call from
> > +			 * split_huge_page_to_list() so everything below must
> > +			 * be true:
> > +			 *   - page is not anonymous
> > +			 *   - page is locked
> > +			 *
> > +			 * So as it is a shared page and it is locked, it can
> > +			 * not be remove from the page cache and replace by
> > +			 * a new page before mmu_notifier_invalidate_range_end
> > +			 * so no concurrent thread might update its page table
> > +			 * to point at new page while a device still is using
> > +			 * this page.
> > +			 *
> > +			 * But we can not assume that new user of try_to_unmap
> > +			 * will have that in mind so just to be safe here call
> > +			 * mmu_notifier_invalidate_range()
> > +			 *
> > +			 * See Documentation/vm/mmu_notifier.txt
> > +			 */
> >  			dec_mm_counter(mm, mm_counter_file(page));
> > +			mmu_notifier_invalidate_range(mm, address,
> > +						      address + PAGE_SIZE);
> > +		}
> >  discard:
> > +		/*
> > +		 * No need to call mmu_notifier_invalidate_range() as we are
> > +		 * either replacing a present pte with non present one (either
> > +		 * a swap or special one). We handling the clearing pte case
> > +		 * above.
> > +		 *
> > +		 * See Documentation/vm/mmu_notifier.txt
> > +		 */
> >  		page_remove_rmap(subpage, PageHuge(page));
> >  		put_page(page);
> > -		mmu_notifier_invalidate_range(mm, address,
> > -					      address + PAGE_SIZE);
> >  	}
> >  
> >  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> 
> That is the path that unmaps filebacked pages (btw, not necessarily
> shared unlike comment says, they can be private but still filebacked).

I was more refering to the fact that they are in page cache and thus
given current condition they can not be migrated to a new page in our
back. But yes it can be a private mapping, i will fix the comment.

> I'd like some more explanation about the inner working of "that new
> user" as per comment above.
> 
> It would be enough to drop mmu_notifier_invalidate_range from above
> without adding it to the filebacked case. The above gives higher prio
> to the hypothetical and uncertain future case, than to the current
> real filebacked case that doesn't need ->invalidate_range inside the
> PT lock, or do you see something that might already need such
> ->invalidate_range?

No i don't see any new user today that might need such invalidate but
i was trying to be extra cautious as i have a tendency to assume that
someone might do a patch that use try_to_unmap() without going through
all the comments in the function and thus possibly using it in a an
unexpected way from mmu_notifier callback point of view. I am fine
with putting the burden on new user to get it right and adding an
extra warning in the function description to try to warn people in a
sensible way.

> I certainly like the patch. I expect ->invalidate_range users will
> like the slight higher complexity in order to eliminate unnecessary
> invalidates that are slowing them down unnecessarily. At the same time
> this is zero risk (because a noop) for all other MMU notifier users
> (those that don't share the primary MMU pagetables, like KVM).

I have another patchset to restore the change_pte optimization for kvm
i want to post too. I will need to do some benchmarking first to make
sure it actualy helps even in a small way.

Thank you for looking into this patch, i will repost with your suggestions
soon.

Jérôme