[PATCH 1/2] mm/mmu_notifier: avoid double notification when it is useless v2

Fri Oct 20 03:58:23 AEDT 2017

On Thu, Oct 19, 2017 at 09:53:11PM +1100, Balbir Singh wrote:
> On Thu, Oct 19, 2017 at 2:28 PM, Jerome Glisse <jglisse at redhat.com> wrote:
> > On Thu, Oct 19, 2017 at 02:04:26PM +1100, Balbir Singh wrote:
> >> On Mon, 16 Oct 2017 23:10:02 -0400
> >> jglisse at redhat.com wrote:
> >>
> >> > From: Jérôme Glisse <jglisse at redhat.com>
> >> >
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             if (pmdp) {
> >> >  #ifdef CONFIG_FS_DAX_PMD
> >> >                     pmd_t pmd;
> >> > @@ -628,7 +635,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pmd = pmd_wrprotect(pmd);
> >> >                     pmd = pmd_mkclean(pmd);
> >> >                     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Could the secondary TLB still see the mapping as dirty and propagate the dirty bit back?
> >
> > I am assuming hardware does sane thing of setting the dirty bit only
> > when walking the CPU page table when device does a write fault ie
> > once the device get a write TLB entry the dirty is set by the IOMMU
> > when walking the page table before returning the lookup result to the
> > device and that it won't be set again latter (ie propagated back
> > latter).
> >
> 
> The other possibility is that the hardware things the page is writable
> and already
> marked dirty. It allows writes and does not set the dirty bit?

I thought about this some more and the patch can not regress anything
that is not broken today. So if we assume that device can propagate
dirty bit because it can cache the write protection than all current
code is broken for two reasons:

First one is current code clear pte entry, build a new pte value with
write protection and update pte entry with new pte value. So any PASID/
ATS platform that allows device to cache the write bit and set dirty
bit anytime after that can race during that window and you would loose
the dirty bit of the device. That is not that bad as you are gonna
propagate the dirty bit to the struct page.

Second one is if the dirty bit is propagated back to the new write
protected pte. Quick look at code it seems that when we zap pte or
or mkclean we don't check that the pte has write permission but only
care about the dirty bit. So it should not have any bad consequence.

After this patch only the second window is bigger and thus more likely
to happen. But nothing sinister should happen from that.

> 
> > I should probably have spell that out and maybe some of the ATS/PASID
> > implementer did not do that.
> >
> >>
> >> >  unlock_pmd:
> >> >                     spin_unlock(ptl);
> >> >  #endif
> >> > @@ -643,7 +649,6 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
> >> >                     pte = pte_wrprotect(pte);
> >> >                     pte = pte_mkclean(pte);
> >> >                     set_pte_at(vma->vm_mm, address, ptep, pte);
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> >>
> >> Ditto
> >>
> >> >  unlock_pte:
> >> >                     pte_unmap_unlock(ptep, ptl);
> >> >             }
> >> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> > index 6866e8126982..49c925c96b8a 100644
> >> > --- a/include/linux/mmu_notifier.h
> >> > +++ b/include/linux/mmu_notifier.h
> >> > @@ -155,7 +155,8 @@ struct mmu_notifier_ops {
> >> >      * shared page-tables, it not necessary to implement the
> >> >      * invalidate_range_start()/end() notifiers, as
> >> >      * invalidate_range() alread catches the points in time when an
> >> > -    * external TLB range needs to be flushed.
> >> > +    * external TLB range needs to be flushed. For more in depth
> >> > +    * discussion on this see Documentation/vm/mmu_notifier.txt
> >> >      *
> >> >      * The invalidate_range() function is called under the ptl
> >> >      * spin-lock and not allowed to sleep.
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index c037d3d34950..ff5bc647b51d 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -1186,8 +1186,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >> >             goto out_free_pages;
> >> >     VM_BUG_ON_PAGE(!PageHead(page), page);
> >> >
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note we must notify here as
> >> > +    * concurrent CPU thread might write to new page before the call to
> >> > +    * mmu_notifier_invalidate_range_end() happens which can lead to a
> >> > +    * device seeing memory write in different order than CPU.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> >> > -   /* leave pmd empty until pte is filled */
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, vmf->pmd);
> >> >     pmd_populate(vma->vm_mm, &_pmd, pgtable);
> >> > @@ -2026,8 +2033,15 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >> >     pmd_t _pmd;
> >> >     int i;
> >> >
> >> > -   /* leave pmd empty until pte is filled */
> >> > -   pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> >> > +   /*
> >> > +    * Leave pmd empty until pte is filled note that it is fine to delay
> >> > +    * notification until mmu_notifier_invalidate_range_end() as we are
> >> > +    * replacing a zero pmd write protected page with a zero pte write
> >> > +    * protected page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   pmdp_huge_clear_flush(vma, haddr, pmd);
> >>
> >> Shouldn't the secondary TLB know if the page size changed?
> >
> > It should not matter, we are talking virtual to physical on behalf
> > of a device against a process address space. So the hardware should
> > not care about the page size.
> >
> 
> Does that not indicate how much the device can access? Could it try
> to access more than what is mapped?

Assuming device has huge TLB and 2MB huge page with 4K small page.
You are going from one 1 TLB covering a 2MB zero page to 512 TLB
each covering 4K. Both case is read only and both case are pointing
to same data (ie zero).

It is fine to delay the TLB invalidate on the device to the call of
mmu_notifier_invalidate_range_end(). The device will keep using the
huge TLB for a little longer but both CPU and device are looking at
same data.

Now if there is a racing thread that replace one of the 512 zeor page
after the split but before mmu_notifier_invalidate_range_end() that
code path would call mmu_notifier_invalidate_range() before changing
the pte to point to something else. Which should shoot down the device
TLB (it would be a serious device bug if this did not work).

> 
> > Moreover if any of the new 512 (assuming 2MB huge and 4K pages) zero
> > 4K pages is replace by something new then a device TLB shootdown will
> > happen before the new page is set.
> >
> > Only issue i can think of is if the IOMMU TLB (if there is one) or
> > the device TLB (you do expect that there is one) does not invalidate
> > TLB entry if the TLB shootdown is smaller than the TLB entry. That
> > would be idiotic but yes i know hardware bug.
> >
> >
> >>
> >> >
> >> >     pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> >> >     pmd_populate(mm, &_pmd, pgtable);
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 1768efa4c501..63a63f1b536c 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -3254,9 +3254,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >> >                     set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
> >> >             } else {
> >> >                     if (cow) {
> >> > +                           /*
> >> > +                            * No need to notify as we are downgrading page
> >> > +                            * table protection not changing it to point
> >> > +                            * to a new page.
> >> > +                            *
> >> > +                            * See Documentation/vm/mmu_notifier.txt
> >> > +                            */
> >> >                             huge_ptep_set_wrprotect(src, addr, src_pte);
> >>
> >> OK.. so we could get write faults on write accesses from the device.
> >>
> >> > -                           mmu_notifier_invalidate_range(src, mmun_start,
> >> > -                                                              mmun_end);
> >> >                     }
> >> >                     entry = huge_ptep_get(src_pte);
> >> >                     ptepage = pte_page(entry);
> >> > @@ -4288,7 +4293,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >> >      * and that page table be reused and filled with junk.
> >> >      */
> >> >     flush_hugetlb_tlb_range(vma, start, end);
> >> > -   mmu_notifier_invalidate_range(mm, start, end);
> >> > +   /*
> >> > +    * No need to call mmu_notifier_invalidate_range() we are downgrading
> >> > +    * page table protection not changing it to point to a new page.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> >     i_mmap_unlock_write(vma->vm_file->f_mapping);
> >> >     mmu_notifier_invalidate_range_end(mm, start, end);
> >> >
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 6cb60f46cce5..be8f4576f842 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -1052,8 +1052,13 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >> >              * So we clear the pte and flush the tlb before the check
> >> >              * this assure us that no O_DIRECT can happen after the check
> >> >              * or in the middle of the check.
> >> > +            *
> >> > +            * No need to notify as we are downgrading page table to read
> >> > +            * only not changing it to point to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> >              */
> >> > -           entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
> >> > +           entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> >> >             /*
> >> >              * Check that no O_DIRECT or similar I/O is in progress on the
> >> >              * page
> >> > @@ -1136,7 +1141,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >> >     }
> >> >
> >> >     flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> > -   ptep_clear_flush_notify(vma, addr, ptep);
> >> > +   /*
> >> > +    * No need to notify as we are replacing a read only page with another
> >> > +    * read only page with the same content.
> >> > +    *
> >> > +    * See Documentation/vm/mmu_notifier.txt
> >> > +    */
> >> > +   ptep_clear_flush(vma, addr, ptep);
> >> >     set_pte_at_notify(mm, addr, ptep, newpte);
> >> >
> >> >     page_remove_rmap(page, false);
> >> > diff --git a/mm/rmap.c b/mm/rmap.c
> >> > index 061826278520..6b5a0f219ac0 100644
> >> > --- a/mm/rmap.c
> >> > +++ b/mm/rmap.c
> >> > @@ -937,10 +937,15 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >> >  #endif
> >> >             }
> >> >
> >> > -           if (ret) {
> >> > -                   mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() as we are
> >> > +            * downgrading page table protection not changing it to point
> >> > +            * to a new page.
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> > +           if (ret)
> >> >                     (*cleaned)++;
> >> > -           }
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >> > @@ -1424,6 +1429,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >                     goto discard;
> >> >             }
> >> >
> >> > @@ -1481,6 +1490,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                      * will take care of the rest.
> >> >                      */
> >> >                     dec_mm_counter(mm, mm_counter(page));
> >> > +                   /* We have to invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> >             } else if (IS_ENABLED(CONFIG_MIGRATION) &&
> >> >                             (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
> >> >                     swp_entry_t entry;
> >> > @@ -1496,6 +1508,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > +                   /*
> >> > +                    * No need to invalidate here it will synchronize on
> >> > +                    * against the special swap migration pte.
> >> > +                    */
> >> >             } else if (PageAnon(page)) {
> >> >                     swp_entry_t entry = { .val = page_private(subpage) };
> >> >                     pte_t swp_pte;
> >> > @@ -1507,6 +1523,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                             WARN_ON_ONCE(1);
> >> >                             ret = false;
> >> >                             /* We have to invalidate as we cleared the pte */
> >> > +                           mmu_notifier_invalidate_range(mm, address,
> >> > +                                                   address + PAGE_SIZE);
> >> >                             page_vma_mapped_walk_done(&pvmw);
> >> >                             break;
> >> >                     }
> >> > @@ -1514,6 +1532,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     /* MADV_FREE page check */
> >> >                     if (!PageSwapBacked(page)) {
> >> >                             if (!PageDirty(page)) {
> >> > +                                   /* Invalidate as we cleared the pte */
> >> > +                                   mmu_notifier_invalidate_range(mm,
> >> > +                                           address, address + PAGE_SIZE);
> >> >                                     dec_mm_counter(mm, MM_ANONPAGES);
> >> >                                     goto discard;
> >> >                             }
> >> > @@ -1547,13 +1568,39 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> >                     if (pte_soft_dirty(pteval))
> >> >                             swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> >                     set_pte_at(mm, address, pvmw.pte, swp_pte);
> >> > -           } else
> >> > +                   /* Invalidate as we cleared the pte */
> >> > +                   mmu_notifier_invalidate_range(mm, address,
> >> > +                                                 address + PAGE_SIZE);
> >> > +           } else {
> >> > +                   /*
> >> > +                    * We should not need to notify here as we reach this
> >> > +                    * case only from freeze_page() itself only call from
> >> > +                    * split_huge_page_to_list() so everything below must
> >> > +                    * be true:
> >> > +                    *   - page is not anonymous
> >> > +                    *   - page is locked
> >> > +                    *
> >> > +                    * So as it is a locked file back page thus it can not
> >> > +                    * be remove from the page cache and replace by a new
> >> > +                    * page before mmu_notifier_invalidate_range_end so no
> >> > +                    * concurrent thread might update its page table to
> >> > +                    * point at new page while a device still is using this
> >> > +                    * page.
> >> > +                    *
> >> > +                    * See Documentation/vm/mmu_notifier.txt
> >> > +                    */
> >> >                     dec_mm_counter(mm, mm_counter_file(page));
> >> > +           }
> >> >  discard:
> >> > +           /*
> >> > +            * No need to call mmu_notifier_invalidate_range() it has be
> >> > +            * done above for all cases requiring it to happen under page
> >> > +            * table lock before mmu_notifier_invalidate_range_end()
> >> > +            *
> >> > +            * See Documentation/vm/mmu_notifier.txt
> >> > +            */
> >> >             page_remove_rmap(subpage, PageHuge(page));
> >> >             put_page(page);
> >> > -           mmu_notifier_invalidate_range(mm, address,
> >> > -                                         address + PAGE_SIZE);
> >> >     }
> >> >
> >> >     mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> >>
> >> Looking at the patchset, I understand the efficiency, but I am concerned
> >> with correctness.
> >
> > I am fine in holding this off from reaching Linus but only way to flush this
> > issues out if any is to have this patch in linux-next or somewhere were they
> > get a chance of being tested.
> >
> 
> Yep, I would like to see some additional testing around npu and get Alistair
> Popple to comment as well

I think this patch is fine. The only one race window that it might make
bigger should have no bad consequences.

> 
> > Note that the second patch is always safe. I agree that this one might
> > not be if hardware implementation is idiotic (well that would be my
> > opinion and any opinion/point of view can be challenge :))
> 
> 
> You mean the only_end variant that avoids shootdown after pmd/pte changes
> that avoid the _start/_end and have just the only_end variant? That seemed
> reasonable to me, but I've not tested it or evaluated it in depth

Yes, patch 2/2 in this serie is definitly fine. It invalidate the device
TLB right after clearing pte entry and avoid latter unecessary invalidation
of same TLB.

Jérôme