[PATCH -V10 00/15] THP support for PPC64

Wed Jun 19 03:54:30 EST 2013

Benjamin Herrenschmidt <benh at au1.ibm.com> writes:

> On Wed, 2013-06-05 at 20:58 +0530, Aneesh Kumar K.V wrote:
>> This is the second patchset needed to support THP on ppc64. Some of
>> the changes
>> included in this series are tricky in that it changes the powerpc
>> linux page table
>> walk subtly. We also overload few of the pte flags for ptes at PMD
>> level (huge
>> page PTEs).
>> 
>> The related mm/ changes are already merged to Andrew's -mm tree.
>
> [Andrea, question for you near the end ]
>
> So I'm trying to understand how you handle races between hash_page
> and collapse.
>
> The generic collapse code does:
>
> 	_pmd = pmdp_clear_flush(vma, address, pmd);
>
> Which expects the architecture to essentially have stopped any
> concurrent walk by the time it returns.
>
> Your implementation of the above does this:
>
> 		pmd = *pmdp;
> 		pmd_clear(pmdp);
> 		/*
> 		 * Now invalidate the hpte entries in the range
> 		 * covered by pmd. This make sure we take a
> 		 * fault and will find the pmd as none, which will
> 		 * result in a major fault which takes mmap_sem and
> 		 * hence wait for collapse to complete. Without this
> 		 * the __collapse_huge_page_copy can result in copying
> 		 * the old content.
> 		 */
> 		flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
>
> So we clear the pmd after making a copy of it. This will eventually
> prevent a tablewalk but only eventually, once that store becomes visible
> to other processors, which may take a while. Then you proceed to flush
> the hash table for all the underlying PTEs.
>
> So at this point, hash_page might *still* see the old pmd. Unless I
> missed something, you did nothing that will prevent that (the only way
> to lock against hash_page is really an IPI & wait or to take the PTE's
> busy and make them !present or something). So as far as I can tell,
> a concurrent hash_page can still sneak into the hash some "small"
> entries after you have supposedly flushed them.
>

We are depending on the pmd being none. But as you said that doesn't
take care of an already undergoing hash_page. As per the discussion I
am listing below the option that use synchronize_sched. The other option
that we have is to clear _PAGE_USER.

commit f69f11ba6a957aac81ea8096b244005c450a2059
Author: Aneesh Kumar K.V <aneesh.kumar at linux.vnet.ibm.com>
Date:   Tue Jun 18 19:17:17 2013 +0530

    powerpc/THP: Wait for all hash_page calls to finish before invalidating HPTE entries
    
    When we collapse normal pages to hugepage, we first clear the pmd, then invalidate all
    the PTE entries. The assumption here is that any low level page fault will see pmd as
    none and take the slow path that will wait on mmap_sem. But we could very well be in
    a hash_page with local ptep pointer value. Such a hash page can result in adding new
    HPTE entries for normal subpages/small page. That means we could be modifying the
    page content as we copy them to a huge page. Fix this by waiting on hash_page to finish
    after marking the pmd none and bfore invalidating HPTE entries. We use the heavy
    synchronize_sched(). This should be ok as we do this in the background khugepaged thread
    and not in application context. Also if we find collapse slow we can ideally increase
    the scan rate.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar at linux.vnet.ibm.com>

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index bbecac4..92b733e 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -532,6 +532,7 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pmd_t *pmdp)
 {
 	pmd_t pmd;
+	struct mm_struct *mm = vma->vm_mm;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	if (pmd_trans_huge(*pmdp)) {
@@ -542,6 +543,16 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		 */
 		pmd = *pmdp;
 		pmd_clear(pmdp);
+		spin_unlock(&mm->page_table_lock);
+		/*
+		 * Wait for all pending hash_page to finish
+		 * We can do this by waiting for a context switch to happen on
+		 * the cpus. Any new hash_page after this will see pmd none
+		 * and fallback to code that takes mmap_sem and hence will block
+		 * for collapse to finish.
+		 */
+		synchronize_sched();
+		spin_lock(&mm->page_table_lock);
 		/*
 		 * Now invalidate the hpte entries in the range
 		 * covered by pmd. This make sure we take a