[PATCH -V10 00/15] THP support for PPC64

Sun Jun 16 12:00:07 EST 2013

On Wed, 2013-06-05 at 20:58 +0530, Aneesh Kumar K.V wrote:
> This is the second patchset needed to support THP on ppc64. Some of
> the changes
> included in this series are tricky in that it changes the powerpc
> linux page table
> walk subtly. We also overload few of the pte flags for ptes at PMD
> level (huge
> page PTEs).
> 
> The related mm/ changes are already merged to Andrew's -mm tree.

[Andrea, question for you near the end ]

So I'm trying to understand how you handle races between hash_page
and collapse.

The generic collapse code does:

	_pmd = pmdp_clear_flush(vma, address, pmd);

Which expects the architecture to essentially have stopped any
concurrent walk by the time it returns.

Your implementation of the above does this:

		pmd = *pmdp;
		pmd_clear(pmdp);
		/*
		 * Now invalidate the hpte entries in the range
		 * covered by pmd. This make sure we take a
		 * fault and will find the pmd as none, which will
		 * result in a major fault which takes mmap_sem and
		 * hence wait for collapse to complete. Without this
		 * the __collapse_huge_page_copy can result in copying
		 * the old content.
		 */
		flush_tlb_pmd_range(vma->vm_mm, &pmd, address);

So we clear the pmd after making a copy of it. This will eventually
prevent a tablewalk but only eventually, once that store becomes visible
to other processors, which may take a while. Then you proceed to flush
the hash table for all the underlying PTEs.

So at this point, hash_page might *still* see the old pmd. Unless I
missed something, you did nothing that will prevent that (the only way
to lock against hash_page is really an IPI & wait or to take the PTE's
busy and make them !present or something). So as far as I can tell,
a concurrent hash_page can still sneak into the hash some "small"
entries after you have supposedly flushed them.

In addition, my reading of __collapse_huge_page_isolate() is that it
expects page_young() to be stable at that point, while because of the
above, a concurrent hash_page() might still be setting _PAGE_ACCESSED
(and _PAGE_DIRTY).

So it might be that you have a sneaky way to perform the synchronization
that I have missed :-) But so far I haven't seen it....

Also, a more general question from Andrea. Looking at the code, I was
originally thinking that there is a similar race with dirty. But then
I noticed that the collapse code doesn't look at dirty at all on the sub
pages, it just ignores it. That stroke me as broken until I also noticed
that you seem to always make the THPs dirty....

Is there a reason for that rather than harvesting dirty in the sub pages
and making the THP's dirty the logical OR of the small pages one ?

I understand that anonymous memory is often either zero or dirty, but
I suppose it can be clear with content as well as a result of swap out
and back in, no ? Or is there other reasons why THPs must be dirty
always ?

Cheers,
Ben.