[PATCH -V10 00/15] THP support for PPC64

Wed Jun 19 04:46:20 EST 2013

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> On Sun, 2013-06-16 at 13:37 +1000, Benjamin Herrenschmidt wrote:
>> On Sun, 2013-06-16 at 12:00 +1000, Benjamin Herrenschmidt wrote:
>> > So at this point, hash_page might *still* see the old pmd. Unless I
>> > missed something, you did nothing that will prevent that (the only way
>> > to lock against hash_page is really an IPI & wait or to take the PTE's
>> > busy and make them !present or something). So as far as I can tell,
>> > a concurrent hash_page can still sneak into the hash some "small"
>> > entries after you have supposedly flushed them.
>> 
>> Note that the _PAGE_PRESENT bit is removed eventually ... but much
>> later, in __collapse_huge_page_copy() which will also flush the hash, so
>> at least we will remove a stale hash entry that would have been added by
>> the race above I suppose...  but:
>> 
>>  - _PAGE_ACCESSED can still potentially be set after it was supposed to
>> be stable
>> 
>>  - The clearing happens *after* copy_user_highpage(), ie, unless I
>> missed something here, we potentially still have something writing to
>> the 4k page while it's being copied, which is BAD.
>> 
>> Now, let me know if I did miss something here :-)
>
> An additional issue is that this all collides a bit with Alexey's work
> to support TCEs in real mode in KVM, which is necessary to have "usable"
> PCI pass-through.
>
> Look at patches http://patchwork.ozlabs.org/patch/248920/ and followup,
> he basically walks the page tables here in a slightly different way than
> Paul does in H_ENTER. It's more like gup_fast. It will need to handle
> concurrent split/collapse etc... as well which it doesn't right now.
>
> I'm considering merging Alexei stuff first (provided I don't find major
> problems with it), then you can provide a new THP series on top of it.
>
> While at it, also fix:
>
>  - Some of your patches are bug fixes (like the one about subpage
> protection). They need to be either merged in the main patch or put
> before the patch that enables THP.

We can move them before. One of the reason I would like to keep them as
seperate patches is to document the changes better in commit messages. 

>
>  - I haven't completely yet considered the impact of the demotion of
> segments, but neither do you :-) IE. Under some circumstances, we can
> demote entire segments from 64K HW pages to 4K HW pages in the SLB. For
> example if a driver (such as HCA) sets the 4K_PFN bit in a PTE, this
> will happen at hashing time. I don't think your code deals with that at
> all, am I correct ? It *might* be that the right approach with those is:
>

But will that by anonymous memory ? ie, will we find them suitable for
THP allocation ?

>   * If you find a THP in hash_page and the segment size is 4k, fault
>
>   * In do_page_fault, re-check for that condition (or maybe we can make
> hash_page return a specific bit that gets ORed into the error_code into
> do_page_fault ?) and split huge pages there.
>
> But that's just an idea off the top of my mind, there might be a better
> way. Of course this needs to be tested.
>
> BTW. For the subpage protection, similarily, you need to make sure you
> properly map the entire segment as "no THP", not just the range
> passed-in by the user.

Can you explain that more, why should the entire segment be marked no THP ?
The segment can work with 4K base page size and we still be able to
allocate a hugepage in that segment.

-aneesh