mmotm threatens ppc preemption again

Wed Mar 23 00:34:24 EST 2011

On 03/21/2011 10:52 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-03-21 at 11:24 +0000, Jeremy Fitzhardinge wrote:
>> I'm very sorry about that, I didn't realize power was also using that
>> interface.  Unfortunately, the "no preemption" definition was an error,
>> and had to be changed to match the pre-existing locking rules.
>>
>> Could you implement a similar "flush batched pte updates on context
>> switch" as x86? 
> Well, we already do that for -rt & co.
>
> However, we have another issue which is the reason we used those
> lazy_mmu hooks to do our flushing.
>
> Our PTEs eventually get faulted into a hash table which is what the real
> MMU uses. We must never (ever) allow that hash table to contain a
> duplicate entry for a given virtual address.
>
> When we do a batch, we remove things from the linux PTE, and keep a
> reference in our batch structure, and only update the hash table at the
> end of the batch.

Wouldn't implicitly ending a batch on context switch get the same effect?

> That means that we must not allow a hash fault to populate the hash with
> a "new" PTE value prior to the old one having been flushed out (which is
> possible if they different in protection attributes for example). For
> that to happen, we must basically not allow a page fault to re-populate
> a PTE invalidated by a batch before that batch has completed.

Kernel ptes are not generally populated on fault though, unless there's
something in power?  On x86 it can happen when syncing a process's
kernel pmd with the init_mm one, but that shouldn't happen in the middle
of an update since you'd deadlock anyway.  If a particular kernel
subsystem has its own locks to manage the ptes for a kernel mapping,
then that should prevent any nested updates within a batch shouldn't it?

> That translates to batches must only happen within a PTE lock section.

Well, in that case, I guess your best bet is to disable batching for
kernel pagetable updates.  These apply_to_page_range() changes are the
first time any attempt to batch kernel pagetable updates has been made
(otherwise you would have seen this problem earlier), so not batching
them will not be a regression for you.

But I'm not sure what the proper fix to get batching in your case will
be.  But the assumption that there's a pte lock for kernel ptes is not
valid.

    J