mmotm threatens ppc preemption again

Mon Mar 21 22:24:33 EST 2011

On 03/20/2011 11:53 PM, Benjamin Herrenschmidt wrote:
> On Sat, 2011-03-19 at 21:11 -0700, Hugh Dickins wrote:
>> As I warned a few weeks ago, Jeremy has vmalloc apply_to_pte_range
>> patches in mmotm, which again assault PowerPC's expectations, and
>> cause lots of noise with CONFIG_PREEMPT=y CONFIG_PREEMPT_DEBUG=y.
>>
>> This time in vmalloc as well as vfree; and Peter's fix to the last
>> lot, which went into 2.6.38, doesn't protect against these ones.
>> Here's what I now see when I swapon and swapoff:
> Right. And we said from day one we had the HARD WIRED assumption that
> arch_enter/leave_lazy_mmu_mode() was ALWAYS going to be called within
> a PTE lock section, and we did get reassurance that it was going to
> remain so.
>
> So why is it ok for them to change those and break us like that ?

In general, the pagetable's locking rules are that all *usermode* pte
updates have to be done under a pte lock, but kernel mode ones do not;
they generally have some kind of per-subsystem ad-hoc locking where
needed, which may or may not be no-preempt.

Originally the enter/leave_lazy_mmu_mode did require preemption to be
disabled for the whole time, but that was incompatible with the above
locking rules, and resulted in preemption being disabled for long
periods when using lazy mode which wouldn't normally happen.  This
raised a number of complaints.

To address this, I changed the x86 implementation to deal with
preemption in lazy mode by dropping out for lazy mode at context switch
time (and recording the fact that we were in lazy mode with a TIF flag
and re-entering on the next context switch).

> Seriously, this is going out of control. If we can't even rely on
> fundamental locking assumptions in the VM to remain reasonably stable
> or at least get some amount of -care- from who changes them as to
> whether they break others and work with us to fix them, wtf ?
>
> I don't know what the right way to fix that is. We have an absolute
> requirement that the batching we start within a lazy MMU section
> is complete and flushed before any other PTE in that section can be
> touched by anything else. Do we -at least- keep that guarantee ?
>
> If yes, then maybe preempt_disable/enable() around
> arch_enter/leave_lazy_mmu_mode() in apply_to_pte_range() would do... 
>
> Or maybe I should just prevent any batching  of init_mm :-(

I'm very sorry about that, I didn't realize power was also using that
interface.  Unfortunately, the "no preemption" definition was an error,
and had to be changed to match the pre-existing locking rules.

Could you implement a similar "flush batched pte updates on context
switch" as x86?

    J