[PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

Shan Hai haishan.bai at gmail.com
Fri Jul 15 19:08:12 EST 2011

On 07/15/2011 04:44 PM, Peter Zijlstra wrote:
> On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
>> On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
>>> On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
>>>> The following test case could reveal a bug in the futex_lock_pi()
>>>> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
>>>>           on Powerpc e500 core.
>>>> Cause: The linux kernel on the e500 core has no write permission on
>>>>           the COW page, refer the head comment of the following test code.
>>>> ftrace on test case:
>>>> [000]   353.990181: futex_lock_pi_atomic<-futex_lock_pi
>>>> [000]   353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>>>> [snip]
>>>> [000]   353.990191: do_page_fault<-handle_page_fault
>>>> [000]   353.990192: bad_page_fault<-handle_page_fault
>>>> [000]   353.990193: search_exception_tables<-bad_page_fault
>>>> [snip]
>>>> [000]   353.990199: get_user_pages<-fault_in_user_writeable
>>>> [snip]
>>>> [000]   353.990208: mark_page_accessed<-follow_page
>>>> [000]   353.990222: futex_lock_pi_atomic<-futex_lock_pi
>>>> [snip]
>>>> [000]   353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>>>> [ a loop occures here ]
>>> But but but but, that get_user_pages(.write=1, .force=0) should result
>>> in a COW break, getting our own writable page.
>>> What is this e500 thing smoking that this doesn't work?
>> A page could be set to read only by the kernel (supervisor in the powerpc
>> literature) on the e500, and that's what the kernel do. Set SW(supervisor
>> write) bit in the TLB entry to grant write permission to the kernel on a
>> page.
>> And further the SW bit is set according to the DIRTY flag of the PTE,
>> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
>> page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
>> COW occurred, infinite loop followed.
> I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
> mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.
> Also get_user_pages() fully expects to be able to schedule, and in fact
> can call the full pf handler path all by its lonesome self.

The whole scenario should be,
- the child process triggers a page fault at the first time access to
     the lock, and it got its own writable page, but its *clean* for
     the reason just for checking the status of the lock.
     I am sorry for above "unbreakable COW".
- the futex_lock_pi() is invoked because of the lock contention,
     and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
     it found out the lock is free so tries to write to the lock for
     reservation, a page fault occurs, because the page is read only
     for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
     but from the get_user_pages() view everything is ok, because
     the COW was already broken, retry futex_lock_pi_atomic()
- futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
     another write protection page fault
- infinite loop

Shan Hai

