[PATCH] [2.4] Remove page_table_lock in hash_page for kernel faults

Paul Mackerras paulus at au1.ibm.com
Thu Sep 25 13:05:23 EST 2003

Olof Johansson writes:

> Attached patch removes the page_table_lock for vmalloc and io-mapped
> regions.
> We've seen cases where someone would call vmalloc(), which takes the
> page_table_lock without disabling interrupts, we get an interrupt, and the
> interrupt code will fault -> hash_page() -> deadlock. It seems to be more
> common when the driver is a module (i.e. in the vmalloc region).
> The tricky part is to make the update atomic. I essentially copied Paul
> MacKerras 32-bit code where it checks access and sets the ACCESSED/DIRTY
> bits. ppc(32) has it all in assembler, I kept most of it in C to keep
> changes smaller. Rewriting selected parts into larger asm-blocks is a
> todo-item for the future (it'd shorten path length a bit).
> We've given this patch a few days of testing on one of the large SPECweb
> servers, where we'd otherwise see the deadlock every now and then, and we
> haven't seen any problems with it. But more eyes on the code couldn't
> hurt.
> Comments/questions anyone?

The basic idea is good but I am worried about the fact that we update
the PTE (the linux PTE) twice, once to set accessed/dirty bits and
once to update the HPTE present/slot-number bits.  Since we haven't
taken the mm->page_table_lock, some other cpu could have zeroed out
the PTE in the meantime.  We then have a race when it goes to do the
corresponding tlb_flush call, which should find and wipe out the

This is going to take a bit more thought, I think.  A first step would
be to define a compare-and-swap operation for the linux PTE (called,
say, CAS_PTE).  We would first read the PTE and check permissions and
set the accessed and/or dirty bits.  Then we do CAS_PTE to update the
pte provided it hasn't changed - if it has we go back and read it
again.  Then we update the hash table and finally do another CAS_PTE
to set the hashtable present/slot-number bits.

Also, I am mildly surprised that we aren't checking _PAGE_USER.  I
guess that's OK although it will mean that an attempted access by a
user process to a page that it doesn't have the right to access will
still result in _PAGE_ACCESSED (and possibly _PAGE_DIRTY) getting
set.  Setting _PAGE_ACCESSED is probably benign but setting
_PAGE_DIRTY could be more serious.

Another problem is that if shared_mem_area is true we seem to be going
ahead and putting in a HPTE even if _PAGE_PRESENT is false, which is
bad.  Not even the kernel can access a page for which _PAGE_PRESENT is
false.  If _PAGE_PRESENT is false, the other bits in the pte (except
for the HPTE present/slot-number bits) can be used for other purposes
such as storing a swap entry.   That is probably more generically a
bug in the shared memory area code though.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/

More information about the Linuxppc64-dev mailing list