[PATCH V2 04/68] powerpc/mm: Use big endian page table for book3s 64

Mon May 30 09:08:33 AEST 2016

Hi Ben,

> That is surprising, do we have any idea what specifically increases
> the overhead so significantly ? Does gcc know about ldbrx/stdbrx ? I
> notice in our io.h for example we still do manual ld/std + swap
> because old processors didn't know these, we should fix that for
> CONFIG_POWER8 (or is it POWER7 that brought these ?).

The futex issue seems to be __get_user_pages_fast():

        ld      r11,0(r6)
        ...
        rldicl  r8,r11,32,32
        rotlwi  r28,r11,24
        rlwimi  r28,r11,8,8,15
        rotlwi  r6,r8,24
        rlwimi  r28,r11,8,24,31
        rlwimi  r6,r8,8,8,15
        rlwimi  r6,r8,8,24,31
        rldicr  r28,r28,32,31
        or      r28,r28,r6
        cmpdi   cr7,r28,0
        beq     cr7,2428

That's a whole lot of work just to check if a pte is zero. I assume
the reason gcc can't replace this with a byte reversed load is that
we access the pte via the READ_ONCE() macro.

I see the same issue in unmap_page_range(), __hash_page_64K(),
handle_mm_fault().

The other issue I see is when we access a pte via larx/stcx, and then
we have no choice but to byte swap it manually. I see that in
__hash_page_64K():

        rldicl  r28,r30,32,32
        rotlwi  r0,r30,24
        rlwimi  r0,r30,8,8,15
        rotlwi  r10,r28,24
        rlwimi  r0,r30,8,24,31
        rlwimi  r10,r28,8,8,15
        rlwimi  r10,r28,8,24,31
        rldicr  r0,r0,32,31
        or      r0,r0,r10
        hwsync
        ldarx   r12,0,r6
        cmpd    r12,r11 
        bne-    c00000000004fad0
        stdcx.  r0,0,r6 
        bne-    c00000000004fab8
        hwsync

Anton