[PATCH 2/7] powerpc/mm: 64-bit 4k: use a PMD-based virtual page table

Tue May 24 04:54:33 EST 2011

On Sat, 21 May 2011 08:15:36 +1000
Benjamin Herrenschmidt <benh at kernel.crashing.org> wrote:

> On Fri, 2011-05-20 at 15:57 -0500, Scott Wood wrote:
> 
> > I see a 2% cost going from virtual pmd to full 4-level walk in the
> > benchmark mentioned above (some type of sort), and just under 3% in
> > page-stride lat_mem_rd from lmbench.
> > 
> > OTOH, the virtual pmd approach still leaves the possibility of taking a
> > bunch of virtual page table misses if non-localized accesses happen over a
> > very large chunk of address space (tens of GiB), and we'd have one fewer
> > type of TLB miss to worry about complexity-wise with a straight table walk.
> > 
> > Let me know what you'd prefer.
> 
> I'm tempted to kill the virtual linear feature alltogether.. it didn't
> buy us that much. Have you looked if you can snatch back some of those
> cycles with hand tuning of the level walker ?

That's after trying a bit of that (pulled the pgd load up before
normal_tlb_miss, and some other reordering).  Not sure how much more can be
squeezed out of it with such techniques, at least with e5500.

Hmm, in the normal miss case we know we're in the first EXTLB level,
right?  So we could cut out a load/mfspr by subtracting EXTLB from r12
to get the PACA (that load's latency is pretty well buried, but maybe we
could replace it with loading pgd, replacing it later if it's a kernel
region).  Maybe move pgd to the first EXTLB, so it's in the same cache line
as the state save data. The PACA cacheline containing pgd is probably
pretty hot in normal kernel code, but not so much in a long stretch of
userspace plus TLB misses (other than for pgd itself).

> Would it work/help to have a simple cache of the last pmd & address and
> compare just that ?

Maybe.

It would still slow down the case where you miss that cache -- not by as
much as a virtual page table miss (and it wouldn't compete for TLB entries
with actual user pages), but it would happen more often, since you'd only be
able to cache one pmd.

> Maybe in a SPRG or a known cache hot location like
> the PACA in a line that we already load anyways ?

A cache access is faster than a SPRG access on our chips (plus we
don't have many to spare, especially if we want to avoid swapping SPRG4-7 on
guest entry/exit in KVM), so I'd favor putting it in the PACA.

I'll try this stuff out and see what helps.

-Scott