kernel mapping

Wed Jan 17 11:04:58 EST 2001

Dan Malek wrote:
>
> Frank Rowand wrote:
>

I have only a small amount of performance instrumentation and measurements.
Some of what I have to say is based on observation, inference, and
conjecture...

> > I pinned some IO ranges as a convenience when I was first porting
> > to the 405gp but plan to remove those pins.
>
> Those are actually performace advantages, and I am doing that
> on some 8xx applications.  The difference now is we don't have
> to actually allocate specific "pinned" entries, the large mapping
> will just happen as part of the TLB reload.

The IO ranges that I pinned were all just a 4k page (except the 64k
"page" for PCI IO space, which shouldn't be accessed much except for
PCI device initialization).  So the only performance advantage I
gained was avoiding TLB misses, not from large pages.

> > I think that's a good idea.  If you do so, please provide a way to
> > force an entry to be locked in the tlb.
>
> Nope.  I don't want to do that.  Then you have to make processor
> specific trade offs, or incur high management overhead like the
> 405 does now.  For example, some of the processors allow a fixed
> number of locked entries, but you have to trade off what you will
> put there against losing TLB entries.  Or, you do like the 405
> does and create a "software" locking, losing the use of some
> very functional TLB management instructions.

The 405 core (and thus the many processors based on it) has a 64 entry
tlb.  While debugging via a JTAG debugger I have observed that the
tlb very quickly gets filled with entries for the current context.  It
is extremely rare to see entries for a different context left over.
>From this, I infer that the tlb is not large enough to hold a working
set.  (If I was still working as a performance geek, I would find this
an interesting area to instrument.)  Locking a few kernel entries in
the tlb means that the majority of the kernel's working set _is_ in
the tlb at all times.  Here is a simple measurement of tlb misses
(running a simple load of copying nfs mounted files around, etc):

  dtlb  misses:   34679326        <--- data tlb
  itlb  misses:   33075725        <--- instruction tlb
  d + i misses:   67755051
  ktlb  misses:     233683        <--- kernel addresses
  utlb  misses:   67521368        <--- user space addresses
  k + u misses:   67755051

If you want to repeat the measurement with other workloads, just
cat /proc/ppc_htab in my kernel to get the above data.

For the 405, the only tlb management instruction I sacrificed was
the tlbia (invalidate the entire tlb) that I would have used for
PPC4xx_tlb_flush_all(), which is used by flush_tlb_all(), which
is only called from:

  ppc_htab_write()
  mmu_context_overflow()
  vmfree_area_pages()
  vmalloc_area_pages()
  flush_all_zero_pkmaps()

Which doesn't seem to be much of a sacrifice for a large gain.

> By not locking entries and using large page table entries you don't
> need to have processor unique configurations that are cumbersome
> or unworkable on lesser featured processors.  You also let the
> system operation find the best distribution of TLB entries.  Yes,

The tlb is not large enough to accumulate a working set in the tlb,
so system operation never finds the best distribution of tlb entries.

> there is a clearly visible latency concern with loading TLBs, but
> considering the amount of context we are switching these days a
> single large page TLB miss is insignificant.

It will be nice to have large page TLB implemented.

-Frank
--
Frank Rowand <frank_rowand at mvista.com>
MontaVista Software, Inc

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/