[2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code

Sun Feb 8 18:46:13 EST 2004

Hi Olof,

> I've spent some time cleaning up the PCI DMA mapping code in 2.6. The
> patch is large (117KB, 4000 lines), so I won't post it here. It can be
> found at:

Very nice! Ive thrown this onto a few machines here and am stressing
it with random IO benchmarks. Its something we desperately needed done.

> I also replaced the old buddy-style allocator for TCE ranges with a
> simpler bitmap allocator. Time and benchmarking will tell if it's
> efficient enough, but it's fairly well abstracted and can easily be
> replaced

Agreed. I suspect (as with our SLB allocation code) we will only know
once the big IO benchmarks have beaten on it. We should get Jose, Rick
and Nancy onto it as soon as possible.

Some things to think about:

- Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
  64bit slot)

- Likelihood of large (multiple page) non SG requests. The e1000 comes
  to mind here, it has an MTU of 16kB so could do a pci_map_single of
  that size.

- Peak TCE usage. After chasing emulex TCE starvation you guys would
  know the figures for this better than I.

- Whether virtual merging makes sense. Virtual merging will place more
  pressure on our TCE allocation code (because we will end up asking for
  much more high order TCE allocations). Its also more complex, Id
  prefer to avoid it unless we do see a performance advantage.

- We currently allocate a 2GB window in PCI space for TCEs. This is 4MB
  worth of TCE tables. Unfortunately we have to allocate an 8MB window
  on POWER4 boxes because firmware sets up some chip inits to cover the
  TCE region. If we allocate less and let normal memory get into this
  region, our performance grinds to a halt. (Its to do with the way
  TCE coherency is done on POWER4).

  Allocating a 2GB region unconditionally is also wrong, I have a
  nighthawk node that has a 3GB IO hole, and yes there is PCI memory
  allocated at 1GB and above (see below). We get away with it by luck with
  the current code but its going to hit when we switch to your new code.

  If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF
  properties to get the correct range we should be good.

- False sharing between the CPU and host bridge. We store a few things
  to a TCE cacheline (eg for an SG list) then initiate IO. The IO device
  requests the first address, the host bridge realises it must do a TCE
  lookup. It then caches this cacheline.

  Meantime the cpu is setting up another request. It stores to the same
  cacheline which forces the cacheline in the host bridge to be flushed.
  It still hasnt completed the first sg list, so it has to refetch it.

  I think the answer here is to allocate an SG list within a cacheline
  then move onto the next cacheline for the next request. As suggested
  by davem we should convert the network drivers over to using SG lists.

- TCE table bypass, DAC.

Anton

  Bus  0, device  12, function  0:
    SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 3).
      IRQ 17.
      Master Capable.  Latency=74.  Min Gnt=17.Max Lat=64.
      I/O at 0x7ffc00 [0x7ffcff].
      Non-prefetchable 32 bit memory at 0x40101000 [0x401010ff].
      Non-prefetchable 32 bit memory at 0x40100000 [0x40100fff].

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/