[2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
Anton Blanchard
anton at samba.org
Sun Feb 8 18:46:13 EST 2004
Hi Olof,
> I've spent some time cleaning up the PCI DMA mapping code in 2.6. The
> patch is large (117KB, 4000 lines), so I won't post it here. It can be
> found at:
Very nice! Ive thrown this onto a few machines here and am stressing
it with random IO benchmarks. Its something we desperately needed done.
> I also replaced the old buddy-style allocator for TCE ranges with a
> simpler bitmap allocator. Time and benchmarking will tell if it's
> efficient enough, but it's fairly well abstracted and can easily be
> replaced
Agreed. I suspect (as with our SLB allocation code) we will only know
once the big IO benchmarks have beaten on it. We should get Jose, Rick
and Nancy onto it as soon as possible.
Some things to think about:
- Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
64bit slot)
- Likelihood of large (multiple page) non SG requests. The e1000 comes
to mind here, it has an MTU of 16kB so could do a pci_map_single of
that size.
- Peak TCE usage. After chasing emulex TCE starvation you guys would
know the figures for this better than I.
- Whether virtual merging makes sense. Virtual merging will place more
pressure on our TCE allocation code (because we will end up asking for
much more high order TCE allocations). Its also more complex, Id
prefer to avoid it unless we do see a performance advantage.
- We currently allocate a 2GB window in PCI space for TCEs. This is 4MB
worth of TCE tables. Unfortunately we have to allocate an 8MB window
on POWER4 boxes because firmware sets up some chip inits to cover the
TCE region. If we allocate less and let normal memory get into this
region, our performance grinds to a halt. (Its to do with the way
TCE coherency is done on POWER4).
Allocating a 2GB region unconditionally is also wrong, I have a
nighthawk node that has a 3GB IO hole, and yes there is PCI memory
allocated at 1GB and above (see below). We get away with it by luck with
the current code but its going to hit when we switch to your new code.
If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF
properties to get the correct range we should be good.
- False sharing between the CPU and host bridge. We store a few things
to a TCE cacheline (eg for an SG list) then initiate IO. The IO device
requests the first address, the host bridge realises it must do a TCE
lookup. It then caches this cacheline.
Meantime the cpu is setting up another request. It stores to the same
cacheline which forces the cacheline in the host bridge to be flushed.
It still hasnt completed the first sg list, so it has to refetch it.
I think the answer here is to allocate an SG list within a cacheline
then move onto the next cacheline for the next request. As suggested
by davem we should convert the network drivers over to using SG lists.
- TCE table bypass, DAC.
Anton
Bus 0, device 12, function 0:
SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 3).
IRQ 17.
Master Capable. Latency=74. Min Gnt=17.Max Lat=64.
I/O at 0x7ffc00 [0x7ffcff].
Non-prefetchable 32 bit memory at 0x40101000 [0x401010ff].
Non-prefetchable 32 bit memory at 0x40100000 [0x40100fff].
** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
More information about the Linuxppc64-dev
mailing list