[2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
olof at austin.ibm.com
olof at austin.ibm.com
Mon Feb 9 13:17:33 EST 2004
Anton,
Thanks for the feedback. Comments to a few of the items are below.
-Olof
On Sun, 8 Feb 2004, Anton Blanchard wrote:
> > I also replaced the old buddy-style allocator for TCE ranges with a
> > simpler bitmap allocator. Time and benchmarking will tell if it's
> > efficient enough, but it's fairly well abstracted and can easily be
> > replaced
>
> Agreed. I suspect (as with our SLB allocation code) we will only know
> once the big IO benchmarks have beaten on it. We should get Jose, Rick
> and Nancy onto it as soon as possible.
>
> Some things to think about:
>
> - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
> 64bit slot)
Not sure I follow this one. Are you saying we need to guarantee a minimum?
On SMP, we just carve up the space reserved by the PHB in equal chunks per
slot, on LPAR we just use the ibm,dma-window property to size the space.
> - Likelihood of large (multiple page) non SG requests. The e1000 comes
> to mind here, it has an MTU of 16kB so could do a pci_map_single of
> that size.
Yes, the behaviour here would be interesting. It'll still only be 4-page
allocations so I don't expect any trouble. Allocations closer to 16 pages
would be more likely to fail due to fragmentation.
> - Peak TCE usage. After chasing emulex TCE starvation you guys would
> know the figures for this better than I.
Good point. I'll follow up on this.
[...
> - We currently allocate a 2GB window in PCI space for TCEs. This is 4MB
> worth of TCE tables. Unfortunately we have to allocate an 8MB window
> on POWER4 boxes because firmware sets up some chip inits to cover the
> TCE region. If we allocate less and let normal memory get into this
> region, our performance grinds to a halt. (Its to do with the way
> TCE coherency is done on POWER4).
>
> Allocating a 2GB region unconditionally is also wrong, I have a
> nighthawk node that has a 3GB IO hole, and yes there is PCI memory
> allocated at 1GB and above (see below). We get away with it by luck with
> the current code but its going to hit when we switch to your new code.
>
> If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF
> properties to get the correct range we should be good.
Can we blindly base this on architecture, i.e. all POWER4-based systems
will be fine with 8MB and the other way around?
> - False sharing between the CPU and host bridge. We store a few things
> to a TCE cacheline (eg for an SG list) then initiate IO. The IO device
> requests the first address, the host bridge realises it must do a TCE
> lookup. It then caches this cacheline.
>
> Meantime the cpu is setting up another request. It stores to the same
> cacheline which forces the cacheline in the host bridge to be flushed.
> It still hasnt completed the first sg list, so it has to refetch it.
>
> I think the answer here is to allocate an SG list within a cacheline
> then move onto the next cacheline for the next request. As suggested
> by davem we should convert the network drivers over to using SG lists.
There's room for improvement here, but did a simple first implementation
that will try to honor the PHB cachelines when allocating:
The "next" hint will always be bumped up to a new cacheline, and SG list
allocations will use their own hint instead of the next pointer. The
result should be that SG lists are packed in cache lines, while other
allocations should always move between lines.
The downside is when the table gets full, but the load should be evenly
spread between cache lines at least. I.e. while we'll have sharing, it
should get evenly distributed in round-robin fashion. The next-allocation
hint is explicitly NOT updated at free time to accomodate this.
> - TCE table bypass, DAC.
Yes, table bypass will be useful for machines with less than 2GB memory as
well (G5's in particular). Ben has a ppc_pci_md (or similar) with function
pointers to the pci_*map* functions, we might need something like that for
boot/runtime selection of allocation behaviour.
-Olof
** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
More information about the Linuxppc64-dev
mailing list