Generic IOMMU pooled allocator

Thu Mar 26 11:43:42 AEDT 2015

On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh at kernel.crashing.org>
> Date: Tue, 24 Mar 2015 13:08:10 +1100
> 
> > For the large pool, we don't keep a hint so we don't know it's
> > wrapped, in fact we purposefully don't use a hint to limit
> > fragmentation on it, but then, it should be used rarely enough that
> > flushing always is, I suspect, a good option.
> 
> I can't think of any use case where the largepool would be hit a lot
> at all.

Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
driver mapped a whole 64KiB page, it would hit the largepool.

I have been suspicious for some time that after Anton's work on the
pools, the large mappings optimization would throw away the benefit of
using the 4 pools, since some drivers would always hit the largepool.

Of course, drivers that map entire pages, when not buggy, are optimized
already to avoid calling dma_map all the time. I worked on that for
mlx4_en, and I would expect that its receive side would always hit the
largepool.

So, I decided to experiment and count the number of times that
largealloc is true versus false.

On the transmit side, or when using ICMP, I didn't notice many large
allocations with qlge or cxgb4.

However, when using large TCP send/recv (I used uperf with 64KB
writes/reads), I noticed that on the transmit side, largealloc is not
used, but on the receive side, cxgb4 almost only uses largealloc, while
qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
When turning GRO off, that ratio is closer to 1/10, meaning there is
still some fair use of largealloc in that scenario.

I confess my experiments are not complete. I would like to test a couple
of other drivers as well, including mlx4_en and bnx2x, and test with
small packet sizes. I suspected that MTU size could make a difference,
but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
didn't notice any significant hit of largepool with either qlge or
cxgb4.

Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
latest code, with plans on using 64KiB in some situations, Alexey or Ben
should have more details.

But I believe that on the receive side, all drivers should map entire
pages, using some allocation strategy similar to mlx4_en, in order to
avoid DMA mapping all the time. Some believe that is bad for latency,
and prefer to call something like skb_alloc for every package received,
but I haven't seen any hard numbers, and I don't know why we couldn't
make such an allocator as good as using something like the SLAB/SLUB
allocator. Maybe there is a jitter problem, since the allocator has to
go out and get some new pages and map them, once in a while. But I don't
see why this would not be a problem with SLAB/SLUB as well. Calling
dma_map is even worse with the current implementation. It's just that
some architectures do no work at all when dma_map/unmap is called.

Hope that helps consider the best strategy for the DMA space allocation
as of now.

Regards.
Cascardo.