[PATCH] mlx4_en: map entire pages to increase throughput

Tue Jul 17 05:06:12 EST 2012

On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> >In its receive path, mlx4_en driver maps each page chunk that it pushes
> >to the hardware and unmaps it when pushing it up the stack. This limits
> >throughput to about 3Gbps on a Power7 8-core machine.
> 
> That seems rather extraordinarily low - Power7 is supposed to be a
> rather high performance CPU.  The last time I noticed O(3Gbit/s) on
> 10G for bulk transfer was before the advent of LRO/GRO - that was in
> the x86 space though.  Is mapping really that expensive with Power7?
> 

Copying linuxppc-dev and Anton here. But I can tell you that we have
lock contention when doing the mapping on the same adapter (map table
per device). Anton has sent some patches that improves that *a lot*.

However, for 1500 MTU, mlx4_en was doing two unmaps and two maps per
packet. The problem is not the CPU power needed to do the mappings, but
that we find the lock contention and end up with the CPUs more than 30%
of the time spent on spin locking.

> 
> >One solution is to map the entire allocated page at once. However, this
> >requires that we keep track of every page fragment we give to a
> >descriptor. We also need to work with the discipline that all fragments will
> >be released (in the sense that it will not be reused by the driver
> >anymore) in the order they are allocated to the driver.
> >
> >This requires that we don't reuse any fragments, every single one of
> >them must be reallocated. We do that by releasing all the fragments that
> >are processed and only after finished processing the descriptors, we
> >start the refill.
> >
> >We also must somehow guarantee that we either refill all fragments in a
> >descriptor or none at all, without resorting to giving up a page
> >fragment that we would have already given. Otherwise, we would break the
> >discipline of only releasing the fragments in the order they were
> >allocated.
> >
> >This has passed page allocation fault injections (restricted to the
> >driver by using required-start and required-end) and device hotplug
> >while 16 TCP streams were able to deliver more than 9Gbps.
> 
> What is the effect on packet-per-second performance?  (eg aggregate,
> burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> 

I used uperf with TCP_NODELAY and 16 threads sending from another
machine 64000-sized writes for 60 seconds.

I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
(9.23Gb/s) with the patch.

Best regards.
Cascardo.

> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>