[RFC 0/1] lro: Generic Large Receive Offload for TCP traffic

Thu Jul 26 03:17:54 EST 2007

Hi,

I've ported myri10ge to use the new LRO interface.  I have attached a
preliminary patch to myri10ge.  I'm very pleased to note that the
performance is on-par with my own LRO used by our out-of-tree driver.
(except when using mixed MTUS, see performance data below).

As I expected, actually porting our driver to use the LRO interface
gave me a far better understanding of the interface, and allowed for
better feedback.  I have attached a patch to the LRO code which
addresses some of the issues I mention below.

Please find below a performance summary, as well as my comments
on the LRO code, and patches to myri10ge and inet_lro. Both patches
are Signed-off-by: Andrew J. Gallatin <gallatin at myri.com>

Cheers,

Drew

===================
Performance:
===================

Here is a performance summary taken on my very low-end 2.0GHz AMD
Athlon(tm) 64 X2 Dual Core Processor 3800+ running 2.6.23-rc1 and
receiving a netperf TCP_SENDFILE test from an identical sender (which
was running 2.6.22 and our 1.3.1 "out of tree" driver).  The netserver
process was bound to a different core than the interrupt handler.  The
data reported is from the median of 5 60 second netperf tests.  The
following settings were in /etc/sysctl.conf on both machines:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 2500
net.ipv4.tcp_timestamps = 0

RX Performance for Sender MTU=1500, Receiver MTU=1500 expressed as
"x Gb/s, y %CPU receiver utilization":

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
-----	 -------   -------    --------
4K pg	 8.9 78%   8.8 77%	3.7 89%
8K pg	 9.2 77%   9.1 77%	3.7 89%
16Kpg	 9.4 73%   9.4 73%	3.8 89%
32Kpg	 9.4 72%   9.4 72%	3.9 89%
skb	 N/A N/A   5.5 90%	4.1 92%

RX Performance for Sender MTU=1500, Receiver MTU=9000 expressed as
"x Gb/s, y %CPU receiver utilization":

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
-----	 -------   -------    --------
4K pg	 8.9 78%   7.3 79%	3.7 89%
8K pg	 9.2 77%   7.6 79%	3.7 89%
16Kpg	 9.4 73%   8.0 78%	3.8 89%
32Kpg	 9.4 72%   8.2 79%	3.9 89%
skb	 N/A N/A   4.9 92%	4.1 92%

RX Performance for Sender MTU=9000, Receiver MTU=9000 expressed as
"x Gb/s, y %CPU receiver utilization":

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
-----	 -------   -------    --------
4K pg	 9.9 63%   9.6 66%	8.3 71%
8K pg	 9.9 60%   9.9 63%	8.4 72%
16Kpg	 9.9 55%   9.9 55%	8.7 70%
32Kpg	 9.9 53%   9.9 53%	8.9 67%
skb	 N/A N/A   9.9 68%	8.7 72%

(1) "xK pg" means the driver was configured to adjust the receive page
size using MYRI10GE_ALLOC_ORDER.  "skb" means an internal variant
of our driver which receives into skbs rather than pages was used.

(2) "1.3.1" is our latest out of tree driver which uses the myri10ge
specific frags-based LRO code previously submitted and rejected.

===================
Code review / comments:
===================

1) Checksum information for CHECKSUM_COMPLETE drivers.

Our NIC passes partial checksums to our driver.  In the current code,
it seems impossible for page based CHECKSUM_COMPLETE drivers to behave
correctly in the case of "rejected" frames.  Eg, there is no way
to pass the partial checksum to the LRO module so that it gets
set in the skb header and passed up the stack.

Further, there seems to be no (easy) way to use CHECKSUM_COMPLETE
on an aggregated packet at LRO flush time.  By the time a packet
is aggregated, the partial checksum from the first segment is
out of date.

I think it would make sense to require that a CHECKSUM_COMPLETE style
driver verify the checksum in its get_frag_header / get_skb_header
callback.  This allows the LRO code to unconditionally set
CHECKSUM_UNNECESSARY.

The attached a patch modifies the code to do this.

2) Non-accelerated VLAN tags

Our firmware currently does not do vlan tag insertion
and removal.  This causes a problem in __lro_proc_segment()
where the tcp and ip headers are setup to point into the
newly created skb.  A frame containing an unstripped vlan
tag will cause the headers to be garbage.

The attached patch modifies __lro_proc_segment() to offset
those pointers by VLAN_HLEN when required.

3) Padded frames.

I may be missing something, but I don't see where you
either strip padding from frames or reject padded frames.
(see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv()

I did not add such a feature as I was confused about the intended
use of len/true_size.

Also, trimming is a pain when dealing with pure frags (without a
containing skb).  We have code in our out-of-kernel driver to deal
with it which you are welcome to use.

4) LRO_MIN_PG_HLEN (== 80)

This confuses me.  Can you please explain what you're trying to do?
Because of this, I kept getting crashes in the skb_pull() done by
eth_type_trans() because I was passing segments which were 60 bytes
and the skb->data_len of the skb constructed by lro_gen_skb() was -20.
I added my own code to bump the length to a magic 80 bytes, and the
panics disappeared.  This may cause data corruption because of
#3 above!

5) NAPI/non-NAPI

The LRO code assumes the underlying driver uses NAPI, and calls
netif_receive_skb() rather than netif_rx().  Perhaps there should be a
field in the lro_mgr struct to specify napi / non-napi.

6) The checks for when to stop aggregating and flush in
    __lro_proc_{segment|skb} need some improvement.

The skb variant currently uses a pure count (max_aggr).  In order to
keep the resulting aggregated frame below 64KB, the underlying driver
computes max_aggr as 0xffff / MTU.  This reduces the effectiveness of
LRO on mixed MTU networks.  Eg, this causes packets coming from a
source with a 1500b MTU to be aggregated after 7 frames when using a
9000b MTU on the receiver, rather than after 43 frames.  As you can
see from the differences in inet_lro's performance in the table
above, this is a real problem.

I believe that a hybrid byte / max_aggr model should be used.  The
__lro_proc_segment takes this approach.  Note that there is a subtle
bug in that the maximum aggregated bytes should not be be 0xffff.
Rather, one must leave room for the next frame to arrive by setting
the max aggregated bytes to 0xffff - dev->mtu.  This is masked
by the driver computing max_aggr as above.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: inet_lro.diff
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20070725/68bc55b6/attachment.asc>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: myri10ge_lro.diff
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20070725/68bc55b6/attachment.txt>