[PATCH/RFC] 64 bit csum_partial_copy_generic

Thu Sep 11 23:45:05 EST 2008

> The current 64 bit csum_partial_copy_generic function is based on  
> the 32 bit version and never was optimized for 64 bit.  This patch  
> takes the 64 bit memcpy and adapts it to also do the sum.  It has  
> been tested on a variety of input sizes and alignments on Power5  
> and Power6 processors.  It gives correct output for all cases  
> tested.  It also runs 20-55% faster than the implemention it  
> replaces depending on size, alignment, and processor.
>
> I think there is still some room for improvement in the unaligned  
> case, but given that it is much faster than what we have now I  
> figured I'd send it out.

Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.

Or maybe the bottleneck here is purely the memory bandwidth?

> Signed-off-by: Joel Schopp<jschopp at austin.ibm.com>

You missed a space there.

Segher