[PATCH/RFC] 64 bit csum_partial_copy_generic
Segher Boessenkool
segher at kernel.crashing.org
Thu Sep 11 23:45:05 EST 2008
> The current 64 bit csum_partial_copy_generic function is based on
> the 32 bit version and never was optimized for 64 bit. This patch
> takes the 64 bit memcpy and adapts it to also do the sum. It has
> been tested on a variety of input sizes and alignments on Power5
> and Power6 processors. It gives correct output for all cases
> tested. It also runs 20-55% faster than the implemention it
> replaces depending on size, alignment, and processor.
>
> I think there is still some room for improvement in the unaligned
> case, but given that it is much faster than what we have now I
> figured I'd send it out.
Did you consider the other alternative? If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one). Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.
Or maybe the bottleneck here is purely the memory bandwidth?
> Signed-off-by: Joel Schopp<jschopp at austin.ibm.com>
You missed a space there.
Segher
More information about the Linuxppc-dev
mailing list