csum_partial() and csum_partial_copy_generic() in badly optimized?

Mon Nov 18 02:17:41 EST 2002

> On Saturday, November 16, 2002, at 02:16  AM, Joakim Tjernlund wrote:
>
> >> The comment is probably correct.  The reason the instruction has
> >> (effectively) zero overhead is that most PowerPCs have a feature which
> >> "folds" predicted-taken branches out of the instruction stream before
> >> they are dispatched.  This effectively makes the branch cost 0 cycles,
> >> as it does not occupy integer execution resources as it would on other
> >> possible microarchitectures.
> >>
> > hmm, I am on a mpc860 and I get big performace improvements if I apply
> > unrolling. Consider the standard CRC32 funtion:
> > while(len--) {
> >         result = (result << 8 | *data++) ^ crctab[result >> 24];
> > }
> > If I apply manual unrolling or compile with -funroll-loops I get
> >> 20% performance increase. Is this a special case or is
> > the mpc860 doing a bad job?
>
> Don't forget about gcc.
>
> In the code you originally were talking about, the PPC CTR register and
> bdnz instruction were used to implement the loop counter.  bdnz puts
> all the loop overhead (counter decrement, test, and branch) into one
> instruction.  Since that instruction is a branch, it can be folded, and
> thus have 0 overhead.  CTR and the instructions which operate on it
> (such as bdnz) were put into the PPC architecture mainly as an
> optimization opportunity for loops where the loop variable is not used
> inside the loop body.

loop variable not USED or loop variable not MODIFIED?

>
> There is no guarantee that gcc will always use CTR, even for such
> obvious candidates as the crc32 loop.  gcc is simply not that great at
> PPC optimization, especially at low optimization levels.  I've just
> been playing with gcc 2.95.4 on YDL 2.3 and Apple's gcc 3.1 on OSX
> 10.2.2 (these versions are merely what I happen to have installed on
> machines that are handy).  Here's a summary of when gcc will compile
> that crc32 loop with use of CTR and bdnz (note that -O3 or above
> automatically turn on -funroll-loops, so I saw no point in testing
> those levels):
>
>            -O1    -O2    -O1 -funroll-loops    -O2 -funroll-loops
> 2.95.4    no     no     no                    no
> 3.1       no     yes    yes                   yes

hmm, looks like I should upgrade gcc to 3.1 or possibly 3.2. However
I think that gcc >=3.0 has changed the ABI for C++, which is bad for me.

Is 2.95.x still maintained? Maybe this optimization could be added
to that branch.

>
> If gcc isn't generating a CTR loop to start out with, the crc32 code
> will benefit more from unrolling than it should.
>
> I did a bit of crude performance testing, and with gcc 3.1 there is no
> difference in the cache-hot performance of the crc32 loop when
> switching between -O2 and -O2 -funroll-loops.
>
> Now, that _was_ on a 7455.  You would see some difference on a 860,
> because gcc 3.1 did find something else to optimize.  Here's the loop
> body for -O2:
>
> L48:
>          ; basic block 2
>          lbz r9,0(r3)    ; * data
>          rlwinm r0,r30,10,22,29  ;  result
>          lwzx r11,r10,r0 ;  crctab
>          addi r3,r3,1    ;  data,  data
>          slwi r0,r30,8   ;  result
>          or r0,r0,r9
>          xor r30,r0,r11  ;  result
>          bdnz L48
>
> With -O2 -funroll-loops, gcc copies this loop body four times, and
> transforms the addi (increments 'data' ptr by 1) and lbz (loads *data)
> instructions.  addi is hoisted from the loop body and instead the
> pointer is incremented by 4 at the end of the unrolled loop.  Each lbz
> copy has the appropriate offset (0, 1, 2, 3) to simulate the original
> pointer incrementing.  The net effect is that for every four iterations
> of the original loop we execute three fewer addi instructions.

I see.

> The 7455 has a lot of integer execution units and can probably do the
> extra adds for free, but the 860 is basically a 601, and I think the
> 601 had just one IU, so the adds will not be free.
probably.

>
> But if you go back to that original loop in csum_partial() etc., I
> don't see any opportunity to perform similar optimizations.  So I doubt
> very much that unrolling that loop would have any benefit even on the
> 860.

You are probably right. Thanks for your effort to clear this up for me.

          Jocke

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/