Apple Job Posting and Good News for LinuxPPC developers

Tue Apr 6 02:06:40 EST 1999

On 2 Apr 1999, Holger Bettag wrote:

> > - a 64 bit version (means >32 address pins, perhaps ~40 or so, 64 is
> >   obviously overkill right now and I don't ask for it)
> > 
> I have heard rumours that the "Max" core has provisions to physically address
> more than 4GB of memory (via the MMU's segment registers). Processes would
> still be limited to a 4GB logical address space, though.

This will be virtual and already exists. There is provision for 4Gb
address space in the hash table defined for 32 bit processors and
extending it is far too messy. Only a 64 bit PPC can be expected to have
more than 32 address pins. 

> > - more superscalar (not 2+branch, but at least 4 way)
> > 
> The 604e is four-way superscalar, but it has slightly lower integer performance
> per clock cycle than the 750. Apparently you can't extract significantly more
> parallelism than 2 instructions per clock from real-world code, so the 604e's
> abilities are wasted, while the 750's exceptionally elegant branch handling
> leads to measurable benefits over the 604e's more traditional (though very
> sophisticated) branch prediction.

It depends of your definition of real world code. Floating point intensive
scientific code is often extremely well behaved in this respect and could
use more than 2 instructions/clock (as long as you don't call the math
library for scalar transcendental operations).  

> > - longer in flight instruction queue
> > 
> Compared to the 750, "Max" has additional reservation stations and a longer
> completion queue. Motorola's first estimates were 10% more integer performance
> per clock cycle than the 750, but this probably includes the effect of
> the larger L2.
> 
> BTW, a larger number of unresolved "in flight" instructions would jeopardize
> the processors unique (among superscalar CPUs) ability to directly execute
> branches without predicting them, because the outcome of branch conditions
> would more often be still "in flight", too.

On PPC, with 8 CR fields, you can often prepare condition codes a long
time in advance especially on scientific code. I've checked the code
generated by GCC, and I often have 20 to 40 instructions between CR
setting and actual branch (which are highly predictable BTW). Besides this
the innermost loop is terminated by a bdnz when the iteration count can be
computed beforehand. In this last case GCC does a bad job at optimizing
the outer loop level but it's not that critical: 

Loop not using ctr:
 104:   7c 03 f0 40     cmplw   r3,r30
 108:   7c 78 1b 78     mr      r24,r3
 10c:   40 80 00 9c     bge     1a8 <fftrtr+0x1a8>
 110:   54 69 10 3a     rlwinm  r9,r3,2,0,29
 114:   7c 09 b0 2e     lwzx    r0,r9,r22
 118:   38 63 00 01     addi    r3,r3,1
 11c:   7c 03 f0 40     cmplw   r3,r30
 ... 
[snipped 33 instructions]
 ...
 1a4:   41 80 ff 70     blt     114 <fftrtr+0x114>

Nested loops with bdnz on inner loop:
 298:   7f 03 c3 78     mr      r3,r24
 29c:   7c 03 f0 40     cmplw   r3,r30
 2a0:   40 80 01 04     bge     3a4 <fftrtr+0x3a4>
[Fair enough, 0 iteration count may actually happen, although it would be
better to use r24 instead of r3 for cmplw]
 2a4:   54 64 10 3a     rlwinm  r4,r3,2,0,29
 2a8:   7c 04 b0 2e     lwzx    r0,r4,r22
 2ac:   57 26 18 38     rlwinm  r6,r25,3,0,28
 2b0:   54 00 18 38     rlwinm  r0,r0,3,0,28
 2b4:   7c fd 02 14     add     r7,r29,r0
 2b8:   7d 07 32 14     add     r8,r7,r6
 2bc:   35 fa ff ff     addic.  r15,r26,-1
 2c0:   7d e9 03 a6     mtctr   r15
 2c4:   41 82 00 d0     beq     394 <fftrtr+0x394>
[Actually here the iteration count can never be zero but the compiler
can't be blamed for not knowing it]
 2c8:   7e 0c 83 78     mr      r12,r16
 2cc:   38 a0 00 00     li      r5,0
 2d0:   7d 9b 60 50     subf    r12,r27,r12
 ...
[snipped 47 instructions]
 ...
 390:   42 00 ff 40     bdnz    2d0 <fftrtr+0x2d0>
 394:   38 63 00 01     addi    r3,r3,1
 398:   7c 03 f0 40     cmplw   r3,r30
 39c:   38 84 00 04     addi    r4,r4,4
[Quite bad placement, neither r3 nor r30 are touched in the inner loop]
 3a0:   41 80 ff 08     blt     2a8 <fftrtr+0x2a8>

I don't ask for 40 or 70 in flight instructions like PPro/K7, but
something closer to the 604 (15-20). 6 is very little especially because
it blocks the processor after divides (and the FPU is also blocked but
it's the next paragraph). 

And BTW, I know that a lot of code is not as predictable as scientific
code but. I've written an Intel 486SX emulator which is the only code
right now able to properly initialize my S3 video and there the branch
density is significantly higher, indeed instruction dispatch and
addressing mode decoding can probably be considered as a worst case in
this respect.  It's true that this code would not benefit a lot from much
higher parallelism, but I consider it as non representative: it's a byte
per byte interpreter which, at 24kB code+data, is small enough to be put
in early boot code/Flash/whatever, and runs at a sufficient speed for this
job on a 200MHz 603e (thanks to the fact that it does not thrash the
cache). 

However, even in this code the branch immediately following the CR setting
instruction is the exception, 5 instructions are common. For indirect
branches the ctr is typically set quite a lot of instructions before the
branch (and there are other conditional branches in between anyway). 

If you want to see the code, it is at:
ftp://vcorr1.iram.es/pub/linux-2.2/mvme2600.generic-patch-2.2.4
in the arch/ppc/prepboot directory. 

> > - 2 FPU: not sure, but divides seem to kill current PPC, one
> >   unit which can do all instructions(including sqrt) + one mult/add only
> >   would be great. OTOH, on FFT benchmarks which interest me also (not a
> >   single div), PPC are excellent. 
> > 
> AFAIK, the divide algorithm used in PowerPC FPUs was selected specifically
> to re-use as much of the multiplier as possible; i.e. saving transistors
> was the foremost goal.
> 
> With nowadays silicon structure sizes, it would probably be a very good idea
> to have a separate divider. Possibly even one that uses a faster algorithm
> (like estimation plus refinement as is done explicitly in AltiVec).

At least we agree on that one. BTW, I wrote some time ago a square root
routine which is much faster than the one currently implemented in glibc. 
Anyone wants to test it ? I was not able to prove mathmatically that it is
fully IEEE compliant but I suspect so (and I can't test 2^53
possibilities. The code is free of divides and takes advantage of the
intermediate 106 bit precision of fused multiply/add instructions).

> > - 2 LSU to feed all these units: hey Pentium and PPro can do 2 memory
> >   accesses por clock since they came out. They need it because they
> >   require many load/store to compensate for the small number of registers 
> >   but I'd expect it to be beneficial even on PPC with large and fast 
> >   backside L2 caches. Power2 has had 2 LSU since the beginning AFAICT.   
> >
> AFAIK, Pentium and PPro/PII/PIII do not have the equivalent of two full-blown
> LSUs. In best case, they can do either two loads or one load plus one store,
> but not two store operations. Furthermore, their L1 cache is not really
> dual-ported, only dual-banked, so that two accesses can only be carried out
> in parallel if they hit different banks.

Pentium has (can even perform 2 stack pushes per clock of register and/or
immediate values). P6 core (PPro/PII...) is 1 load + 1 store. Of course
it's banked, but I would not care very much if it were implemented as in
Power2, with special instructions which allow to load and store 2 FPR
simultaneously. (even with alignment restrictions: with AltiVec, the HW to
perform 16 byte load and stores is already there. I know it's more complex
to implement when you need to access 2 different FPR, one of which may be
ready and not the other, but it's great for complex arithmetic). 

> And finally, if you ever have an algorithm where loads or stores are the
> bottleneck, the performance will be limited by main memory bandwidth anyway,
> regardless of L1 bandwidth (unless you are register-starved, of course, but
> that will almost never happen with 32 GPRs + 32 FPRs (+ 32VRs)).

No, I want to saturate L2 BW, which is going to be in the 10 Gb/s range
soon (500 Mhz @ 128 bit is 8Gb/s, double for 256 bit). With L2 caches
larger than 1 Mb, my apps basically do not thrash L2 cache (and are even
careful enough not to cause too many writebacks to it which could halve
effective BW), provided it has enough associativity (4 way at least).

Besides that, 2 LSU would help in saving/restoring registers on procedure
entry/exit (especially on 64 bit PPC since there is no lmd/stmd). But this
is a side effect: if you worried about the overhead of procedure
entry/exit, you've got other problems to cater first. 

	Gabriel.

[[ This message was sent via the linuxppc-dev mailing list.  Replies are ]]
[[ not  forced  back  to the list, so be sure to Cc linuxppc-dev if your ]]
[[ reply is of general interest. Please check http://lists.linuxppc.org/ ]]
[[ and http://www.linuxppc.org/ for useful information before posting.   ]]