[Linuxppc-users] xsadddp throughput on Power9

Wed Mar 20 12:54:12 AEDT 2019

Hi Bill, Segher,

Thanks for digging this up, this is really interesting new information!

But it still doesn't quite solve the puzzle, because the throughput of 
just xsadddp instructions is 2.27 (the throughput of all instructions 
including branches is 2.44 for the case of 12 xsadddp) which is more 
than the 2.0 we would expect from vector instructions. Also, since the 
instruction isn't tuple restricted each superslice can dispatch 3 
xsadddp instr/clk, and since each slice can at most accept 2 
instructions from dispatch, both the primary and the supplementary 
dispatch port of each slice must be able to handle one xsadddp 
instruction/cycle and there can only ever be one slice taking care of 
one xsadddp instruction. This means, in contrast to vector instructions 
for which two slices handle the writeback, one slice has to handle the 
writeback of both the upper and lower half of the vector register.
The writeback network seems to be able to handle at least 6 
writebacks/clk (sustained throughput of mtvsrd is 3 instr/clk), so, 
assuming this is the bottleneck, we would get the same throughput for 
xsadddp as for mtvsrd, 3 instr/clk, but since we aren't, there must 
still be something else.

Thanks again for your help and awesome find :)
     Nicolas

P.S.: This time I'm not going to accidentally drop the mailing list, as 
hard as hitting reply-all might be :D

On 3/18/19 4:20 PM, Bill Schmidt wrote:
> 
> 
> On 3/7/19 2:16 PM, Bill Schmidt wrote:
>> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>>> Hi Nicolas,
>>>
>>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>>> Hello world,
>>>>
>>>> After asking this question on another mailing list, I was redirected 
>>>> to this list. I hope someone on here will be able to help me :)
>>>>
>>>> While running a few benchmarks, I noticed that the following code 
>>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk 
>>>> (measured via pmc6) instead of the expected 4:
>>>>
>>>> loop:
>>>>     .rept 12
>>>>         xsadddp %vs2, %vs1, %vs1
>>>>     .endr
>>>>     bdnz loop
>>>>
>>>> From what I can gather, the bottleneck shouldn't be the history 
>>>> buffers. Since there are no long latency operations, FIN->COMP 
>>>> shouldn't take more than 12 cycles (the size of the secondary HB for 
>>>> FPSCR, the smallest relevant one). The primary HB and the issue 
>>>> queue shouldn't overflow either, since xsadddp takes 7 cycles from 
>>>> issue to finish and they can accomodate 20 and 13 entries 
>>>> respectivly with one instruction only using one of each. It doesn't 
>>>> stall on writeback ports either, because there are only 4 results in 
>>>> any one clock and 4 writeback ports (the decrement of the bdnz 
>>>> instruction is handled in the branch slice without involving the 
>>>> writeback network).
>>>>
>>>> Has anyone here any idea where the bottleneck might be?
>>> Donald Stence was kind enough to answer this question for me.  Here is his note,
>>> which indicates this is actually performing better than you think!
>>>
>>> Hi Bill,
>>>      P9's design has it combine 64-bit execution units from two slices for processing a single 128-bit op.
>>>      Therefore, it can only issue two 128-bit ops per cycle, a theoretical max.
>> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I don't think we have an answer yet.
> 
> Segher discovered that xsadddp has the same limitation because of its internal
> implementation on P9.  Per the P9 UM:
> 
> "Because the binary floating-point registers (FPRs) are mapped to the vector-scalar registers 0 - 31 in the
> Power ISA, the rightmost doubleword is updated with zero whenever a binary or decimal floating-point
> instruction writes the target FPR. This behavior applies to any binary or decimal floating-point instruction
> that writes an FPR, not just loads."
> 
> Thanks for letting us know about this!
> 
> Bill
> 
>> Sorry,
>> Bill
>>>   
>>>      The Dispatch rate is higher than the Issue rate, of 2 xsaddp's per cycle, will result in the Issue Queue
>>>      slots becoming full within just a few cycles and will result in Dispatch holds (nothing gets Dispatched
>>>      for a cycle because there are no available Issue slots to place more ops into).
>>>   
>>>      The branch overlaps and actually pushes the IPC up from just 2 ops/cycle.
>>>   
>>>      Thanks,
>>>   
>>> Donald Stence
>>> IBM PSP - P10 Technical Lead
>>>
>>> Cheers,
>>> Bill
>>>>
>>>> Thanks in advance
>>>>     Nicolas
>>>> _______________________________________________
>>>> Linuxppc-users mailing list
>>>> Linuxppc-users at lists.ozlabs.org
>>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
> 
> 
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
>