[Linuxppc-users] xsadddp throughput on Power9

Tue Mar 19 02:20:22 AEDT 2019

On 3/7/19 2:16 PM, Bill Schmidt wrote:
> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>> Hi Nicolas,
>>
>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>> Hello world,
>>>
>>> After asking this question on another mailing list, I was redirected
>>> to this list. I hope someone on here will be able to help me :)
>>>
>>> While running a few benchmarks, I noticed that the following code
>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk
>>> (measured via pmc6) instead of the expected 4:
>>>
>>> loop:
>>>     .rept 12
>>>         xsadddp %vs2, %vs1, %vs1
>>>     .endr
>>>     bdnz loop
>>>
>>> From what I can gather, the bottleneck shouldn't be the history
>>> buffers. Since there are no long latency operations, FIN->COMP
>>> shouldn't take more than 12 cycles (the size of the secondary HB for
>>> FPSCR, the smallest relevant one). The primary HB and the issue
>>> queue shouldn't overflow either, since xsadddp takes 7 cycles from
>>> issue to finish and they can accomodate 20 and 13 entries
>>> respectivly with one instruction only using one of each. It doesn't
>>> stall on writeback ports either, because there are only 4 results in
>>> any one clock and 4 writeback ports (the decrement of the bdnz
>>> instruction is handled in the branch slice without involving the
>>> writeback network).
>>>
>>> Has anyone here any idea where the bottleneck might be?
>> Donald Stence was kind enough to answer this question for me.  Here is his note,
>> which indicates this is actually performing better than you think!
>>
>> Hi Bill,
>>     P9's design has it combine 64-bit execution units from two slices for processing a single 128-bit op.
>>     Therefore, it can only issue two 128-bit ops per cycle, a theoretical max.
> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I don't think we have an answer yet.

Segher discovered that xsadddp has the same limitation because of its internal 
implementation on P9.  Per the P9 UM:

"Because the binary floating-point registers (FPRs) are mapped to the vector-scalar registers 0 - 31 in the
Power ISA, the rightmost doubleword is updated with zero whenever a binary or decimal floating-point
instruction writes the target FPR. This behavior applies to any binary or decimal floating-point instruction
that writes an FPR, not just loads."

Thanks for letting us know about this!

Bill

>
> Sorry,
> Bill
>>  
>>     The Dispatch rate is higher than the Issue rate, of 2 xsaddp's per cycle, will result in the Issue Queue
>>     slots becoming full within just a few cycles and will result in Dispatch holds (nothing gets Dispatched
>>     for a cycle because there are no available Issue slots to place more ops into).
>>  
>>     The branch overlaps and actually pushes the IPC up from just 2 ops/cycle.
>>  
>>     Thanks,
>>  
>> Donald Stence
>> IBM PSP - P10 Technical Lead
>>  
>> Cheers,
>> Bill
>>>
>>> Thanks in advance
>>>     Nicolas
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
>
>
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20190318/4fae0450/attachment.htm>