[Linuxppc-users] xsadddp throughput on Power9

Fri Mar 8 10:28:15 AEDT 2019


On 3/8/19 12:10 AM, Nicolas König wrote:
> 
> 
> On 3/7/19 9:16 PM, Bill Schmidt wrote:
>> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>>> Hi Nicolas,
>>>
>>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>>> Hello world,
>>>>
>>>> After asking this question on another mailing list, I was redirected 
>>>> to this list. I hope someone on here will be able to help me :)
>>>>
>>>> While running a few benchmarks, I noticed that the following code 
>>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk 
>>>> (measured via pmc6) instead of the expected 4:
>>>>
>>>> loop:
>>>>     .rept 12
>>>>         xsadddp %vs2, %vs1, %vs1
>>>>     .endr
>>>>     bdnz loop
>>>>
>>>> From what I can gather, the bottleneck shouldn't be the history 
>>>> buffers. Since there are no long latency operations, FIN->COMP 
>>>> shouldn't take more than 12 cycles (the size of the secondary HB for 
>>>> FPSCR, the smallest relevant one). The primary HB and the issue 
>>>> queue shouldn't overflow either, since xsadddp takes 7 cycles from 
>>>> issue to finish and they can accomodate 20 and 13 entries 
>>>> respectivly with one instruction only using one of each. It doesn't 
>>>> stall on writeback ports either, because there are only 4 results in 
>>>> any one clock and 4 writeback ports (the decrement of the bdnz 
>>>> instruction is handled in the branch slice without involving the 
>>>> writeback network).
>>>>
>>>> Has anyone here any idea where the bottleneck might be?
>>> Donald Stence was kind enough to answer this question for me.  Here 
>>> is his note,
>>> which indicates this is actually performing better than you think!
>>>
>>> Hi Bill,
>>>      P9's design has it combine 64-bit execution units from 
>>> two slices for processing a single 128-bit op.
>>>      Therefore, it can only issue two 128-bit ops per cycle, 
>>> a theoretical max.
>>
>> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I 
>> don't think we have an answer yet.
>>
> 
> Hi Bill,
> 
> Thanks for looking into this :) I ran a test with xvadddp as well, it 
> yielded 1.72 xvadddp instr/cycle, while running it with clzd and addi 

s/clzd/cntlzd
(I chose these two instructions because addi goes to the ALU pipe and 
cntldz goes to the ALU2 pipe)

> both resulted in the expected 4 non-branch instructions/clk (3.85 and 
> 3.95 to be precise).
> 
> I attached the function I used for measuring the throughput of xsadddp. 
> I hope that helps a bit.
> 
>      Nicolas
> 
>> Sorry,
>> Bill
>>
>>>      The Dispatch rate is higher than the Issue rate, of 2 xsaddp's 
>>> per cycle, will result in the Issue Queue
>>>      slots becoming full within just a few cycles and will result 
>>> in Dispatch holds (nothing gets Dispatched
>>>      for a cycle because there are no available Issue slots to place 
>>> more ops into).
>>>      The branch overlaps and actually pushes the IPC up from just 2 
>>> ops/cycle.
>>>      Thanks,
>>> Donald Stence
>>> IBM PSP - P10 Technical Lead
>>>
>>> Cheers,
>>> Bill
>>>>
>>>> Thanks in advance
>>>>     Nicolas
>>>> _______________________________________________
>>>> Linuxppc-users mailing list
>>>> Linuxppc-users at lists.ozlabs.org
>>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>
> 
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
>