[Linuxppc-users] xsadddp throughput on Power9
Nicolas König
koenigni at student.ethz.ch
Fri Mar 8 10:28:15 AEDT 2019
On 3/8/19 12:10 AM, Nicolas König wrote:
>
>
> On 3/7/19 9:16 PM, Bill Schmidt wrote:
>> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>>> Hi Nicolas,
>>>
>>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>>> Hello world,
>>>>
>>>> After asking this question on another mailing list, I was redirected
>>>> to this list. I hope someone on here will be able to help me :)
>>>>
>>>> While running a few benchmarks, I noticed that the following code
>>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk
>>>> (measured via pmc6) instead of the expected 4:
>>>>
>>>> loop:
>>>> .rept 12
>>>> xsadddp %vs2, %vs1, %vs1
>>>> .endr
>>>> bdnz loop
>>>>
>>>> From what I can gather, the bottleneck shouldn't be the history
>>>> buffers. Since there are no long latency operations, FIN->COMP
>>>> shouldn't take more than 12 cycles (the size of the secondary HB for
>>>> FPSCR, the smallest relevant one). The primary HB and the issue
>>>> queue shouldn't overflow either, since xsadddp takes 7 cycles from
>>>> issue to finish and they can accomodate 20 and 13 entries
>>>> respectivly with one instruction only using one of each. It doesn't
>>>> stall on writeback ports either, because there are only 4 results in
>>>> any one clock and 4 writeback ports (the decrement of the bdnz
>>>> instruction is handled in the branch slice without involving the
>>>> writeback network).
>>>>
>>>> Has anyone here any idea where the bottleneck might be?
>>> Donald Stence was kind enough to answer this question for me. Here
>>> is his note,
>>> which indicates this is actually performing better than you think!
>>>
>>> Hi Bill,
>>> P9's design has it combine 64-bit execution units from
>>> two slices for processing a single 128-bit op.
>>> Therefore, it can only issue two 128-bit ops per cycle,
>>> a theoretical max.
>>
>> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I
>> don't think we have an answer yet.
>>
>
> Hi Bill,
>
> Thanks for looking into this :) I ran a test with xvadddp as well, it
> yielded 1.72 xvadddp instr/cycle, while running it with clzd and addi
s/clzd/cntlzd
(I chose these two instructions because addi goes to the ALU pipe and
cntldz goes to the ALU2 pipe)
> both resulted in the expected 4 non-branch instructions/clk (3.85 and
> 3.95 to be precise).
>
> I attached the function I used for measuring the throughput of xsadddp.
> I hope that helps a bit.
>
> Nicolas
>
>> Sorry,
>> Bill
>>
>>> The Dispatch rate is higher than the Issue rate, of 2 xsaddp's
>>> per cycle, will result in the Issue Queue
>>> slots becoming full within just a few cycles and will result
>>> in Dispatch holds (nothing gets Dispatched
>>> for a cycle because there are no available Issue slots to place
>>> more ops into).
>>> The branch overlaps and actually pushes the IPC up from just 2
>>> ops/cycle.
>>> Thanks,
>>> Donald Stence
>>> IBM PSP - P10 Technical Lead
>>>
>>> Cheers,
>>> Bill
>>>>
>>>> Thanks in advance
>>>> Nicolas
>>>> _______________________________________________
>>>> Linuxppc-users mailing list
>>>> Linuxppc-users at lists.ozlabs.org
>>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>>
>>>
>>>
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>
>
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
>
More information about the Linuxppc-users
mailing list