[Linuxppc-users] xsadddp throughput on Power9
Nicolas König
koenigni at student.ethz.ch
Fri Mar 8 10:10:48 AEDT 2019
On 3/7/19 9:16 PM, Bill Schmidt wrote:
> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>> Hi Nicolas,
>>
>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>> Hello world,
>>>
>>> After asking this question on another mailing list, I was redirected
>>> to this list. I hope someone on here will be able to help me :)
>>>
>>> While running a few benchmarks, I noticed that the following code
>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk
>>> (measured via pmc6) instead of the expected 4:
>>>
>>> loop:
>>> .rept 12
>>> xsadddp %vs2, %vs1, %vs1
>>> .endr
>>> bdnz loop
>>>
>>> From what I can gather, the bottleneck shouldn't be the history
>>> buffers. Since there are no long latency operations, FIN->COMP
>>> shouldn't take more than 12 cycles (the size of the secondary HB for
>>> FPSCR, the smallest relevant one). The primary HB and the issue queue
>>> shouldn't overflow either, since xsadddp takes 7 cycles from issue to
>>> finish and they can accomodate 20 and 13 entries respectivly with one
>>> instruction only using one of each. It doesn't stall on writeback
>>> ports either, because there are only 4 results in any one clock and 4
>>> writeback ports (the decrement of the bdnz instruction is handled in
>>> the branch slice without involving the writeback network).
>>>
>>> Has anyone here any idea where the bottleneck might be?
>> Donald Stence was kind enough to answer this question for me. Here is his note,
>> which indicates this is actually performing better than you think!
>>
>> Hi Bill,
>> P9's design has it combine 64-bit execution units from two slices for processing a single 128-bit op.
>> Therefore, it can only issue two 128-bit ops per cycle, a theoretical max.
>
> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I don't think we have an answer yet.
>
Hi Bill,
Thanks for looking into this :) I ran a test with xvadddp as well, it
yielded 1.72 xvadddp instr/cycle, while running it with clzd and addi
both resulted in the expected 4 non-branch instructions/clk (3.85 and
3.95 to be precise).
I attached the function I used for measuring the throughput of xsadddp.
I hope that helps a bit.
Nicolas
> Sorry,
> Bill
>
>>
>> The Dispatch rate is higher than the Issue rate, of 2 xsaddp's per cycle, will result in the Issue Queue
>> slots becoming full within just a few cycles and will result in Dispatch holds (nothing gets Dispatched
>> for a cycle because there are no available Issue slots to place more ops into).
>>
>> The branch overlaps and actually pushes the IPC up from just 2 ops/cycle.
>>
>> Thanks,
>>
>> Donald Stence
>> IBM PSP - P10 Technical Lead
>>
>> Cheers,
>> Bill
>>>
>>> Thanks in advance
>>> Nicolas
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
>
>
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
>
-------------- next part --------------
.abiversion 2
#ifndef NUM_INSTR
#define NUM_INSTR 12
#endif
.macro nopalign length
.p2alignl \length, 0x60000000
.endm
.macro .nopalign length
nopalign \length
.endm
.macro double_cast_div vt, ra, rb, via
mtvsrd \vt, \rb
xscvsxddp \vt, \vt
mtvsrd \via, \ra
xscvsxddp \via, \via
xsdivdp \vt, \vt, \via
.endm
.macro init_vsx_r_r v, r
mtvsrd \v, \r
xscvsxddp \v, \v
xxspltd \v, \v, 0
.endm
.macro function name, globl=0
.if \globl
.globl \name
.endif
.type \name, at function
.align 4
\name:
.endm
.macro .function name, globl=0
function \name, \globl
.endm
.function through_xsadddp, 1
mtctr %r3
li %r4, 1
init_vsx_r_r %vs1, %r4
mfspr %r5, 776
.nopalign 4
xsadddp_loop:
.rept NUM_INSTR
xsadddp %vs2, %vs1, %vs1
.endr
bdnz xsadddp_loop
.nopalign 4
mfspr %r6, 776
sub %r5, %r6, %r5
mulli %r3, %r3, NUM_INSTR
double_cast_div %vs1, %r5, %r3, %vs2
blr
More information about the Linuxppc-users
mailing list