[Linuxppc-users] xsadddp throughput on Power9

Fri Mar 8 10:10:48 AEDT 2019

On 3/7/19 9:16 PM, Bill Schmidt wrote:
> On 3/7/19 1:24 PM, Bill Schmidt wrote:
>> Hi Nicolas,
>>
>> On 3/6/19 4:35 AM, Nicolas Koenig wrote:
>>> Hello world,
>>>
>>> After asking this question on another mailing list, I was redirected 
>>> to this list. I hope someone on here will be able to help me :)
>>>
>>> While running a few benchmarks, I noticed that the following code 
>>> (with SMT disabled) only manages about 2.25 xsadddp instr/clk 
>>> (measured via pmc6) instead of the expected 4:
>>>
>>> loop:
>>>     .rept 12
>>>         xsadddp %vs2, %vs1, %vs1
>>>     .endr
>>>     bdnz loop
>>>
>>> From what I can gather, the bottleneck shouldn't be the history 
>>> buffers. Since there are no long latency operations, FIN->COMP 
>>> shouldn't take more than 12 cycles (the size of the secondary HB for 
>>> FPSCR, the smallest relevant one). The primary HB and the issue queue 
>>> shouldn't overflow either, since xsadddp takes 7 cycles from issue to 
>>> finish and they can accomodate 20 and 13 entries respectivly with one 
>>> instruction only using one of each. It doesn't stall on writeback 
>>> ports either, because there are only 4 results in any one clock and 4 
>>> writeback ports (the decrement of the bdnz instruction is handled in 
>>> the branch slice without involving the writeback network).
>>>
>>> Has anyone here any idea where the bottleneck might be?
>> Donald Stence was kind enough to answer this question for me.  Here is his note,
>> which indicates this is actually performing better than you think!
>>
>> Hi Bill,
>>      P9's design has it combine 64-bit execution units from two slices for processing a single 128-bit op.
>>      Therefore, it can only issue two 128-bit ops per cycle, a theoretical max.
> 
> Hrm, it is pointed out to me that this is xsadddp, not xvadddp, so I don't think we have an answer yet.
> 

Hi Bill,

Thanks for looking into this :) I ran a test with xvadddp as well, it 
yielded 1.72 xvadddp instr/cycle, while running it with clzd and addi 
both resulted in the expected 4 non-branch instructions/clk (3.85 and 
3.95 to be precise).

I attached the function I used for measuring the throughput of xsadddp. 
I hope that helps a bit.

     Nicolas

> Sorry,
> Bill
> 
>>   
>>      The Dispatch rate is higher than the Issue rate, of 2 xsaddp's per cycle, will result in the Issue Queue
>>      slots becoming full within just a few cycles and will result in Dispatch holds (nothing gets Dispatched
>>      for a cycle because there are no available Issue slots to place more ops into).
>>   
>>      The branch overlaps and actually pushes the IPC up from just 2 ops/cycle.
>>   
>>      Thanks,
>>   
>> Donald Stence
>> IBM PSP - P10 Technical Lead
>>
>> Cheers,
>> Bill
>>>
>>> Thanks in advance
>>>     Nicolas
>>> _______________________________________________
>>> Linuxppc-users mailing list
>>> Linuxppc-users at lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-users
>>>
>>
>>
>> _______________________________________________
>> Linuxppc-users mailing list
>> Linuxppc-users at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-users
> 
> 
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
> 
-------------- next part --------------
.abiversion 2

#ifndef NUM_INSTR
#define NUM_INSTR 12
#endif

.macro nopalign length
  .p2alignl \length, 0x60000000
.endm

.macro .nopalign length
  nopalign \length
.endm

.macro double_cast_div vt, ra, rb, via
	mtvsrd \vt, \rb
	xscvsxddp \vt, \vt
	mtvsrd \via, \ra
	xscvsxddp \via, \via
	xsdivdp \vt, \vt, \via
.endm

.macro init_vsx_r_r v, r
	mtvsrd \v, \r
	xscvsxddp \v, \v
	xxspltd \v, \v, 0
.endm

.macro function name, globl=0
	.if \globl
  		.globl \name
 	.endif
	.type \name, at function
	.align 4
	\name:
.endm

.macro .function name, globl=0
  function \name, \globl
.endm

.function through_xsadddp, 1
  mtctr %r3 
  li %r4, 1
  init_vsx_r_r %vs1, %r4
  mfspr %r5, 776
.nopalign 4
xsadddp_loop:
.rept NUM_INSTR
  xsadddp %vs2, %vs1, %vs1
.endr
  bdnz xsadddp_loop
.nopalign 4
  mfspr %r6, 776
  sub %r5, %r6, %r5
  mulli %r3, %r3, NUM_INSTR
  double_cast_div %vs1, %r5, %r3, %vs2  
  blr