[Linuxppc-users] 256 bit aligned variables

Wed Jun 28 00:08:05 AEST 2017

Hi,

Thanks for you answers.
I've got new questions below.

On 23/06/2017 03:07, Steve Munroe wrote:
>
> "Linuxppc-users" 
> <linuxppc-users-bounces+sjmunroe=us.ibm.com at lists.ozlabs.org> wrote on 
> 06/22/2017 01:19:15 AM:
>
> > From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> > To: linuxppc-users at lists.ozlabs.org
> > Date: 06/22/2017 05:02 PM
> > Subject: [Linuxppc-users] 256 bit aligned variables
> > Sent by: "Linuxppc-users" <linuxppc-users-bounces
> > +sjmunroe=us.ibm.com at lists.ozlabs.org>
> >
> > Hi,
> >
> > I'd like to manage 256 bit aligned variables.
> > So I use gcc's vector_size attribute like in the following example:
> >
> PowerISA does not have 256-bit vectors, yet.
>
> It does have 128-bit vectors, with lots of vector registers (64 of 
> them) and multiple vector units executing in parallel.
>
> POWER8
> - 4 load units, 2 stores units (these are paired for vector)
> - Two symmetric fixed-point units (FXU)
> – Four floating-point units (FPU), implemented as two 2-way SIMD 
> operations for double- and single-
> precision.
> – Two VMX execution units capable of executing simple FX, permute, 
> complex FX, and 4-way SIMD
> single-precision floating-point operations
> – One Crypto unit
>
> Two 16-byte loads and one 16-byte store operation are supported for 
> VMX and VSX operations per cycle.
> And
> Two Vector Double operation per cycle (4 X double FMAs per cycle)
> And
> Two Vector float / permute / Integer / Logic operations per cycle.
>
> Think RISC not CISC.
>
> > typedef long long  uint256_t __attribute__ ((vector_size (8 * sizeof
> > (uint32_t))));
> > int main (void)
> > {
> >      volatile uint256_t inVar256, outVar256;
> >      outVar256 = inVar256;
> > }
> >
>
> GCC has native support for vector_size (16). Larger multiples are 
> handled as arrays of vector_size(16).
>
> As such there is no real advantage for alignment greater then 16-bytes.
>
The 256 bit alignment is not used for performance issue, we use it to 
simplify the address decoding in the FPGA in charge of the management of 
the local bus on our PCIe board.

So, if I well understood, it's not possible to guarantee a 256 bit 
access (1 address + 8 words) on ppc but it's possible to have 128 bit 
accesses (1 address + 4 words), it's correct ?
Will it be the case if I manipulate uint128_t variables ? with typedef 
long long  uint128_t __attribute__ ((vector_size (4 * sizeof (uint32_t))));
>
>
> Also with the inherent instruction parallelism of the POWER8, you are 
> already getting effective 256-bits of vector operations per cycle.
>
>
> > On x86_64, I use -mavx so gcc generates AVX instructions.
> > Disassembling the object file, I've got:
> >      outVar256 = inVar256;
> >    13:    c5 fd 6f 45 b0           vmovdqa -0x50(%rbp),%ymm0
> >    18:    c5 fd 7f 45 d0           vmovdqa %ymm0,-0x30(%rbp)
> >
> > On ppc (cross-compiling with: /opt/at10.0/bin/powerpc64le-linux-gnu-gcc
> > -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o), I've got:
> >      outVar256 = inVar256;
> >    1c:    40 00 40 39     li      r10,64
> >    20:    98 56 89 7d     lxvd2x  vs12,r9,r10
> >    24:    10 00 40 39     li      r10,16
> >    28:    50 00 00 39     li      r8,80
> >    2c:    98 46 09 7c     lxvd2x  vs0,r9,r8
> >    30:    98 4f 80 7d     stxvd2x vs12,0,r9
> >    34:    98 57 09 7c     stxvd2x vs0,r9,r10
> >    38:    98 4e 80 7d     lxvd2x  vs12,0,r9
> >    3c:    98 56 09 7c     lxvd2x  vs0,r9,r10
> >    40:    20 00 40 39     li      r10,32
> >    44:    98 57 89 7d     stxvd2x vs12,r9,r10
> >    48:    30 00 40 39     li      r10,48
> >    4c:    98 57 09 7c     stxvd2x vs0,r9,r10
> >
>
> Again think RISC not CISC. The compiler broke the 256-bit vector into 
> pairs of 128-bit vector operations.
>
> > I'm new on ppc and not familiar with ppc instructions but it doesn't
> > seem to be SIMD instructions.
>
> Those (lxvd2x/stxvd2x) are 128-bit SIMD load/store operations.
>
> > How can I generate 256 bit aligned instructions ?
> >
> You don't need more then 128-bit alignment and you will get effective 
> 256-bit vector operation without this.
>
> > For more details, the goal is to drive a PCIe board and to generate 256
> > bit aligned memcpy from host memory to PCIe bus.
> >
> You should look at the GLIBC memcpy_power8 implementation for example 
> of how to maximize memory bandwidth
>
For the performance, I use write-combining by remapping the PCIe address 
space with ioremap_wc () in my kernel device driver.
Is there any similar mechanism ?

Thanks,
Fred

>
> Steven J. Munroe
> Linux on Power Toolchain Architect
> IBM Corporation, Linux Technology Center
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20170627/e1925af6/attachment.html>