[Linuxppc-users] 256 bit aligned variables

Wed Jun 28 01:21:08 AEST 2017

Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center

"Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org> wrote on 06/27/2017 09:08:05 AM:

> From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> To: linuxppc-users at lists.ozlabs.org
> Date: 06/27/2017 09:10 AM
> Subject: Re: [Linuxppc-users] 256 bit aligned variables
> Sent by: "Linuxppc-users" <linuxppc-users-bounces
> +sjmunroe=us.ibm.com at lists.ozlabs.org>
>
> Hi,
>
> Thanks for you answers.
> I've got new questions below.

> On 23/06/2017 03:07, Steve Munroe wrote:
> "Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org>
> wrote on 06/22/2017 01:19:15 AM:
>
> > From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> > To: linuxppc-users at lists.ozlabs.org
> > Date: 06/22/2017 05:02 PM
> > Subject: [Linuxppc-users] 256 bit aligned variables
> > Sent by: "Linuxppc-users" <linuxppc-users-bounces
> > +sjmunroe=us.ibm.com at lists.ozlabs.org>
> >
> > Hi,
> >
> > I'd like to manage 256 bit aligned variables.
> > So I use gcc's vector_size attribute like in the following example:
> >
> PowerISA does not have 256-bit vectors, yet.
>
> It does have 128-bit vectors, with lots of vector registers (64 of
> them) and multiple vector units executing in parallel.
>
> POWER8
> - 4 load units, 2 stores units (these are paired for vector)
> - Two symmetric fixed-point units (FXU)
> – Four floating-point units (FPU), implemented as two 2-way SIMD
> operations for double- and single-
> precision.
> – Two VMX execution units capable of executing simple FX, permute,
> complex FX, and 4-way SIMD
> single-precision floating-point operations
> – One Crypto unit
>
> Two 16-byte loads and one 16-byte store operation are supported for
> VMX and VSX operations per cycle.
> And
> Two Vector Double operation per cycle (4 X double FMAs per cycle)
> And
> Two Vector float / permute / Integer / Logic operations per cycle.
>
> Think RISC not CISC.
>
> > typedef long long  uint256_t __attribute__ ((vector_size (8 * sizeof
> > (uint32_t))));
> > int main (void)
> > {
> >      volatile uint256_t inVar256, outVar256;
> >      outVar256 = inVar256;
> > }
> >
>
> GCC has native support for vector_size (16). Larger multiples are
> handled as arrays of vector_size(16).
>
> As such there is no real advantage for alignment greater then 16-bytes.
> The 256 bit alignment is not used for performance issue, we use it
> to simplify the address decoding in the FPGA in charge of the
> management of the local bus on our PCIe board.
>
> So, if I well understood, it's not possible to guarantee a 256 bit
> access (1 address + 8 words) on ppc but it's possible to have 128
> bit accesses (1 address + 4 words), it's correct ?
> Will it be the case if I manipulate uint128_t variables ? with
> typedef long long  uint128_t __attribute__ ((vector_size (4 * sizeof
> (uint32_t))));
>
You can use the aligned attribute, as in:

aligned (alignment)
      This attribute specifies a minimum alignment for the variable or
      structure field, measured in bytes. For example, the declaration:

      int x __attribute__ ((aligned (32))) = 0;

this works for static (BSS) storage . I am not sure about automatic (stack)
as the ABI only requires (aligned (16)).

But if you are working in IO-space you are likely be aligning the pointer
yourself. If the pointer is aligned the accesses will be aligned.

> Also with the inherent instruction parallelism of the POWER8, you
> are already getting effective 256-bits of vector operations per cycle.
>
Yes there can be multiple vector instructions dispatch and issue per cycle
but the details of read/write ports at the various cache levels are
specific to the processor.

And also cache-coherent and cache-inhibited storage behave differently.

Best to read the relevant sections of the "POWER8 Processor User’s Manual
for the Single-Chip Module" yourself.

The information portal: https://www-355.ibm.com/systems/power/openpower/

Under power8:
https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=POWER8

and

Power8 with NVidia:
https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=POWER8_with_NVIDIA_NVLink

There a similar links for power9 and the PowerISA under this portal.
>
> > On x86_64, I use -mavx so gcc generates AVX instructions.
> > Disassembling the object file, I've got:
> >      outVar256 = inVar256;
> >    13:    c5 fd 6f 45 b0           vmovdqa -0x50(%rbp),%ymm0
> >    18:    c5 fd 7f 45 d0           vmovdqa %ymm0,-0x30(%rbp)
> >
> > On ppc (cross-compiling with: /opt/at10.0/bin/powerpc64le-linux-gnu-gcc

> > -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o), I've got:
> >      outVar256 = inVar256;
> >    1c:    40 00 40 39     li      r10,64
> >    20:    98 56 89 7d     lxvd2x  vs12,r9,r10
> >    24:    10 00 40 39     li      r10,16
> >    28:    50 00 00 39     li      r8,80
> >    2c:    98 46 09 7c     lxvd2x  vs0,r9,r8
> >    30:    98 4f 80 7d     stxvd2x vs12,0,r9
> >    34:    98 57 09 7c     stxvd2x vs0,r9,r10
> >    38:    98 4e 80 7d     lxvd2x  vs12,0,r9
> >    3c:    98 56 09 7c     lxvd2x  vs0,r9,r10
> >    40:    20 00 40 39     li      r10,32
> >    44:    98 57 89 7d     stxvd2x vs12,r9,r10
> >    48:    30 00 40 39     li      r10,48
> >    4c:    98 57 09 7c     stxvd2x vs0,r9,r10
> >
>
> Again think RISC not CISC. The compiler broke the 256-bit vector
> into pairs of 128-bit vector operations.
>
> > I'm new on ppc and not familiar with ppc instructions but it doesn't
> > seem to be SIMD instructions.
>
> Those (lxvd2x/stxvd2x) are 128-bit SIMD load/store operations.
>
> > How can I generate 256 bit aligned instructions ?
> >
> You don't need more then 128-bit alignment and you will get
> effective 256-bit vector operation without this.
>
> > For more details, the goal is to drive a PCIe board and to generate 256

> > bit aligned memcpy from host memory to PCIe bus.
> >
> You should look at the GLIBC memcpy_power8 implementation for
> example of how to maximize memory bandwidth
> For the performance, I use write-combining by remapping the PCIe
> address space with ioremap_wc () in my kernel device driver.
> Is there any similar mechanism ?
>
>
> Thanks,
> Fred

>
> Steven J. Munroe
> Linux on Power Toolchain Architect
> IBM Corporation, Linux Technology Center

> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20170627/d4d9eb93/attachment-0001.html>