[Linuxppc-users] 256 bit aligned variables

Fri Jun 23 11:07:47 AEST 2017

"Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org> wrote on 06/22/2017 01:19:15 AM:

> From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> To: linuxppc-users at lists.ozlabs.org
> Date: 06/22/2017 05:02 PM
> Subject: [Linuxppc-users] 256 bit aligned variables
> Sent by: "Linuxppc-users" <linuxppc-users-bounces
> +sjmunroe=us.ibm.com at lists.ozlabs.org>
>
> Hi,
>
> I'd like to manage 256 bit aligned variables.
> So I use gcc's vector_size attribute like in the following example:
>
PowerISA does not have 256-bit vectors, yet.

It does have 128-bit vectors, with lots of vector registers (64 of them)
and multiple vector units executing in parallel.

POWER8
- 4 load units, 2 stores units (these are paired for vector)
- Two symmetric fixed-point units (FXU)
– Four floating-point units (FPU), implemented as two 2-way SIMD operations
for double- and single-
precision.
– Two VMX execution units capable of executing simple FX, permute, complex
FX, and 4-way SIMD
single-precision floating-point operations
– One Crypto unit

Two 16-byte loads and one 16-byte store operation are supported for VMX and
VSX operations per cycle.
And
Two Vector Double operation per cycle (4 X double FMAs per cycle)
And
Two Vector float / permute / Integer / Logic operations per cycle.

Think RISC not CISC.

> typedef long long  uint256_t __attribute__ ((vector_size (8 * sizeof
> (uint32_t))));
> int main (void)
> {
>      volatile uint256_t inVar256, outVar256;
>      outVar256 = inVar256;
> }
>

GCC has native support for vector_size (16). Larger multiples are handled
as arrays of vector_size(16).

As such there is no real advantage for alignment greater then 16-bytes.

Also with the inherent instruction parallelism of the POWER8, you are
already getting effective 256-bits of vector operations per cycle.

> On x86_64, I use -mavx so gcc generates AVX instructions.
> Disassembling the object file, I've got:
>      outVar256 = inVar256;
>    13:    c5 fd 6f 45 b0           vmovdqa -0x50(%rbp),%ymm0
>    18:    c5 fd 7f 45 d0           vmovdqa %ymm0,-0x30(%rbp)
>
> On ppc (cross-compiling with: /opt/at10.0/bin/powerpc64le-linux-gnu-gcc
> -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o), I've got:
>      outVar256 = inVar256;
>    1c:    40 00 40 39     li      r10,64
>    20:    98 56 89 7d     lxvd2x  vs12,r9,r10
>    24:    10 00 40 39     li      r10,16
>    28:    50 00 00 39     li      r8,80
>    2c:    98 46 09 7c     lxvd2x  vs0,r9,r8
>    30:    98 4f 80 7d     stxvd2x vs12,0,r9
>    34:    98 57 09 7c     stxvd2x vs0,r9,r10
>    38:    98 4e 80 7d     lxvd2x  vs12,0,r9
>    3c:    98 56 09 7c     lxvd2x  vs0,r9,r10
>    40:    20 00 40 39     li      r10,32
>    44:    98 57 89 7d     stxvd2x vs12,r9,r10
>    48:    30 00 40 39     li      r10,48
>    4c:    98 57 09 7c     stxvd2x vs0,r9,r10
>

Again think RISC not CISC. The compiler broke the 256-bit vector into pairs
of 128-bit vector operations.

> I'm new on ppc and not familiar with ppc instructions but it doesn't
> seem to be SIMD instructions.

Those (lxvd2x/stxvd2x) are 128-bit SIMD load/store operations.

> How can I generate 256 bit aligned instructions ?
>
You don't need more then 128-bit alignment and you will get effective
256-bit vector operation without this.

> For more details, the goal is to drive a PCIe board and to generate 256
> bit aligned memcpy from host memory to PCIe bus.
>
You should look at the GLIBC memcpy_power8 implementation for example of
how to maximize memory bandwidth

Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20170622/88388f66/attachment-0001.html>