[Linuxppc-users] 256 bit aligned variables
Steve Munroe
sjmunroe at us.ibm.com
Fri Jun 23 11:07:47 AEST 2017
"Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org> wrote on 06/22/2017 01:19:15 AM:
> From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> To: linuxppc-users at lists.ozlabs.org
> Date: 06/22/2017 05:02 PM
> Subject: [Linuxppc-users] 256 bit aligned variables
> Sent by: "Linuxppc-users" <linuxppc-users-bounces
> +sjmunroe=us.ibm.com at lists.ozlabs.org>
>
> Hi,
>
> I'd like to manage 256 bit aligned variables.
> So I use gcc's vector_size attribute like in the following example:
>
PowerISA does not have 256-bit vectors, yet.
It does have 128-bit vectors, with lots of vector registers (64 of them)
and multiple vector units executing in parallel.
POWER8
- 4 load units, 2 stores units (these are paired for vector)
- Two symmetric fixed-point units (FXU)
– Four floating-point units (FPU), implemented as two 2-way SIMD operations
for double- and single-
precision.
– Two VMX execution units capable of executing simple FX, permute, complex
FX, and 4-way SIMD
single-precision floating-point operations
– One Crypto unit
Two 16-byte loads and one 16-byte store operation are supported for VMX and
VSX operations per cycle.
And
Two Vector Double operation per cycle (4 X double FMAs per cycle)
And
Two Vector float / permute / Integer / Logic operations per cycle.
Think RISC not CISC.
> typedef long long uint256_t __attribute__ ((vector_size (8 * sizeof
> (uint32_t))));
> int main (void)
> {
> volatile uint256_t inVar256, outVar256;
> outVar256 = inVar256;
> }
>
GCC has native support for vector_size (16). Larger multiples are handled
as arrays of vector_size(16).
As such there is no real advantage for alignment greater then 16-bytes.
Also with the inherent instruction parallelism of the POWER8, you are
already getting effective 256-bits of vector operations per cycle.
> On x86_64, I use -mavx so gcc generates AVX instructions.
> Disassembling the object file, I've got:
> outVar256 = inVar256;
> 13: c5 fd 6f 45 b0 vmovdqa -0x50(%rbp),%ymm0
> 18: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
>
> On ppc (cross-compiling with: /opt/at10.0/bin/powerpc64le-linux-gnu-gcc
> -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o), I've got:
> outVar256 = inVar256;
> 1c: 40 00 40 39 li r10,64
> 20: 98 56 89 7d lxvd2x vs12,r9,r10
> 24: 10 00 40 39 li r10,16
> 28: 50 00 00 39 li r8,80
> 2c: 98 46 09 7c lxvd2x vs0,r9,r8
> 30: 98 4f 80 7d stxvd2x vs12,0,r9
> 34: 98 57 09 7c stxvd2x vs0,r9,r10
> 38: 98 4e 80 7d lxvd2x vs12,0,r9
> 3c: 98 56 09 7c lxvd2x vs0,r9,r10
> 40: 20 00 40 39 li r10,32
> 44: 98 57 89 7d stxvd2x vs12,r9,r10
> 48: 30 00 40 39 li r10,48
> 4c: 98 57 09 7c stxvd2x vs0,r9,r10
>
Again think RISC not CISC. The compiler broke the 256-bit vector into pairs
of 128-bit vector operations.
> I'm new on ppc and not familiar with ppc instructions but it doesn't
> seem to be SIMD instructions.
Those (lxvd2x/stxvd2x) are 128-bit SIMD load/store operations.
> How can I generate 256 bit aligned instructions ?
>
You don't need more then 128-bit alignment and you will get effective
256-bit vector operation without this.
> For more details, the goal is to drive a PCIe board and to generate 256
> bit aligned memcpy from host memory to PCIe bus.
>
You should look at the GLIBC memcpy_power8 implementation for example of
how to maximize memory bandwidth
Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20170622/88388f66/attachment-0001.html>
More information about the Linuxppc-users
mailing list