[Linuxppc-users] 256 bit aligned variables
Steve Munroe
sjmunroe at us.ibm.com
Wed Jun 28 01:21:08 AEST 2017
Steven J. Munroe
Linux on Power Toolchain Architect
IBM Corporation, Linux Technology Center
"Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org> wrote on 06/27/2017 09:08:05 AM:
> From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> To: linuxppc-users at lists.ozlabs.org
> Date: 06/27/2017 09:10 AM
> Subject: Re: [Linuxppc-users] 256 bit aligned variables
> Sent by: "Linuxppc-users" <linuxppc-users-bounces
> +sjmunroe=us.ibm.com at lists.ozlabs.org>
>
> Hi,
>
> Thanks for you answers.
> I've got new questions below.
> On 23/06/2017 03:07, Steve Munroe wrote:
> "Linuxppc-users" <linuxppc-users-bounces
+sjmunroe=us.ibm.com at lists.ozlabs.org>
> wrote on 06/22/2017 01:19:15 AM:
>
> > From: Frederic DUMOULIN <frederic.dumoulin at mipsology.com>
> > To: linuxppc-users at lists.ozlabs.org
> > Date: 06/22/2017 05:02 PM
> > Subject: [Linuxppc-users] 256 bit aligned variables
> > Sent by: "Linuxppc-users" <linuxppc-users-bounces
> > +sjmunroe=us.ibm.com at lists.ozlabs.org>
> >
> > Hi,
> >
> > I'd like to manage 256 bit aligned variables.
> > So I use gcc's vector_size attribute like in the following example:
> >
> PowerISA does not have 256-bit vectors, yet.
>
> It does have 128-bit vectors, with lots of vector registers (64 of
> them) and multiple vector units executing in parallel.
>
> POWER8
> - 4 load units, 2 stores units (these are paired for vector)
> - Two symmetric fixed-point units (FXU)
> – Four floating-point units (FPU), implemented as two 2-way SIMD
> operations for double- and single-
> precision.
> – Two VMX execution units capable of executing simple FX, permute,
> complex FX, and 4-way SIMD
> single-precision floating-point operations
> – One Crypto unit
>
> Two 16-byte loads and one 16-byte store operation are supported for
> VMX and VSX operations per cycle.
> And
> Two Vector Double operation per cycle (4 X double FMAs per cycle)
> And
> Two Vector float / permute / Integer / Logic operations per cycle.
>
> Think RISC not CISC.
>
> > typedef long long uint256_t __attribute__ ((vector_size (8 * sizeof
> > (uint32_t))));
> > int main (void)
> > {
> > volatile uint256_t inVar256, outVar256;
> > outVar256 = inVar256;
> > }
> >
>
> GCC has native support for vector_size (16). Larger multiples are
> handled as arrays of vector_size(16).
>
> As such there is no real advantage for alignment greater then 16-bytes.
> The 256 bit alignment is not used for performance issue, we use it
> to simplify the address decoding in the FPGA in charge of the
> management of the local bus on our PCIe board.
>
> So, if I well understood, it's not possible to guarantee a 256 bit
> access (1 address + 8 words) on ppc but it's possible to have 128
> bit accesses (1 address + 4 words), it's correct ?
> Will it be the case if I manipulate uint128_t variables ? with
> typedef long long uint128_t __attribute__ ((vector_size (4 * sizeof
> (uint32_t))));
>
You can use the aligned attribute, as in:
aligned (alignment)
This attribute specifies a minimum alignment for the variable or
structure field, measured in bytes. For example, the declaration:
int x __attribute__ ((aligned (32))) = 0;
this works for static (BSS) storage . I am not sure about automatic (stack)
as the ABI only requires (aligned (16)).
But if you are working in IO-space you are likely be aligning the pointer
yourself. If the pointer is aligned the accesses will be aligned.
> Also with the inherent instruction parallelism of the POWER8, you
> are already getting effective 256-bits of vector operations per cycle.
>
Yes there can be multiple vector instructions dispatch and issue per cycle
but the details of read/write ports at the various cache levels are
specific to the processor.
And also cache-coherent and cache-inhibited storage behave differently.
Best to read the relevant sections of the "POWER8 Processor User’s Manual
for the Single-Chip Module" yourself.
The information portal: https://www-355.ibm.com/systems/power/openpower/
Under power8:
https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=POWER8
and
Power8 with NVidia:
https://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=POWER8_with_NVIDIA_NVLink
There a similar links for power9 and the PowerISA under this portal.
>
> > On x86_64, I use -mavx so gcc generates AVX instructions.
> > Disassembling the object file, I've got:
> > outVar256 = inVar256;
> > 13: c5 fd 6f 45 b0 vmovdqa -0x50(%rbp),%ymm0
> > 18: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp)
> >
> > On ppc (cross-compiling with: /opt/at10.0/bin/powerpc64le-linux-gnu-gcc
> > -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o), I've got:
> > outVar256 = inVar256;
> > 1c: 40 00 40 39 li r10,64
> > 20: 98 56 89 7d lxvd2x vs12,r9,r10
> > 24: 10 00 40 39 li r10,16
> > 28: 50 00 00 39 li r8,80
> > 2c: 98 46 09 7c lxvd2x vs0,r9,r8
> > 30: 98 4f 80 7d stxvd2x vs12,0,r9
> > 34: 98 57 09 7c stxvd2x vs0,r9,r10
> > 38: 98 4e 80 7d lxvd2x vs12,0,r9
> > 3c: 98 56 09 7c lxvd2x vs0,r9,r10
> > 40: 20 00 40 39 li r10,32
> > 44: 98 57 89 7d stxvd2x vs12,r9,r10
> > 48: 30 00 40 39 li r10,48
> > 4c: 98 57 09 7c stxvd2x vs0,r9,r10
> >
>
> Again think RISC not CISC. The compiler broke the 256-bit vector
> into pairs of 128-bit vector operations.
>
> > I'm new on ppc and not familiar with ppc instructions but it doesn't
> > seem to be SIMD instructions.
>
> Those (lxvd2x/stxvd2x) are 128-bit SIMD load/store operations.
>
> > How can I generate 256 bit aligned instructions ?
> >
> You don't need more then 128-bit alignment and you will get
> effective 256-bit vector operation without this.
>
> > For more details, the goal is to drive a PCIe board and to generate 256
> > bit aligned memcpy from host memory to PCIe bus.
> >
> You should look at the GLIBC memcpy_power8 implementation for
> example of how to maximize memory bandwidth
> For the performance, I use write-combining by remapping the PCIe
> address space with ioremap_wc () in my kernel device driver.
> Is there any similar mechanism ?
>
>
> Thanks,
> Fred
>
> Steven J. Munroe
> Linux on Power Toolchain Architect
> IBM Corporation, Linux Technology Center
> _______________________________________________
> Linuxppc-users mailing list
> Linuxppc-users at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-users/attachments/20170627/d4d9eb93/attachment-0001.html>
More information about the Linuxppc-users
mailing list