<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi,<br>

    <br>

    Thanks for you answers.<br>

    I've got new questions below.<br>

    <br>

    <div class="moz-cite-prefix">On 23/06/2017 03:07, Steve Munroe

      wrote:<br>

    </div>

    <blockquote

cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"

      type="cite">

      <p><tt><font size="2">"Linuxppc-users"

            <a class="moz-txt-link-rfc2396E" href="mailto:linuxppc-users-bounces+sjmunroe=us.ibm.com@lists.ozlabs.org"><linuxppc-users-bounces+sjmunroe=us.ibm.com@lists.ozlabs.org></a>

            wrote on 06/22/2017 01:19:15 AM:<br>

            <br>

            > From: Frederic DUMOULIN

            <a class="moz-txt-link-rfc2396E" href="mailto:frederic.dumoulin@mipsology.com"><frederic.dumoulin@mipsology.com></a></font></tt><br>

        <tt><font size="2">> To: <a class="moz-txt-link-abbreviated" href="mailto:linuxppc-users@lists.ozlabs.org">linuxppc-users@lists.ozlabs.org</a></font></tt><br>

        <tt><font size="2">> Date: 06/22/2017 05:02 PM</font></tt><br>

        <tt><font size="2">> Subject: [Linuxppc-users] 256 bit

            aligned variables</font></tt><br>

        <tt><font size="2">> Sent by: "Linuxppc-users"

            <linuxppc-users-bounces<br>

            > +sjmunroe=us.ibm.com@lists.ozlabs.org></font></tt><br>

        <tt><font size="2">> <br>

            > Hi,<br>

            > <br>

            > I'd like to manage 256 bit aligned variables.<br>

            > So I use gcc's vector_size attribute like in the

            following example:<br>

            > </font></tt><br>

        <tt><font size="2">PowerISA does not have 256-bit vectors, yet.</font></tt><br>

        <br>

        <tt><font size="2">It does have 128-bit vectors, with lots of

            vector registers (64 of them) and multiple vector units

            executing in parallel.</font></tt><br>

        <br>

        <tt><font size="2">POWER8 </font></tt><br>

        <tt><font size="2">- 4 load units, 2 stores units (these are

            paired for vector)</font></tt><br>

        <tt><font size="2">- Two symmetric fixed-point units (FXU)</font></tt><br>

        <tt><font size="2">– Four floating-point units (FPU),

            implemented as two 2-way SIMD operations for double- and

            single-</font></tt><br>

        <tt><font size="2">precision.</font></tt><br>

        <tt><font size="2">– Two VMX execution units capable of

            executing simple FX, permute, complex FX, and 4-way SIMD</font></tt><br>

        <tt><font size="2">single-precision floating-point operations</font></tt><br>

        <tt><font size="2">– One Crypto unit</font></tt><br>

        <br>

        <tt><font size="2">Two 16-byte loads and one 16-byte store

            operation are supported for VMX and VSX operations per

            cycle.</font></tt><br>

        <tt><font size="2">And</font></tt><br>

        <tt><font size="2">Two Vector Double operation per cycle (4 X

            double FMAs per cycle)</font></tt><br>

        <tt><font size="2">And</font></tt><br>

        <tt><font size="2">Two Vector float / permute / Integer / Logic

            operations per cycle.</font></tt><br>

        <br>

        <tt><font size="2">Think RISC not CISC. </font></tt><br>

        <tt><font size="2"><br>

            > typedef long long  uint256_t __attribute__

            ((vector_size (8 * sizeof <br>

            > (uint32_t))));<br>

            > int main (void)<br>

            > {<br>

            >      volatile uint256_t inVar256, outVar256;<br>

            >      outVar256 = inVar256;<br>

            > }<br>

            > </font></tt><br>

        <br>

        <tt><font size="2">GCC has native support for vector_size (16).

            Larger multiples are handled as arrays of vector_size(16).</font></tt><br>

        <br>

        <tt><font size="2">As such there is no real advantage for

            alignment greater then 16-bytes. </font></tt><br>

      </p>

    </blockquote>

    The 256 bit alignment is not used for performance issue, we use it

    to simplify the address decoding in the FPGA in charge of the

    management of the local bus on our PCIe board.<br>

    <br>

    So, if I well understood, it's not possible to guarantee a 256 bit

    access (1 address + 8 words) on ppc but it's possible to have 128

    bit accesses (1 address + 4 words), it's correct ?<br>

    Will it be the case if I manipulate uint128_t variables ? with <tt>typedef

      long long  uint128_t __attribute__ ((vector_size (4 * sizeof

      (uint32_t))));</tt>

    <blockquote

cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"

      type="cite">

      <p><br>

        <tt><font size="2">Also with the inherent instruction

            parallelism of the POWER8, you are already getting effective

            256-bits of vector operations per cycle.</font></tt><br>

        <br>

        <tt><font size="2"><br>

            > On x86_64, I use -mavx so gcc generates AVX

            instructions.<br>

            > Disassembling the object file, I've got:<br>

            >      outVar256 = inVar256;<br>

            >    13:    c5 fd 6f 45 b0           vmovdqa

            -0x50(%rbp),%ymm0<br>

            >    18:    c5 fd 7f 45 d0           vmovdqa

            %ymm0,-0x30(%rbp)<br>

            > <br>

            > On ppc (cross-compiling with:

            /opt/at10.0/bin/powerpc64le-linux-gnu-gcc <br>

            > -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o),

            I've got:<br>

            >      outVar256 = inVar256;<br>

            >    1c:    40 00 40 39     li      r10,64<br>

            >    20:    98 56 89 7d     lxvd2x  vs12,r9,r10<br>

            >    24:    10 00 40 39     li      r10,16<br>

            >    28:    50 00 00 39     li      r8,80<br>

            >    2c:    98 46 09 7c     lxvd2x  vs0,r9,r8<br>

            >    30:    98 4f 80 7d     stxvd2x vs12,0,r9<br>

            >    34:    98 57 09 7c     stxvd2x vs0,r9,r10<br>

            >    38:    98 4e 80 7d     lxvd2x  vs12,0,r9<br>

            >    3c:    98 56 09 7c     lxvd2x  vs0,r9,r10<br>

            >    40:    20 00 40 39     li      r10,32<br>

            >    44:    98 57 89 7d     stxvd2x vs12,r9,r10<br>

            >    48:    30 00 40 39     li      r10,48<br>

            >    4c:    98 57 09 7c     stxvd2x vs0,r9,r10<br>

            > </font></tt><br>

        <br>

        <tt><font size="2">Again think RISC not CISC. The compiler broke

            the 256-bit vector into pairs of 128-bit vector operations.</font></tt><br>

        <tt><font size="2"><br>

            > I'm new on ppc and not familiar with ppc instructions

            but it doesn't <br>

            > seem to be SIMD instructions.</font></tt><br>

        <br>

        <tt><font size="2">Those (lxvd2x/stxvd2x) are 128-bit SIMD

            load/store operations.</font></tt><br>

        <tt><font size="2"><br>

            > How can I generate 256 bit aligned instructions ?<br>

            > </font></tt><br>

        <tt><font size="2">You don't need more then 128-bit alignment

            and you will get effective 256-bit vector operation without

            this.</font></tt><br>

        <tt><font size="2"><br>

            > For more details, the goal is to drive a PCIe board and

            to generate 256 <br>

            > bit aligned memcpy from host memory to PCIe bus.<br>

            > </font></tt><br>

        <tt><font size="2">You should look at the GLIBC memcpy_power8

            implementation for example of how to maximize memory

            bandwidth</font></tt><br>

      </p>

    </blockquote>

    <tt><font size="2">For the performance, I use write-combining by

        remapping the PCIe address space with ioremap_wc () in my kernel

        device driver.<br>

        Is there any similar mechanism ?<br>

        <br>

        <br>

        Thanks,<br>

        Fred<br>

        <br>

      </font></tt>

    <blockquote

cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"

      type="cite">

      <p><br>

        <font size="2">Steven J. Munroe<br>

          Linux on Power Toolchain Architect<br>

          IBM Corporation, Linux Technology Center<br>

        </font><br>

      </p>

    </blockquote>

    <br>

  </body>

</html>