<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi,<br>
<br>
Thanks for you answers.<br>
I've got new questions below.<br>
<br>
<div class="moz-cite-prefix">On 23/06/2017 03:07, Steve Munroe
wrote:<br>
</div>
<blockquote
cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"
type="cite">
<p><tt><font size="2">"Linuxppc-users"
<a class="moz-txt-link-rfc2396E" href="mailto:linuxppc-users-bounces+sjmunroe=us.ibm.com@lists.ozlabs.org"><linuxppc-users-bounces+sjmunroe=us.ibm.com@lists.ozlabs.org></a>
wrote on 06/22/2017 01:19:15 AM:<br>
<br>
> From: Frederic DUMOULIN
<a class="moz-txt-link-rfc2396E" href="mailto:frederic.dumoulin@mipsology.com"><frederic.dumoulin@mipsology.com></a></font></tt><br>
<tt><font size="2">> To: <a class="moz-txt-link-abbreviated" href="mailto:linuxppc-users@lists.ozlabs.org">linuxppc-users@lists.ozlabs.org</a></font></tt><br>
<tt><font size="2">> Date: 06/22/2017 05:02 PM</font></tt><br>
<tt><font size="2">> Subject: [Linuxppc-users] 256 bit
aligned variables</font></tt><br>
<tt><font size="2">> Sent by: "Linuxppc-users"
<linuxppc-users-bounces<br>
> +sjmunroe=us.ibm.com@lists.ozlabs.org></font></tt><br>
<tt><font size="2">> <br>
> Hi,<br>
> <br>
> I'd like to manage 256 bit aligned variables.<br>
> So I use gcc's vector_size attribute like in the
following example:<br>
> </font></tt><br>
<tt><font size="2">PowerISA does not have 256-bit vectors, yet.</font></tt><br>
<br>
<tt><font size="2">It does have 128-bit vectors, with lots of
vector registers (64 of them) and multiple vector units
executing in parallel.</font></tt><br>
<br>
<tt><font size="2">POWER8 </font></tt><br>
<tt><font size="2">- 4 load units, 2 stores units (these are
paired for vector)</font></tt><br>
<tt><font size="2">- Two symmetric fixed-point units (FXU)</font></tt><br>
<tt><font size="2">– Four floating-point units (FPU),
implemented as two 2-way SIMD operations for double- and
single-</font></tt><br>
<tt><font size="2">precision.</font></tt><br>
<tt><font size="2">– Two VMX execution units capable of
executing simple FX, permute, complex FX, and 4-way SIMD</font></tt><br>
<tt><font size="2">single-precision floating-point operations</font></tt><br>
<tt><font size="2">– One Crypto unit</font></tt><br>
<br>
<tt><font size="2">Two 16-byte loads and one 16-byte store
operation are supported for VMX and VSX operations per
cycle.</font></tt><br>
<tt><font size="2">And</font></tt><br>
<tt><font size="2">Two Vector Double operation per cycle (4 X
double FMAs per cycle)</font></tt><br>
<tt><font size="2">And</font></tt><br>
<tt><font size="2">Two Vector float / permute / Integer / Logic
operations per cycle.</font></tt><br>
<br>
<tt><font size="2">Think RISC not CISC. </font></tt><br>
<tt><font size="2"><br>
> typedef long long uint256_t __attribute__
((vector_size (8 * sizeof <br>
> (uint32_t))));<br>
> int main (void)<br>
> {<br>
> volatile uint256_t inVar256, outVar256;<br>
> outVar256 = inVar256;<br>
> }<br>
> </font></tt><br>
<br>
<tt><font size="2">GCC has native support for vector_size (16).
Larger multiples are handled as arrays of vector_size(16).</font></tt><br>
<br>
<tt><font size="2">As such there is no real advantage for
alignment greater then 16-bytes. </font></tt><br>
</p>
</blockquote>
The 256 bit alignment is not used for performance issue, we use it
to simplify the address decoding in the FPGA in charge of the
management of the local bus on our PCIe board.<br>
<br>
So, if I well understood, it's not possible to guarantee a 256 bit
access (1 address + 8 words) on ppc but it's possible to have 128
bit accesses (1 address + 4 words), it's correct ?<br>
Will it be the case if I manipulate uint128_t variables ? with <tt>typedef
long long uint128_t __attribute__ ((vector_size (4 * sizeof
(uint32_t))));</tt>
<blockquote
cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"
type="cite">
<p><br>
<tt><font size="2">Also with the inherent instruction
parallelism of the POWER8, you are already getting effective
256-bits of vector operations per cycle.</font></tt><br>
<br>
<tt><font size="2"><br>
> On x86_64, I use -mavx so gcc generates AVX
instructions.<br>
> Disassembling the object file, I've got:<br>
> outVar256 = inVar256;<br>
> 13: c5 fd 6f 45 b0 vmovdqa
-0x50(%rbp),%ymm0<br>
> 18: c5 fd 7f 45 d0 vmovdqa
%ymm0,-0x30(%rbp)<br>
> <br>
> On ppc (cross-compiling with:
/opt/at10.0/bin/powerpc64le-linux-gnu-gcc <br>
> -I. -g -O -mcpu=powerpc64le -mvsx -c main.c -o main.o),
I've got:<br>
> outVar256 = inVar256;<br>
> 1c: 40 00 40 39 li r10,64<br>
> 20: 98 56 89 7d lxvd2x vs12,r9,r10<br>
> 24: 10 00 40 39 li r10,16<br>
> 28: 50 00 00 39 li r8,80<br>
> 2c: 98 46 09 7c lxvd2x vs0,r9,r8<br>
> 30: 98 4f 80 7d stxvd2x vs12,0,r9<br>
> 34: 98 57 09 7c stxvd2x vs0,r9,r10<br>
> 38: 98 4e 80 7d lxvd2x vs12,0,r9<br>
> 3c: 98 56 09 7c lxvd2x vs0,r9,r10<br>
> 40: 20 00 40 39 li r10,32<br>
> 44: 98 57 89 7d stxvd2x vs12,r9,r10<br>
> 48: 30 00 40 39 li r10,48<br>
> 4c: 98 57 09 7c stxvd2x vs0,r9,r10<br>
> </font></tt><br>
<br>
<tt><font size="2">Again think RISC not CISC. The compiler broke
the 256-bit vector into pairs of 128-bit vector operations.</font></tt><br>
<tt><font size="2"><br>
> I'm new on ppc and not familiar with ppc instructions
but it doesn't <br>
> seem to be SIMD instructions.</font></tt><br>
<br>
<tt><font size="2">Those (lxvd2x/stxvd2x) are 128-bit SIMD
load/store operations.</font></tt><br>
<tt><font size="2"><br>
> How can I generate 256 bit aligned instructions ?<br>
> </font></tt><br>
<tt><font size="2">You don't need more then 128-bit alignment
and you will get effective 256-bit vector operation without
this.</font></tt><br>
<tt><font size="2"><br>
> For more details, the goal is to drive a PCIe board and
to generate 256 <br>
> bit aligned memcpy from host memory to PCIe bus.<br>
> </font></tt><br>
<tt><font size="2">You should look at the GLIBC memcpy_power8
implementation for example of how to maximize memory
bandwidth</font></tt><br>
</p>
</blockquote>
<tt><font size="2">For the performance, I use write-combining by
remapping the PCIe address space with ioremap_wc () in my kernel
device driver.<br>
Is there any similar mechanism ?<br>
<br>
<br>
Thanks,<br>
Fred<br>
<br>
</font></tt>
<blockquote
cite="mid:OF452B1E27.61A28981-ON86258147.007C1589-86258148.000634B2@notes.na.collabserv.com"
type="cite">
<p><br>
<font size="2">Steven J. Munroe<br>
Linux on Power Toolchain Architect<br>
IBM Corporation, Linux Technology Center<br>
</font><br>
</p>
</blockquote>
<br>
</body>
</html>