<html><body><p>One thing that occurred to me was that since the __kernel_datapage_offset is located immediately before the function, the offset is small and you could get rid of the addi and just fold it into the lwz.<br><br><br>Aaron Sawdey, Ph.D. sawdey@us.ibm.com<br>050-2/C113 (507) 253-7520 TL: 553-7520 home: 507/263-0782<br>IBM Linux Technology Center - PPC Toolchain<br><br><img width="16" height="16" src="cid:1__=09BBF45ADFF6678E8f9e8a93df938690918c09B@" border="0" alt="Inactive hide details for Denis Kirjanov ---09/23/2015 01:22:39 PM---On 9/23/15, Michael Neuling <mikey@neuling.org> wrote: > T"><font color="#424282">Denis Kirjanov ---09/23/2015 01:22:39 PM---On 9/23/15, Michael Neuling <mikey@neuling.org> wrote: > The 32 and 64 bit variants of __get_datapag</font><br><br><font size="2" color="#5F5F5F">From: </font><font size="2">Denis Kirjanov <kda@linux-powerpc.org></font><br><font size="2" color="#5F5F5F">To: </font><font size="2">Michael Neuling <mikey@neuling.org></font><br><font size="2" color="#5F5F5F">Cc: </font><font size="2">Michael Ellerman <michael@ellerman.id.au>, benh <benh@kernel.crashing.org>, Aaron Sawdey/Rochester/IBM@IBMUS, linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, Anton Blanchard <anton@samba.org>, Steve Munroe/Rochester/IBM@IBMUS</font><br><font size="2" color="#5F5F5F">Date: </font><font size="2">09/23/2015 01:22 PM</font><br><font size="2" color="#5F5F5F">Subject: </font><font size="2">Re: [PATCH] powerpc/vdso: Avoid link stack corruption in __get_datapage()</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt>On 9/23/15, Michael Neuling <mikey@neuling.org> wrote:<br>> The 32 and 64 bit variants of __get_datapage() use a "bcl; mflr" to<br>> determine the loaded address of the VDSO. The current version of these<br>> attempt to use the special bcl variant which avoids pushing to the<br>> link stack.<br>><br>> Unfortunately it uses bcl+8 rather than the required bcl+4. Hence the<br>> current code results in link stack corruption and the resulting<br>> performance degradation (due to branch mis-prediction).<br>><br>> This patch moves us to bcl+4 by moving __kernel_datapage_offset<br>> out of __get_datapage().<br>><br>> With this patch, running the below benchmark we get a bump in<br>> performance on POWER8 for gettimeofday() (which uses<br>> __get_datapage()).<br>><br>> 64bit gets ~4% improvement:<br>> Without patch:<br>> # ./tb<br>> time = 0.180321<br>> With patch:<br>> # ./tb<br>> time = 0.187408<br>><br>> 32bit gets ~9% improvement:<br>> Without patch:<br>> # ./tb<br>> time = 0.276551<br>> With patch:<br>> # ./tb<br>> time = 0.252767<br><br>I've got the following results on POWER7 64bit<br>without the patch:<br># ./tb<br>time = 0.263337<br># ./tb<br>time = 0.251273<br># ./tb<br>time = 0.258453<br># ./tb<br>time = 0.260189<br><br>with the patch:<br># ./tb<br>time = 0.241517<br># ./tb<br>time = 0.241973<br># ./tb<br>time = 0.239365<br># ./tb<br>time = 0.240109<br><br><br><br>><br>> Testcase tb.c (stolen from Anton)<br>> /* gcc -O2 tb.c -o tb */<br>> #include <sys/time.h><br>> #include <stdio.h><br>><br>> int main()<br>> {<br>> int i;<br>><br>> struct timeval tv_start, tv_end;<br>><br>> gettimeofday(&tv_start, NULL);<br>><br>> for(i = 0; i < 10000000; i++) {<br>> gettimeofday(&tv_end, NULL);<br>> }<br>><br>> printf("time = %.6f\n", tv_end.tv_sec - tv_start.tv_sec + (tv_end.tv_usec<br>> - tv_start.tv_usec) * 1e-6);<br>><br>> return 0;<br>> }<br>><br>> Signed-off-by: Michael Neuling <mikey@neuling.org><br>> Reported-by: Aaron Sawdey <sawdey@us.ibm.com><br>><br>> diff --git a/arch/powerpc/kernel/vdso32/datapage.S<br>> b/arch/powerpc/kernel/vdso32/datapage.S<br>> index dc21e89..59cf5f4 100644<br>> --- a/arch/powerpc/kernel/vdso32/datapage.S<br>> +++ b/arch/powerpc/kernel/vdso32/datapage.S<br>> @@ -16,6 +16,10 @@<br>> #include <asm/vdso.h><br>><br>> .text<br>> + .global __kernel_datapage_offset;<br>> +__kernel_datapage_offset:<br>> + .long 0<br>> +<br>> V_FUNCTION_BEGIN(__get_datapage)<br>> .cfi_startproc<br>> /* We don't want that exposed or overridable as we want other objects<br>> @@ -27,13 +31,11 @@ V_FUNCTION_BEGIN(__get_datapage)<br>> mflr r0<br>> .cfi_register lr,r0<br>><br>> - bcl 20,31,1f<br>> - .global __kernel_datapage_offset;<br>> -__kernel_datapage_offset:<br>> - .long 0<br>> -1:<br>> + bcl 20,31,data_page_branch<br>> +data_page_branch:<br>> mflr r3<br>> mtlr r0<br>> + addi r3, r3, __kernel_datapage_offset-data_page_branch<br>> lwz r0,0(r3)<br>> add r3,r0,r3<br>> blr<br>> diff --git a/arch/powerpc/kernel/vdso64/datapage.S<br>> b/arch/powerpc/kernel/vdso64/datapage.S<br>> index 79796de..2f01c4a 100644<br>> --- a/arch/powerpc/kernel/vdso64/datapage.S<br>> +++ b/arch/powerpc/kernel/vdso64/datapage.S<br>> @@ -16,6 +16,10 @@<br>> #include <asm/vdso.h><br>><br>> .text<br>> +.global __kernel_datapage_offset;<br>> +__kernel_datapage_offset:<br>> + .long 0<br>> +<br>> V_FUNCTION_BEGIN(__get_datapage)<br>> .cfi_startproc<br>> /* We don't want that exposed or overridable as we want other objects<br>> @@ -27,13 +31,11 @@ V_FUNCTION_BEGIN(__get_datapage)<br>> mflr r0<br>> .cfi_register lr,r0<br>><br>> - bcl 20,31,1f<br>> - .global __kernel_datapage_offset;<br>> -__kernel_datapage_offset:<br>> - .long 0<br>> -1:<br>> + bcl 20,31,data_page_branch<br>> +data_page_branch:<br>> mflr r3<br>> mtlr r0<br>> + addi r3, r3, __kernel_datapage_offset-data_page_branch<br>> lwz r0,0(r3)<br>> add r3,r0,r3<br>> blr<br>><br>><br>> _______________________________________________<br>> Linuxppc-dev mailing list<br>> Linuxppc-dev@lists.ozlabs.org<br>> </tt><tt><a href="https://lists.ozlabs.org/listinfo/linuxppc-dev">https://lists.ozlabs.org/listinfo/linuxppc-dev</a></tt><tt><br><br></tt><br><br><BR>
</body></html>