[PATCH V2 0/8] FP/VEC/VSX switching optimisations

Cyril Bur cyrilbur at gmail.com
Fri Jan 15 16:04:06 AEDT 2016


Cover-letter for V1 of the series is at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-November/136350.html

Version one of this series used a cmpb instruction in handcrafted assembly
which it turns out is not supported on older power machines. Michael
suggested replacing it with crandc, which instruction works fine. Testing
also showed no difference in performance between using cmpb and crandc.

The primary objective improving the syscall hot path. While gut feelings
may be that avoiding C is quicker it may also be the case that the C is not
significantly slower. If C is not slower using C would provide a distinct
readability and maintainability advantage.
I have benchmarked a few possible scenarios:
1. Always calling into C.
2. Testing for the common case in assembly and calling into C
3. Using crandc in the full assembly check

All benchmarks are the average of 50 runs of Antons context switch
benchmark http://www.ozlabs.org/~anton/junkcode/context_switch2.c with
the kernel and ramdisk run under QEMU/KVM on a POWER8.
To test for all cases a variety of flags were passed to the benchmark to
see the effect of only touching a subset of the 'math' register space.

The absolute numbers are in context switches per second can vary greatly
depending on the how the kernel is run (virt/powernv/ramdisk/disk) and as
such units aren't very relevant here as we're interested in a speedup.
The most interesting number here is the %speedup over the previous
scenario. In this case 100% means there was no difference, therefore <100%
indicates a decrease in performance and >100% an increase.

For 1 - Always calling into C
         Flags |  Average   |  Stddev  |
========================================
          none | 2059785.00 | 14217.64 |
            fp | 1766297.65 | 10576.64 |
    fp altivec | 1636125.04 | 5693.84  |
     fp vector | 1640951.76 | 13141.93 |
       altivec | 1815133.80 | 10450.46 |
altivec vector | 1636438.60 | 5475.12  |
        vector | 1639628.16 | 11456.06 |
           all | 1629516.32 | 7785.36  |



For 2 - Common case checking in asm before calling into C
         Flags |  Average   |  Stddev  | %speedup vs 1 |
========================================================
          none | 2058003.64 | 20464.22 | 99.91         |
            fp | 1757245.80 | 14455.45 | 99.49         |
    fp altivec | 1658240.12 | 6318.41  | 101.35        |
     fp vector | 1668912.96 | 9451.47  | 101.70        |
       altivec | 1815223.96 | 4819.82  | 100.00        |
altivec vector | 1648805.32 | 15100.50 | 100.76        |
        vector | 1663654.68 | 13814.79 | 101.47        |
           all | 1644884.04 | 11315.74 | 100.94        |



For 3 - Full checking in ASM using crandc instead of cmpb
         Flags |  Average   |  Stddev  | %speedup vs 2 |
========================================================
          none | 2066930.52 | 19426.46 | 100.43        |
            fp | 1781653.24 | 7744.55  | 101.39        |
    fp altivec | 1653125.84 | 6727.36  | 99.69         |
     fp vector | 1656011.04 | 11678.56 | 99.23         |
       altivec | 1824934.72 | 16842.19 | 100.53        |
altivec vector | 1649486.92 | 3219.14  | 100.04        |
        vector | 1662420.20 | 9609.34  | 99.93         |
           all | 1647933.64 | 11121.22 | 100.19        |

>From these numbers it appears that reducing the call to C in the common
case is beneficial, possibly up to 1.5% speedup over always calling C. The
benefit of the more complicated asm checking does appear to be very slight,
fractions of a percent at best. In balance it may prove wise to use the
option 2, there are much bigger fish to fry in terms of performance, the
complexity of the assembly for a small fraction of one percent improvement
is not worth it at this stage.

Version 2 of this series also addresses some comments from Mikey Neuling in
the tests such as adding .gitignore and forcing 64 bit compiles of the
tests as they use 64 bit only instructions.


Cyril Bur (8):
  selftests/powerpc: Test the preservation of FPU and VMX regs across
    syscall
  selftests/powerpc: Test preservation of FPU and VMX regs across
    preemption
  selftests/powerpc: Test FPU and VMX regs in signal ucontext
  powerpc: Explicitly disable math features when copying thread
  powerpc: Restore FPU/VEC/VSX if previously used
  powerpc: Add the ability to save FPU without giving it up
  powerpc: Add the ability to save Altivec without giving it up
  powerpc: Add the ability to save VSX without giving it up

 arch/powerpc/include/asm/processor.h               |   2 +
 arch/powerpc/include/asm/switch_to.h               |   5 +-
 arch/powerpc/kernel/asm-offsets.c                  |   2 +
 arch/powerpc/kernel/entry_64.S                     |  21 +-
 arch/powerpc/kernel/fpu.S                          |  25 +--
 arch/powerpc/kernel/ppc_ksyms.c                    |   4 -
 arch/powerpc/kernel/process.c                      | 144 +++++++++++--
 arch/powerpc/kernel/vector.S                       |  45 +---
 tools/testing/selftests/powerpc/Makefile           |   3 +-
 tools/testing/selftests/powerpc/basic_asm.h        |  26 +++
 tools/testing/selftests/powerpc/math/.gitignore    |   6 +
 tools/testing/selftests/powerpc/math/Makefile      |  19 ++
 tools/testing/selftests/powerpc/math/fpu_asm.S     | 195 ++++++++++++++++++
 tools/testing/selftests/powerpc/math/fpu_preempt.c | 113 ++++++++++
 tools/testing/selftests/powerpc/math/fpu_signal.c  | 135 ++++++++++++
 tools/testing/selftests/powerpc/math/fpu_syscall.c |  90 ++++++++
 tools/testing/selftests/powerpc/math/vmx_asm.S     | 229 +++++++++++++++++++++
 tools/testing/selftests/powerpc/math/vmx_preempt.c | 113 ++++++++++
 tools/testing/selftests/powerpc/math/vmx_signal.c  | 138 +++++++++++++
 tools/testing/selftests/powerpc/math/vmx_syscall.c |  92 +++++++++
 20 files changed, 1326 insertions(+), 81 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/basic_asm.h
 create mode 100644 tools/testing/selftests/powerpc/math/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/math/Makefile
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c

-- 
2.7.0



More information about the Linuxppc-dev mailing list