[PATCH 0/5] powerpc: Implement masked user access

Sun Jul 6 06:15:57 AEST 2025

Hi!

On Sat, Jul 05, 2025 at 07:33:32PM +0100, David Laight wrote:
> On Thu, 26 Jun 2025 17:01:48 -0500
> Segher Boessenkool <segher at kernel.crashing.org> wrote:
> > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
> ...
> > I have no idea why you think power9 has it while older CPUS do not.  In
> > the GCC source code we have this comment:
> >   /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
> >      altivec is a win so enable it.  */
> > and in fact we do not enable it for ISA 2.06 (p8) either, probably for

2.07 I meant of course.  Sigh.

> > a similar reason.
> 
> Odd, I'd have thought that replacing a conditional branch with a
> conditional move would pretty much always be a win.
> Unless, of course, you only consider benchmark loops where the
> branch predictor in 100% accurate.

The isel machine instruction is super expensive on p8: it is marked as
first in an instruction group, and has latency 5 for the GPR sources,
and 8 for the CR field source.

On p7 it wasn't great either, it was actually converted to a branch
sequence internally!

On p8 there are bc+8 optimisations done by the core as well, conditional
branches that skip one insn are faster than equivalent isel insns!

Since p9 it is a lot better :-)

> OTOH isn't altivec 'simd' instructions?

AltiVec is the old motorola marketing name for what is called the
"Vector Facility" in the architecture, and which at IBM is still called
VMX, the name it was developed under ("Vector Multimedia Extension").

Since p7 (ISA 2.06, 2010) there also is the Vector-Scalar Extension
Facility, VSX, which adds another 32 vector registers, and the
traditional floating point registers are physically the same (but those
use only the first half of each vector reg).  Many new VSX instructions
can do simple floating point stuff on all 64 VSX registers, either just
on the first lane ("scalar") or on all lanes ("vector").

This does largely mean that all floating point is stored in IEEE DP
format internally (on older cores usually some close to 70-bit format
was used internally), which in olden times actually allowed to make the
cores faster.  Only when storing a value to memory it was actually
converted to IEEE format (but of course it was always rounded correctly,
etc.)

> They pretty much only help for loops with lots of iterations.
> I don't know about ppc, but I've seen gcc make a real 'pigs breakfast'
> of loop vectorisation on x86.

For PowerPC (or Power, the more modern name) of course we also have our
fair share of problems with vectorisation.  It does help that we were
the first architecture used by GCC that had a serious Vector thing,
the C syntax extension for Vector literals is taken from the old
extensions in the AltiVec PIM but using curly brackets {} instead of
round brackets (), for example.

> For the linux kernel (which as Linus keeps reminding people) tends
> to run 'cold cache', you probably want conditional moves in order
> to avoid mis-predicted branches and non-linear execution, but
> don't want loop vectorisation because the setup and end cases
> cost too much compared to the gain for each iteration.

You are best off using what GCC gives you, usually.  It is very well
tuned, both the generic and the machine-specific code :-)

The kernel of course disables all Vector and FP stuff, essentially it
disables use of any of the associated registers, and that's pretty much
the end of it ;-)

(The reason for that is that it would make task switches more expensive,
long ago all task switches, but nowadays still user<->kernel switches).

Segher