z constraint in powerpc inline assembly ?
David Laight
David.Laight at ACULAB.COM
Fri Jan 17 02:54:58 AEDT 2020
From: Christophe Leroy
> Sent: 16 January 2020 06:12
>
> I'm trying to see if we could enhance TCP checksum calculations by
> splitting inline assembly blocks to give GCC the opportunity to mix it
> with other stuff, but I'm getting difficulties with the carry.
if you are trying to 'loop carry' the 'carry flag' with 'add with carry'
instructions you'll almost certainly need to write the loop in asm.
Since the loop itself is simple, this probably doesn't matter.
However a loop of 'add with carry' instructions may not be the
fastest code by any means.
Because the carry flag is needed for every 'adc' you can't do more
that one adc per clock.
This limits you to 8 bytes/clock on a 64bit system - even one
that can schedule multiple memory reads and lots of instructions
every clock.
I don't know ppc, but on x86 you don't even get 1 adc per clock
until very recent (Haswell I think) cpus.
Sandy/Ivy bridge will do so if you add to alternate registers.
For earlier cpu it is actually difficult to beat the 4 bytes/clock
you get by adding 32bit values to a 64bit register in C code.
One possibility is to do a normal add then shift the carry
into a separate register.
After 64 words use 'popcnt' to sum the carry bits.
With 2 accumulators (and carry shifts) you'd need to
break the loop every 1024 bytes.
This should beat 8 bytes/clock if you can exeute more than
1 memory read, one add and one shift each clock.
I've not tried this on an old x86 cpu - which would need a software
'popcnt'. It got close to 8 bytes/clock on Ivy bridge.
It almost certainly beats the 4 bytes/clock of the current x86-64
code on such systems.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
More information about the Linuxppc-dev
mailing list