libfreevec benchmarks

Sun Aug 24 18:03:56 EST 2008

Στις Friday 22 August 2008 20:44:07 ο/η Ryan S. Arnold έγραψε:
> Do you have FSF (Free Software Foundation) copyright assignment yet?

Copyright assignment is not the issue, if there was interest in the first 
place, that would never had deterred me.

> How've you implemented the optimizations?

Scalar for small sizes, AltiVec for larger (>16 bytes, depending on the 
routine).

> Optimizations for individual architectures should follow the powerpc-cpu
> precedent for providing these routines, e.g.
>
> sysdeps/powerpc/powerpc32/power6/memcpy.S
> sysdeps/powerpc/powerpc64/power6/memcpy.S

That's the idea I got, but so far I understood that only 64-bit PowerPC/POWER 
cpus are supported, what about 32-bit cpus? libfreevec isn't ported to 64-bit 
yet (though I will finish that soon). Would it be enough to have one dir like 
eg:

sysdeps/powerpc/powerpc32/altivec/

or would I have to refer to specific CPU models? eg 74xx? And use Implies for 
the rest?

> Today, if glibc is configure with --with-cpu=970 it will actually
> default to the power optimizations for the string routines, as indicated
> by the sysdeps/powerpc/powerpc[32|64]/970/Implies files.  It'd be worth
> verifying that your baseline glibc runs are against existing optimized
> versions of glibc.  If they're not then this is a fault of the distro
> you're testing on.

Well, I used Debian Lenny and OpenSuse 11.0 (using glibc 2.7 and glibc2.8 
resp. If it doesn't work as supposed, these are two popular distros with a 
broken glibc, which I would think it's not very likely.

> I'm not aware of the status of some of the embedded PowerPC processors
> with-regard to powerpc-cpu optimizations.

Would the G4 and 8610 fall under the "embedded" PowerPC category?

> Our research found that for some tasks on some PowerPC processors the
> expense of reserving the floating point pipeline for vector operations
> exceeds the benefit of using vector insns for the task.

Well, I would advise *strongly* against that, except for specific cases, not 
for OS-wide functions. For example, in a popular 3D application such as 
Blender (or the Mesa 3D library), a lot of memory copying is done along with 
lots of FPU math. If you use the FPU unit for plain memcpy/etc stuff, you 
essentially forbid the app to use it for the important stuff, ie math, and in 
the end you lose performance. On the other hand, the AltiVec unit remains 
unused all the time, and it's certainly more capable and more generic than the 
FPU for most of the stuff -not to mention that inside the same app, the issue 
of context switching becomes unimportant. 

> Generally our optimizations tend to favor data an average of 12 bytes
> with 1000 byte max.  We also favor aligned data and use the existing
> implementation as a model as a baseline for where we try to keep
> unaligned data performance from dropping below.

Please, check the graphs of most libfreevec functions for the sizes 
12-1000bytes. Apart from strlen(), which is the only function that performs 
better overall than libfreevec, most other functions offer the same 
performance for sizes up to 48/96 bytes, but then performance increases 
dramatically due to the use of the vector unit.

> This research would be a good candidate for selectively replacing some
> of the existing libm functionality.  Do these results hold for all
> permutations of long double support?  Do they hold for x86/x86_64 as
> well as PowerPC?  I would suggest against a massive patch to libc-alpha
> and would instead recommend selective, individual replacement of
> fundamental routines to start with accompanied by exhaustive profile
> data.  You have to show that you're dedicated to maintenance of these
> routines and you can't overwhelm the reviewers with massive patches.

For the moment, my focus is on 32-bit floats only, but the algorithm is the 
same for 64-bit/128-bit floating point numbers even. It will just use more 
terms. And yes, as I said, it doesn't use AltiVec and is totally cross-
platform -just plain C- and very short code even. I tested the code on an 
Athlon X2 again and I get even better performance than on the PowerPC CPUs. 
For some reason, glibc -and freebsd libc for that matter as I did a look 
around- use very complex source trees with no good reason. The implementation 
of a sinf() for example is no more than 20 C lines. 

As for commitment, well I've been working on that stuff since 2004 (with a ~2y 
break because of other obligations, army, family, baby, etc :), but unless 
IBM/Freescale choose to dump AltiVec altogether, I don't see myself stopping 
working on it. To tell you the truth, the promotion of the vector unit by both 
companies has been a disappointment in my eyes at least, so I might just as 
well switch platform... But that won't happen yet anyway.

> Any submission to GLIBC is going to require that you and your code
> follow the GLIBC process or it'll probably be ignored.  You can engage
> me directly via CC and I can help you understand how to integrate the
> code but I can't give you a free pass or do the work for you.

I never asked that. However, first it's more imporant to me to show that the 
code is worth including and then *if* it's proven worthy, then we can worry 
about stuff like copyright assignment, etc. 

> The new libc-help mailing list was also created as a place for people to
> learn the process and get the patches in a state where they're ready to
> be submitted to libc-alpha.

I will take a look, thanks for that info.

Konstantinos Margaritis
Codex
http://www.codex.gr