[Cbe-oss-dev] initial performance comparison

Fri Nov 2 03:56:26 EST 2007

Here one bit of performance analysis that shows
that the new scheduler is roughly 10X more efficient
when the system is over committed.  Also, the new 
scheduler is better even when running just one job.

regards, Luke

====================================
Single Job
====================================

Executing the following command:

time ./matrix_mul -i 200 -m 512 -s 16

======================================
New spufs scheduler had these numbers: 

real    0m2.488s
user    0m0.281s
sys     0m0.154s

real    0m2.489s
user    0m0.279s
sys     0m0.156s

real    0m2.487s
user    0m0.280s
sys     0m0.154s

See numbers below.  Note elapsed time above 
is half of the old scheduler, same for 
user time, even system time is 30% less.

======================================
Old spufs scheduler:  

real    0m4.063s
user    0m0.456s
sys     0m0.230s

real    0m4.567s
user    0m0.457s
sys     0m0.215s

real    0m3.926s
user    0m0.439s
sys     0m0.211s

The reduction is system time, I attribute to streamlining
and rearranging the logic in the loop.  I also eliminated 
a global lock reference and some calls to check_signal().

======================================
Over committed examples
======================================

Running the following from a shell script:

time ./matrix_mul -i 200 -m 512 -s 16 &
time ./matrix_mul -i 200 -m 512 -s 16 &

New spufs scheduler: 

real    0m18.318s	0m18.425s   (each col is a separate job)
user    0m0.190s	0m0.291s
sys     0m0.134s	0m0.203s

real    0m14.463s	0m14.603s
user    0m0.239s	0m0.410s
sys     0m0.140s	0m0.176s

real    0m15.465s	0m15.844s
user    0m0.232s	0m0.504s
sys     0m0.187s	0m0.190s

Note system time increases slightly in the new scheduler.  

0.154-0.156 (new - single job) vs. 0.134-0.203 (new - two jobs)

Old spufs scheduler:

real    1m14.541s 	1m14.542s
user    0m0.464s	0m0.460s
sys     0m11.086s	0m10.466s

real    1m12.540s	1m12.621s
user    0m0.446s	0m0.451s
sys     0m11.419s	0m10.112s

real    1m3.298s	1m3.478s
user    0m0.467s	0m0.465s
sys     0m8.107s	0m9.942s

Note the explosion in sys time -- 10 seconds! This is probably
caused by the extra context switching that I eliminated in the
new scheduler.  The secondary impact of this context switching 
is that user code must wait longer to run.  

Parallel applications like matrix_mul suffer the most as user 
code must wait to synchronize spe output. 

The new scheduler scales much better.  It is roughly 10 
times as efficient (sys time comparison) with multiple jobs.

=======================================

Running three instances of job.

new spufs scheduler:

real    0m33.682s 	0m33.682s	0m33.683s	
user    0m0.379s	0m0.438s	0m0.364s	
sys     0m0.171s	0m0.140s	0m0.186s

real    0m22.858s	0m28.083s	0m28.101s
user    0m0.278s	0m0.496s	0m0.506s
sys     0m0.199s	0m0.297s	0m0.162s

real    0m30.015s	0m31.840s	0m32.277s
user    0m0.545s	0m0.283s	0m0.540s
sys     0m0.255s	0m0.258s	0m0.192s

Note adding a third job did not significantly increase
system overhead.  It went up from 0.134-0.200 to 0.140-0.297s 
which is statistically not important, particularly when
you throw out the best and worst numbers.  This yields 
0.162-0.258 showing that the new scheduler is very 
deterministic and scales well.

old spufs scheduler:

real    1m14.541s	1m14.542s	XXX (system hang)
user    0m0.464s	0m0.460s	XXX 
sys     0m11.086s	0m10.466s	XXX 

It never completed (system hung), but it completed 2 of the 3 jobs.  

Essentially, the same numbers as with two jobs.  

New spufs scheduler is 10X more efficient when it has 3 jobs.

regards, Luke