Benchmarking on POWER5 server...

Michael Neuling mikey at
Mon Feb 6 15:05:16 EST 2006


We have a 16 thread (2 NUMA nodes, 2 chips per node, 2 cores per chip,
2 threads per core) POWER5 machine with 16 GB of memory.  I've done
some benchmarking by compiling the Linux kernel and seeing what effect
some different options have on the compile times.  All tests were run
with distcc off.  This machine has reasonably good disk IO so other
large machines may get different results.  Tests were performed with
0.8.4 with tonyb's semaphone fix and ccache 2.4.  With each config, I
did a few runs, and picked two which seem to represent the most
consistent times (in lay speak: highly unscientific)

Results are below (2 runs per config shown):
 cpus = 24. warm ccache
  real    0m48.724s     0m46.442s  
  user    5m30.411s     5m30.315s  
  sys     1m14.379s     1m14.932s  

 cpus = 16. warm ccache
  real    0m48.594s     0m47.191s
  user    5m25.503s     5m22.890s
  sys     1m13.584s     1m13.033s

 cpus = 8. warm ccache
  real    0m55.128s     0m57.276s
  user    4m50.494s     4m50.390s
  sys     1m8.749s      1m8.806s 

 cpus = 4. warm ccache
  real    1m19.286s     1m14.782s
  user    4m33.085s     4m30.333s
  sys     1m4.092s      1m3.434s 

 cpus = 16. no ccache
  real    1m46.624s     1m46.504s 
  user    21m11.198s    21m13.730s
  sys     1m38.683s     1m38.648s 

 cpus = 16. cold ccache (rm -rf ~/.ccache) 
  real    1m52.340s     1m52.960s 
  user    22m47.736s    22m50.005s
  sys     2m0.261s      2m0.187s  
 cpus = 16. cold ccache (rm -rf ~/.ccache) with CCACHE_NOSTATS
  real    1m53.605s     1m54.222s 
  user    22m51.339s    22m41.690s
  sys     2m0.929s      2m0.490s  

Some brief analysis:
 Setting cpu > 16 doesn't help. 
  - more processes than threads
 Setting cpu = 16 vs cpu = 8 helps a bit
  - should setup 1 process per thread rather than core
 Setting cpu < 8 slows you down a lot.
  - less than the number of cores
 So best to keep all the threads busy.  

 ccache helps a lot when the cache is warm.  

 ccache does cost something to maintain the cache when it's cache is
 cold.  With so many processes present (ccontrol spawns 20 x cpus),
 I thought this could be due to the ccache stats file locking over
 head.  Hence, I did a run with CCACHE_NOSTATS set but it didn't 
 improve the performance.

More information about the ccontrol mailing list