Help with finding memory read performance problem

Fri Sep 17 06:12:20 EST 2010

For our code we needed a fast memory compare of 5 buffers.  I've implemented
said routine in asm and it works fine and is very fast in the test bench. 
However when integrated with the app it is much less performant and we 
are trying to figure out why.

The app in question gets the 5 4MB buffers in the kernel via kmalloc and
then uses them for DMA.  No other methods are being called for the memory
besides kmalloc.  This is an embedded system on the 460EX, so there is no
drive, only RAM.  Within the user code mmap is called on these buffers
physical address and they are given to the compare routine.  The result 
is slow.  If I allocate buffers in user space then the performance is
excellent.  

Next I implemented my compare routine within a kernel module so that it
was using the kernel virtual addresses for each of the buffers. I did 
not see any change between this and the mmap approach.  

For comparison sake, using the kernel memory is about 19s whereas user
memory is about 11s for the same size / configuration of buffer.  In the
test bench the algorithm is about 8s.  The processor is not doing any
other intensive tasks during these tests and the times are repeatable.

Is something happening to mmap'd memory that causes the access to it to
be slow?  Is there a way to speed that up?  Why are the kernel memory
access slower than user memory?  

What is the best overall approach?  Is it to DMA into user memory and
then run the routines there?  Is kmalloc not the best approach for 
kernel DMA memory?

This is on linux 2.6.31.5 on 460EX

thanks
ayman