Worst case performance of up()

Sat Nov 25 07:45:24 EST 2006

On Fri, 2006-11-24 at 16:21 +0000, Adrian Cox wrote:
> First the background: I've been investigating poor performance of a
> Firewire capture application, running on a dual-7450 board with a 2.6.17
> kernel. The kernel is based on a slightly earlier version of the
> mpc7448hpc2 board port, using arch/powerpc, which I've not yet updated
> to reflect the changes made when the board support entered the
> mainstream kernel. 
> 
> The application runs smoothly on a single processor. On the dual
> processor machine, the application sometimes suffers a drop in
> frame-rate, simultaneous with high CPU usage by the Firewire kernel
> thread.
> 
> Further investigation reveals that the kernel thread spends most of the
> time in one line: up(&fi->complete_sem) in __queue_complete_req() in
> drivers/iee1394/raw1394.c.  It seems that whenever the userspace thread
> calling raw1394_read() is scheduled on the opposite CPU to the kernel
> thread, the kernel thread takes much longer to execute up() - typically
> 10000 times longer.
> 
> Does anybody have any ideas what could make up() take so long in this
> circumstance? I'd expect cache transfers to make the operation about 100
> times slower, but this looks like repeated cache ping-pong between the
> two CPUs.

Is it hung in up() (toplevel) or __up (low level) ?

The former is mostly just a atomic_add_return which boils down to :

static __inline__ int atomic_add_return(int a, atomic_t *v)
{
        int t;

        __asm__ __volatile__(
        LWSYNC_ON_SMP
"1:     lwarx   %0,0,%2         # atomic_add_return\n\
        add     %0,%1,%0\n"
        PPC405_ERR77(0,%2)
"       stwcx.  %0,0,%2 \n\
        bne-    1b"
        ISYNC_ON_SMP
        : "=&r" (t)
        : "r" (a), "r" (&v->counter)
        : "cc", "memory");

        return t;
}

So yes, on SMP, you get an additional sync and isync in there, though
I'm surprised that you hit a code path where that would make such a big
difference (unless you are really up'ing a zillion times per sec).

Have you tried some oprofile runs to catch the exact instruction where
the cycles appear to be wasted ?

Maybe there is some contention on the reservation (though that would be
a bit strange to have a contention on a up...) or somewhat the semaphore
ends up sharing a cache line with something else. That would cause a
performance problem.

Have you tried moving the semaphore away from whatever other data might
be manipulated at the same time ? In it's own cache line maybe ?

Cheers,
Ben.