[PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

Tue Mar 31 08:15:25 AEDT 2015

On (03/30/15 09:01), Sowmini Varadhan wrote:
> 
> So I tried looking at the code, and perhaps there is some arch-specific
> subtlety here that I am missing, but where does spin_lock itself
> do the cpu_relax? afaict, LOCK_CONTENDED() itself does not have this.

To answer my question:
I'd missed the CONFIG_LOCK_STAT (which David Ahern pointed out to me).
the above is only true for the LOCK_STAT case.

In any case, I ran some experiments today: I was running 
iperf [http://en.wikipedia.org/wiki/Iperf] over ixgbe, which
is where I'd noticed the original perf issues for sparc. I was
running iperf2 (which is more aggressively threaded than iperf3) with
8, 10, 16, 20 threads, and with TSO turned off. In each case, I was
making sure that I was able to reach 9.X Gbps (this is a 10Gbps link)

I dont see any significant difference in the perf profile between the
spin_trylock and the spin_lock version (other than, of course, the change
to the lock-contention for the trylock version). I looked at the
perf profiled cache-misses (works out to about 1400M for 10 threads,
with or without the trylock).

I'm still waiting for some of the IB folks to try out the spin_lock
version (they had also seen some significant perf improvements from
breaking down the monolithic lock into multiple pools, so their workload
is also sensitive to this)

But as such, it looks like it doesnt matter much, whether you use
the trylock to find the first available pool, or block on the spin_lock.
I'll let folks on this list vote on this one (assuming the IB tests also 
come out without a significant variation between the 2 locking choices).

--Sowmini