[Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Mon Nov 10 02:16:45 EST 2008

On Tue, 4 Nov 2008 21:44:00 +0100, "Ingo Molnar" <mingo at elte.hu> said:
> * Alexander van Heukelum <heukelum at fastmail.fm> wrote:
> > On Tue, 4 Nov 2008 18:05:01 +0100, "Andi Kleen" <andi at firstfloor.org>
> > said:
> > > > not taking into account the cost of cs reading (which I
> > > > don't suspect to be that expensive apart from writting,
> > > 
> > > GDT accesses have an implied LOCK prefix. Especially
> > > on some older CPUs that could be slow.
> > > 
> > > I don't know if it's a problem or not but it would need
> > > some careful benchmarking on different systems to make sure interrupt 
> > > latencies are not impacted.
> 
> That's not a real issue on anything produced in this decade as we have 
> had per CPU GDTs in Linux for about a decade as well.
> 
> It's only an issue on ancient CPUs that export all their LOCKed cycles 
> to the bus. Pentium and older or so. The PPro got it right already.
> 
> What matters is what i said before: the actual raw cycle count before 
> and after the patch, on the two main classes of CPUs, and the amount 
> of icache we can save.
> 
> > That's good to know. I assume this LOCKed bus cycle only occurs if 
> > the (hidden) segment information is not cached in some way? How many 
> > segments are typically cached? In particular, does it optimize 
> > switching between two segments?
> > 
> > > Another reason I would be also careful with this patch is that it 
> > > will likely trigger slow paths in JITs like qemu/vmware/etc.
> > 
> > Software can be fixed ;).
> 
> Yes, and things like vmware were never a reason to hinder Linux.
> 
> > > Also code segment switching is likely not something that current 
> > > and future micro architectures will spend a lot of time 
> > > optimizing.
> > >
> > > I'm not sure that risk is worth the small improvement in code 
> > > size.
> > 
> > I think it is worth exploring a bit more. I feel it should be a 
> > neutral change worst-case performance-wise, but I really think the 
> > new code is more readable/understandable.
> 
> It's all measurable, so the vague "risk" mentioned above can be 
> dispelled via hard numbers.
> 
> > > An alternative BTW to having all the stubs in the executable would 
> > > be to just dynamically generate them when the interrupt is set up. 
> > > Then you would only have the stubs around for the interrupts which 
> > > are actually used.
> > 
> > I was trying to simplify things, not make it even less transparent 
> > ;).
> 
> yep, the complexity of dynamic stubs is the last thing we need here.
> 
> And as hpa's comments point it out, compressing the rather stupid irq 
> stubs might be a third option that looks promising as well.
> 
> 	Ingo

Hi all,

I have spent some time trying to find out how expensive the
segment-switching patch was. I have only one computer available
at the time: a "Sempron 2400+", 32-bit-only machine.

Measured were timings of "hackbench 10" in a loop. The average
was taken of more than 100 runs. Timings were done for two
seperate boots of the system.

The second test measures the time difference in ticks between
two (almost) consecutive rdtsc instructions and makes a
histogram from that. The (cumulative) histograms are plotted
for an otherwise unused system and for a system running
"hackbench 10" in a loop. The histogram is plotted on a
log-log scale. On the horizontal acces is the length of the
latency, and on the vertical axis the number of occurrences
of a measured time difference _more than_ the value on the
horizontal axis.

before:
hackbench 10: 1.698(2)s 1.712(1)s
hackbench 10 + rdtsctest: hackbench 1.736(2)s 1.712(3)s

after:
hackbench 10: 1.775(1)s, 1.740(1)s
hackbench 10 + rdtsctest: hackbench 1.734(1)s 1.723(1)s

In general: after applying the patch, latencies are more
often seen by the rdtsctest. It also seems to cause a
small percentage decrease in speed of hackbench. 
Looking at the latency histograms I believe this is
a real effect, but I could not do enough boots/runs to
make this a certainty from the runtimes alone.

At least for this PC, doing hpa's suggested cleanup of
the stub table is the right way to go for now... A
second option would be to get rid of the stub table by
assigning each important vector a unique handler and
to make sure those handlers do not rely on the vector
number at all.

Greetings,
    Alexander
-- 
  Alexander van Heukelum
  heukelum at fastmail.fm

-- 
http://www.fastmail.fm - One of many happy users:
  http://www.fastmail.fm/docs/quotes.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: latency.png
Type: image/png
Size: 7304 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/lguest/attachments/20081109/d5f134f9/attachment.png>