[Lguest] [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

Sun Nov 30 02:45:16 EST 2008

On Fri, Nov 28, 2008 at 09:48:10PM +0100, Alexander van Heukelum wrote:
> On Thu, Nov 27, 2008 at 12:13:43PM +0200, Avi Kivity wrote:
> > H. Peter Anvin wrote:
> > >
> > >>I suspect we could get it down to three bytes, by sharing the last 
> > >>byte of the four-byte call sequence with the first byte of the next:
> > >>
> > >> 66 e8 ff 66 e8 fc 66 e8 f9 66 e8 f6 ...
> > >>
> > >>Every three bytes a new stub begins; it's a four-byte call to offset 
> > >>0x6703 relative to the beginning of the first stub.
> > >>
> > >>Can anyone better 24 bits/stub?
> > >
> > >On the entirely silly level...
> > >
> > >CC xx
> > 
> > Nice.  Can actually go to zero, by pointing the IDT at (unmapped_area + 
> > vector), and deducing the vector in the page fault handler from cr2.
> 
> Hi all,
> 
> We started the discussion with doing away with the whole jump
> array entirely, by changing the value of the CS index in the
> IDT. This needs the GDT to be extended with 256 entries, but an
> entire page (space for 512 entries) was already reserved anyhow!
> I think there is still some problem with the patch I sent due to
> some code depending on certain values of the CS index, but the
> system I've benchmarked on seemed to behave.
> 
> I did a set of benchmarks on an 8-way Xeon in 64-bit mode. The
> system was loaded with an instance of bonnie++ pinned to processor
> 0, and all 8 processors were running a program doing (almost)
> adjacent rdtsc's. Bonnie++ causes interrupts and the latencies
> due to these show up as larger time intervals. Complete runs of
> bonnie++ in fast mode were sampled this way for a current -rc6
> kernel and an -rc6 kernel plus my patch. The total sampling time
> was 30 minutes for each run. Per kernel I did one run as a warm-up
> and another two runs to measure the latencies. The results for
> measured latencies between 5 and 1000 microseconds are shown in
> the attached graph. Above 1000 microseconds there is only one big
> contribution: at 40000 microseconds ;). The surface below the graph
> is a measure of time.
> 
> Observations (for this test load!):
> 
> Near 200, 250 and 350 microseconds, the peaks shift to longer
> latencies for the cs-changing code by about 10 microseconds,
> but the total time spent is pretty much constant.
> 
> The highest latencies for the cs-changing code are near 600
> and 650 microseconds. The highest latencies for the current
> code are near 800 and 850 microseconds.
> 
> The total surface of the graphs between 5 and 1000 microseconds
> is within an error estimate of 1% equal for both cases, and is
> about 0.69% of the total time.
> 
> Most time is spent measuring 'latencies' of less than 5 micro-
> seconds, since bonnie++ is taking only about 5% cpu time on a
> single cpu most of the time, and only up to 50% on a single cpu
> during a short time in the file creation benchmark.

I now did the benchmarks for the same -rc6 with hpa's 4-byte stubs
too. Same machine. It's significantly better than the other two
options in terms of speed. It takes about 7% less cpu to handle
the interrupts. (0.64% cpu instead of 0.69%.) I have to run now,
I'll let interpreting the histogram to someone else ;).

Greetings,
	Alexander

-------------- next part --------------
A non-text attachment was scrubbed...
Name: load.png
Type: image/png
Size: 11850 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/lguest/attachments/20081129/ee72ea63/attachment.png>