kvm PCI assignment & VFIO ramblings

Tue Aug 2 23:39:36 EST 2011

On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> >  >
> >  >  What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> >  >  right ?).
> >
> >  (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>

Aggressive partial translation caching.  Even then, performance does 
suffer on memory intensive workloads.  The fix was transparent 
hugepages; that makes the page table walks much faster since they're 
fully cached, the partial translation caches become more effective, and 
the tlb itself becomes more effective.  On some workloads, THP on both 
guest and host was faster than no-THP on bare metal.

> >  >
> >  >  Not sure what you mean... the guest calls h-calls for every iommu page
> >  >  mapping/unmapping, yes. So the performance of these is critical. So yes,
> >  >  we'll eventually do it in kernel. We just haven't yet.
> >
> >  I see.  x86 traditionally doesn't do it for every request.  We had some
> >  proposals to do a pviommu that does map every request, but none reached
> >  maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...

I guess we can hack some kind of private interface, though I'd hoped to 
avoid it (and so far we succeeded - we can even get vfio to inject 
interrupts into kvm from the kernel without either knowing anything 
about the other).

> >  >  >   Does the BAR value contain the segment base address?  Or is that added
> >  >  >   later?
> >  >
> >  >  It's a shared address space. With a basic configuration on p7ioc for
> >  >  example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> >  >  contain the normal PCI address there. But that 1G is divided in 128
> >  >  segments of equal size which can separately be assigned to PE#'s.
> >  >
> >  >  So BARs are allocated by firmware or the kernel PCI code so that devices
> >  >  in different PEs don't share segments.
> >
> >  Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.

Yes, we've seen that.

> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.

I think I see, yes.

-- 
error compiling committee.c: too many arguments to function