kvm PCI assignment & VFIO ramblings
Avi Kivity
avi at redhat.com
Tue Aug 2 19:12:02 EST 2011
On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> > I have a feeling you'll be getting the same capabilities sooner or
> > later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).
Don't those limitations include "all VFs must be assigned to the same
guest"?
PCI on x86 has function granularity, SRIOV reduces this to VF
granularity, but I thought power has partition or group granularity
which is much coarser?
> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.
So, SRIOV VFs *don't* have the group limitation? Sorry, I'm deluged by
technical details with no ppc background to put them to, I can't say I'm
making any sense of this.
> > >
> > > VFIO here is basically designed for one and only one thing: expose the
> > > entire guest physical address space to the device more/less 1:1.
> >
> > A single level iommu cannot be exposed to guests. Well, it can be
> > exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> > A two level iommu can be emulated and exposed to the guest. See
> > http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).
(16 or 25)
> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.
Well, then, I guess we need an additional interface to expose that to
the guest.
> > > This means:
> > >
> > > - It only works with iommu's that provide complete DMA address spaces
> > > to devices. Won't work with a single 'segmented' address space like we
> > > have on POWER.
> > >
> > > - It requires the guest to be pinned. Pass-through -> no more swap
> >
> > Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.
The API is not designed around pinning. It's a side effect of how the
IOMMU works. If your IOMMU only maps pages which are under active DMA,
then it would only pin those pages.
But I see what you mean, the API is designed around up-front
specification of all guest memory.
> > > - It doesn't work for POWER server anyways because of our need to
> > > provide a paravirt iommu interface to the guest since that's how pHyp
> > > works today and how existing OSes expect to operate.
> >
> > Then you need to provide that same interface, and implement it using the
> > real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.
The original kvm device assignment code was (and is) part of kvm
itself. We're trying to move to vfio to allow sharing with non-kvm
users, but it does reduce flexibility. We can have an internal vfio-kvm
interface to update mappings in real time.
> > > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > > and has quite a bit of overhead. However we'll want to do the paravirt
> > > call directly in the kernel eventually ...
> >
> > Does the guest iomap each request? Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.
I see. x86 traditionally doesn't do it for every request. We had some
proposals to do a pviommu that does map every request, but none reached
maturity.
> >
> > So, you have interrupt redirection? That is, MSI-x table values encode
> > the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.
Okay, this is something that x86 doesn't have. Strange that it can
filter DMA at a fine granularity but not MSI, which is practically the
same thing.
> >
> > Does the BAR value contain the segment base address? Or is that added
> > later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
Okay, and config space virtualization ensures that the guest can't remap?
--
error compiling committee.c: too many arguments to function
More information about the Linuxppc-dev
mailing list