kvm PCI assignment & VFIO ramblings
Alexander Graf
agraf at suse.de
Wed Aug 24 13:40:39 EST 2011
On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote:
>
>>> For us the most simple and logical approach (which is also what pHyp
>>> uses and what Linux handles well) is really to expose a given PCI host
>>> bridge per group to the guest. Believe it or not, it makes things
>>> easier :-)
>>
>> I'm all for easier. Why does exposing the bridge use less bus numbers
>> than emulating a bridge?
>
> Because a host bridge doesn't look like a PCI to PCI bridge at all for
> us. It's an entire separate domain with it's own bus number space
> (unlike most x86 setups).
>
> In fact we have some problems afaik in qemu today with the concept of
> PCI domains, for example, I think qemu has assumptions about a single
> shared IO space domain which isn't true for us (each PCI host bridge
> provides a distinct IO space domain starting at 0). We'll have to fix
> that, but it's not a huge deal.
>
> So for each "group" we'd expose in the guest an entire separate PCI
> domain space with its own IO, MMIO etc... spaces, handed off from a
> single device-tree "host bridge" which doesn't itself appear in the
> config space, doesn't need any emulation of any config space etc...
>
>> On x86, I want to maintain that our default assignment is at the device
>> level. A user should be able to pick single or multiple devices from
>> across several groups and have them all show up as individual,
>> hotpluggable devices on bus 0 in the guest. Not surprisingly, we've
>> also seen cases where users try to attach a bridge to the guest,
>> assuming they'll get all the devices below the bridge, so I'd be in
>> favor of making this "just work" if possible too, though we may have to
>> prevent hotplug of those.
>>
>> Given the device requirement on x86 and since everything is a PCI device
>> on x86, I'd like to keep a qemu command line something like -device
>> vfio,host=00:19.0. I assume that some of the iommu properties, such as
>> dma window size/address, will be query-able through an architecture
>> specific (or general if possible) ioctl on the vfio group fd. I hope
>> that will help the specification, but I don't fully understand what all
>> remains. Thanks,
>
> Well, for iommu there's a couple of different issues here but yes,
> basically on one side we'll have some kind of ioctl to know what segment
> of the device(s) DMA address space is assigned to the group and we'll
> need to represent that to the guest via a device-tree property in some
> kind of "parent" node of all the devices in that group.
>
> We -might- be able to implement some kind of hotplug of individual
> devices of a group under such a PHB (PCI Host Bridge), I don't know for
> sure yet, some of that PAPR stuff is pretty arcane, but basically, for
> all intend and purpose, we really want a group to be represented as a
> PHB in the guest.
>
> We cannot arbitrary have individual devices of separate groups be
> represented in the guest as siblings on a single simulated PCI bus.
So would it make sense for you to go the same route that we need to go on embedded power, with a separate VFIO style interface that simply exports memory ranges and irq bindings, but doesn't know anything about PCI? For e500, we'll be using something like that to pass through a full PCI bus into the system.
Alex
More information about the Linuxppc-dev
mailing list