kvm PCI assignment & VFIO ramblings

Joerg Roedel joro at 8bytes.org
Fri Aug 5 23:44:46 EST 2011


On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote:

> Right. In fact to try to clarify the problem for everybody, I think we
> can distinguish two different classes of "constraints" that can
> influence the grouping of devices:
> 
>  1- Hard constraints. These are typically devices using the same RID or
> where the RID cannot be reliably guaranteed (the later is the case with
> some PCIe-PCIX bridges which will take ownership of "some" transactions
> such as split but not all). Devices like that must be in the same
> domain. This is where PowerPC adds to what x86 does today the concept
> that the domains are pre-existing, since we use the RID for error
> isolation & MMIO segmenting as well. so we need to create those domains
> at boot time.

Domains (in the iommu-sense) are created at boot time on x86 today.
Every device needs at least a domain to provide dma-mapping
functionality to the drivers. So all the grouping is done too at
boot-time. This is specific to the iommu-drivers today but can be
generalized I think.

>  2- Softer constraints. Those constraints derive from the fact that not
> applying them risks enabling the guest to create side effects outside of
> its "sandbox". To some extent, there can be "degrees" of badness between
> the various things that can cause such constraints. Examples are shared
> LSIs (since trusting DisINTx can be chancy, see earlier discussions),
> potentially any set of functions in the same device can be problematic
> due to the possibility to get backdoor access to the BARs etc...

Hmm, there is no sane way to handle such constraints in a safe way,
right? We can either blacklist devices which are know to have such
backdoors or we just ignore the problem.

> Now, what I derive from the discussion we've had so far, is that we need
> to find a proper fix for #1, but Alex and Avi seem to prefer that #2
> remains a matter of libvirt/user doing the right thing (basically
> keeping a loaded gun aimed at the user's foot with a very very very
> sweet trigger but heh, let's not start a flamewar here :-)
> 
> So let's try to find a proper solution for #1 now, and leave #2 alone
> for the time being.

Yes, and the solution for #1 should be entirely in the kernel. The
question is how to do that. Probably the most sane way is to introduce a
concept of device ownership. The ownership can either be a kernel driver
or a userspace process. Giving ownership of a device to userspace is
only possible if all devices in the same group are unbound from its
respective drivers. This is a very intrusive concept, no idea if it
has a chance of acceptance :-)
But the advantage is clearly that this allows better semantics in the
IOMMU drivers and a more stable handover of devices from host drivers to
kvm guests.

> Maybe the right option is for x86 to move toward pre-existing domains
> like powerpc does, or maybe we can just expose some kind of ID.

As I said, the domains are created a iommu driver initialization time
(usually boot time). But the groups are internal to the iommu drivers
and not visible somewhere else.

> Ah you started answering to my above questions :-)
> 
> We could do what you propose. It depends what we want to do with
> domains. Practically speaking, we could make domains pre-existing (with
> the ability to group several PEs into larger domains) or we could keep
> the concepts different, possibly with the limitation that on powerpc, a
> domain == a PE.
> 
> I suppose we -could- make arbitrary domains on ppc as well by making the
> various PE's iommu's in HW point to the same in-memory table, but that's
> a bit nasty in practice due to the way we manage those, and it would to
> some extent increase the risk of a failing device/driver stomping on
> another one and thus taking it down with itself. IE. isolation of errors
> is an important feature for us.

These arbitrary domains exist in the iommu-api. It would be good to
emulate them on Power too. Can't you put a PE into an isolated
error-domain when something goes wrong with it? This should provide the
same isolation as before.
What you derive the group number from is your business :-) On x86 it is
certainly the best to use the RID these devices share together with the
PCI segment number.

Regards,

	Joerg



More information about the Linuxppc-dev mailing list