kvm PCI assignment & VFIO ramblings

Thu Aug 25 01:07:46 EST 2011

On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote:
> On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote:
> > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote:
> 
> > > Handling it through fds is a good idea. This makes sure that everything
> > > belongs to one process. I am not really sure yet if we go the way to
> > > just bind plain groups together or if we create meta-groups. The
> > > meta-groups thing seems somewhat cleaner, though.
> > 
> > I'm leaning towards binding because we need to make it dynamic, but I
> > don't really have a good picture of the lifecycle of a meta-group.
> 
> In my view the life-cycle of the meta-group is a subrange of the
> qemu-instance's life-cycle.

I guess I mean the lifecycle of a super-group that's actually exposed as
a new group in sysfs.  Who creates it?  How?  How are groups dynamically
added and removed from the super-group?  The group merging makes sense
to me because it's largely just an optimization that qemu will try to
merge groups.  If it works, great.  If not, it manages them separately.
When all the devices from a group are unplugged, unmerge the group if
necessary.

> > > Putting the process to sleep (which would be uninterruptible) seems bad.
> > > The process would sleep until the guest releases the device-group, which
> > > can take days or months.
> > > The best thing (and the most intrusive :-) ) is to change PCI core to
> > > allow unbindings to fail, I think. But this probably further complicates
> > > the way to upstream VFIO...
> > 
> > Yes, it's not ideal but I think it's sufficient for now and if we later
> > get support for returning an error from release, we can set a timeout
> > after notifying the user to make use of that.  Thanks,
> 
> Ben had the idea of just forcing to hard-unplug this device from the
> guest. Thats probably the best way to deal with that, I think. VFIO
> sends a notification to qemu that the device is gone and qemu informs
> the guest in some way about it.

We need to try the polite method of attempting to hot unplug the device
from qemu first, which the current vfio code already implements.  We can
then escalate if it doesn't respond.  The current code calls abort in
qemu if the guest doesn't respond, but I agree we should also be
enforcing this at the kernel interface.  I think the problem with the
hard-unplug is that we don't have a good revoke mechanism for the mmio
mmaps.  Thanks,

Alex