[PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users

David Gibson david at gibson.dropbear.id.au
Fri Sep 23 17:12:08 AEST 2016


On Wed, Sep 21, 2016 at 04:56:52PM +1000, Alexey Kardashevskiy wrote:
> On 07/09/16 19:09, Alexey Kardashevskiy wrote:
> > On 29/08/16 23:27, David Gibson wrote:
> >> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
> >>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
> >>>> On 17/08/16 13:17, David Gibson wrote:
> >>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> >>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
> >>>>>> David Gibson <david at gibson.dropbear.id.au> wrote:
> >>>>>>
> >>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
> >>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>>>>   
> >>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:  
> >>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>>>>>>     
> >>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:    
> >>>>>>>>>>>> On Wed,  3 Aug 2016 18:40:55 +1000
> >>>>>>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >>>>>>>>>>>>       
> >>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>>>>>>>> memory while there are external users such as KVM.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> >>>>>>>>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>>>>>>>>  include/linux/vfio.h                |  6 ++++++
> >>>>>>>>>>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>>>>>>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>  /**
> >>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>>>>>>>> + * dynamically.
> >>>>>>>>>>>>> + *
> >>>>>>>>>>>>> + */
> >>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	struct vfio_container *container = filep->private_data;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	if (filep->f_op != &vfio_fops)
> >>>>>>>>>>>>> +		return ERR_PTR(-EINVAL);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	vfio_container_get(container);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	return container;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	vfio_container_put(container);
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	return container->iommu_data;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +/**
> >>>>>>>>>>>>>   * Sub-module support
> >>>>>>>>>>>>>   */
> >>>>>>>>>>>>>  /*
> >>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>>>>>>>>  	.detach_group	= tce_iommu_detach_group,
> >>>>>>>>>>>>>  };
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>>>>>>>> +		u64 offset)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> +	struct tce_container *container = iommu_data;
> >>>>>>>>>>>>> +	struct iommu_table *tbl = NULL;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>>>>>>>> +		return NULL;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	iommu_table_get(tbl);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +	return tbl;
> >>>>>>>>>>>>> +}
> >>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>>  static int __init tce_iommu_init(void)
> >>>>>>>>>>>>>  {
> >>>>>>>>>>>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>>>>>>>>  MODULE_LICENSE("GPL v2");
> >>>>>>>>>>>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>>>>>>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>>>>>>>> -
> >>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>>>>>>>> --- a/include/linux/vfio.h
> >>>>>>>>>>>>> +++ b/include/linux/vfio.h
> >>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>>>>>>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>>>>>>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>>>>>>>>  					  unsigned long arg);
> >>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>>>>>>>> +		struct vfio_container *container);
> >>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>>>>>>>> +		void *iommu_data, u64 offset);
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>  /*
> >>>>>>>>>>>>>   * Sub-module helpers      
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>>>>>>>> having a reference means the container itself won't go away, but only
> >>>>>>>>>>>> having a group set within that container holds the actual IOMMU
> >>>>>>>>>>>> references.  container->iommu_data is going to be NULL once the
> >>>>>>>>>>>> groups are lost.  Thanks,      
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>>>>>>>> when they are removed from a container.    
> >>>>>>>>>>
> >>>>>>>>>> The container doesn't own anything, the container is privileged by the
> >>>>>>>>>> groups being attached to it.  When groups are closed, they detach from
> >>>>>>>>>> the container and once the container group list is empty the iommu
> >>>>>>>>>> backend is released and iommu_data is NULL.  A container reference
> >>>>>>>>>> doesn't give you what you're looking for.  It implies nothing about the
> >>>>>>>>>> iommu backend.    
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
> >>>>>>>>> container owns them too.  
> >>>>>>>>
> >>>>>>>> The IOMMU backend is accessed through the container, but that backend
> >>>>>>>> is privileged by the groups it contains.  Once those groups are gone,
> >>>>>>>> the IOMMU backend is released, regardless of whatever reference you
> >>>>>>>> have to the container itself such as you're attempting to do here.  In
> >>>>>>>> that sense, the container does not own those tables.  
> >>>>>>>
> >>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
> >>>>>>> container.  KVM is essentially modelling the DMA address space of a
> >>>>>>> single guest bus, and the container is what's attached to that.
> >>>>>>>
> >>>>>>> The first part of the problem is that KVM wants to basically invoke
> >>>>>>> vfio_dma_map() operations without bouncing via qemu.  Because
> >>>>>>> vfio_dma_map() works on the container level, that's the handle that
> >>>>>>> KVM needs to hold.
> >>>>>>>
> >>>>>>> The second part of the problem is that in order to reduce overhead
> >>>>>>> further, we want to operate in real mode, which means bypassing most
> >>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
> >>>>>>> hcall emulation to the IOMMU backend behind VFIO.  This complicates
> >>>>>>> matters a fair bit.  Because it is, explicitly, a performance hack,
> >>>>>>> some degree of ugliness is probably inevitable.
> >>>>>>>
> >>>>>>> Alexey - actually implementing this in two stages might make this
> >>>>>>> clearer.  The first stage wouldn't allow real mode, and would call
> >>>>>>> through the same vfio_dma_map() path as qemu calls through now.  The
> >>>>>>> second stage would then put in place the necessary hacks to add real
> >>>>>>> mode support.
> >>>>>>>
> >>>>>>>>> The problem I am trying to solve here is when KVM may release the
> >>>>>>>>> iommu_table objects.
> >>>>>>>>>
> >>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>>>>>>>> start using tables (with referencing them).
> >>>>>>>>>
> >>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>>>>>>>> notified that it can release the iommu_table's because the
> >>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>>>>>>>> still used by emulated devices or other containers).
> >>>>>>>>>
> >>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
> >>>>>>>>> for example, on every "set" I could check if some of currently attached
> >>>>>>>>> containers are no more used - and this is where being able to know if there
> >>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
> >>>>>>>>> via vfio_container_get_iommu_data_ext().
> >>>>>>>>>
> >>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
> >>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
> >>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
> >>>>>>>>> container would have to keep track of all registered external users and
> >>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>>>>>>>> unregister the exact external user.  
> >>>>>>>>
> >>>>>>>> I'm not in favor of anything resembling the code above or extensions
> >>>>>>>> beyond it, the container is the wrong place to do this.
> >>>>>>>>   
> >>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
> >>>>>>>>> released until KVM-spapr-tce-table is destroyed.  
> >>>>>>>>
> >>>>>>>> See above, holding a file pointer to the container doesn't do squat.
> >>>>>>>> The groups that are held by the container empower the IOMMU backend,
> >>>>>>>> references to the container itself don't matter.  Those references will
> >>>>>>>> not maintain the IOMMU data.
> >>>>>>>>    
> >>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>>>>>>>> still may be devices using this KVM-spapr-tce-table.
> >>>>>>>>>
> >>>>>>>>> What obvious and nice solution am I missing here? Thanks.  
> >>>>>>>>
> >>>>>>>> The interactions with the IOMMU backend that seem relevant are
> >>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> >>>>>>>> device is also used to tell kvm about groups as they come and go and
> >>>>>>>> has a way to check extensions, and thus properties of the IOMMU
> >>>>>>>> backend.  All of these are available for your {ab}use.  Thanks,  
> >>>>>>>
> >>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
> >>>>>>> a really bad fit.  As noted above, fundamentally it's a container we
> >>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
> >>>>>>> represents is a guest bus DMA address space, and by definition all the
> >>>>>>> groups in a container must have the same DMA address space.
> >>>>>>
> >>>>>> That's all fine and good, but the point remains that a reference to the
> >>>>>> container is no assurance of the iommu state.  The iommu state is
> >>>>>> maintained by the user and the groups attached to the container.  If
> >>>>>> the groups are removed, your container reference no long has any iommu
> >>>>>> backing and iommu_data is worthless.  The user can do this as well by
> >>>>>> un-setting the iommu.  I understand what you're trying to do, it's just
> >>>>>> wrong.  Thanks,
> >>>>>
> >>>>> I'm trying to figure out how to do this right, and it's not at all
> >>>>> obvious.  The container may be wrong, but that doesn't have the
> >>>>> KVM-VFIO device any more useful.  Attempting to do this at the group
> >>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
> >>>>>
> >>>>
> >>>> I could create a new fd, one per iommu_table, the fd would reference the
> >>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
> >>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
> >>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
> >>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
> >>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
> >>>> or/and on every "set" I would look if all attached tables have at least one
> >>>> iommu_table_group attached, if none - release the table.
> >>>>
> >>>> This would make no change to generic VFIO code and very little change in
> >>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
> >>>
> >>>
> >>> Ping?
> >>
> >> I'm still in Toronto after KVM Forum.  I had a detailed discussion
> >> about this with Alex W, which I'll write up once I get back.
> >>
> >> The short version is that Alex more-or-less convinced me that we do
> >> need to go back to doing this with an interface based on linking
> >> groups to LIOBNs.  That leads to an interface that's kind of weird and
> >> has some fairly counter-intuitive properties, but in the end it works
> >> out better than doing it with containers.
> >>
> > 
> > 
> > 
> > 
> > Soooo? :)
> 
> 
> When can I expect a full version of how to do this in-kernel thingy?
> Thanks.

When I can dig myself out from under other things in my queue.  Which
turns out to be now.

Ok.. here's hoping I can remember enough of the conclusions I came to
with Alex W.

User <-> KVM interface
----------------------

This needs to take an LIOBN and a group fd and associate (or
disassociate) them.  This should be possible to do by adding each
group to the vfio-kvm device as on x86, then setting an attribute on
the device to mark the associated liobn.

Attaching different (overlapping) LIOBNs to different groups in the
same container is boundedly undefined (i.e. it mustn't break the host,
but can do anything to the guest).

KVM <-> VFIO (in kernel) interface
----------------------------------

You'll need a special function which takes a vfio group fd and returns
a reference to an iommu_table object.  It would also return an error
if the group isn't backed by the spapr_tce iommu driver (including any
calls to it on a non-ppc host).  This should probably also increment
the iommu table's ref count (on success).

Implementation notes
--------------------

When a device in a new group is hotplugged, qemu would need to add the
group to the container *then* tell KVM to attach the group to the
correct liobn(s).

KVM would add the group to a list for that liobn.  It would call the
vfio hook to get the associated iommu table.  If there's an error,
then it's unable to enable acceleration, and would either return an
error immediately or ensure that later attempts to PUT_TCE will be
punted to qemu.

Assuming it is able to accelerate, it would add the iommu table to a
list of iommu tables associated with the liobn.  It will need to
de-dupe here, since with multiple groups per container you'd expect
multiple groups with the same iommu table.

H_PUT_TCE would walk the list of attached iommu tables and update them
using the ppc kernel iommu interfaces.

When a group is removed from a liobn, kvm would need to recalculate
the list of iommu tables, in case that was the last group attached to
the table.  It would need to decrement the refcount on the iommu table
and, obviously, make sure everything is sychronized with the real mode
PUT_TCE.

When a group is hot unplugged, it's qemu's resposibility to tell kvm
that the group is no longer associated with the liobn, before it
removes the group from the container.  If it doesn't there may be a
stale iommu table attached to the liobn.  That could certainly mess up
DMA on the guest for other devices, but shouldn't damage the host -
the group now belongs to the host again, but because the group was
detached from the container, the HW is no longer using the container's
iommu table (which KVM is touching) to actually serve the group.

If all the groups are unplugged, so the container becomes quiescent,
KVM's refcount(s) on the iommu table stop it going away.  It won't be
looked at by the hardware any more, so updates will be useless, but
again that's only a problem for the guest, not the host.


Hope that covers it.

Alex, please let me know if I missed something from our discussion.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20160923/daf86b4c/attachment-0001.sig>


More information about the Linuxppc-dev mailing list