[PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users

David Gibson david at gibson.dropbear.id.au
Wed Aug 17 13:17:25 AEST 2016


On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
> On Fri, 12 Aug 2016 15:46:01 +1000
> David Gibson <david at gibson.dropbear.id.au> wrote:
> 
> > On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> > > On Wed, 10 Aug 2016 15:37:17 +1000
> > > Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> > >   
> > > > On 09/08/16 22:16, Alex Williamson wrote:  
> > > > > On Tue, 9 Aug 2016 15:19:39 +1000
> > > > > Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> > > > >     
> > > > >> On 09/08/16 02:43, Alex Williamson wrote:    
> > > > >>> On Wed,  3 Aug 2016 18:40:55 +1000
> > > > >>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> > > > >>>       
> > > > >>>> This exports helpers which are needed to keep a VFIO container in
> > > > >>>> memory while there are external users such as KVM.
> > > > >>>>
> > > > >>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> > > > >>>> ---
> > > > >>>>  drivers/vfio/vfio.c                 | 30 ++++++++++++++++++++++++++++++
> > > > >>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> > > > >>>>  include/linux/vfio.h                |  6 ++++++
> > > > >>>>  3 files changed, 51 insertions(+), 1 deletion(-)
> > > > >>>>
> > > > >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > > >>>> index d1d70e0..baf6a9c 100644
> > > > >>>> --- a/drivers/vfio/vfio.c
> > > > >>>> +++ b/drivers/vfio/vfio.c
> > > > >>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> > > > >>>>  EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> > > > >>>>  
> > > > >>>>  /**
> > > > >>>> + * External user API for containers, exported by symbols to be linked
> > > > >>>> + * dynamically.
> > > > >>>> + *
> > > > >>>> + */
> > > > >>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> > > > >>>> +{
> > > > >>>> +	struct vfio_container *container = filep->private_data;
> > > > >>>> +
> > > > >>>> +	if (filep->f_op != &vfio_fops)
> > > > >>>> +		return ERR_PTR(-EINVAL);
> > > > >>>> +
> > > > >>>> +	vfio_container_get(container);
> > > > >>>> +
> > > > >>>> +	return container;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> > > > >>>> +
> > > > >>>> +void vfio_container_put_ext(struct vfio_container *container)
> > > > >>>> +{
> > > > >>>> +	vfio_container_put(container);
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> > > > >>>> +
> > > > >>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> > > > >>>> +{
> > > > >>>> +	return container->iommu_data;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> > > > >>>> +
> > > > >>>> +/**
> > > > >>>>   * Sub-module support
> > > > >>>>   */
> > > > >>>>  /*
> > > > >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> index 3594ad3..fceea3d 100644
> > > > >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > > >>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> > > > >>>>  	.detach_group	= tce_iommu_detach_group,
> > > > >>>>  };
> > > > >>>>  
> > > > >>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> > > > >>>> +		u64 offset)
> > > > >>>> +{
> > > > >>>> +	struct tce_container *container = iommu_data;
> > > > >>>> +	struct iommu_table *tbl = NULL;
> > > > >>>> +
> > > > >>>> +	if (tce_iommu_find_table(container, offset, &tbl) < 0)
> > > > >>>> +		return NULL;
> > > > >>>> +
> > > > >>>> +	iommu_table_get(tbl);
> > > > >>>> +
> > > > >>>> +	return tbl;
> > > > >>>> +}
> > > > >>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> > > > >>>> +
> > > > >>>>  static int __init tce_iommu_init(void)
> > > > >>>>  {
> > > > >>>>  	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> > > > >>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> > > > >>>>  MODULE_LICENSE("GPL v2");
> > > > >>>>  MODULE_AUTHOR(DRIVER_AUTHOR);
> > > > >>>>  MODULE_DESCRIPTION(DRIVER_DESC);
> > > > >>>> -
> > > > >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > >>>> index 0ecae0b..1c2138a 100644
> > > > >>>> --- a/include/linux/vfio.h
> > > > >>>> +++ b/include/linux/vfio.h
> > > > >>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> > > > >>>>  extern int vfio_external_user_iommu_id(struct vfio_group *group);
> > > > >>>>  extern long vfio_external_check_extension(struct vfio_group *group,
> > > > >>>>  					  unsigned long arg);
> > > > >>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> > > > >>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> > > > >>>> +extern void *vfio_container_get_iommu_data_ext(
> > > > >>>> +		struct vfio_container *container);
> > > > >>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> > > > >>>> +		void *iommu_data, u64 offset);
> > > > >>>>  
> > > > >>>>  /*
> > > > >>>>   * Sub-module helpers      
> > > > >>>
> > > > >>>
> > > > >>> I think you need to take a closer look of the lifecycle of a container,
> > > > >>> having a reference means the container itself won't go away, but only
> > > > >>> having a group set within that container holds the actual IOMMU
> > > > >>> references.  container->iommu_data is going to be NULL once the
> > > > >>> groups are lost.  Thanks,      
> > > > >>
> > > > >>
> > > > >> Container owns the iommu tables and this is what I care about here, groups
> > > > >> attached or not - this is handled separately via IOMMU group list in a
> > > > >> specific iommu_table struct, these groups get detached from iommu_table
> > > > >> when they are removed from a container.    
> > > > > 
> > > > > The container doesn't own anything, the container is privileged by the
> > > > > groups being attached to it.  When groups are closed, they detach from
> > > > > the container and once the container group list is empty the iommu
> > > > > backend is released and iommu_data is NULL.  A container reference
> > > > > doesn't give you what you're looking for.  It implies nothing about the
> > > > > iommu backend.    
> > > > 
> > > > 
> > > > Well. Backend is a part of a container and since a backend owns tables, a
> > > > container owns them too.  
> > > 
> > > The IOMMU backend is accessed through the container, but that backend
> > > is privileged by the groups it contains.  Once those groups are gone,
> > > the IOMMU backend is released, regardless of whatever reference you
> > > have to the container itself such as you're attempting to do here.  In
> > > that sense, the container does not own those tables.  
> > 
> > So, the thing is that what KVM fundamentally needs is a handle on the
> > container.  KVM is essentially modelling the DMA address space of a
> > single guest bus, and the container is what's attached to that.
> > 
> > The first part of the problem is that KVM wants to basically invoke
> > vfio_dma_map() operations without bouncing via qemu.  Because
> > vfio_dma_map() works on the container level, that's the handle that
> > KVM needs to hold.
> > 
> > The second part of the problem is that in order to reduce overhead
> > further, we want to operate in real mode, which means bypassing most
> > of the usual VFIO structure and going directly(ish) from the KVM
> > hcall emulation to the IOMMU backend behind VFIO.  This complicates
> > matters a fair bit.  Because it is, explicitly, a performance hack,
> > some degree of ugliness is probably inevitable.
> > 
> > Alexey - actually implementing this in two stages might make this
> > clearer.  The first stage wouldn't allow real mode, and would call
> > through the same vfio_dma_map() path as qemu calls through now.  The
> > second stage would then put in place the necessary hacks to add real
> > mode support.
> > 
> > > > The problem I am trying to solve here is when KVM may release the
> > > > iommu_table objects.
> > > > 
> > > > "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> > > > matter) makes a link between KVM-spapr-tce-table and container and KVM can
> > > > start using tables (with referencing them).
> > > > 
> > > > First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> > > > from region_del() and this works if QEMU removes a window. However if QEMU
> > > > removes a vfio-pci device, region_del() is not called and KVM does not get
> > > > notified that it can release the iommu_table's because the
> > > > KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> > > > still used by emulated devices or other containers).
> > > > 
> > > > So it was suggested that we could do such "unset" somehow later assuming,
> > > > for example, on every "set" I could check if some of currently attached
> > > > containers are no more used - and this is where being able to know if there
> > > > is no backend helps - KVM remembers a container pointer and can check this
> > > > via vfio_container_get_iommu_data_ext().
> > > > 
> > > > The other option would be changing vfio_container_get_ext() to take a
> > > > callback+opaque which container would call when it destroys iommu_data.
> > > > This looks more intrusive and not very intuitive how to make it right -
> > > > container would have to keep track of all registered external users and
> > > > vfio_container_put_ext() would have to pass the same callback+opaque to
> > > > unregister the exact external user.  
> > > 
> > > I'm not in favor of anything resembling the code above or extensions
> > > beyond it, the container is the wrong place to do this.
> > >   
> > > > Or I could store container file* in KVM. Then iommu_data would never be
> > > > released until KVM-spapr-tce-table is destroyed.  
> > > 
> > > See above, holding a file pointer to the container doesn't do squat.
> > > The groups that are held by the container empower the IOMMU backend,
> > > references to the container itself don't matter.  Those references will
> > > not maintain the IOMMU data.
> > >    
> > > > Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> > > > would "unset" container from KVM-spapr-tce-table) is not an option as there
> > > > still may be devices using this KVM-spapr-tce-table.
> > > > 
> > > > What obvious and nice solution am I missing here? Thanks.  
> > > 
> > > The interactions with the IOMMU backend that seem relevant are
> > > vfio_iommu_drivers_ops.{detach_group,release}.  The kvm-vfio pseudo
> > > device is also used to tell kvm about groups as they come and go and
> > > has a way to check extensions, and thus properties of the IOMMU
> > > backend.  All of these are available for your {ab}use.  Thanks,  
> > 
> > So, Alexey started trying to do this via the KVM-VFIO device, but it's
> > a really bad fit.  As noted above, fundamentally it's a container we
> > need to attach to the kvm-spapr-tce-table object, since what that
> > represents is a guest bus DMA address space, and by definition all the
> > groups in a container must have the same DMA address space.
> 
> That's all fine and good, but the point remains that a reference to the
> container is no assurance of the iommu state.  The iommu state is
> maintained by the user and the groups attached to the container.  If
> the groups are removed, your container reference no long has any iommu
> backing and iommu_data is worthless.  The user can do this as well by
> un-setting the iommu.  I understand what you're trying to do, it's just
> wrong.  Thanks,

I'm trying to figure out how to do this right, and it's not at all
obvious.  The container may be wrong, but that doesn't have the
KVM-VFIO device any more useful.  Attempting to do this at the group
level is at least as wrong for the reasons I've mentioned elsewhere.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20160817/731152c4/attachment-0001.sig>


More information about the Linuxppc-dev mailing list