[PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
Alexey Kardashevskiy
aik at ozlabs.ru
Wed Sep 7 19:09:55 AEST 2016
On 29/08/16 23:27, David Gibson wrote:
> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote:
>> On 18/08/16 10:22, Alexey Kardashevskiy wrote:
>>> On 17/08/16 13:17, David Gibson wrote:
>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote:
>>>>> On Fri, 12 Aug 2016 15:46:01 +1000
>>>>> David Gibson <david at gibson.dropbear.id.au> wrote:
>>>>>
>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000
>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
>>>>>>>
>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote:
>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000
>>>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
>>>>>>>>>
>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote:
>>>>>>>>>>> On Wed, 3 Aug 2016 18:40:55 +1000
>>>>>>>>>>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO container in
>>>>>>>>>>>> memory while there are external users such as KVM.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/vfio/vfio.c | 30 ++++++++++++++++++++++++++++++
>>>>>>>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
>>>>>>>>>>>> include/linux/vfio.h | 6 ++++++
>>>>>>>>>>>> 3 files changed, 51 insertions(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>>>>>>>> index d1d70e0..baf6a9c 100644
>>>>>>>>>>>> --- a/drivers/vfio/vfio.c
>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
>>>>>>>>>>>> EXPORT_SYMBOL_GPL(vfio_external_check_extension);
>>>>>>>>>>>>
>>>>>>>>>>>> /**
>>>>>>>>>>>> + * External user API for containers, exported by symbols to be linked
>>>>>>>>>>>> + * dynamically.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + */
>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct vfio_container *container = filep->private_data;
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (filep->f_op != &vfio_fops)
>>>>>>>>>>>> + return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> + vfio_container_get(container);
>>>>>>>>>>>> +
>>>>>>>>>>>> + return container;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + vfio_container_put(container);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + return container->iommu_data;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> * Sub-module support
>>>>>>>>>>>> */
>>>>>>>>>>>> /*
>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> index 3594ad3..fceea3d 100644
>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
>>>>>>>>>>>> .detach_group = tce_iommu_detach_group,
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
>>>>>>>>>>>> + u64 offset)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct tce_container *container = iommu_data;
>>>>>>>>>>>> + struct iommu_table *tbl = NULL;
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (tce_iommu_find_table(container, offset, &tbl) < 0)
>>>>>>>>>>>> + return NULL;
>>>>>>>>>>>> +
>>>>>>>>>>>> + iommu_table_get(tbl);
>>>>>>>>>>>> +
>>>>>>>>>>>> + return tbl;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
>>>>>>>>>>>> +
>>>>>>>>>>>> static int __init tce_iommu_init(void)
>>>>>>>>>>>> {
>>>>>>>>>>>> return vfio_register_iommu_driver(&tce_iommu_driver_ops);
>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
>>>>>>>>>>>> MODULE_LICENSE("GPL v2");
>>>>>>>>>>>> MODULE_AUTHOR(DRIVER_AUTHOR);
>>>>>>>>>>>> MODULE_DESCRIPTION(DRIVER_DESC);
>>>>>>>>>>>> -
>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>>>>>>>> index 0ecae0b..1c2138a 100644
>>>>>>>>>>>> --- a/include/linux/vfio.h
>>>>>>>>>>>> +++ b/include/linux/vfio.h
>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
>>>>>>>>>>>> extern int vfio_external_user_iommu_id(struct vfio_group *group);
>>>>>>>>>>>> extern long vfio_external_check_extension(struct vfio_group *group,
>>>>>>>>>>>> unsigned long arg);
>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext(
>>>>>>>>>>>> + struct vfio_container *container);
>>>>>>>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
>>>>>>>>>>>> + void *iommu_data, u64 offset);
>>>>>>>>>>>>
>>>>>>>>>>>> /*
>>>>>>>>>>>> * Sub-module helpers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a container,
>>>>>>>>>>> having a reference means the container itself won't go away, but only
>>>>>>>>>>> having a group set within that container holds the actual IOMMU
>>>>>>>>>>> references. container->iommu_data is going to be NULL once the
>>>>>>>>>>> groups are lost. Thanks,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Container owns the iommu tables and this is what I care about here, groups
>>>>>>>>>> attached or not - this is handled separately via IOMMU group list in a
>>>>>>>>>> specific iommu_table struct, these groups get detached from iommu_table
>>>>>>>>>> when they are removed from a container.
>>>>>>>>>
>>>>>>>>> The container doesn't own anything, the container is privileged by the
>>>>>>>>> groups being attached to it. When groups are closed, they detach from
>>>>>>>>> the container and once the container group list is empty the iommu
>>>>>>>>> backend is released and iommu_data is NULL. A container reference
>>>>>>>>> doesn't give you what you're looking for. It implies nothing about the
>>>>>>>>> iommu backend.
>>>>>>>>
>>>>>>>>
>>>>>>>> Well. Backend is a part of a container and since a backend owns tables, a
>>>>>>>> container owns them too.
>>>>>>>
>>>>>>> The IOMMU backend is accessed through the container, but that backend
>>>>>>> is privileged by the groups it contains. Once those groups are gone,
>>>>>>> the IOMMU backend is released, regardless of whatever reference you
>>>>>>> have to the container itself such as you're attempting to do here. In
>>>>>>> that sense, the container does not own those tables.
>>>>>>
>>>>>> So, the thing is that what KVM fundamentally needs is a handle on the
>>>>>> container. KVM is essentially modelling the DMA address space of a
>>>>>> single guest bus, and the container is what's attached to that.
>>>>>>
>>>>>> The first part of the problem is that KVM wants to basically invoke
>>>>>> vfio_dma_map() operations without bouncing via qemu. Because
>>>>>> vfio_dma_map() works on the container level, that's the handle that
>>>>>> KVM needs to hold.
>>>>>>
>>>>>> The second part of the problem is that in order to reduce overhead
>>>>>> further, we want to operate in real mode, which means bypassing most
>>>>>> of the usual VFIO structure and going directly(ish) from the KVM
>>>>>> hcall emulation to the IOMMU backend behind VFIO. This complicates
>>>>>> matters a fair bit. Because it is, explicitly, a performance hack,
>>>>>> some degree of ugliness is probably inevitable.
>>>>>>
>>>>>> Alexey - actually implementing this in two stages might make this
>>>>>> clearer. The first stage wouldn't allow real mode, and would call
>>>>>> through the same vfio_dma_map() path as qemu calls through now. The
>>>>>> second stage would then put in place the necessary hacks to add real
>>>>>> mode support.
>>>>>>
>>>>>>>> The problem I am trying to solve here is when KVM may release the
>>>>>>>> iommu_table objects.
>>>>>>>>
>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
>>>>>>>> start using tables (with referencing them).
>>>>>>>>
>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
>>>>>>>> from region_del() and this works if QEMU removes a window. However if QEMU
>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM does not get
>>>>>>>> notified that it can release the iommu_table's because the
>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
>>>>>>>> still used by emulated devices or other containers).
>>>>>>>>
>>>>>>>> So it was suggested that we could do such "unset" somehow later assuming,
>>>>>>>> for example, on every "set" I could check if some of currently attached
>>>>>>>> containers are no more used - and this is where being able to know if there
>>>>>>>> is no backend helps - KVM remembers a container pointer and can check this
>>>>>>>> via vfio_container_get_iommu_data_ext().
>>>>>>>>
>>>>>>>> The other option would be changing vfio_container_get_ext() to take a
>>>>>>>> callback+opaque which container would call when it destroys iommu_data.
>>>>>>>> This looks more intrusive and not very intuitive how to make it right -
>>>>>>>> container would have to keep track of all registered external users and
>>>>>>>> vfio_container_put_ext() would have to pass the same callback+opaque to
>>>>>>>> unregister the exact external user.
>>>>>>>
>>>>>>> I'm not in favor of anything resembling the code above or extensions
>>>>>>> beyond it, the container is the wrong place to do this.
>>>>>>>
>>>>>>>> Or I could store container file* in KVM. Then iommu_data would never be
>>>>>>>> released until KVM-spapr-tce-table is destroyed.
>>>>>>>
>>>>>>> See above, holding a file pointer to the container doesn't do squat.
>>>>>>> The groups that are held by the container empower the IOMMU backend,
>>>>>>> references to the container itself don't matter. Those references will
>>>>>>> not maintain the IOMMU data.
>>>>>>>
>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an option as there
>>>>>>>> still may be devices using this KVM-spapr-tce-table.
>>>>>>>>
>>>>>>>> What obvious and nice solution am I missing here? Thanks.
>>>>>>>
>>>>>>> The interactions with the IOMMU backend that seem relevant are
>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}. The kvm-vfio pseudo
>>>>>>> device is also used to tell kvm about groups as they come and go and
>>>>>>> has a way to check extensions, and thus properties of the IOMMU
>>>>>>> backend. All of these are available for your {ab}use. Thanks,
>>>>>>
>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but it's
>>>>>> a really bad fit. As noted above, fundamentally it's a container we
>>>>>> need to attach to the kvm-spapr-tce-table object, since what that
>>>>>> represents is a guest bus DMA address space, and by definition all the
>>>>>> groups in a container must have the same DMA address space.
>>>>>
>>>>> That's all fine and good, but the point remains that a reference to the
>>>>> container is no assurance of the iommu state. The iommu state is
>>>>> maintained by the user and the groups attached to the container. If
>>>>> the groups are removed, your container reference no long has any iommu
>>>>> backing and iommu_data is worthless. The user can do this as well by
>>>>> un-setting the iommu. I understand what you're trying to do, it's just
>>>>> wrong. Thanks,
>>>>
>>>> I'm trying to figure out how to do this right, and it's not at all
>>>> obvious. The container may be wrong, but that doesn't have the
>>>> KVM-VFIO device any more useful. Attempting to do this at the group
>>>> level is at least as wrong for the reasons I've mentioned elsewhere.
>>>>
>>>
>>> I could create a new fd, one per iommu_table, the fd would reference the
>>> iommu_table (not touching an iommu_table_group or a container), VFIO SPAPR
>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which
>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; then
>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them up. To
>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl()
>>> or/and on every "set" I would look if all attached tables have at least one
>>> iommu_table_group attached, if none - release the table.
>>>
>>> This would make no change to generic VFIO code and very little change in
>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? Thanks.
>>
>>
>> Ping?
>
> I'm still in Toronto after KVM Forum. I had a detailed discussion
> about this with Alex W, which I'll write up once I get back.
>
> The short version is that Alex more-or-less convinced me that we do
> need to go back to doing this with an interface based on linking
> groups to LIOBNs. That leads to an interface that's kind of weird and
> has some fairly counter-intuitive properties, but in the end it works
> out better than doing it with containers.
>
Soooo? :)
--
Alexey
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20160907/2de34743/attachment.sig>
More information about the Linuxppc-dev
mailing list