[PATCH kernel 9/9] KVM: PPC: VFIO device: support SPAPR TCE

Thu Jun 9 16:47:59 AEST 2016

On 23/03/16 14:03, David Gibson wrote:
> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>> Uff, lost cc: list. Added back. Some comments below.
>>
>>
>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>> On March 15, 2016 17:29:26 David Gibson <david at gibson.dropbear.id.au> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we need
>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>
>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>> is added which accepts:
>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>> TCE table;
>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>
>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>> registered with KVM device in order to release them at unexpected KVM
>>>>>>>>>> finish.
>>>>>>>>>>
>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
>>>>>>>>>> space.
>>>>>>>>>>
>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to let it
>>>>>>>>>> link to a KVM module.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>
>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>
>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in which
>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>
>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>> tracking
>>>>>>>>>> +	kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> +	for the VFIO group.
>>>>>>>>>>
>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>> -for the VFIO group.
>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>>>>>>>>>> +	kvm_device_attr.addr points to a struct:
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +			__u32	argsz;
>>>>>>>>>> +			__s32	fd;
>>>>>>>>>> +			__u32	liobn;
>>>>>>>>>> +			__u8	pad[4];
>>>>>>>>>> +			__u64	start_addr;
>>>>>>>>>> +		};
>>>>>>>>>> +		where
>>>>>>>>>> +		@argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>> +		@fd is a file descriptor for a VFIO group;
>>>>>>>>>> +		@liobn is a logical bus id to be associated with the group;
>>>>>>>>>> +		@start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>
>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 32bit
>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>  	select KVM
>>>>>>>>>>  	select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>  	select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>> +	select KVM_VFIO if VFIO
>>>>>>>>>>  	---help---
>>>>>>>>>>  	  Support running unmodified book3s_64 and book3s_32 guest kernels
>>>>>>>>>>  	  in virtual machines on book3s_64 host processors.
>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>
>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>> -		$(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>> +		$(KVM)/eventfd.o
>>>>>>>>>
>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I added
>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>> itself should be available always.
>>>>>>>>
>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a module.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>  	book3s_xics.o
>>>>>>>>>>
>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>> +	$(KVM)/vfio.o \
>>>>>>>>>> +
>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>  	$(KVM)/kvm_main.o \
>>>>>>>>>>  	$(KVM)/eventfd.o \
>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>> *kvm, long ext)
>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>  	case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>> +	case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>  	case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>  	case KVM_CAP_PPC_RTAS:
>>>>>>>>>>  	case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL			2
>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN	3
>>>>>>>>>>
>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>  	KVM_DEV_TYPE_MAX,
>>>>>>>>>>  };
>>>>>>>>>>
>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>> +	__u32	argsz;
>>>>>>>>>> +	__s32	fd;
>>>>>>>>>> +	__u32	liobn;
>>>>>>>>>> +	__u8	pad[4];
>>>>>>>>>> +	__u64	start_addr;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>>  /*
>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>   */
>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>  	struct list_head node;
>>>>>>>>>>  	struct vfio_group *vfio_group;
>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>  	symbol_put(vfio_group_put_external_user);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
>>>>>>>>>> +{
>>>>>>>>>> +	int (*fn)(struct vfio_group *);
>>>>>>>>>> +	int ret = -1;
>>>>>>>>>
>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>
>>>>>>>>>> +	fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>> +	if (!fn)
>>>>>>>>>> +		return ret;
>>>>>>>>>> +
>>>>>>>>>> +	ret = fn(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +	symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>> +
>>>>>>>>>> +	return ret;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
>>>>>>>>>>  {
>>>>>>>>>>  	long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>> kvm_device *dev)
>>>>>>>>>>  	mutex_unlock(&kv->lock);
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>> +		struct vfio_group *vfio_group)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>> function?
>>>>>>>>
>>>>>>>> Having less patches which touch different maintainers areas is better. I
>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can in
>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view of TCE
>>>>>>>> table".
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +	int group_id;
>>>>>>>>>> +	struct iommu_group *grp;
>>>>>>>>>> +
>>>>>>>>>> +	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>> +	grp = iommu_group_get_by_id(group_id);
>>>>>>>>>> +	if (grp) {
>>>>>>>>>> +		kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>> +		iommu_group_put(grp);
>>>>>>>>>> +	}
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>>>>>>>>>>  {
>>>>>>>>>>  	struct kvm_vfio *kv = dev->private;
>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  				continue;
>>>>>>>>>>
>>>>>>>>>>  			list_del(&kvg->node);
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>
>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at the
>>>>>>>>> callsite.
>>>>>>>>
>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>> confused.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +			kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>> +					kvg->vfio_group);
>>>>>>>>>> +#endif
>>>>>>>>>>  			kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>  			kfree(kvg);
>>>>>>>>>>  			ret = 0;
>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct kvm_device
>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>  		kvm_vfio_update_coherency(dev);
>>>>>>>>>>
>>>>>>>>>>  		return ret;
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>> +		struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>> +		unsigned long minsz;
>>>>>>>>>> +		struct kvm_vfio *kv = dev->private;
>>>>>>>>>> +		struct vfio_group *vfio_group;
>>>>>>>>>> +		struct kvm_vfio_group *kvg;
>>>>>>>>>> +		struct fd f;
>>>>>>>>>> +
>>>>>>>>>> +		minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>> +				start_addr);
>>>>>>>>>> +
>>>>>>>>>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>> +			return -EFAULT;
>>>>>>>>>> +
>>>>>>>>>> +		if (param.argsz < minsz)
>>>>>>>>>> +			return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +		f = fdget(param.fd);
>>>>>>>>>> +		if (!f.file)
>>>>>>>>>> +			return -EBADF;
>>>>>>>>>> +
>>>>>>>>>> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>> +		fdput(f);
>>>>>>>>>> +
>>>>>>>>>> +		if (IS_ERR(vfio_group))
>>>>>>>>>> +			return PTR_ERR(vfio_group);
>>>>>>>>>> +
>>>>>>>>>> +		ret = -ENOENT;
>>>>>>>>>
>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  It's
>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>> IOMMU types.
>>>>>>>>
>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type is a
>>>>>>>> VFIO container property, not a group property and here (KVM) we only have
>>>>>>>> groups.
>>>>>>>
>>>>>>> Which, as mentioned previously, is broken.
>>>>>>
>>>>>> Which I am failing to follow you on this.
>>>>>>
>>>>>> What I am trying to achieve here is pretty much referencing a group so it
>>>>>> cannot be reused. Plus LIOBNs.
>>>>>
>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>> to group.
>>>>
>>>> I can see your point but i don't see how to proceed now, I'm totally stuck.
>>>> Pass container fd and then implement new api to lock containers somehow and
>>>
>>> I'm not really understanding what the question is about locking containers.
>>>
>>>> enumerate groups when updating TCE table (including real mode)?
>>>
>>> Why do you need to enumerate groups?  The groups within the container
>>> share a TCE table, so can't you just update that once?
>>
>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>> invalidate) register address, it is still per PE but this does not matter
>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>> the picture.
> 
> True, you'll need to enumerate the groups for invalidates.  But you
> need that already, right.
> 
>>>> Plus new API when we remove a group from a container as the result of guest
>>>> PCI hot unplug?
>>>
>>> I assume you mean a kernel internal API, since it shouldn't need
>>> anything else visible to userspace.  Won't this happen naturally?
>>> When the group is removed from the container, it will get its own TCE
>>> table instead of the previously shared one.
>>>
>>>>>> Passing a container fd does not make much
>>>>>> sense here as the VFIO device would walk through groups, reference them and
>>>>>> that is it, there is no locking on VFIO containters and so far there was no
>>>>>> need to teach KVM about containers.
>>>>>>
>>>>>> What do I miss now?
>>>>>
>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>> LIOBNs map to containers, not groups.
>>>>
>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>> can be one or many or none containers per liobn.
>>>
>>> Ah, true.
>>
>> So I need to add new kernel API for KVM to get table(s) from VFIO
>> container(s). And invent some locking mechanism to prevent table(s) (or
>> associated container(s)) from going away, like
>> vfio_group_get_external_user/vfio_group_put_external_user but for
>> containers. Correct?
> 
> Well, the container is attached to an fd, so if you get a reference on
> the file* that should do it.

I am still trying to think of how to implement this suggestion.

I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
interface as it knows nothing about KVM. There is VFIO-KVM device but it
does not have idea about containers.

So I have to:

Wenever a container is created or removed, notify the VFIO-KVM device by
passing there a container fd. ok.

Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
what LIOBN so the realmode handlers could do the job. The real mode TCE
handlers get LIOBN, find a guest view table and update it. Now I want to
update the hardware table which is iommu_table attached to LIOBN.

I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container fd.

Now VFIO-KVM device needs to extract iommu_table's from the container.
These iommu_table pointers are stored in "struct tce_container" which is
local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
cannot export and use that.

The other way to go would be adding API to VFIO to enumerate IOMMU groups
in a container and use iommu_table pointers stored in iommu_table_group of
each group (in fact the very first group will be enough as multiple groups
in a container share the table). Adding vfio_container_get_groups() when
only first one is needed is quite tricky in terms of maintainers approvals.

So what would be the right course of action? Thanks.

-- 
Alexey