[PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

Alex Williamson alex.williamson at redhat.com
Thu Mar 16 03:39:04 AEDT 2017


On Thu, 16 Mar 2017 00:21:07 +1100
Alexey Kardashevskiy <aik at ozlabs.ru> wrote:

> On 15/03/17 08:05, Alex Williamson wrote:
> > On Fri, 10 Mar 2017 14:53:37 +1100
> > Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >   
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> >> without passing them to user space which saves time on switching
> >> to user space and back.
> >>
> >> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> >> KVM tries to handle a TCE request in the real mode, if failed
> >> it passes the request to the virtual mode to complete the operation.
> >> If it a virtual mode handler fails, the request is passed to
> >> the user space; this is not expected to happen though.
> >>
> >> To avoid dealing with page use counters (which is tricky in real mode),
> >> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> >> to pre-register the userspace memory. The very first TCE request will
> >> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> >> of the TCE table (iommu_table::it_userspace) is not allocated till
> >> the very first mapping happens and we cannot call vmalloc in real mode.
> >>
> >> If we fail to update a hardware IOMMU table unexpected reason, we just
> >> clear it and move on as there is nothing really we can do about it -
> >> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> >> will be mirrored automatically to the hardware and there is no interface
> >> to report to the guest about possible failures.
> >>
> >> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> >> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> >> and associates a physical IOMMU table with the SPAPR TCE table (which
> >> is a guest view of the hardware IOMMU table). The iommu_table object
> >> is cached and referenced so we do not have to look up for it in real mode.
> >>
> >> This does not implement the UNSET counterpart as there is no use for it -
> >> once the acceleration is enabled, the existing userspace won't
> >> disable it unless a VFIO container is destroyed; this adds necessary
> >> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> >>
> >> As this creates a descriptor per IOMMU table-LIOBN couple (called
> >> kvmppc_spapr_tce_iommu_table), it is possible to have several
> >> descriptors with the same iommu_table (hardware IOMMU table) attached
> >> to the same LIOBN; we do not remove duplicates though as
> >> iommu_table_ops::exchange not just update a TCE entry (which is
> >> shared among IOMMU groups) but also invalidates the TCE cache
> >> (one per IOMMU group).
> >>
> >> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> >> space.
> >>
> >> This adds real mode version of WARN_ON_ONCE() as the generic version
> >> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> >> returns in the code, this also adds a check for already existing
> >> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> >>
> >> This finally makes use of vfio_external_user_iommu_id() which was
> >> introduced quite some time ago and was considered for removal.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >> ---
> >> Changes:
> >> v8:
> >> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> >> to handle them
> >> * changed vmalloc_to_phys() callers to return H_HARDWARE
> >> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> >> and added a comment about this in the code
> >> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> >> and do WARN_ON
> >> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> >> have all vmalloc_to_phys() callsites covered
> >>
> >> v7:
> >> * added realmode-friendly WARN_ON_ONCE_RM
> >>
> >> v6:
> >> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> >> * moved kvmppc_gpa_to_ua() to TCE validation
> >>
> >> v5:
> >> * changed error codes in multiple places
> >> * added bunch of WARN_ON() in places which should not really happen
> >> * adde a check that an iommu table is not attached already to LIOBN
> >> * dropped explicit calls to iommu_tce_clear_param_check/
> >> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> >> call them anyway (since the previous patch)
> >> * if we fail to update a hardware IOMMU table for unexpected reason,
> >> this just clears the entry
> >>
> >> v4:
> >> * added note to the commit log about allowing multiple updates of
> >> the same IOMMU table;
> >> * instead of checking for if any memory was preregistered, this
> >> returns H_TOO_HARD if a specific page was not;
> >> * fixed comments from v3 about error handling in many places;
> >> * simplified TCE handlers and merged IOMMU parts inline - for example,
> >> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> >> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> >> the first attached table only (makes the code simpler);
> >>
> >> v3:
> >> * simplified not to use VFIO group notifiers
> >> * reworked cleanup, should be cleaner/simpler now
> >>
> >> v2:
> >> * reworked to use new VFIO notifiers
> >> * now same iommu_table may appear in the list several times, to be fixed later
> >> ---
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/kvm_host.h        |   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
> >>  include/uapi/linux/kvm.h                   |   8 +
> >>  arch/powerpc/kvm/book3s_64_vio.c           | 323 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 201 +++++++++++++++++-
> >>  arch/powerpc/kvm/powerpc.c                 |   2 +
> >>  virt/kvm/vfio.c                            |  60 ++++++
> >>  8 files changed, 623 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> >> index ef51740c67ca..f95d867168ea 100644
> >> --- a/Documentation/virtual/kvm/devices/vfio.txt
> >> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> >> @@ -16,7 +16,25 @@ Groups:
> >>  
> >>  KVM_DEV_VFIO_GROUP attributes:
> >>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> >> +	kvm_device_attr.addr points to an int32_t file descriptor
> >> +	for the VFIO group.
> >> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> >> +	allocated by sPAPR KVM.
> >> +	kvm_device_attr.addr points to a struct:
> >>  
> >> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> >> -for the VFIO group.
> >> +	struct kvm_vfio_spapr_tce {
> >> +		__u32	argsz;
> >> +		__u32	flags;
> >> +		__s32	groupfd;
> >> +		__s32	tablefd;
> >> +	};
> >> +
> >> +	where
> >> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;  
> > 
> > kvm_vfio_spapr_tce_liobn does not exist.  s/_liobn//?  
> 
> Correct.
> 
> 
> >   
> >> +	@flags are not supported now, must be zero;  
> > 
> > We do this argsz/flags thing on vfio ioctls because ioctls are a bit
> > more of a restricted resource.  We don't want to burn through them so
> > we make them expandable.  I don't know that we have that restriction
> > here and the ADD/DEL support certainly doesn't include it.  Maybe this
> > isn't necessary?  
> 
> 
> It is not but since I am going to have padding there, I thought I give it a
> name. Totally pointless and "u8 padding[4]" is better?

Or since these are not ioctls, we simply consider them to have a very
specific and limited purpose.  If we have a need to expand on that,
define a new one that either builds on the existing ones, as
GROUP_SET_SPAPR_TCE builds on GROUP_ADD, or supersedes them.  For
ioctls, it's more costly if we abandon one, here it doesn't seem like a
big deal.  Thanks,

Alex


More information about the Linuxppc-dev mailing list