[PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand

David Gibson david at gibson.dropbear.id.au
Thu Jun 14 22:35:25 AEST 2018


On Thu, Jun 14, 2018 at 04:35:18PM +1000, Alexey Kardashevskiy wrote:
> On 12/6/18 2:17 pm, David Gibson wrote:
> > On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
> >> At the moment we allocate the entire TCE table, twice (hardware part and
> >> userspace translation cache). This normally works as we normally have
> >> contigous memory and the guest will map entire RAM for 64bit DMA.
> >>
> >> However if we have sparse RAM (one example is a memory device), then
> >> we will allocate TCEs which will never be used as the guest only maps
> >> actual memory for DMA. If it is a single level TCE table, there is nothing
> >> we can really do but if it a multilevel table, we can skip allocating
> >> TCEs we know we won't need.
> >>
> >> This adds ability to allocate only first level, saving memory.
> >>
> >> This changes iommu_table::free() to avoid allocating of an extra level;
> >> iommu_table::set() will do this when needed.
> >>
> >> This adds @alloc parameter to iommu_table::exchange() to tell the callback
> >> if it can allocate an extra level; the flag is set to "false" for
> >> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
> >> H_TOO_HARD.
> >>
> >> This still requires the entire table to be counted in mm::locked_vm.
> >>
> >> To be conservative, this only does on-demand allocation when
> >> the usespace cache table is requested which is the case of VFIO.
> >>
> >> The example math for a system replicating a powernv setup with NVLink2
> >> in a guest:
> >> 16GB RAM mapped at 0x0
> >> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
> >>
> >> the table to cover that all with 64K pages takes:
> >> (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> >>
> >> If we allocate only necessary TCE levels, we will only need:
> >> (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
> >> levels).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >> ---
> >>  arch/powerpc/include/asm/iommu.h              |  7 ++-
> >>  arch/powerpc/platforms/powernv/pci.h          |  6 ++-
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c           |  4 +-
> >>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
> >>  arch/powerpc/platforms/powernv/pci-ioda.c     |  8 ++--
> >>  drivers/vfio/vfio_iommu_spapr_tce.c           |  2 +-
> >>  6 files changed, 69 insertions(+), 27 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >> index 4bdcf22..daa3ee5 100644
> >> --- a/arch/powerpc/include/asm/iommu.h
> >> +++ b/arch/powerpc/include/asm/iommu.h
> >> @@ -70,7 +70,7 @@ struct iommu_table_ops {
> >>  			unsigned long *hpa,
> >>  			enum dma_data_direction *direction);
> >>  
> >> -	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
> >> +	__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
> >>  #endif
> >>  	void (*clear)(struct iommu_table *tbl,
> >>  			long index, long npages);
> >> @@ -122,10 +122,13 @@ struct iommu_table {
> >>  	__be64 *it_userspace; /* userspace view of the table */
> >>  	struct iommu_table_ops *it_ops;
> >>  	struct kref    it_kref;
> >> +	int it_nid;
> >>  };
> >>  
> >> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
> >> +		((tbl)->it_ops->useraddrptr((tbl), (entry), false))
> > 
> > Is real mode really the only case where you want to inhibit new
> > allocations?  I would have thought some paths would be read-only and
> > you wouldn't want to allocate, even in virtual mode.
> 
> There are paths when I do not want allocation but I can figure out that
> from dma direction flag, for example, I am cleaning up the table and I do
> not want any extra  allocation to happen there and they do happen because I
> made a mistake so I'll repost. Other than that, this @alloc flag is for
> real mode only.

Hm, ok.

Something just occurred to me: IIRC, going from the vmalloc() to the
explicit table structure was to avoid allocating pages for memory
regions that aren't there due to sparseness.  But.. won't you get that
with vmalloc() anyway, if the pages for the gaps never get
instantiated?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20180614/586a0507/attachment.sig>


More information about the Linuxppc-dev mailing list