[PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

Fri Oct 21 11:21:34 AEDT 2016

On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote:
> On Thu, 20 Oct 2016 14:03:49 +1100
> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> 
> > In some situations the userspace memory context may live longer than
> > the userspace process itself so if we need to do proper memory context
> > cleanup, we better cache @mm and use it later when the process is gone
> > (@current or @current->mm is NULL).
> > 
> > This references mm and stores the pointer in the container; this is done
> > when a container is just created so checking for !current->mm in other
> > places becomes pointless.
> > 
> > This replaces current->mm with container->mm everywhere except debug
> > prints.
> > 
> > This adds a check that current->mm is the same as the one stored in
> > the container to prevent userspace from registering memory in other
> > processes.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> > ---
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 127 ++++++++++++++++++++----------------
> >  1 file changed, 71 insertions(+), 56 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index d0c38b2..6b0b121 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -31,49 +31,46 @@
> 
> Does it make sense to move the rest of these hunks into patch 2?
> I think they're similarly just moving the mm reference into callers.
> 
> 
> >  static void tce_iommu_detach_group(void *iommu_data,
> >  		struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(long npages)
> > +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> >  {
> >  	long ret = 0, locked, lock_limit;
> >  
> > -	if (!current || !current->mm)
> > -		return -ESRCH; /* process exited */
> > -
> >  	if (!npages)
> >  		return 0;
> >  
> > -	down_write(&current->mm->mmap_sem);
> > -	locked = current->mm->locked_vm + npages;
> > +	down_write(&mm->mmap_sem);
> > +	locked = mm->locked_vm + npages;
> >  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >  	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> >  		ret = -ENOMEM;
> >  	else
> > -		current->mm->locked_vm += npages;
> > +		mm->locked_vm += npages;
> >  
> >  	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> >  			npages << PAGE_SHIFT,
> > -			current->mm->locked_vm << PAGE_SHIFT,
> > +			mm->locked_vm << PAGE_SHIFT,
> >  			rlimit(RLIMIT_MEMLOCK),
> >  			ret ? " - exceeded" : "");
> >  
> > -	up_write(&current->mm->mmap_sem);
> > +	up_write(&mm->mmap_sem);
> >  
> >  	return ret;
> >  }
> >  
> > -static void decrement_locked_vm(long npages)
> > +static void decrement_locked_vm(struct mm_struct *mm, long npages)
> >  {
> > -	if (!current || !current->mm || !npages)
> > +	if (!mm || !npages)
> >  		return; /* process exited */
> 
> I know you're trying to be defensive and change as little logic as possible,
> but some cases should be an error, and I think some of the "process exited"
> comments were wrong anyway.
> 
> Maybe pull the !mm test into the caller and make it WARN_ON?
> 
> 
> > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg)
> >  		return ERR_PTR(-EINVAL);
> >  	}
> >  
> > +	if (!current->mm)
> > +		return ERR_PTR(-ESRCH); /* process exited */
> 
> A userspace thread in the kernel can't have its mm disappear, unless you
> are actually in the exit code. !current->mm is more like a test for a kernel
> thread.
> 
> 
> > +
> >  	container = kzalloc(sizeof(*container), GFP_KERNEL);
> >  	if (!container)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg)
> >  
> >  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >  
> > +	container->mm = current->mm;
> > +	atomic_inc(&container->mm->mm_count);
> > +
> >  	return container;
> 
> It's a nitpick if you respin the patch, but I guess it would better be
> described as a reference than a cache of the object. "have tce_container
> take a reference to mm_struct".
> 
> 
> > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container *container,
> >  	unsigned long hpa;
> >  	enum dma_data_direction dirtmp;
> >  
> > +	if (container->mm != current->mm)
> > +		return -ESRCH;
> 
> Good, is this condition now enforced on all entrypoints that use
> container->mm (except the final teardown)? (The mlock/rlimit stuff,
> as we talked about before, doesn't make sense if not).

Right.  I don't know that it's actually dangerous, but i think it
would be needlessly weird for one process to be able to manipulate
another process's mm via the container fd.  So all the entry points
that are directly called from userspace (basically, the ioctl()s)
should verify that current->mm matches container->mm (except the one
which initiallizes container->mm, obviously).

One other concern.  If I follow the logic correctly, if a process
created a container, passed the fd to another process then exited, the
container fd held by the other process would keep the original
process's mm alive indefinitely.  I'm not sure if that's a problem.
Nick?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20161021/bdadb2b3/attachment-0001.sig>