[PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

Fri Oct 21 12:47:37 AEDT 2016

On Fri, 21 Oct 2016 11:21:34 +1100
David Gibson <david at gibson.dropbear.id.au> wrote:

> On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote:
> > On Thu, 20 Oct 2016 14:03:49 +1100
> > Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >   
> > > In some situations the userspace memory context may live longer than
> > > the userspace process itself so if we need to do proper memory context
> > > cleanup, we better cache @mm and use it later when the process is gone
> > > (@current or @current->mm is NULL).
> > > 
> > > This references mm and stores the pointer in the container; this is done
> > > when a container is just created so checking for !current->mm in other
> > > places becomes pointless.
> > > 
> > > This replaces current->mm with container->mm everywhere except debug
> > > prints.
> > > 
> > > This adds a check that current->mm is the same as the one stored in
> > > the container to prevent userspace from registering memory in other
> > > processes.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> > > ---
> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 127 ++++++++++++++++++++----------------
> > >  1 file changed, 71 insertions(+), 56 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > index d0c38b2..6b0b121 100644
> > > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > @@ -31,49 +31,46 @@  
> > 
> > Does it make sense to move the rest of these hunks into patch 2?
> > I think they're similarly just moving the mm reference into callers.
> > 
> >   
> > >  static void tce_iommu_detach_group(void *iommu_data,
> > >  		struct iommu_group *iommu_group);
> > >  
> > > -static long try_increment_locked_vm(long npages)
> > > +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> > >  {
> > >  	long ret = 0, locked, lock_limit;
> > >  
> > > -	if (!current || !current->mm)
> > > -		return -ESRCH; /* process exited */
> > > -
> > >  	if (!npages)
> > >  		return 0;
> > >  
> > > -	down_write(&current->mm->mmap_sem);
> > > -	locked = current->mm->locked_vm + npages;
> > > +	down_write(&mm->mmap_sem);
> > > +	locked = mm->locked_vm + npages;
> > >  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >  	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > >  		ret = -ENOMEM;
> > >  	else
> > > -		current->mm->locked_vm += npages;
> > > +		mm->locked_vm += npages;
> > >  
> > >  	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > >  			npages << PAGE_SHIFT,
> > > -			current->mm->locked_vm << PAGE_SHIFT,
> > > +			mm->locked_vm << PAGE_SHIFT,
> > >  			rlimit(RLIMIT_MEMLOCK),
> > >  			ret ? " - exceeded" : "");
> > >  
> > > -	up_write(&current->mm->mmap_sem);
> > > +	up_write(&mm->mmap_sem);
> > >  
> > >  	return ret;
> > >  }
> > >  
> > > -static void decrement_locked_vm(long npages)
> > > +static void decrement_locked_vm(struct mm_struct *mm, long npages)
> > >  {
> > > -	if (!current || !current->mm || !npages)
> > > +	if (!mm || !npages)
> > >  		return; /* process exited */  
> > 
> > I know you're trying to be defensive and change as little logic as possible,
> > but some cases should be an error, and I think some of the "process exited"
> > comments were wrong anyway.
> > 
> > Maybe pull the !mm test into the caller and make it WARN_ON?
> > 
> >   
> > > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg)
> > >  		return ERR_PTR(-EINVAL);
> > >  	}
> > >  
> > > +	if (!current->mm)
> > > +		return ERR_PTR(-ESRCH); /* process exited */  
> > 
> > A userspace thread in the kernel can't have its mm disappear, unless you
> > are actually in the exit code. !current->mm is more like a test for a kernel
> > thread.
> > 
> >   
> > > +
> > >  	container = kzalloc(sizeof(*container), GFP_KERNEL);
> > >  	if (!container)
> > >  		return ERR_PTR(-ENOMEM);
> > > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg)
> > >  
> > >  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> > >  
> > > +	container->mm = current->mm;
> > > +	atomic_inc(&container->mm->mm_count);
> > > +
> > >  	return container;  
> > 
> > It's a nitpick if you respin the patch, but I guess it would better be
> > described as a reference than a cache of the object. "have tce_container
> > take a reference to mm_struct".
> > 
> >   
> > > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container *container,
> > >  	unsigned long hpa;
> > >  	enum dma_data_direction dirtmp;
> > >  
> > > +	if (container->mm != current->mm)
> > > +		return -ESRCH;  
> > 
> > Good, is this condition now enforced on all entrypoints that use
> > container->mm (except the final teardown)? (The mlock/rlimit stuff,
> > as we talked about before, doesn't make sense if not).  
> 
> Right.  I don't know that it's actually dangerous, but i think it
> would be needlessly weird for one process to be able to manipulate
> another process's mm via the container fd.  So all the entry points
> that are directly called from userspace (basically, the ioctl()s)
> should verify that current->mm matches container->mm (except the one
> which initiallizes container->mm, obviously).
> 
> One other concern.  If I follow the logic correctly, if a process
> created a container, passed the fd to another process then exited, the
> container fd held by the other process would keep the original
> process's mm alive indefinitely.  I'm not sure if that's a problem.
> Nick?
> 

Keeping the mm alive indefinitely is okay. When a process exits, it
will tear down its user mappings, but the mm struct stays around for
other users. Kernel threads can pick these up and keep them alive for
example, which is what was causing the problem in the first place.

Thanks,
Nick