[PATCH 0/2] iommu: Remove iommu_sva_ops::mm_exit()

Jean-Philippe Brucker jean-philippe at linaro.org
Fri Apr 10 00:50:58 AEST 2020


On Thu, Apr 09, 2020 at 07:14:24AM -0700, Jacob Pan wrote:
> On Thu, 9 Apr 2020 08:39:05 +0200
> Jean-Philippe Brucker <jean-philippe at linaro.org> wrote:
> 
> > On Wed, Apr 08, 2020 at 04:48:02PM -0700, Jacob Pan wrote:
> > > On Wed, 8 Apr 2020 19:32:18 -0300
> > > Jason Gunthorpe <jgg at ziepe.ca> wrote:
> > >   
> > > > On Wed, Apr 08, 2020 at 02:35:52PM -0700, Jacob Pan wrote:  
> > > > > > On Wed, Apr 08, 2020 at 11:35:52AM -0700, Jacob Pan wrote:    
> > > > > > > Hi Jean,
> > > > > > > 
> > > > > > > On Wed,  8 Apr 2020 16:04:25 +0200
> > > > > > > Jean-Philippe Brucker <jean-philippe at linaro.org> wrote:
> > > > > > >       
> > > > > > > > The IOMMU SVA API currently requires device drivers to
> > > > > > > > implement an mm_exit() callback, which stops device jobs
> > > > > > > > that do DMA. This function is called in the release() MMU
> > > > > > > > notifier, when an address space that is shared with a
> > > > > > > > device exits.
> > > > > > > > 
> > > > > > > > It has been noted several time during discussions about
> > > > > > > > SVA that cancelling DMA jobs can be slow and complex, and
> > > > > > > > doing it in the release() notifier might cause
> > > > > > > > synchronization issues (patch 2 has more background).
> > > > > > > > Device drivers must in any case call unbind() to remove
> > > > > > > > their bond, after stopping DMA from a more favorable
> > > > > > > > context (release of a file descriptor).
> > > > > > > > 
> > > > > > > > So after mm exits, rather than notifying device drivers,
> > > > > > > > we can hold on to the PASID until unbind(), ask IOMMU
> > > > > > > > drivers to silently abort DMA and Page Requests in the
> > > > > > > > meantime. This change should relieve the mmput()
> > > > > > > > path.      
> > > > > > >
> > > > > > > I assume mm is destroyed after all the FDs are closed      
> > > > > > 
> > > > > > FDs do not hold a mmget(), but they may hold a mmgrab(), ie
> > > > > > anything using mmu_notifiers has to hold a grab until the
> > > > > > notifier is destroyed, which is often triggered by FD close.
> > > > > >     
> > > > > Sorry, I don't get this. Are you saying we have to hold a
> > > > > mmgrab() between svm_bind/mmu_notifier_register and
> > > > > svm_unbind/mmu_notifier_unregister?    
> > > > 
> > > > Yes. This is done automatically for the caller inside the
> > > > mmu_notifier implementation. We now even store the mm_struct
> > > > pointer inside the notifier.
> > > > 
> > > > Once a notifier is registered the mm_struct remains valid memory
> > > > until the notifier is unregistered.
> > > >   
> > > > > Isn't the idea of mmu_notifier is to avoid holding the mm
> > > > > reference and rely on the notifier to tell us when mm is going
> > > > > away?    
> > > > 
> > > > The notifier only holds a mmgrab(), not a mmget() - this allows
> > > > exit_mmap to proceed, but the mm_struct memory remains.
> > > > 
> > > > This is also probably why it is a bad idea to tie the lifetime of
> > > > something like a pasid to the mmdrop as a evil user could cause a
> > > > large number of mm structs to be released but not freed, probably
> > > > defeating cgroup limits and so forth (not sure)
> > > >   
> > > > > It seems both Intel and AMD iommu drivers don't hold mmgrab
> > > > > after mmu_notifier_register.    
> > > > 
> > > > It is done internally to the implementation.
> > > >   
> > > > > > So the exit_mmap() -> release() may happen before the FDs are
> > > > > > destroyed, but the final mmdrop() will be during some FD
> > > > > > release when the final mmdrop() happens.    
> > > > > 
> > > > > Do you mean mmdrop() is after FD release?     
> > > > 
> > > > Yes, it will be done by the mmu_notifier_unregister(), which
> > > > should be called during FD release if the iommu lifetime is
> > > > linked to some FD. 
> > > > > If so, unbind is called in FD release should take care of
> > > > > everything, i.e. stops DMA, clear PASID context on IOMMU, flush
> > > > > PRS queue etc.    
> > > > 
> > > > Yes, this is the proper way, when the DMA is stopped and no use
> > > > of the PASID remains then you can drop the mmu notifier and
> > > > release the PASID entirely. If that is linked to the lifetime of
> > > > the FD then forget completely about the mm_struct lifetime, it
> > > > doesn't matter.. 
> > > Got everything above, thanks a lot.
> > > 
> > > If everything is in order with the FD close. Why do we need to 
> > > "ask IOMMU drivers to silently abort DMA and Page Requests in the
> > > meantime." in mm_exit notifier? This will be done orderly in unbind
> > > anyway.  
> > 
> > When the process is killed, mm release can happen before fds are
> > released. If you look at do_exit() in kernel/exit.c:
> > 
> > 	exit_mm()
> > 	  mmput()
> > 	   -> mmu release notifier  
> > 	...
> > 	exit_files()
> > 	  close_files()
> > 	    fput()
> > 	exit_task_work()
> > 	  __fput()
> > 	   -> unbind()  
> > 
> So unbind is coming anyway, the difference in handling in mmu release
> notifier is whether we silently drop DMA fault vs. reporting fault?

What I meant is, between mmu release notifier and unbind(), we can't print
any error from DMA fault on dmesg, because an mm exit is easily triggered
by userspace. Look at the lifetime of the bond:

bind()
 |
 : Here any DMA fault is handled by mm, and on error we don't print
 : anything to dmesg. Userspace can easily trigger faults by issuing DMA
 : on unmapped buffers.
 |
mm exit -> clear pgd, invalidate IOTLBs
 |
 : Here the PASID descriptor doesn't have the pgd anymore, but we don't
 : print out any error to dmesg either. DMA is likely still running but
 : any fault has to be ignored.
 :
 : We also can't free the PASID yet, since transactions are still coming
 : in with this PASID.
 |
unbind() -> clear context descriptor, release PASID and mmu notifier
 |
 : Here the PASID descriptor is clear. If DMA is still running the device
 : driver really messed up and we have to print out any fault.

For that middle state I had to introduce a new pasid descriptor state in
the SMMU driver, to avoid reporting errors between mm exit and unbind().

Thanks,
Jean

> If a process crash during unbind, something already went seriously
> wrong, DMA fault is expected.
> I think having some error indication is useful, compared to "silently
> drop"
> 
> Thanks,
> 
> Jacob
> 
> > Thanks,
> > Jean
> > 
> > >   
> > > > > Enforcing unbind upon FD close might be a precarious path,
> > > > > perhaps that is why we have to deal with out of order
> > > > > situation?    
> > > > 
> > > > How so? You just put it in the FD release function :)
> > > >   
> > > I was thinking some driver may choose to defer unbind in some
> > > workqueue etc.
> > >   
> > > > > > > In VT-d, because of enqcmd and lazy PASID free we plan to
> > > > > > > hold on to the PASID until mmdrop.
> > > > > > > https://lore.kernel.org/patchwork/patch/1217762/      
> > > > > > 
> > > > > > Why? The bind already gets a mmu_notifier which has refcounts
> > > > > > and the right lifetime for PASID.. This code could already be
> > > > > > simplified by using the mmu_notifier_get()/put() stuff.
> > > > > >     
> > > > > Yes, I guess mmu_notifier_get()/put() is new :)
> > > > > +Fenghua    
> > > > 
> > > > I was going to convert the intel code when I did many other
> > > > drivers, but it was a bit too complex..
> > > > 
> > > > But the approach is straightforward. Get rid of the mm search
> > > > list and use mmu_notifier_get(). This returns a singlton notifier
> > > > for the mm_struct and handles refcounting/etc
> > > > 
> > > > Use mmu_notifier_put() during a unbind, it will callback to
> > > > free_notifier() to do the final frees (ie this is where the pasid
> > > > should go away)
> > > > 
> > > > For the SVM_FLAG_PRIVATE_PASID continue to use
> > > > mmu_notifier_register, however this can now be mixed with
> > > > mmu_notifier_put() so the cleanup is the same. A separate ops
> > > > static struct is needed to create a unique key though
> > > > 
> > > > Jason  
> > > 
> > > [Jacob Pan]  
> 
> [Jacob Pan]


More information about the Linux-accelerators mailing list