[PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

Sat Jan 21 06:23:42 AEDT 2023

* Suren Baghdasaryan <surenb at google.com> [230120 12:50]:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <willy at infradead.org> wrote:
> >
> > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <Liam.Howlett at oracle.com> wrote:
> > > >
> > > > * Matthew Wilcox <willy at infradead.org> [230120 11:50]:
> > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <surenb at google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <mhocko at suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <mhocko at suse.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > > >
> > > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > > that this is not really optimal.
> > > > > > > > > >
> > > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > > >
> > > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > > >
> > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > > not sound too appealing either.
> > > > > > > >
> > > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > > >
> > > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > > shrinker.
> > > > > >
> > > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > > vm_area_free to batch the destruction and all other cases call
> > > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > > floods.
> > > > >
> > > > > ... or have two different call_rcu functions; one for munmap() and
> > > > > one for exit.  It'd be nice to use kmem_cache_free_bulk().
> > > >
> > > > Do we even need a call_rcu on exit?  At the point of freeing the VMAs we
> > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > > Once we have obtained the write lock again, I think it's safe to say we
> > > > can just go ahead and free the VMAs directly.
> > >
> > > I think that would be still racy if the page fault handler found that
> > > VMA under read-RCU protection but did not lock it yet (no locks are
> > > held yet). If it's preempted, the VMA can be freed and destroyed from
> > > under it without RCU grace period.
> >
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs.  Right?
> 
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?
> 
> lock_vma_under_rcu() receives mm as a parameter, so I guess it's
> implied that the caller should either mmget() it or operate on
> current->mm, so no need to document this requirement?

It is also implied by the vma->vm_mm link.  Otherwise any RCU holder of
the VMA could have an unsafe pointer.  In fact, if this isn't true then
we need to change the callers to take the ref count to avoid just this
scenario.