[PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

Fri Jan 20 19:57:05 AEDT 2023

On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <paulmck at kernel.org> wrote:
> > [...]
> > > > There are a couple of possibilities here.
> > > >
> > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > order to save power by batching RCU work over multiple call_rcu()
> > > > invocations.  If this is causing a problem for a given call site, the
> > > > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > > > to the old-school non-laziness, but can of course consume more power.
> > > 
> > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > at the time I was profiling this issue.
> > > Laxy RCU would be a great option to replace this patch but
> > > unfortunately it's not the default behavior, so I would still have to
> > > implement this batching in case lazy RCU is not enabled.
> > > 
> > > >
> > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > and the invocation of the corresponding callback in kernels built with
> > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > the added delay.  The reason for this delay is the use of a separate
> > > > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > > > lock contention on the main ->cblist.  This is not needed in old-school
> > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > (including most datacenter kernels) because in that case the callbacks
> > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > that there is no need for locks.
> > > 
> > > I believe this is the reason in my profiled case.
> > > 
> > > >
> > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> > > 
> > > I don't think I've seen such a case.
> > > Thanks for clarifications, Paul!
> > 
> > Thanks for the explanation Paul. I have to say this has caught me as a
> > surprise. There are just not enough details about the benchmark to
> > understand what is going on but I find it rather surprising that
> > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > which is the callback. My naive understanding has been that call_rcu is
> > really fast way to defer the execution to the RCU safe context to do the
> > final cleanup.
> 
> If I am following along correctly (ha!), then your "induce a higher
> overhead" should be something like "induce a higher to-kfree() latency".

Yes, this is expected.

> Of course, there already is a higher latency-to-kfree via call_rcu()
> than via a direct call to kfree(), and callback-offload CPUs that are
> being flooded with callbacks raise that latency a jiffy or so more in
> order to avoid lock contention.
> 
> If this becomes a problem, the callback-offloading code can be a bit
> smarter about avoiding lock contention, but need to see a real problem
> before I make that change.  But if there is a real problem I will of
> course fix it.

I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.

But I guess we need some specific numbers from Suren to be sure what is
going on here.

Thanks!
-- 
Michal Hocko
SUSE Labs