RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

Jonathan Cameron Jonathan.Cameron at huawei.com
Mon Jul 31 21:09:08 AEST 2017


On Wed, 26 Jul 2017 16:15:05 -0700
"Paul E. McKenney" <paulmck at linux.vnet.ibm.com> wrote:

> On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
> > From: "Paul E. McKenney" <paulmck at linux.vnet.ibm.com>
> > Date: Wed, 26 Jul 2017 15:36:58 -0700
> >   
> > > And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> > > CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
> > > really is having an effect.  
> > 
> > Thanks for all of the info Paul, I'll digest this and scan over the
> > code myself.
> > 
> > Just out of curiousity, what x86 idle method is your machine using?
> > The mwait one or the one which simply uses 'halt'?  The mwait variant
> > might mask this bug, and halt would be a lot closer to how sparc64 and
> > Jonathan's system operates.  
> 
> My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> I am not using the mwait one.  Here is a grep for IDLE in my .config:
> 
> 	CONFIG_NO_HZ_IDLE=y
> 	CONFIG_GENERIC_SMP_IDLE_THREAD=y
> 	# CONFIG_IDLE_PAGE_TRACKING is not set
> 	CONFIG_ACPI_PROCESSOR_IDLE=y
> 	CONFIG_CPU_IDLE=y
> 	# CONFIG_CPU_IDLE_GOV_LADDER is not set
> 	CONFIG_CPU_IDLE_GOV_MENU=y
> 	# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
> 	# CONFIG_INTEL_IDLE is not set
> 
> > On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> > local TICK register keeps advancing, and the local timer therefore
> > will still trigger.  Also, any externally generated interrupts
> > (including cross calls) will wake up the cpu as well.
> > 
> > The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> > case.  One of my running theories is that we miss scheduling a tick
> > due to a race.  That would be consistent with the behavior we see
> > in the RCU dumps, I think.  
> 
> But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> warning?  By default, your grace period needs to extend for more than
> 21 seconds (more than one-third of a -minute-) to get one.  Or do
> you mean that the ticks get shut off now and forever, as opposed to
> just losing one of them?
> 
> > Anyways, just a theory, and that's why I keep mentioning that commit
> > about the revert of the revert (specifically
> > 411fe24e6b7c283c3a1911450cdba6dd3aaea56e).
> > 
> > :-)  
> 
> I am running an overnight test in preparation for attempting to push
> some fixes for regressions into 4.12, but will try reverting this
> and enabling CONFIG_HZ_PERIODIC tomorrow.
> 
> Jonathan, might the commit that Dave points out above be what reduces
> the probability of occurrence as you test older releases?
I just got around to trying this out of curiosity.  Superficially it did
appear to possibly make the issue harder to hit took over 30 minutes
but the issue otherwise looks much the same with or without that patch.

Just out of curiosity, next thing on my list is to disable hrtimers entirely
and see what happens.

Jonathan
> 
> 							Thanx, Paul
> 



More information about the Linuxppc-dev mailing list