rcu_sched self-detected stall on CPU

Paul E. McKenney paulmck at kernel.org
Tue Apr 12 23:36:55 AEST 2022


On Tue, Apr 12, 2022 at 04:53:06PM +1000, Michael Ellerman wrote:
> "Paul E. McKenney" <paulmck at kernel.org> writes:
> > On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
> >> Zhouyi Zhou <zhouzhouyi at gmail.com> writes:
> >> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck at kernel.org> wrote:
> >> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
> >> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe at ellerman.id.au> wrote:
> >> ...
> >> >> > > I haven't seen it in my testing. But using Miguel's config I can
> >> >> > > reproduce it seemingly on every boot.
> >> >> > >
> >> >> > > For me it bisects to:
> >> >> > >
> >> >> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >> >> > >
> >> >> > > Which seems plausible.
> >> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
> >> >> > clockevent processing")
> >> ...
> >> >>
> >> >> > > Reverting that on mainline makes the bug go away.
> >> 
> >> >> > I also revert that on the mainline, and am currently doing a pressure
> >> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
> >> >> > VM in Oregon State University.
> >> 
> >> > After 306 rounds of stress test on mainline without triggering the bug
> >> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
> >> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
> >> > processing") and stop the test for now.
> >> 
> >> Thanks for testing, that's pretty conclusive.
> >> 
> >> I'm not inclined to actually revert it yet.
> >> 
> >> We need to understand if there's actually a bug in the patch, or if it's
> >> just exposing some existing bug/bad behavior we have. The fact that it
> >> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
> >> 
> >> Do we have some code that inadvertently relies on something enabled by
> >> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?
> >
> > For whatever it is worth, moderate rcutorture runs to completion without
> > errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.
> 
> Thanks for testing that, I don't have any big x86 machines to test on :)
> 
> > Also for whatever it is worth, I don't know of anything other than
> > microcontrollers or the larger IoT devices that would want their kernels
> > built with CONFIG_HIGH_RES_TIMERS=n.  Which might be a failure of
> > imagination on my part, but so it goes.
> 
> Yeah I agree, like I said before I wasn't even aware you could turn it
> off. So I think we'll definitely add a select HIGH_RES_TIMERS in future,
> but first I need to work out why we are seeing stalls with it disabled.

Good point, and fair enough!

							Thanx, Paul


More information about the Linuxppc-dev mailing list