rcu_sched self-detected stall on CPU

Sun Apr 10 21:33:43 AEST 2022

Zhouyi Zhou <zhouzhouyi at gmail.com> writes:
> On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck at kernel.org> wrote:
>> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
>> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe at ellerman.id.au> wrote:
...
>> > > I haven't seen it in my testing. But using Miguel's config I can
>> > > reproduce it seemingly on every boot.
>> > >
>> > > For me it bisects to:
>> > >
>> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>> > >
>> > > Which seems plausible.
>> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
>> > clockevent processing")
...
>>
>> > > Reverting that on mainline makes the bug go away.

>> > I also revert that on the mainline, and am currently doing a pressure
>> > test (by repeatedly invoking qemu and checking the console.log) on PPC
>> > VM in Oregon State University.

> After 306 rounds of stress test on mainline without triggering the bug
> (last for 4 hours and 27 minutes), I think the bug is indeed caused by
> 35de589cb879 ("powerpc/time: improve decrementer clockevent
> processing") and stop the test for now.

Thanks for testing, that's pretty conclusive.

I'm not inclined to actually revert it yet.

We need to understand if there's actually a bug in the patch, or if it's
just exposing some existing bug/bad behavior we have. The fact that it
only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.

Do we have some code that inadvertently relies on something enabled by
HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?

cheers