rcu_sched self-detected stall on CPU

Zhouyi Zhou zhouzhouyi at gmail.com
Sat Apr 9 00:25:20 AEST 2022


On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck at kernel.org> wrote:
>
> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe at ellerman.id.au> wrote:
> > >
> > > "Paul E. McKenney" <paulmck at kernel.org> writes:
> > > > On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> > > >> Hi
> > > >>
> > > >> I can reproduce it in a ppc virtual cloud server provided by Oregon
> > > >> State University.  Following is what I do:
> > > >> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> > > >> -o linux-5.18-rc1.tar.gz
> > > >> 2) tar zxf linux-5.18-rc1.tar.gz
> > > >> 3) cp config linux-5.18-rc1/.config
> > > >> 4) cd linux-5.18-rc1
> > > >> 5) make vmlinux -j 8
> > > >> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> > > >> -smp 2 (QEMU 4.2.1)
> > > >> 7) after 12 rounds, the bug got reproduced:
> > > >> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> > > >
> > > > Just to make sure, are you both seeing the same thing?  Last I knew,
> > > > Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> > > > built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> > > > I miss something?
> > > >
> > > > Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> > > > kthread slept for three milliseconds, but did not wake up for more than
> > > > 20 seconds.  This kthread would normally have awakened on CPU 1, but
> > > > CPU 1 looks to me to be very unhealthy, as can be seen in your console
> > > > output below (but maybe my idea of what is healthy for powerpc systems
> > > > is outdated).  Please see also the inline annotations.
> > > >
> > > > Thoughts from the PPC guys?
> > >
> > > I haven't seen it in my testing. But using Miguel's config I can
> > > reproduce it seemingly on every boot.
> > >
> > > For me it bisects to:
> > >
> > >   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> > >
> > > Which seems plausible.
> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
> > clockevent processing")
>
> Very good!  Thank you all!!!
You are very welcome ;-)  and Thank you all!!!!
>
>                                                         Thanx, Paul
>
> > > Reverting that on mainline makes the bug go away.
> > I also revert that on the mainline, and am currently doing a pressure
> > test (by repeatedly invoking qemu and checking the console.log) on PPC
> > VM in Oregon State University.
After 306 rounds of stress test on mainline without triggering the bug
(last for 4 hours and 27 minutes), I think the bug is indeed caused by
35de589cb879 ("powerpc/time: improve decrementer clockevent
processing") and stop the test for now.

Thanks ;-)
Zhouyi
> > >
> > > I don't see an obvious bug in the diff, but I could be wrong, or the old
> > > code was papering over an existing bug?
> > >
> > > I'll try and work out what it is about Miguel's config that exposes
> > > this vs our defconfig, that might give us a clue.
> > Great job!
> > >
> > > cheers
> > Thanks
> > Zhouyi


More information about the Linuxppc-dev mailing list