Low-res tick handler device not going to ONESHOT_STOPPED when tick is stopped (was: rcu_sched self-detected stall on CPU)

Fri Apr 15 03:15:06 AEST 2022

On Wed, Apr 13, 2022 at 04:10:02PM +1000, Nicholas Piggin wrote:
> Oops, fixed subject...
> 
> Excerpts from Nicholas Piggin's message of April 13, 2022 3:11 pm:
> > +Daniel, Thomas, Viresh
> > 
> > Subject: Re: rcu_sched self-detected stall on CPU
> > 
> > Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am:
> >> Michael Ellerman <mpe at ellerman.id.au> writes:
> >>> "Paul E. McKenney" <paulmck at kernel.org> writes:
> >>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> >>>>> Hi
> >>>>> 
> >>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
> >>>>> State University.  Following is what I do:
> >>>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> >>>>> -o linux-5.18-rc1.tar.gz
> >>>>> 2) tar zxf linux-5.18-rc1.tar.gz
> >>>>> 3) cp config linux-5.18-rc1/.config
> >>>>> 4) cd linux-5.18-rc1
> >>>>> 5) make vmlinux -j 8
> >>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> >>>>> -smp 2 (QEMU 4.2.1)
> >>>>> 7) after 12 rounds, the bug got reproduced:
> >>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >>>>
> >>>> Just to make sure, are you both seeing the same thing?  Last I knew,
> >>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> >>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> >>>> I miss something?
> >>>>
> >>>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> >>>> kthread slept for three milliseconds, but did not wake up for more than
> >>>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
> >>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
> >>>> output below (but maybe my idea of what is healthy for powerpc systems
> >>>> is outdated).  Please see also the inline annotations.
> >>>>
> >>>> Thoughts from the PPC guys?
> >>>
> >>> I haven't seen it in my testing. But using Miguel's config I can
> >>> reproduce it seemingly on every boot.
> >>>
> >>> For me it bisects to:
> >>>
> >>>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >>>
> >>> Which seems plausible.
> >>>
> >>> Reverting that on mainline makes the bug go away.
> >>>
> >>> I don't see an obvious bug in the diff, but I could be wrong, or the old
> >>> code was papering over an existing bug?
> >>>
> >>> I'll try and work out what it is about Miguel's config that exposes
> >>> this vs our defconfig, that might give us a clue.
> >> 
> >> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
> >> 
> >> I can reproduce just with:
> >> 
> >>   $ make ppc64le_guest_defconfig
> >>   $ ./scripts/config -d HIGH_RES_TIMERS
> >> 
> >> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
> >> realise you could disable it TBH :)
> >> 
> >> The Rust CI has it disabled because I copied that from the x86 defconfig
> >> they were using back when I added the Rust support. I think that was
> >> meant to be a stripped down fast config for CI, but the result is it's
> >> just using a badly tested combination which is not helpful.
> >> 
> >> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
> >> can debug this further without blocking them.
> > 
> > So we traced the problem down to possibly a misunderstanding between 
> > decrementer clock event device and core code.
> > 
> > The decrementer is only oneshot*ish*. It actually needs to either be 
> > reprogrammed or shut down otherwise it just continues to cause 
> > interrupts.
> > 
> > Before commit 35de589cb879, it was sort of two-shot. The initial 
> > interrupt at the programmed time would set its internal next_tb variable 
> > to ~0 and call the ->event_handler(). If that did not set_next_event or 
> > stop the timer, the interrupt will fire again immediately, notice 
> > next_tb is ~0, and only then stop the decrementer interrupt.
> > 
> > So that was already kind of ugly, this patch just turned it into a hang.
> > 
> > The problem happens when the tick is stopped with an event still 
> > pending, then tick_nohz_handler() is called, but it bails out because 
> > tick_stopped == 1 so the device never gets programmed again, and so it 
> > keeps firing.
> > 
> > How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
> > really oneshot, but we would like to avoid doing that because it requires 
> > additional programming of the hardware on each timer interrupt. We have 
> > the ONESHOT_STOPPED state which seems to be just about what we want.
> > 
> > Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
> > we don't stop it here? This patch seems to fix the hang (not heavily
> > tested though).

This looks plausible to me based on my interactions with ticks, but it
would be good to have someone who understands that code better than I
do to look it over.

							Thanx, Paul

> > Thanks,
> > Nick
> > 
> > ---
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 2d76c91b85de..7e13a55b6b71 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1364,9 +1364,11 @@ static void tick_nohz_handler(struct clock_event_device *dev)
> >  	tick_sched_do_timer(ts, now);
> >  	tick_sched_handle(ts, regs);
> >  
> > -	/* No need to reprogram if we are running tickless  */
> > -	if (unlikely(ts->tick_stopped))
> > +	if (unlikely(ts->tick_stopped)) {
> > +		/* If we are tickless, change the clock event to stopped */
> > +		tick_program_event(KTIME_MAX, 1);
> >  		return;
> > +	}
> >  
> >  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> >  	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
> >