[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

Wed Aug 5 14:37:32 AEST 2020

Quoting Michael Ellerman (2020-08-04 22:07:08)
> Greg Kurz <groug at kaod.org> writes:
> > On Tue, 04 Aug 2020 23:35:10 +1000
> > Michael Ellerman <mpe at ellerman.id.au> wrote:
> >> There is a bit of history to this code, but not in a good way :)
> >> 
> >> Michael Roth <mdroth at linux.vnet.ibm.com> writes:
> >> > For a power9 KVM guest with XIVE enabled, running a test loop
> >> > where we hotplug 384 vcpus and then unplug them, the following traces
> >> > can be seen (generally within a few loops) either from the unplugged
> >> > vcpu:
> >> >
> >> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
> >> ...
> >> >
> >> > At that point the worker thread assumes the unplugged CPU is in some
> >> > unknown/dead state and procedes with the cleanup, causing the race with
> >> > the XIVE cleanup code executed by the unplugged CPU.
> >> >
> >> > Fix this by inserting an msleep() after each RTAS call to avoid
> >> 
> >> We previously had an msleep(), but it was removed:
> >> 
> >>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> >
> > Ah, I hadn't seen that one...
> >
> >> > pseries_cpu_die() returning prematurely, and double the number of
> >> > attempts so we wait at least a total of 5 seconds. While this isn't an
> >> > ideal solution, it is similar to how we dealt with a similar issue for
> >> > cede_offline mode in the past (940ce422a3).
> >> 
> >> Thiago tried to fix this previously but there was a bit of discussion
> >> that didn't quite resolve:
> >> 
> >>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
> >
> > Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
> > messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
> > which is prohibited by PAPR... not fixing actual crashes.
> 
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.
> 
> >> Spinning forever seems like a bad idea, but as has been demonstrated at
> >> least twice now, continuing when we don't know the state of the other
> >> CPU can lead to straight up crashes.
> >> 
> >> So I think I'm persuaded that it's preferable to have the kernel stuck
> >> spinning rather than oopsing.
> >> 
> >
> > +1
> >
> >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >> first instinct is no, if we're stuck here for 20s a stack trace would be
> >> good. But then we will probably hit that on some big and/or heavily
> >> loaded machine.
> >> 
> >> So possibly we should call cond_resched() but have some custom logic in
> >> the loop to print a warning if we are stuck for more than some
> >> sufficiently long amount of time.
> >
> > How long should that be ?
> 
> Yeah good question.
> 
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
> 
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.
> 
> Maybe a minute or two?

Hmm, so I took a stab at this where I called cond_resched() after
every 5 seconds of polling and printed a warning at the same time (FWIW
that doesn't seem to trigger any warnings on a loaded 96-core mihawk
system using KVM running the 384vcpu unplug loop)

But it sounds like that's not quite what you had in mind. How frequently
do you think we should call cond_resched()? Maybe after 25 iterations
of polling smp_query_cpu_stopped() to keep original behavior somewhat
similar?

I'll let the current patch run on the mihawk system overnight in the
meantime so we at least have that data point, but would be good to
know what things look like a large pHyp machine.

Thanks!

> 
> >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> >> 
> >> This is not public.
> >
> > I'll have a look at changing that.
> 
> Thanks.
> 
> cheers