[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

Michael Ellerman mpe at ellerman.id.au
Wed Aug 5 13:07:08 AEST 2020


Greg Kurz <groug at kaod.org> writes:
> On Tue, 04 Aug 2020 23:35:10 +1000
> Michael Ellerman <mpe at ellerman.id.au> wrote:
>> There is a bit of history to this code, but not in a good way :)
>> 
>> Michael Roth <mdroth at linux.vnet.ibm.com> writes:
>> > For a power9 KVM guest with XIVE enabled, running a test loop
>> > where we hotplug 384 vcpus and then unplug them, the following traces
>> > can be seen (generally within a few loops) either from the unplugged
>> > vcpu:
>> >
>> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
>> ...
>> >
>> > At that point the worker thread assumes the unplugged CPU is in some
>> > unknown/dead state and procedes with the cleanup, causing the race with
>> > the XIVE cleanup code executed by the unplugged CPU.
>> >
>> > Fix this by inserting an msleep() after each RTAS call to avoid
>> 
>> We previously had an msleep(), but it was removed:
>> 
>>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
>
> Ah, I hadn't seen that one...
>
>> > pseries_cpu_die() returning prematurely, and double the number of
>> > attempts so we wait at least a total of 5 seconds. While this isn't an
>> > ideal solution, it is similar to how we dealt with a similar issue for
>> > cede_offline mode in the past (940ce422a3).
>> 
>> Thiago tried to fix this previously but there was a bit of discussion
>> that didn't quite resolve:
>> 
>>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
>
> Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
> messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
> which is prohibited by PAPR... not fixing actual crashes.

I'm pretty sure at one point we were triggering crashes *in* RTAS via
this path, I think that got resolved.

>> Spinning forever seems like a bad idea, but as has been demonstrated at
>> least twice now, continuing when we don't know the state of the other
>> CPU can lead to straight up crashes.
>> 
>> So I think I'm persuaded that it's preferable to have the kernel stuck
>> spinning rather than oopsing.
>> 
>
> +1
>
>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> first instinct is no, if we're stuck here for 20s a stack trace would be
>> good. But then we will probably hit that on some big and/or heavily
>> loaded machine.
>> 
>> So possibly we should call cond_resched() but have some custom logic in
>> the loop to print a warning if we are stuck for more than some
>> sufficiently long amount of time.
>
> How long should that be ?

Yeah good question.

I guess step one would be seeing how long it can take on the 384 vcpu
machine. And we can probably test on some other big machines.

Hopefully Nathan can give us some idea of how long he's seen it take on
large systems? I know he was concerned about the 20s timeout of the
softlockup detector.

Maybe a minute or two?

>> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
>> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
>> 
>> This is not public.
>
> I'll have a look at changing that.

Thanks.

cheers


More information about the Linuxppc-dev mailing list