[PATCH] powerpc/pseries: Disable CPU hotplug across migrations

Nathan Fontenot nfont at linux.vnet.ibm.com
Tue Sep 25 00:30:18 AEST 2018


On 09/24/2018 03:56 AM, Gautham R Shenoy wrote:
> Hi Michael,
> 
> On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
>> Nathan Fontenot <nfont at linux.vnet.ibm.com> writes:
>>> On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
>>>> Hi Nathan,
>>>> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
>>>> <nfont at linux.vnet.ibm.com> wrote:
>>>>>
>>>>> When performing partition migrations all present CPUs must be online
>>>>> as all present CPUs must make the H_JOIN call as part of the migration
>>>>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>>>>> to make the rtas call to perform the migration to the destination system.
>>>>>
>>>>> During testing of migration and changing the SMT state we have found
>>>>> instances where CPUs are offlined, as part of the SMT state change,
>>>>> before they make the H_JOIN call. This results in a hung system where
>>>>> every CPU is either in H_JOIN or offline.
>>>>>
>>>>> To prevent this this patch disables CPU hotplug during the migration
>>>>> process.
>>>>>
>>>>> Signed-off-by: Nathan Fontenot <nfont at linux.vnet.ibm.com>
>>>>> ---
>>>>>  arch/powerpc/kernel/rtas.c |    2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>>>> index 8afd146bc9c7..2c7ed31c736e 100644
>>>>> --- a/arch/powerpc/kernel/rtas.c
>>>>> +++ b/arch/powerpc/kernel/rtas.c
>>>>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>>                 goto out;
>>>>>         }
>>>>>
>>>>> +       cpu_hotplug_disable();
>>>>
>>>> So, some of the onlined CPUs ( via
>>>> rtas_online_cpus_mask(offline_mask);) can go still offline,
>>>> if the userspace issues an offline command, just before we execute
>>>> cpu_hotplug_disable().
>>>>
>>>> So we are narrowing down the race, but it still exists. Am I missing something ?
>>>
>>> You're correct, this narrows the window in which a CPU can go offline.
>>>
>>> In testing with this patch we have not been able to re-create the failure but
>>> there is still a small window.
>>
>> Well let's close it.
>>
>> We just need to check that all present CPUs are online after we've
>> called cpu_hotplug_disable() don't we?
> 
> Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
> online, since will return with an -EBUSY if CPU-Hotplug has been
> disabled. _cpu_up() works, but it is (understandably) a static
> function in kernel/cpu.c
> 
> So, we might need a new APIs along the lines of
> disable_nonboot_cpus()/enable_nonboot_cpus() 
> that is currently being used by the suspend subsystem, only that we
> would need the APIs to
>       - Disable hotplug and online all the CPUs in an atomic
>       fashion. Would be good if the API returns the cpumask of CPUs
>       which were offline, which were brought online by this API.
> 
>       - Restore the state of the machine by offlining the CPUs which
>       we brought online, and enable hotplug again. 
>       

There is already code in the LPM path that saves a cpu mask of the offline
cpus prior to bringing them all online so we can offline them again after
the migration.

The missing piece to fully close the window is an API that will allow us to
online cpus while cpu hotplug is disabled.

Since we have not been able to re-create the failure with this patch would
it be ok to pull in this patch while other options are explored?

-Nathan

>>
>> cheers
>>



More information about the Linuxppc-dev mailing list