[PATCH] powerpc/pseries: Disable CPU hotplug across migrations

Nathan Fontenot nfont at linux.vnet.ibm.com
Fri Sep 21 01:03:40 AEST 2018


On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
> Hi Nathan,
> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
> <nfont at linux.vnet.ibm.com> wrote:
>>
>> When performing partition migrations all present CPUs must be online
>> as all present CPUs must make the H_JOIN call as part of the migration
>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>> to make the rtas call to perform the migration to the destination system.
>>
>> During testing of migration and changing the SMT state we have found
>> instances where CPUs are offlined, as part of the SMT state change,
>> before they make the H_JOIN call. This results in a hung system where
>> every CPU is either in H_JOIN or offline.
>>
>> To prevent this this patch disables CPU hotplug during the migration
>> process.
>>
>> Signed-off-by: Nathan Fontenot <nfont at linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/rtas.c |    2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> index 8afd146bc9c7..2c7ed31c736e 100644
>> --- a/arch/powerpc/kernel/rtas.c
>> +++ b/arch/powerpc/kernel/rtas.c
>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>                 goto out;
>>         }
>>
>> +       cpu_hotplug_disable();
> 
> So, some of the onlined CPUs ( via
> rtas_online_cpus_mask(offline_mask);) can go still offline,
> if the userspace issues an offline command, just before we execute
> cpu_hotplug_disable().
> 
> So we are narrowing down the race, but it still exists. Am I missing something ?

You're correct, this narrows the window in which a CPU can go offline.

In testing with this patch we have not been able to re-create the failure but
there is still a small window.

-Nathan

> 
>>         stop_topology_update();
>>
>>         /* Call function on all CPUs.  One of us will make the
>> @@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>                 printk(KERN_ERR "Error doing global join\n");
>>
>>         start_topology_update();
>> +       cpu_hotplug_enable();
>>
>>         /* Take down CPUs not online prior to suspend */
>>         cpuret = rtas_offline_cpus_mask(offline_mask);
>>
> 
> 



More information about the Linuxppc-dev mailing list