[PATCH] powerpc/pseries: Disable CPU hotplug across migrations

Gautham R Shenoy ego at linux.vnet.ibm.com
Mon Sep 24 18:56:06 AEST 2018


Hi Michael,

On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
> Nathan Fontenot <nfont at linux.vnet.ibm.com> writes:
> > On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
> >> Hi Nathan,
> >> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
> >> <nfont at linux.vnet.ibm.com> wrote:
> >>>
> >>> When performing partition migrations all present CPUs must be online
> >>> as all present CPUs must make the H_JOIN call as part of the migration
> >>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
> >>> to make the rtas call to perform the migration to the destination system.
> >>>
> >>> During testing of migration and changing the SMT state we have found
> >>> instances where CPUs are offlined, as part of the SMT state change,
> >>> before they make the H_JOIN call. This results in a hung system where
> >>> every CPU is either in H_JOIN or offline.
> >>>
> >>> To prevent this this patch disables CPU hotplug during the migration
> >>> process.
> >>>
> >>> Signed-off-by: Nathan Fontenot <nfont at linux.vnet.ibm.com>
> >>> ---
> >>>  arch/powerpc/kernel/rtas.c |    2 ++
> >>>  1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> >>> index 8afd146bc9c7..2c7ed31c736e 100644
> >>> --- a/arch/powerpc/kernel/rtas.c
> >>> +++ b/arch/powerpc/kernel/rtas.c
> >>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
> >>>                 goto out;
> >>>         }
> >>>
> >>> +       cpu_hotplug_disable();
> >> 
> >> So, some of the onlined CPUs ( via
> >> rtas_online_cpus_mask(offline_mask);) can go still offline,
> >> if the userspace issues an offline command, just before we execute
> >> cpu_hotplug_disable().
> >> 
> >> So we are narrowing down the race, but it still exists. Am I missing something ?
> >
> > You're correct, this narrows the window in which a CPU can go offline.
> >
> > In testing with this patch we have not been able to re-create the failure but
> > there is still a small window.
> 
> Well let's close it.
> 
> We just need to check that all present CPUs are online after we've
> called cpu_hotplug_disable() don't we?

Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
online, since will return with an -EBUSY if CPU-Hotplug has been
disabled. _cpu_up() works, but it is (understandably) a static
function in kernel/cpu.c

So, we might need a new APIs along the lines of
disable_nonboot_cpus()/enable_nonboot_cpus() 
that is currently being used by the suspend subsystem, only that we
would need the APIs to
      - Disable hotplug and online all the CPUs in an atomic
      fashion. Would be good if the API returns the cpumask of CPUs
      which were offline, which were brought online by this API.

      - Restore the state of the machine by offlining the CPUs which
      we brought online, and enable hotplug again. 
      
> 
> cheers
> 



More information about the Linuxppc-dev mailing list