[PATCH] powerpc/rtas: Fix a potential race between CPU-Offline & Migration

Gautham R Shenoy ego at linux.vnet.ibm.com
Fri Sep 28 17:02:44 AEST 2018


Hi Nathan,

On Thu, Sep 27, 2018 at 12:31:34PM -0500, Nathan Fontenot wrote:
> On 09/27/2018 11:51 AM, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" <ego at linux.vnet.ibm.com>
> > 
> > Live Partition Migrations require all the present CPUs to execute the
> > H_JOIN call, and hence rtas_ibm_suspend_me() onlines any offline CPUs
> > before initiating the migration for this purpose.
> > 
> > The commit 85a88cabad57
> > ("powerpc/pseries: Disable CPU hotplug across migrations")
> > disables any CPU-hotplug operations once all the offline CPUs are
> > brought online to prevent any further state change. Once the
> > CPU-Hotplug operation is disabled, the code assumes that all the CPUs
> > are online.
> > 
> > However, there is a minor window in rtas_ibm_suspend_me() between
> > onlining the offline CPUs and disabling CPU-Hotplug when a concurrent
> > CPU-offline operations initiated by the userspace can succeed thereby
> > nullifying the the aformentioned assumption. In this unlikely case
> > these offlined CPUs will not call H_JOIN, resulting in a system hang.
> > 
> > Fix this by verifying that all the present CPUs are actually online
> > after CPU-Hotplug has been disabled, failing which we return from
> > rtas_ibm_suspend_me() with -EBUSY.
> 
> Would we also want to havr the ability to re-try onlining all of the cpus
> before failing the migration?

Given that we haven't been able to hit issue in practice after your
fix to disable CPU Hotplug after migrations, it indicates that the
race-window, if it is not merely a theoretical one, is extremely
narrow. So, this current patch addresses the safety aspect, as in,
should someone manage to exploit this narrow race-window, it ensures
that the system doesn't go to a hang state.

Having the ability to retry onlining all the CPUs is only required for
progress of LPM in this rarest of cases. We should add the code to
retry onlining the CPUs if the consequence of failing an LPM is high,
even in this rarest of case. Otherwise IMHO we should be ok not adding
the additional code.

> 
> This would involve a bigger code change as the current code to online all
> CPUs would work in its current form.
> 
> -Nathan
> 
> > 
> > Cc: Nathan Fontenot <nfont at linux.vnet.ibm.com>
> > Cc: Tyrel Datwyler <tyreld at linux.vnet.ibm.com>
> > Suggested-by: Michael Ellerman <mpe at ellerman.id.au>
> > Signed-off-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/rtas.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> > index 2c7ed31..27f6fd3 100644
> > --- a/arch/powerpc/kernel/rtas.c
> > +++ b/arch/powerpc/kernel/rtas.c
> > @@ -982,6 +982,16 @@ int rtas_ibm_suspend_me(u64 handle)
> >  	}
> > 
> >  	cpu_hotplug_disable();
> > +
> > +	/* Check if we raced with a CPU-Offline Operation */
> > +	if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
> > +		pr_err("%s: Raced against a concurrent CPU-Offline\n",
> > +		       __func__);
> > +		atomic_set(&data.error, -EBUSY);
> > +		cpu_hotplug_enable();
> > +		goto out;
> > +	}
> > +
> >  	stop_topology_update();
> > 
> >  	/* Call function on all CPUs.  One of us will make the
> > 



More information about the Linuxppc-dev mailing list