[PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration

Tue Feb 5 21:24:51 AEDT 2019

Michael Bringmann <mwb at linux.vnet.ibm.com> writes:
> See below.
>
> On 1/31/19 3:53 PM, Michael Bringmann wrote:
>> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>>> Michael Bringmann <mwb at linux.vnet.ibm.com> writes:
>>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>>> tests encountered a problem ehere Linux has put some threads to
>>>> sleep (possibly to save energy or something), LPM was attempted,
>>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>>> the H_JOIN for the active threads.  Since the sleeping threads
>>>> are not awake, they can not issue the expected H_JOIN, and the
>>>> partition would never suspend.  This patch wakes the sleeping
>>>> threads back up.
>>>
>>> I'm don't think this is the right solution.
>>>
>>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>>> to every CPU, and that should wake all CPUs up from CEDE.
>>>
>>> If that's not happening then there is a bug somewhere, and we need to
>>> work out where.
>
> From Pete Heyrman:
>     Both sending IPI or H_PROD will awaken a logical processors that has ceded.
>     When you have logical proc doing cede and one logical proc doing prod or IPI
>     you have a race condition that the prod/IPI can proceed the cede request.
>     If you use prod, the hypervisor takes care of the synchronization by ignoring
>     a cede request if it was preceeded by a prod.  With IPI the interrupt is
>     delivered which could then be followed by a cede so OS would need to provide
>     synchronization.
>
> Shouldn't this answer your concerns about race conditions and the suitability
> of using H_PROD?

No sorry it doesn't.

Assuming the other CPU is idle it will just continually do CEDE in a
loop, sending it a PROD will just wake it up once and then it will CEDE
again. That first CEDE might return immediately on seeing the PROD, but
then the kernel will just CEDE again because it has nothing to do.

In contrast the IPI we send wakes up the other CPU and tells it to run a
function, rtas_percpu_suspend_me(), which does the H_JOIN directly.

I still don't understand how the original bug ever even happened. That's
what I want to know.

The way we do the joining and suspend seems like it could be simpler,
there's a bunch of atomic flags and __rtas_suspend_last_cpu() seems to
duplicate much of __rtas_suspend_cpu(). It seems more likely we have a
bug in there somewhere.

cheers