[PATCH] powerpc/pseries: Fix cpu hotplug

Fri Nov 28 21:04:46 EST 2008

  Hi Nathan,

On Thu, 27 Nov 2008 18:14:33 -0600 Nathan Lynch <ntl at pobox.com> wrote:

> Hi, I have some questions about this patch.
> 
> Sebastien Dugue wrote:
> > 
> >   Currently, pseries_cpu_die() calls msleep() while polling RTAS for
> > the status of the dying cpu.
> > 
> >   However if the cpu that is going down also happens to be the one doing
> > the tick then we're hosed as the tick_do_timer_cpu 'baton' is only passed
> > later on in tick_shutdown() when _cpu_down() does the CPU_DEAD notification.
> > Therefore jiffies won't be updated anymore.
> 
> I confess unfamiliarity with the tick/timer code, but this sounds like
> something that should be addressed earlier in the process of taking
> down a CPU.

  Maybe you're right, at least the tick_do_timer_cpu should be changed earlier
in the down process, but I'm not sure where we can do that.

> 
> 
> >   This patch replaces that msleep() with a cpu_relax() to make sure we're
> > not going to schedule at that point.
> 
> This is a significant change in behavior.  With the msleep(), we poll
> for at least five seconds before giving up; with the cpu_relax(), the
> period will almost certainly be much shorter and we're likely to give
> up too soon in some circumstances.

  Right, I realized a bit late that that would indeed change the
behaviour. On my test box (2 Power6) the msleep call is hit in ~10 % of the
cases and only loop once, but that may not be the case for all the pSeries
out there where a longer delay might be needed.

>  Could be addressed by using
> mdelay(), but...

  Yep, would be better. I'm still wondering why the hang is not systematic
when offlining the tick_do_timer_cpu, I must have missed something in
my analysis and there might be a race somewhere I failed to identify.

> 
> It's just not clear to me how busy-waiting in the __cpu_die() path is
> a legitimate fix.  Is sleeping in this path forbidden now?

  In that case, I think it is if the cpu going down is also doing the
tick.

>  (I notice
> at least native_cpu_die() in x86 does msleep(), btw.)

  Right, as most other arches do, but the thing is that on those, you
cannot offline CPU0, whereas on power (and maybe otheres too) you can.

> 
> As it can take several milliseconds for RTAS to report a CPU
> offline, and the maximum latency of the operation is unspecified, it
> seems inappropriate to tie up the waiting CPU this way.

  Agreed.

> 
> 
> >   With this patch my test box survives a 100k iterations hotplug stress
> > test on _all_ cpus, whereas without it, it quickly dies after ~50 iterations.
> 
> What is the failure (e.g. stack trace, kernel messages)?

  No stack trace, no kernel messages, nothing :( When that happens, the
cpus get stuck in idle and won't reschedule at all.

  I verified that the decrementer is still ticking, hrtimers are still
running but regular timers are stuck. Changing the tick_do_timer_cpu manually
under xmon resurects the box.

  Will have to look at this a bit more.

  Thanks,

  Sebastien.