[PATCH] powerpc/smp: Wait until secondaries are active & online

Stewart Smith stewart at linux.vnet.ibm.com
Thu Feb 26 10:13:16 AEDT 2015


Michael Ellerman <mpe at ellerman.id.au> writes:

> Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>
>   BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
>   CPU: 0
>   Comm: watchdog/130
>
> The problem is that we aren't ensuring the CPU active bit is set for the
> secondary before allowing the master to continue on. The master unparks
> the secondary CPU's kthreads and the scheduler looks for a CPU to run
> on. It calls select_task_rq() and realises the suggested CPU is not in
> the cpus_allowed mask. It then ends up in select_fallback_rq(), and
> since the active bit isnt't set we choose some other CPU to run on.
>
> This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug
> vs. set_cpus_allowed_ptr()", which changed from setting active before
> online to setting active after online. However that was in turn fixing a
> bug where other code assumed an active CPU was also online, so we can't
> just revert that fix.
>
> The simplest fix is just to spin waiting for both active & online to be
> set. We already have a barrier prior to set_cpu_online() (which also
> sets active), to ensure all other setup is completed before online &
> active are set.
>
> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
> Signed-off-by: Michael Ellerman <mpe at ellerman.id.au>
> Signed-off-by: Anton Blanchard <anton at samba.org>

By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole
bunch slower (because gcov), I could really *really* reliably reproduce
this. With this patch, I cannot.

Tested-by: Stewart Smith <stewart at linux.vnet.ibm.com>



More information about the Linuxppc-dev mailing list