[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

Tue Aug 11 15:39:01 AEST 2020

Quoting Nathan Lynch (2020-08-07 02:05:09)
> Hi everyone,
> 
> Michael Ellerman <mpe at ellerman.id.au> writes:
> > Greg Kurz <groug at kaod.org> writes:
> >> On Tue, 04 Aug 2020 23:35:10 +1000
> >> Michael Ellerman <mpe at ellerman.id.au> wrote:
> >>> Spinning forever seems like a bad idea, but as has been demonstrated at
> >>> least twice now, continuing when we don't know the state of the other
> >>> CPU can lead to straight up crashes.
> >>> 
> >>> So I think I'm persuaded that it's preferable to have the kernel stuck
> >>> spinning rather than oopsing.
> >>> 
> >>
> >> +1
> >>
> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >>> first instinct is no, if we're stuck here for 20s a stack trace would be
> >>> good. But then we will probably hit that on some big and/or heavily
> >>> loaded machine.
> >>> 
> >>> So possibly we should call cond_resched() but have some custom logic in
> >>> the loop to print a warning if we are stuck for more than some
> >>> sufficiently long amount of time.
> >>
> >> How long should that be ?
> >
> > Yeah good question.
> >
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> >
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> 
> Maybe I'm not quite clear what this is referring to, but I don't think
> stop-self/query-stopped-state latency increases with processor count, at
> least not on PowerVM. And IIRC I was concerned with the earlier patch's
> potential for causing the softlockup watchdog to rightly complain by
> polling the stopped state without ever scheduling away.
> 
> The fact that smp_query_cpu_stopped() kind of collapses the two distinct
> results from the query-cpu-stopped-state RTAS call into one return value
> may make it harder than necessary to reason about the questions around
> cond_resched() and whether to warn.
> 
> Sorry to pull this stunt but I have had some code sitting in a neglected
> branch that I think gets the logic around this right.
> 
> What we should have is a simple C wrapper for the RTAS call that reflects the
> architected inputs and outputs:
> 
> ================================================================
> (-- rtas.c --)
> 
> /**
>  * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
>  * @hwcpu: Identifies the processor thread to be queried.
>  * @status: Pointer to status, valid only on success.
>  *
>  * Determine whether the given processor thread is in the stopped
>  * state.  If successful and @status is non-NULL, the thread's status
>  * is stored to @status.
>  *
>  * Return:
>  * * 0   - Success
>  * * -1  - Hardware error
>  * * -2  - Busy, try again later
>  */
> int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
> {
>        unsigned int cpu_status;
>        int token;
>        int fwrc;
> 
>        token = rtas_token("query-cpu-stopped-state");
> 
>        fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
>        if (fwrc != 0)
>                goto out;
> 
>        if (status != NULL)
>                *status = cpu_status;
> out:
>        return fwrc;
> }
> ================================================================
> 
> 
> And then a utility function that waits for the remote thread to enter
> stopped state, with higher-level logic for rescheduling and warning. The
> fact that smp_query_cpu_stopped() currently does not handle a -2/busy
> status is a bug, fixed below by using rtas_busy_delay(). Note the
> justification for the explicit cond_resched() in the outer loop:
> 
> ================================================================
> (-- rtas.h --)
> 
> /* query-cpu-stopped-state CPU_status */
> #define RTAS_QCSS_STATUS_STOPPED     0
> #define RTAS_QCSS_STATUS_IN_PROGRESS 1
> #define RTAS_QCSS_STATUS_NOT_STOPPED 2
> 
> (-- pseries/hotplug-cpu.c --)
> 
> /**
>  * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
>  */
> static void wait_for_cpu_stopped(unsigned int cpu)
> {
>        unsigned int status;
>        unsigned int hwcpu;
> 
>        hwcpu = get_hard_smp_processor_id(cpu);
> 
>        do {
>                int fwrc;
> 
>                /*
>                 * rtas_busy_delay() will yield only if RTAS returns a
>                 * busy status. Since query-cpu-stopped-state can
>                 * yield RTAS_QCSS_STATUS_IN_PROGRESS or
>                 * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
>                 * period before the target thread stops, we must take
>                 * care to explicitly reschedule while polling.
>                 */
>                cond_resched();
> 
>                do {
>                        fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
>                } while (rtas_busy_delay(fwrc));
> 
>                if (fwrc == 0)
>                        continue;
> 
>                pr_err_ratelimited("query-cpu-stopped-state for "
>                                   "thread 0x%x returned %d\n",
>                                   hwcpu, fwrc);
>                goto out;
> 
>        } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
>                 status == RTAS_QCSS_STATUS_IN_PROGRESS);
> 
>        if (status != RTAS_QCSS_STATUS_STOPPED) {
>                pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
>                                   "status %d for thread 0x%x\n",
>                                   status, hwcpu);
>        }
> out:
>        return;
> }
> 
> [...]
> 
> static void pseries_cpu_die(unsigned int cpu)
> {
>        wait_for_cpu_stopped(cpu);
>        paca_ptrs[cpu]->cpu_start = 0;
> }
> ================================================================
> 
> wait_for_cpu_stopped() should be able to accommodate a time-based
> warning if necessary, but speaking as a likely recipient of any bug
> reports that would arise here, I'm not convinced of the need and I
> don't know what a good value would be. It's relatively easy to sample
> the stack of a task that's apparently failing to make progress, plus I
> probably would use 'perf probe' or similar to report the inputs and
> outputs for the RTAS call.

I think if we make the timeout sufficiently high like 2 minutes or so
it wouldn't hurt and if we did seem them it would probably point to an
actual bug. But I don't have a strong feeling either way.

> 
> I'm happy to make this a proper submission after I can clean it up and
> retest it, or Michael R. is welcome to appropriate it, assuming it's
> acceptable.
> 

I've given it a shot with this patch and it seems to be holding up in
testing. If we don't think the ~2 minutes warning message is needed I
can clean it up to post:

https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e

I'd likely break the refactoring patches out to a separate patch under
Nathan's name since it fixes a separate bug potentially.