[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Michael Roth
mdroth at linux.vnet.ibm.com
Tue Aug 11 15:39:01 AEST 2020
Quoting Nathan Lynch (2020-08-07 02:05:09)
> Hi everyone,
>
> Michael Ellerman <mpe at ellerman.id.au> writes:
> > Greg Kurz <groug at kaod.org> writes:
> >> On Tue, 04 Aug 2020 23:35:10 +1000
> >> Michael Ellerman <mpe at ellerman.id.au> wrote:
> >>> Spinning forever seems like a bad idea, but as has been demonstrated at
> >>> least twice now, continuing when we don't know the state of the other
> >>> CPU can lead to straight up crashes.
> >>>
> >>> So I think I'm persuaded that it's preferable to have the kernel stuck
> >>> spinning rather than oopsing.
> >>>
> >>
> >> +1
> >>
> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >>> first instinct is no, if we're stuck here for 20s a stack trace would be
> >>> good. But then we will probably hit that on some big and/or heavily
> >>> loaded machine.
> >>>
> >>> So possibly we should call cond_resched() but have some custom logic in
> >>> the loop to print a warning if we are stuck for more than some
> >>> sufficiently long amount of time.
> >>
> >> How long should that be ?
> >
> > Yeah good question.
> >
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> >
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
>
> Maybe I'm not quite clear what this is referring to, but I don't think
> stop-self/query-stopped-state latency increases with processor count, at
> least not on PowerVM. And IIRC I was concerned with the earlier patch's
> potential for causing the softlockup watchdog to rightly complain by
> polling the stopped state without ever scheduling away.
>
> The fact that smp_query_cpu_stopped() kind of collapses the two distinct
> results from the query-cpu-stopped-state RTAS call into one return value
> may make it harder than necessary to reason about the questions around
> cond_resched() and whether to warn.
>
> Sorry to pull this stunt but I have had some code sitting in a neglected
> branch that I think gets the logic around this right.
>
> What we should have is a simple C wrapper for the RTAS call that reflects the
> architected inputs and outputs:
>
> ================================================================
> (-- rtas.c --)
>
> /**
> * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
> * @hwcpu: Identifies the processor thread to be queried.
> * @status: Pointer to status, valid only on success.
> *
> * Determine whether the given processor thread is in the stopped
> * state. If successful and @status is non-NULL, the thread's status
> * is stored to @status.
> *
> * Return:
> * * 0 - Success
> * * -1 - Hardware error
> * * -2 - Busy, try again later
> */
> int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
> {
> unsigned int cpu_status;
> int token;
> int fwrc;
>
> token = rtas_token("query-cpu-stopped-state");
>
> fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
> if (fwrc != 0)
> goto out;
>
> if (status != NULL)
> *status = cpu_status;
> out:
> return fwrc;
> }
> ================================================================
>
>
> And then a utility function that waits for the remote thread to enter
> stopped state, with higher-level logic for rescheduling and warning. The
> fact that smp_query_cpu_stopped() currently does not handle a -2/busy
> status is a bug, fixed below by using rtas_busy_delay(). Note the
> justification for the explicit cond_resched() in the outer loop:
>
> ================================================================
> (-- rtas.h --)
>
> /* query-cpu-stopped-state CPU_status */
> #define RTAS_QCSS_STATUS_STOPPED 0
> #define RTAS_QCSS_STATUS_IN_PROGRESS 1
> #define RTAS_QCSS_STATUS_NOT_STOPPED 2
>
> (-- pseries/hotplug-cpu.c --)
>
> /**
> * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
> */
> static void wait_for_cpu_stopped(unsigned int cpu)
> {
> unsigned int status;
> unsigned int hwcpu;
>
> hwcpu = get_hard_smp_processor_id(cpu);
>
> do {
> int fwrc;
>
> /*
> * rtas_busy_delay() will yield only if RTAS returns a
> * busy status. Since query-cpu-stopped-state can
> * yield RTAS_QCSS_STATUS_IN_PROGRESS or
> * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
> * period before the target thread stops, we must take
> * care to explicitly reschedule while polling.
> */
> cond_resched();
>
> do {
> fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
> } while (rtas_busy_delay(fwrc));
>
> if (fwrc == 0)
> continue;
>
> pr_err_ratelimited("query-cpu-stopped-state for "
> "thread 0x%x returned %d\n",
> hwcpu, fwrc);
> goto out;
>
> } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
> status == RTAS_QCSS_STATUS_IN_PROGRESS);
>
> if (status != RTAS_QCSS_STATUS_STOPPED) {
> pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
> "status %d for thread 0x%x\n",
> status, hwcpu);
> }
> out:
> return;
> }
>
> [...]
>
> static void pseries_cpu_die(unsigned int cpu)
> {
> wait_for_cpu_stopped(cpu);
> paca_ptrs[cpu]->cpu_start = 0;
> }
> ================================================================
>
> wait_for_cpu_stopped() should be able to accommodate a time-based
> warning if necessary, but speaking as a likely recipient of any bug
> reports that would arise here, I'm not convinced of the need and I
> don't know what a good value would be. It's relatively easy to sample
> the stack of a task that's apparently failing to make progress, plus I
> probably would use 'perf probe' or similar to report the inputs and
> outputs for the RTAS call.
I think if we make the timeout sufficiently high like 2 minutes or so
it wouldn't hurt and if we did seem them it would probably point to an
actual bug. But I don't have a strong feeling either way.
>
> I'm happy to make this a proper submission after I can clean it up and
> retest it, or Michael R. is welcome to appropriate it, assuming it's
> acceptable.
>
I've given it a shot with this patch and it seems to be holding up in
testing. If we don't think the ~2 minutes warning message is needed I
can clean it up to post:
https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
I'd likely break the refactoring patches out to a separate patch under
Nathan's name since it fixes a separate bug potentially.
More information about the Linuxppc-dev
mailing list