[PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

Sat May 20 03:23:06 AEST 2023

Hi,

On Mon, May 8, 2023 at 8:52 AM Doug Anderson <dianders at chromium.org> wrote:
>
> Hmmm, but I don't think you really need "all-to-all" checking to get
> the stacktraces you want, do you? Each CPU can be "watching" exactly
> one other CPU, but then when we actually lock up we could check all of
> them and dump stacks on all the ones that are locked up. I think this
> would be a fairly easy improvement for the buddy system. I'll leave it
> out for now just to keep things simple for the initial landing, but it
> wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
> all-to-all) would be equivalent in terms of functionality?

FWIW, I take back my "this would be fairly easy" comment. :-P ...or,
at least I'll acknowledge that the easy way has some tradeoffs. It
wouldn't be trivially easy to just snoop on the data of the other
buddies because the watching processors aren't necessarily
synchronized with each other.

That being said, if someone really wanted to report on other locked
CPUs before doing a panic() and was willing to delay the panic, it
probably wouldn't be too hard to put in a mode where the CPU that
detects the first lockup could do some extra work to look for lockups.
Maybe it could send a normal IPI to other CPUs and see if they respond
or maybe it could take over monitoring all CPUs and wait one extra
period.

In any case, I'm not planning on implementing this now, but at least
wanted to document thoughts. ;-)

> With my simplistic solution
> of just allowing the buddy detector to be enabled in parallel with a
> perf-based detector then we wouldn't have this level of coordination,
> but I'll assume that's OK for the initial landing.

I dug into this more as well and I also wanted to note that, at least
for now, I'm not going to include support to turn on both the buddy
and perf lockup detectors in the common core. In order to do this and
not have them stomp on each other then I think we need extra
coordination or two copies of the interrupt count / saved interrupt
count and, at least at this point in time, it doesn't seem worth it
for a halfway solution. From everything I've heard there is a push on
many x86 machines to get off the perf lockup detector anyway to free
up the resources. Someone could look at adding this complexity later.

-Doug