BMC Health Watchdog

Wed Mar 28 02:08:52 AEDT 2018

On Fri, Mar 23, 2018 at 11:51 PM, Patrick Venture <venture at google.com> wrote:
> I've been on a BMC health kick lately.  Mostly because it's something
> I think we all want and have different requirements.
>
> Outside of exporting the BMC health information, which could be done
> from the same watchdog daemon, -- what requirements has everybody run
> into?

I've seen some pretty interesting things in the past.
- Take a time stamp, sleep for X seconds, take time stamp and compare,
if it's off by some margin, reboot BMC (identifying some general time
skew issue with BMC)
- Periodically attempt to open, write, close, delete a test file on
all non-flash based writeable fs's, if operation fails reboot BMC
- Depending on filesystem, look for bad blocks or write limits, log
event if past some threshold
- Verify filesystem space in writeable fs's, run cleanup if possible,
otherwise log event
- Look for "bad" services, restart (yes, "bad" is vague here, but
usually revolves around fd leaks or memory leaks or lack of response
on critical interfaces)
- Custom checks (specific devices like FSI in system) - look at
errors, recover if possible

The complexity is usually with being sure to prevent situations where
you just continuously reboot your BMC and prevent debugging of the
actual issue.  Also, to ensure you don't just keep logging the same
event over and over.

>
> IIRC, the facebook presentation at the summit indicated they were
> tracking BMC health -- thoughts?
>
> Patrick