BMC Health Watchdog

Patrick Venture venture at google.com
Wed Mar 28 02:42:05 AEDT 2018


On Tue, Mar 27, 2018 at 8:08 AM, Andrew Geissler <geissonator at gmail.com> wrote:
> On Fri, Mar 23, 2018 at 11:51 PM, Patrick Venture <venture at google.com> wrote:
>> I've been on a BMC health kick lately.  Mostly because it's something
>> I think we all want and have different requirements.
>>
>> Outside of exporting the BMC health information, which could be done
>> from the same watchdog daemon, -- what requirements has everybody run
>> into?
>
> I've seen some pretty interesting things in the past.
> - Take a time stamp, sleep for X seconds, take time stamp and compare,
> if it's off by some margin, reboot BMC (identifying some general time
> skew issue with BMC)
> - Periodically attempt to open, write, close, delete a test file on
> all non-flash based writeable fs's, if operation fails reboot BMC
> - Depending on filesystem, look for bad blocks or write limits, log
> event if past some threshold
> - Verify filesystem space in writeable fs's, run cleanup if possible,
> otherwise log event
> - Look for "bad" services, restart (yes, "bad" is vague here, but
> usually revolves around fd leaks or memory leaks or lack of response
> on critical interfaces)
> - Custom checks (specific devices like FSI in system) - look at
> errors, recover if possible
>

So there's a linux software watchdog that handles most of that, if not
all, via configuration,
https://linux.die.net/man/8/watchdog

https://layers.openembedded.org/layerindex/recipe/122/

> The complexity is usually with being sure to prevent situations where
> you just continuously reboot your BMC and prevent debugging of the
> actual issue.  Also, to ensure you don't just keep logging the same
> event over and over.

That's true. There can be a situation where there's a bug s.t. the
BMCs in the fleet are in a local DoS attack on themselves.

>
>>
>> IIRC, the facebook presentation at the summit indicated they were
>> tracking BMC health -- thoughts?
>>
>> Patrick


More information about the openbmc mailing list