Critical BMC process failure recovery

Tue Oct 20 06:53:11 AEDT 2020

Greetings,

I've started initial investigation into two IBM requirements:

- Reboot the BMC if a "critical" process fails and can not recover
- Limit the amount of times the BMC reboots for recovery
  - Limit should be configurable, i.e. 3 resets within 5 minutes
  - If limit reached, display error to panel (if one available) and halt
    the BMC.

The goal here is to have the BMC try and get itself back into a working state
via a reboot of itself.

This same reboot logic and limits would also apply to kernel panics and/or
BMC hardware watchdog expirations.

Some thoughts that have been thrown around internally:

- Spend more time ensuring code doesn't fail vs. handling them failing
- Put all BMC code into a single application so it's all or nothing (vs. 
  trying to pick and choose specific applications and dealing with all of
  the intricacies of restarting individual ones)
- Rebooting the BMC and getting the proper ordering of service starts is
  sometimes easier then testing every individual service restart for recovery
  paths

"Critical" processes would be things like mapper or dbus-broker. There's
definitely a grey area though with other services so we'd need some
guidelines around defining them and allow the meta layers to have a way
to deem whichever they want critical.

So anyway, just throwing this out there to see if anyone has any input
or is looking for something similar.

High level, I'd probably start looking into utilizing systemd as much as
possible. "FailureAction=reboot-force" in the critical services and something
that monitors for these types of reboots and enforces the reboot limits.

Andrew