Critical BMC process failure recovery

Tue Oct 20 13:58:48 AEDT 2020

Hi Andrew,

In Intel-BMC/openbmc, there are watchdog configs for every service
that in case it fails, it will reset the BMC using the watchdog. See
the below related configs and scripts.

https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/classes/systemd-watchdog.bbclass
https://github.com/Intel-BMC/openbmc/blob/intel/meta-openbmc-mods/meta-common/recipes-phosphor/watchdog/system-watchdog/watchdog-reset.sh

It probably meets most of the requirements.

On Tue, Oct 20, 2020 at 3:54 AM Andrew Geissler <geissonator at gmail.com> wrote:
>
> Greetings,
>
> I've started initial investigation into two IBM requirements:
>
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>   - Limit should be configurable, i.e. 3 resets within 5 minutes
>   - If limit reached, display error to panel (if one available) and halt
>     the BMC.
>
> The goal here is to have the BMC try and get itself back into a working state
> via a reboot of itself.
>
> This same reboot logic and limits would also apply to kernel panics and/or
> BMC hardware watchdog expirations.
>
> Some thoughts that have been thrown around internally:
>
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs.
>   trying to pick and choose specific applications and dealing with all of
>   the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>   sometimes easier then testing every individual service restart for recovery
>   paths
>
> "Critical" processes would be things like mapper or dbus-broker. There's
> definitely a grey area though with other services so we'd need some
> guidelines around defining them and allow the meta layers to have a way
> to deem whichever they want critical.
>
> So anyway, just throwing this out there to see if anyone has any input
> or is looking for something similar.
>
> High level, I'd probably start looking into utilizing systemd as much as
> possible. "FailureAction=reboot-force" in the critical services and something
> that monitors for these types of reboots and enforces the reboot limits.
>
> Andrew

-- 
BRs,
Lei YU