Critical BMC process failure recovery

Fri Oct 23 02:41:33 AEDT 2020

> On Oct 19, 2020, at 4:35 PM, Neil Bradley <Neil_Bradley at phoenix.com> wrote:
> 
> Hey Andrew!
> 
> At least initially, the requirements don't really seem like requirements - they seem like what someone's idea of what they think a solution would be.  For example, why reset 3 times? Why not 10? Or 2? Seems completely arbitrary.

Hey Neil. I was starting with what our previous closed-source system
requirements were. The processes that cause a reset and the amount
of times we reset should definitely be configurable.

> If the BMC resets twice in a row, there's no reason to think it would be OK the 3rd time. It's kinda like how people have been known do 4-5 firmware updates to "fix" a problem and it "still doesn't work". 😉

Yeah, history has shown that if one reboot doesn’t fix it then you’re
probably out of of luck. But…it is up to the system owner to
configure whatever they like.

> 
> If the ultimate goal is availability, then there's more nuance to the discussion to be had. Let's assume the goal is "highest availability possible".
> 
> With that in mind, defining what "failure" is gets to be a bit more convoluted. Back when we did the CMM code for the Intel modular server, we had a several-pronged approach:
> 
> 1) Run procmon - Look for any service that is supposed to be running (but isn't) and restart it and/or its process dependency tree.
> 2) Create a monitor (either a standalone program or a script) that periodically connects to the various services available - IPMI, web, KVM, etc.... - think of it like a functional "ping". A bit more involved, as this master control program (Tron reference 😉 ) would have to speak sentiently to each service to gauge how alive it is. There have been plenty of situations where a BMC is otherwise healthy but one service wasn't working, and it's overkill to have a 30-45 second outage while the BMC restarts.

This sounds like it fits in with https://github.com/openbmc/phosphor-health-monitor
That to me is the next level of process health and recovery but initially here
I was just looking for a broad, “what do we do if our service is restarted
x amount of times, still in a fail state, and is critical to the basic
functionality of the BMC”. To me the only options are try a reboot
of the BMC or log an error and indicate the BMC is in a unstable
state.

> 
> -----Original Message-----
> From: openbmc <openbmc-bounces+neil_bradley=phoenix.com at lists.ozlabs.org> On Behalf Of Andrew Geissler
> Sent: Monday, October 19, 2020 12:53 PM
> To: OpenBMC Maillist <openbmc at lists.ozlabs.org>
> Subject: Critical BMC process failure recovery
> 
> Greetings,
> 
> I've started initial investigation into two IBM requirements:
> 
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
>  - Limit should be configurable, i.e. 3 resets within 5 minutes
>  - If limit reached, display error to panel (if one available) and halt
>    the BMC.
> 
> The goal here is to have the BMC try and get itself back into a working state via a reboot of itself.
> 
> This same reboot logic and limits would also apply to kernel panics and/or BMC hardware watchdog expirations.
> 
> Some thoughts that have been thrown around internally:
> 
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs. 
>  trying to pick and choose specific applications and dealing with all of
>  the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
>  sometimes easier then testing every individual service restart for recovery
>  paths
> 
> "Critical" processes would be things like mapper or dbus-broker. There's definitely a grey area though with other services so we'd need some guidelines around defining them and allow the meta layers to have a way to deem whichever they want critical.
> 
> So anyway, just throwing this out there to see if anyone has any input or is looking for something similar.
> 
> High level, I'd probably start looking into utilizing systemd as much as possible. "FailureAction=reboot-force" in the critical services and something that monitors for these types of reboots and enforces the reboot limits.
> 
> Andrew
>