Critical BMC process failure recovery
Andrew Geissler
geissonator at gmail.com
Fri Oct 23 02:41:33 AEDT 2020
> On Oct 19, 2020, at 4:35 PM, Neil Bradley <Neil_Bradley at phoenix.com> wrote:
>
> Hey Andrew!
>
> At least initially, the requirements don't really seem like requirements - they seem like what someone's idea of what they think a solution would be. For example, why reset 3 times? Why not 10? Or 2? Seems completely arbitrary.
Hey Neil. I was starting with what our previous closed-source system
requirements were. The processes that cause a reset and the amount
of times we reset should definitely be configurable.
> If the BMC resets twice in a row, there's no reason to think it would be OK the 3rd time. It's kinda like how people have been known do 4-5 firmware updates to "fix" a problem and it "still doesn't work". 😉
Yeah, history has shown that if one reboot doesn’t fix it then you’re
probably out of of luck. But…it is up to the system owner to
configure whatever they like.
>
> If the ultimate goal is availability, then there's more nuance to the discussion to be had. Let's assume the goal is "highest availability possible".
>
> With that in mind, defining what "failure" is gets to be a bit more convoluted. Back when we did the CMM code for the Intel modular server, we had a several-pronged approach:
>
> 1) Run procmon - Look for any service that is supposed to be running (but isn't) and restart it and/or its process dependency tree.
> 2) Create a monitor (either a standalone program or a script) that periodically connects to the various services available - IPMI, web, KVM, etc.... - think of it like a functional "ping". A bit more involved, as this master control program (Tron reference 😉 ) would have to speak sentiently to each service to gauge how alive it is. There have been plenty of situations where a BMC is otherwise healthy but one service wasn't working, and it's overkill to have a 30-45 second outage while the BMC restarts.
This sounds like it fits in with https://github.com/openbmc/phosphor-health-monitor
That to me is the next level of process health and recovery but initially here
I was just looking for a broad, “what do we do if our service is restarted
x amount of times, still in a fail state, and is critical to the basic
functionality of the BMC”. To me the only options are try a reboot
of the BMC or log an error and indicate the BMC is in a unstable
state.
>
> -----Original Message-----
> From: openbmc <openbmc-bounces+neil_bradley=phoenix.com at lists.ozlabs.org> On Behalf Of Andrew Geissler
> Sent: Monday, October 19, 2020 12:53 PM
> To: OpenBMC Maillist <openbmc at lists.ozlabs.org>
> Subject: Critical BMC process failure recovery
>
> Greetings,
>
> I've started initial investigation into two IBM requirements:
>
> - Reboot the BMC if a "critical" process fails and can not recover
> - Limit the amount of times the BMC reboots for recovery
> - Limit should be configurable, i.e. 3 resets within 5 minutes
> - If limit reached, display error to panel (if one available) and halt
> the BMC.
>
> The goal here is to have the BMC try and get itself back into a working state via a reboot of itself.
>
> This same reboot logic and limits would also apply to kernel panics and/or BMC hardware watchdog expirations.
>
> Some thoughts that have been thrown around internally:
>
> - Spend more time ensuring code doesn't fail vs. handling them failing
> - Put all BMC code into a single application so it's all or nothing (vs.
> trying to pick and choose specific applications and dealing with all of
> the intricacies of restarting individual ones)
> - Rebooting the BMC and getting the proper ordering of service starts is
> sometimes easier then testing every individual service restart for recovery
> paths
>
> "Critical" processes would be things like mapper or dbus-broker. There's definitely a grey area though with other services so we'd need some guidelines around defining them and allow the meta layers to have a way to deem whichever they want critical.
>
> So anyway, just throwing this out there to see if anyone has any input or is looking for something similar.
>
> High level, I'd probably start looking into utilizing systemd as much as possible. "FailureAction=reboot-force" in the critical services and something that monitors for these types of reboots and enforces the reboot limits.
>
> Andrew
>
More information about the openbmc
mailing list