Preventing a system power on before BMC Ready

Wed May 3 10:48:35 AEST 2023

On Tue, May 2, 2023 at 1:49 PM Andrew Geissler <geissonator at gmail.com>
wrote:
>
> About once a month a bug arrives internally where someone has powered on
the
> host without waiting for the BMC to reach its Ready state. Our systems
for a
> variety of reasons require the BMC to be at Ready before initiating a
system
> power on.
>
> The defects are usually returned as user error in that users are supposed
to
> know to wait. Our Redfish clients (including the web UI) know to not
allow a
> power on operation until Ready. Recently however we had a bug where our
external
> Redfish client allowed a power on before Ready. That client is event
driven once
> connected to the BMC and because they never got an event about an
unexpected BMC
> reboot, they allowed a power on before Ready when the BMC came back up.
Granted
> there is only about a 30s window where we have a problem here, but as we
all
> know, when there's a window, someone finds it.
>
> That got us brainstorming about some possible solutions:
> - Write some code in bmcweb to send a “bmc state change event” anytime
bmcweb
>   comes up to ensure listening clients know “something” has happened
> - Add an optional compile option to bmcweb (or PSM/x86-power-control) to
require
>   BMC Ready before issuing chassis or system POST requests (return error
if not
>   at Ready)

PSM or x86-power-control mods would be my preference.  bmcweb should not be
in charge of business logic.  If the system shouldn't allow power on while
the bmc is in ready state, then the daemons that handle power on need to
have that as a constraint, otherwise you'd have the same problem if a user
tried from IPMI.

> - Queue up the power on request and execute it once we reach BMC Ready
(not sure
>   what type of response that would be to Redfish clients or what error
path
>   looks like if we never reach Ready?)

Redfish has async tasks for this exact use case, and we already have code
to do them.  Alternatively you could just return an error that the
operation is not possible, along with a retry-after header instructing the
user when to retry their request.  We do this in the few update apis
already.

> - Find a way in the client to better detect an unexpected bmc reboot
(heartbeat
>   of some sort)
> - Push bmcweb further in the startup to BMC Ready, ensuring clients can't
talk
>   to the BMC until it's near Ready state

For your use case, if this is possible, that’s probably easiest and most
client friendly, so long as you can handle the case where the bmc never
hits “ready”

>
> Thoughts?
> Andrew
-- 
-Ed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230502/13a49574/attachment.htm>