Redundant BMC's
Andrew Geissler
geissonator at gmail.com
Thu Dec 14 07:43:37 AEDT 2023
Greetings,
We at IBM are looking at implementing a server with redundant BMC's. The idea of
redundant BMC's is that if one fails (software or hardware related), the other
BMC takes over and there is no impact to the owner of the server (enterprise,
high availability market). One BMC is the "Active" BMC and the other is the
"Passive”.
High level you have 2 or more chassis's in a single server. 2 of those chassis's
have BMC's running OpenBMC. The BMC's negotiate on startup which one will be the
Active BMC and which one will be the Passive. Both BMC's have full access to the
server hardware (fans, power supplies, VPD chips, ...) but only one can access
the hardware at one time (via hardware mux).
The Passive BMC will be running a subset of OpenBMC services. As it will need to
support firmware update, and other basic features, it will have bmcweb running.
But other services like fan or power control would not be running on the
Passive.
The Active BMC will utilize bmcweb aggregation to provide basic information
about the Passive BMC. Server management can only occur via the Active BMC.
As the user changes settings (BIOS, certificates, system policy, ...) via the
Active BMC, we need to ensure we replicate these settings over to the Passive.
We've done a bit of initial exploration into using corosync/pacemaker. It has
some potential but also feels a bit heavy for what we need. The thought is that
a role change where the Passive BMC becomes the Active BMC and the Active
becomes the Passive is mostly driven by our external software managers. There's
potential for some cases where the BMC's themselves drive the role changes but
most of our use cases are situations where something in the BMC hardware (or its
connections to the server) have failed and the BIOS firmware or Redfish
management client direct the Passive BMC to become the Active.
A roll-our-own data synchronization daemon (utilizing rsync) to monitor for file
changes with some basic rules on when to synch (immediate, synch points) doesn't
seem all that bad but there's probably a lot of unknown pitfalls something like
corosync/pacemaker already handle.
Just throwing this out there in case anyone is also working on this or has any
opinions on direction here.
Thanks,
Andrew
More information about the openbmc
mailing list