OpenBMC on RCS platforms

Ed Tanous ed at tanous.net
Sat Apr 24 03:23:23 AEST 2021


On Fri, Apr 23, 2021 at 7:36 AM Timothy Pearson
<tpearson at raptorengineering.com> wrote:
>

First off, this is great feedback, and despite some of my comments
below, I do really appreciate you putting it out there.

> All,
>
> I'm reaching out after some internal discussion on how we can better integrate our platforms with the OpenBMC project.  As many of you may know, we have been using OpenBMC in our lineup of OpenPOWER-based server and desktop products, with a number of custom patches on top to better serve our target markets.
>
> While we have had fairly good success with OpenBMC in the server / datacenter space, reception has been lukewarm at best in the desktop space.  This is not too surprising, given OpenBMC's historical focus on datacenter applications, but it is also becoming an expensive technical and PR pain point for us as the years go by.  To make matters worse, we're still shielding our desktop / workstation customer base to some degree from certain design decisions that persist in upstream OpenBMC, and we'd like to open discussion on all of these topics to see if a resolution can be found with minimal wasted effort from all sides.
>
> Roughly speaking, we see issues in OpenBMC in 5 main areas:
>
>
> == Fan control ==
>
> Out of all of the various pain points we've dealt with over the years, this has proven the most costly and is responsible on its own for the lack of RCS platforms upstream in OpenBMC.
>
> To be perfectly frank, OpenBMC's current fan control subsystem is a technical embarrassment, and not up to the high quality seen elsewhere in the project.

Which fan control subsystem are you referring to?  Phosphor-fans or
phosphor-pid-control?

>  Worse, this multi-daemon DBUS-interconnected Rube Goldberg contraption has somehow managed to persist over the past 4+ years, likely because it reached a complexity level where it is both tightly integrated with the rest of the OpenBMC system and extremely difficult to understand, therefore it is equally difficult to replace.  Furthering the lack of progress is the fact that it is mostly "working" for datacenter applications, so there may be a "don't touch what isn't broken" mentality in play.

I'm not really sure I agree with that.  If someone came with a design
for "We should replace dbus with X", had good technical foundations
for why X was better, and was putting forward the monumental effort to
do the work, I know that I personally wouldn't be opposed.  For the
record, I agree with you about the complexity here, but most of the
ideas I've heard to make it better were "Throw everything out and
start over", which, if that's what you want to do, by all means do,
but I don't think the community is willing to redo all of the untold
hours of engineering effort spent over the years the project has
existed.

FWIW, u-bmc was a project that took the existing kernel, threw out all
the userspace and started over.  From my view outside the project,
they seem to have failed to gain traction, and only support a couple
of platforms.

>  From a technical perspective, it is indirected to a sufficient level as to be nearly incomprehensible to most people, with the source spread across multiple different projects and repositories, yet somehow it remains rigid / fragile enough to not support basic features like runtime (or even post-compile) fan configuration for a given server.

With respect, this statement is incorrect.  On an entity-manager
enabled system + phosphor-pid-control, all of the fan control
parameters are fully modifiable at runtime either from within the
system (through dbus) or through Redfish out of band through the
OEMManager API.  If you haven't ported your systems to entity-manager
yet, there's quite a bit of people doing it at the moment and are
discussing this stuff on discord basically every day that I'm sure
would be able to give you some direction on where to start getting
your systems moved over.

>
> What we need is a much simpler, more robust fan control daemon.  Ideally this would be one self-contained process, not multiple interconnected processes where a single failure causes the entire system to go into safe mode.

in phosphor-pid-control, the failure modes are configurable per zone,
and includes things like N failures to failsafe, or adjusted fan floor
on failsafe.  If what's there doesn't meet your needs, I'm sure we can
discuss adding something else (I know there's at least one feature in
review in this area that you might check out on gerrit.)

>
> Our requirements:
> 1.) True PID control with tunable constants.  Trying to do things with PWM/temp maps alone may have made sense in the resource-constrained environments common in the 1970s, but it makes no sense on modern, powerful BMC silicon with hard floating point instructions.  Even the stock fan daemon implements a sort of bespoke integrator-by-another-name, but without the P and D components it does a terrible job outside of a constant-temperature datacenter environment.

phosphor-pid-control implements PI based fan control.  If you really
wanted to add D, it would be an easy addition, but in practice, most
server control loops have enough noise, and a low enough loop
bandwidth that a D component isn't useful, so it was omitted from the
initial version.

> 2.) Tunable PID constants, tunable temperature thresholds, tunable min/max fan speeds, and arbitrary linkage between temp inputs (zones) and fan outputs (also zoned).

All of this exists in phosphor-pid-control.  Example:
https://github.com/openbmc/entity-manager/blob/a5a716dadfbf97b601577276cc699af8f662beeb/configurations/WFT%20Baseboard.json#L1100

> 3.) Configurable zones -- both temperature and PWMs, as well as installed / not installed fans and temperature sensors.

Also exists in phosphor-pid-control.  Example:
https://github.com/openbmc/entity-manager/blob/ec98491a00c5dcffae6be362e483380c807f234c/configurations/R2000%20Chassis.json#L411

> 4.) Configurable failure behavior.  A single failed or uninstalled chassis fan should NOT cause the entire platform to go into failsafe mode!

Also exists in phosphor-pid-control.  Example of allowing single rotor
failures to not cause the system to hit failsafe:
https://github.com/openbmc/entity-manager/blob/ec98491a00c5dcffae6be362e483380c807f234c/configurations/R1000%20Chassis.json#L303

> 5.) A decent GUI to configure all of this, and the ability to export / import the settings.

Doesn't exist, but considering we already have the Redfish API for
this, it should be relatively easy to execute within webui-vue.  With
that said, I've had this on my "Great idea for an intern project" list
for some time now.  If you have engineers to spare (or you're
interested in implementing this yourself) feel free to hop on discord
and I can help get you ramped on getting this started and how those
interfaces work.

>
> To be fair, we've only been able to implement 1, 2, 3, and 4 above at compile time -- we don't have the runtime configuration options due to the way the fan systems work in OpenBMC right now, and the sheer amount of work needed to overhaul the GUI in the out-of-tree situation we remain stuck in.  With all that said, however, we point out that our competition, especially on x86 platforms, has all of these features and more, all neatly contained in a nice user-friendly point+click GUI.  OpenBMC should be able to easily match or exceed that functionality, but for some reason it seems stuck in datacenter-only mode with archaic hardcoded tables and constants.
>
> == Local firmware updates ==
>
> This is right behind fan control in terms of cost and PR damage for us vs. competing platforms.  While OpenBMC's firmware update support is very well tuned for datacenter operations (we use a simple SSH + pflash method on our large clusters, for example) it's absolutely terrible for desktop and workstation applications where a second PC is not guaranteed to be available, and where wired Ethernet even exists DHCP is either non-existent or provided by a consumer cable box.  Some method of flashing -- and recovering -- the BMC and host firmware right from the local machine is badly needed, especially for the WiFi-only environments we're starting to see more of in the wild.  Ideally this would be a command line tool / library such that we can integrate it with our bootloader or a GUI as desired.

You might check Intels openbmc fork;  I believe they had u-boot
patches to do this that you might consider upstreaming, or working
with them to upstream them.

>
> == BMC boot time ==
>
> This is self explanatory.  Other vendors' solutions allow the host to be powered on within seconds of power application from the wall, and even our own Kestrel soft BMC allows the host to begin booting less than 10 seconds after power is applied.  Several *minutes* for OpenBMC to reach a point where it can even start to boot the host is a major issue outside of datacenter applications.

While this is great information to have, it's a little disingenuous to
the fact that we've significantly reduced the boot time in the last
few years with things like dropping python, and porting the mapper to
a compiled language.  We can always do better, but unless you have
concrete ideas on how we can continue reducing this, there's very
little OpenBMC can do.

>
> == Host boot status indications ==
>
> Any ODM that makes server products has had to deal with the psychological "dead server effect", where lack of visible progress during boot causes spurious callouts / RMAs.  It's even worse on desktop, especially if server-type hardware is used inside the machine.  We've worked around this a few times with our "IPL observer" services, and really do need this functionality in OpenBMC.  The current version we have is both front panel lights and a progress bar on the BMC boot monitor (VGA/HDMI), and this is something we're willing to contribute upstream.

For some reason I thought we already had code to allow the BMC to post
a splash screen ahead of processor boot, but I'm not recalling what it
was called, as I've never had this requirement myself.

>
> == IPMI / BMC permissions ==
>
> An item that's come up recently is that, at least on our older OpenBMC versions, there's a complete disconnect between the BMC's shell user database and the IPMI user database.  Resetting the BMC root password isn't possible from IPMI on the host, and setting up IPMI doesn't seem possible from the BMC shell.  If IPMI support is something OpenBMC provides alongside Redfish, it needs to be better integrated -- we're dealing with multiple locked-out BMC issues at the moment at various customer sites, and the recovery method is painful at best when it should be as simple as an ipmitool command from the host terminal.

I thought this was fixed long ago.  User passwords and user accounts
are common between redfish, ipmi, and ssh.  Do you think you could try
a more recent build and see if this is still an issue for you?

>
>
> If there is interest, I'd suggest we all work on getting some semblance of a modern fan control system and the boot status indication framework into upstream OpenBMC.  This would allow Raptor to start upstreaming base support for RCS product lines without risking severe regressions in user pain points like noisy fans -- perceived high noise levels are always a great way to kill sales of office products, and as a result the fan control functionality is something we're quite sensitive about.  The main problem is that with the existing fan control system's tentacles snaking everywhere including the UI, this will need to be a concerted effort by multiple organizations including the maintainers of the UI and the other ODMs currently using the existing fan control functionality.  We're willing to make yet another attempt *if* there's enough buy-in from the various stakeholders to ensure a prompt merge and update of the other components.

I'd really prefer you look at what already exists.  I think most of
your concerns are covered in phosphor-pid-control today, and if they
aren't, I suspect we can add new parts to the control loop where
needed.

>
> Finally, some of you may also be aware of our Kestrel project [1], which eschews the typical BMC ASICs, Linux, and OpenBMC itself.  I'd like to point out that this is not a direct competitor to OpenBMC, it is designed specifically for certain target applications with unique requirements surrounding overall size, functionality, speed, auditability, transparency, etc.  Why we have gone to those lengths will become apparent later this year, but suffice it to say we're considering Kestrel to be used in spaces where OpenBMC is not practical and vice versa.  In fact, we'd like to see OpenBMC run on the Kestrel SoCs (or a derivative thereof) at some point in the future, once the performance concerns above are sufficiently mitigated to make that practical.
>
> [1] https://gitlab.raptorengineering.com/kestrel-collaboration/kestrel-litex/litex-boards/-/blob/master/README.md


More information about the openbmc mailing list