OpenBMC on RCS platforms

Tue Apr 27 07:42:16 AEST 2021

Timothy Pearson <tpearson at raptorengineering.com> wrote:
>----- Original Message -----
>> From: "Patrick Williams" <patrick at stwcx.xyz>
>> To: "Timothy Pearson" <tpearson at raptorengineering.com>
>> Cc: "openbmc" <openbmc at lists.ozlabs.org>
>> Sent: Friday, April 23, 2021 12:11:26 PM
>> Subject: Re: OpenBMC on RCS platforms
>
>> On Fri, Apr 23, 2021 at 09:30:00AM -0500, Timothy Pearson wrote:
>>> All,
>>> 
>>> I'm reaching out after some internal discussion on how we can
>>> better integrate
>>> our platforms with the OpenBMC project.  As many of you may know,
>>> we have been
>>> using OpenBMC in our lineup of OpenPOWER-based server and desktop
>>> products,
>>> with a number of custom patches on top to better serve our target
>>> markets.
>> 
>> Hi Timothy,
>> 
>> Good to hear from your team again and hope there is some ways we
>> can
>> work together on solving some of these issues.
>> 
>>> Roughly speaking, we see issues in OpenBMC in 5 main areas:
>> 
>> We might want to fork this into 5 different discussion threads
>> and/or
>> design documents, but let's see how this goes...
>> 

[ some issues trimmed, including fan ]

>>> == Local firmware updates ==
>>> 
>>> This is right behind fan control in terms of cost and PR damage
>>> for us vs.
>>> competing platforms.  While OpenBMC's firmware update support is
>>> very well
>>> tuned for datacenter operations (we use a simple SSH + pflash
>>> method on our
>>> large clusters, for example) it's absolutely terrible for desktop
>>> and
>>> workstation applications where a second PC is not guaranteed to be
>>> available,
>>> and where wired Ethernet even exists DHCP is either non-existent
>>> or provided by
>>> a consumer cable box.  Some method of flashing -- and recovering
>>> -- the BMC and
>>> host firmware right from the local machine is badly needed,
>>> especially for the
>>> WiFi-only environments we're starting to see more of in the wild.
>>> Ideally this
>>> would be a command line tool / library such that we can integrate
>>> it with our
>>> bootloader or a GUI as desired.
>> 
>> This sounds to me pretty easily obtainable and what I have in mind
>> is
>> actually a valid data center use case for many of us.  When all
>> else
>> fails, you should be able to use a USB key to update the system
>> (assuming the image you're updating with is trusted for whatever
>> your
>> system determines is trust-worthy).  I'm pretty sure our OCP
>> systems can
>> be updated with a magic combination of a USB-key and an OCP debug
>> card(*).  I don't think that is currently implemented on
>> openbmc/openbmc,
>> but it is on our list of pending features.
>> 
>> For your specific users, the OCP debug card is probably not a good
>> requirement, but you could likely automate the update whenever a
>> USB-key
>> plus text file is added?  (I'm just brainstorming how you'd know to
>> kick
>> it off).  The current software update code probably isn't too far
>> off
>> from being able to facilitate this for you.
>> 
>> https://www.opencompute.org/documents/facebook-ocp-debug-card-with-lcd-spec_v1p0
>
>At first glance, that's another overly complex solution for a simple
>problem that would cause a degraded user experience vs. other
>platforms.
>

I have to agree, both overly complex and probably not useful in that
its just a port interface for control.

>We have an 800Mhz Linux-based computer with 512MB of RAM, serial and
>video out support already integrated into every one of our products.
>It can receive data via PCIe and via USB from an active host.  Why
>isn't there a mechanism to send a signed container to it over one of
>these existing channels for self-update?
>
>A potential user story looks like this:
>
>=====
>
>I want to update the firmware on my Blackbird desktop to fix a
>problem I'm having with a new control widget I've plugged in.  To
>make things more interesting, I'm on an oil rig in the Gulf, and the
>desktop only connects via intermittent WiFi.  Spare parts are weeks
>away, and I have next to no electronic diagnostic equipment available
>to me.  There's one or two USB ports I can normally use because I
>have administrative privileges, but I was able to grab the upgrade
>file over WiFi instead, saving myself some time cleaning accumulated
>gunk out of the ports.
>
>I can update my <large vendor> standard PC firmware just by running a
>tool on Windows, but the Blackbird was selected because it controls a
>critical process that needed to be malware-resistant.
>
>Fortunately, OpenBMC implemented a quality firmware update process.
>I just need to launch a GUI tool with host administrative privileges,
>select the upgrade file, and queue an upgrade to happen when I reboot
>the machine.  I queue the update, start the reboot, and stick around
>to see the upgrade progress on the screen while it's booting back up.
> Because I can see the status on the screen, I know what is happening
>and don't pull the power plug due to only seeing a black screen and
>power LED for 10 minutes.  Finally, the machine loads the OS and I
>verify the new control widget is working properly.
>
>=====
>
>Is there a technical / architectural reason this can't be done, or
>some other reason it's a bad idea?
>

I ended up writing this twice or thrice.  Also what I call
phosphor-initfs is actually the package obmc-phosphor-initfs.bb
found in meta-phosphor/recipies-phosphor/initrdscripts/.

There are two issues.  One is that there is no graphics
library or console code for the aspeed bmc.  I understand a 
text rendering library was added for boot monitoring). But 
if you are starting from the host up, then use the host to 
drive the GUI and just establish a command session (network, 
USB to host, or serial).  

The biggest limitation is we use squashfs for file system 
for space efficency.  This is a read-only filesystem that 
contains references between different pieces that is loaded
and decompressed by the kernel on demand.  That means you can
not be running on the copy in flash while trying to update
that copy in the flash.

If you have space for two copies then you can update the
second copy while the primary is online.  This is supported
in the UBI and eMMC layouts upstream.

If you only have flash space for one copy then you have to
arrange for something more limited.  Either way you are 
subject to bricking on interrupted flash unless you do
something exotic like repurpose the host chip as a backup
BMC during the process.   But if its just the feedback
then the upstream code has help that isn't in the Redfish
flow.

====
Once

The "static" mtd layout with phosphor-initfs has support 
for both loading the static flash content into RAM, allowing 
the update to occur with full services running, and as  a 
backup on shutdown it will apply the update on bmc reboot 
by switching back to the initramfs and performing the flash 
from there.  The status of the later update is only visible 
on the console, which might be hidden on an internal serial 
cable by default.

Unfortunately the "prepare for update" method that was in 
the original update instructions and tells the BMC init 
"hey, load all this content into ram, so that you can write 
over the flash" got lost in the "we must be limited to what 
RedFish can support".  The code is still in the low level 
scripts but the fancy rest api is missing.  Also with the 
addition of code verification the actual flash progress 
was hidden.

The phosphor-initfs scripts also allow a new filesystem 
image to be downloaded over the network if you wish to test.
This doesn't have signature checking code, and it can be
disabled by build options.

All of the options to phosphor-initfs can be set by u-boot 
environment variables (one of which is cleared by a systemd
unit each boot, on that is not) and by the kernel command 
line.

Note: I highly suggest not to use image-bmc (for the whole
flash) as this erases the entire flash (although we try to
write back the u-boot environment), but instead use image-kernel, 
image-rofs, etc to allow the prior rwfs and u-boot to persist.
Some bad assertions may have migrated into the code-update 
rest endpoints and we should accept patches.

Bottom Line:

Put the BMC in maintence mode and you can update the image
while the stack is running.  You can then use ssh to 
display the flash progress.  If you need a fancy gui and 
not the internal serial then use the host, or write the 
rest of the graphics stack.

If you need the reliable backout then you need space for 
a second image, even if its smaller due to being emergency
servies only.

PS:  There were some flashes we tried early that had 
horrible erase times -- over 20 minutes for a full
erase.  Check the specs for the parts you provide vs 
others in the market, the better ones erase in a few
minutes.

PPS:  The reason we added UBI was its feature to use
the whole flash for wear leveling (minus the bootloader
that is outside the UBI partition).

=======================================
Twice: Going back to the scenerio again

>I just need to launch a GUI tool with host administrative privileges,
>select the upgrade file, and queue an upgrade to happen when I reboot
>the machine.  I queue the update, start the reboot, and stick around
>to see the upgrade progress on the screen while it's booting back up.
> Because I can see the status on the screen, I know what is happening
>and don't pull the power plug due to only seeing a black screen and
>power LED for 10 minutes.  Finally, the machine loads the OS and I
>verify the new control widget is working properly.

If the gui is on the host, with todays stock phosphor-initfs, you need
1) a connection from the host to the bmc
   ethernet, serial, usb ethernet etc  
   (to copy files from host to BMC RAM and to monitor command output)

2) hardware ability to reboot bmc with host surviving
 - all userspace has to be replaced with those on the filesystem in RAM
 - can be shortened slightly by preloading image in BMC before shuting
   down services if the current kernel is compatible.  This can be the
   old or new image.

 - or -

 Boot the host for GUI support with the BMC in an optimized
 update mode.

  This can be before or after the file is downloaded to the
  host.

3) Once the bmc is running from a squashfs in RAM (and if you want
to clean the rwfs overlay, persist on clean reboot/shutdown mode),

- copy the image to the bmc 
- validate as required (preferably somewhere under /run)
- move imgage-rofs , kernel, etc as needed to /run/initramfs
- /run/initramfs/update 
    (which checks the fs is not obviously mounted,
     runs flashcp, which has status on stdout
     moves files successfully written
     and then writes selected overlay content back to rwfs
- check the images were all written
- reboot

=================
Option Three:
This might be a better experience but needs some software work
to enable kexec on the 2500.   

Transfer the FS and kernel to the BMC RAM, and kexec the kernel
(note patches on the list for 2600 need to test and maybe a bit of
coding for the 2500).  Optionally this can contain the virt pnor
image too.  After the BMC boots from the system in RAM boot the
host from vpnor image in RAM then use the host to drive the GUI
to acknoledge and initiate the flash as desired.

The hooks are in phosphor-initfs to flash the image after the 
host is up, and to boot with the image in RAM.  

As an alternative to kexec, if the new file system supports the 
old BMC kernel then the shutdown script can easily be edited to
restart the exec script with the images in /run.  Alternatively 
if the new kernel supports the old user space then it can be 
flashed first, then on the next boot the prior case applies as
it is the updated kernel.  Note: I did this flow several times
in developement but decided not to put code in the shutdown 
script because its a script that is executed from /run/initramfs
and can easily be edited there when alternative flow is required.
(there are comments that show where to edit).

>>> == BMC boot time ==
>>> 
>>> This is self explanatory.  Other vendors' solutions allow the host
>>> to be powered
>>> on within seconds of power application from the wall, and even our
>>> own Kestrel
>>> soft BMC allows the host to begin booting less than 10 seconds
>>> after power is
>>> applied.  Several *minutes* for OpenBMC to reach a point where it
>>> can even
>>> start to boot the host is a major issue outside of datacenter
>>> applications.
>> 
>> Some of this is, to me, an artifact of the Power architecture and
>> not an
>> artifact of OpenBMC explicitly.  On x86 systems we have a little
>> code in
>> u-boot that wiggles a GPIO and gets the Host power sequence going
>> while
>> the BMC is booting up.  This overlaps quite a bit of the memory
>> testing
>> of the Host with the BMC boot time.  The "well-known proprietary
>> BMC"
>> also does this same trick.
>
>I think we're talking about two different well know proprietary BMCs,
>but that's not important for this discussion other than no, the one I
>have in mind doesn't resort to such tricks.  What it does do is start
>up its core services rapidly enough where this isn't a problem, and
>lets the rest of the BMC stack start up at its own pace later on.
> 
>> Power requires the BMC to be up in order to serve out the virtual
>> PNOR,
>> from my recollection.  It seems like this could be solved in other
>> ways,
>> such as a SPI-mux on a physical SPI-NOR so that the BMC can take
>> the NOR
>> at specific times during update but otherwise it is given to the
>> host
>> CPUs.  This is exactly what we do on x86 systems.
>
>Ouch.  So on x86 boxen you might actually have two "BMCs" -- the
>proprietary one inside the CPU that starts in seconds and provides
>base services like SPI Flash mapping to CPU address space, and the
>external OpenBMC one that can run in parallel without interfering
>with host start.  Adding a mux is then a hack needed on top, since
>you can't really communicate with the proprietary stack in the
>required manner.
>

I'd say their cpu doesn't require the bmc to boot, it also means
they trust their system to not melt without bmc monitoring.

>For systems like POWER that lack the proprietary internal "BMC", I
>guess there are a few ways we could address the problem:
>
>1.) Speed up OpenBMC load -- this sounds like it would end up being
>completely supported by one or two vendors alone, and subject to
>breakage from the other vendors that simply don't have any concerns
>around OpenBMC start time since their platforms aren't visibly
>affected by it.  It's also unlikely to come into the desired sub-10s
>range.
>
>2.) Split the BMC into "essential" and "nice to have" services, much
>like the other platforms.  Painful, as it now requires even more
>parts on the mainboard.
>
>3.) Keep the single BMC device, but split it into two software
>stacks, one that can load nearly instantly and start providing
>essential services, and another than can load more slowly.  This
>would effectively require two separate CPUs inside the BMC, which we
>actually do have in the AST2500.  I haven't done any digging though
>to see if the second CPU is powerful enough to implement the HIOMAP
>protocol at speed.
>
>> Having said all of that, there is certainly some performance
>> improvements that can be done, but nobody has taken up the torch on
>> it.
>> A big low-hanging fruit in my mind is the file system compression
>> being
>> xz or gzip is very computationally intensive.  I did some work,
>> with
>> Nick Terrell, to switch to zstd on our systems for both the kernel
>> initramfs and UBI and saw significant boot time improvements.  The
>> upstream enablement for this appears to have landed as of v5.9 so
>> we
>> could certainly start enabling it here now.
>> 
>>
>INVALID URI REMOVED
>linux-2Dkbuild_20200730190841.2071656-2D7-2Dnickrterrell-40gmail.com_
>&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=bvv7AJEECoRKBU02rcu4F5DWd-EwX8As
>2xrXeO9ZSo4&m=2O37p_XR8IO9jl4psZwnU-fmhndTW41NpqMXsT9Or6w&s=DF7yGqfSE
>-V5_j_DgmASLOgLpkfjcJpCK5xsJW3avqY&e= 
>> 

In addition to compression options there are tradeoffs on how much is 
copied to ram vs how much is read from the flash possibly repeatedly.
If you add secure boot the time goes up.

>>> == Host boot status indications ==
>>> 
>>> Any ODM that makes server products has had to deal with the
>>> psychological "dead
>>> server effect", where lack of visible progress during boot causes
>>> spurious
>>> callouts / RMAs.  It's even worse on desktop, especially if
>>> server-type
>>> hardware is used inside the machine.  We've worked around this a
>>> few times with
>>> our "IPL observer" services, and really do need this functionality
>>> in OpenBMC.
>>> The current version we have is both front panel lights and a
>>> progress bar on
>>> the BMC boot monitor (VGA/HDMI), and this is something we're
>>> willing to
>>> contribute upstream.
>> 
>> Great!  Let's get that merged!
>
>Sounds good!  The files aren't too complex:
>
>INVALID URI REMOVED
>_git_blackbird-2Dskeleton_tree_pyiplobserver&d=DwIFaQ&c=jf_iaSHvJObTb
>x-siA1ZOg&r=bvv7AJEECoRKBU02rcu4F5DWd-EwX8As2xrXeO9ZSo4&m=2O37p_XR8IO
>9jl4psZwnU-fmhndTW41NpqMXsT9Or6w&s=zLtrjaE2hHjV3z9ar0gcJVvZ9Uzwxinfed
>AOMEWs04s&e= 
>INVALID URI REMOVED
>_git_blackbird-2Dskeleton_tree_pyiplledmonitor&d=DwIFaQ&c=jf_iaSHvJOb
>Tbx-siA1ZOg&r=bvv7AJEECoRKBU02rcu4F5DWd-EwX8As2xrXeO9ZSo4&m=2O37p_XR8
>IO9jl4psZwnU-fmhndTW41NpqMXsT9Or6w&s=AOWB1Ja82thvSZFO81WfIj7MJtg5TeZN
>8wpT_EpG_Zo&e= 
>
>Is the skeleton repository the best place for a merge request?

hmm, as prototype code in python, maybe.   I don't think many current
systems ship python.  Also upstream Yocto removed all support for 
python 2.  

In addition I see a mix of "copy the data" and "transform the data"
in the same script, such as 

updateIPLLeds(self, initial_start, status_changed)

with 
            # Show major ISTEP on LED bank
            # On Talos we only have three LEDs plus a fourth indicator modification 
            # bit, but the major ISTEPs range from 2 to 21
            # Try to condense that down to something more readily displayable

[ After some thought, its ok to be in the output code, as it's 
formatting the data for the display. ]

The upstream post interface logs the post codes, and display is
a separate function.  The ipl_status_monitor seems to mix monitoring 
the port 80 snoops with other logic to determine the system state 
eg is the host up?.

Also both scripts extensivly use popen to handle device communication
and some communication to other services (kill to post code).

>
>> I do think some others have support for a 7-seg display with the
>> postcodes going to it already.  I think this is along those same
>> lines.
>> It might just be another back-end for our existing post code daemon
>> to
>> replicate them to the VGA and/or blink morse code on an LED.
>
>OK, so this is what we ran into before.  Where is this support
>in-tree, and do we need to reimplement our system to match what
>already exists (by extension, extending the other vendor code since
>our observer is more detailed in terms of status etc.), or would we
>be allowed to provide a competing solution to this other support,
>letting ODMs pick which one they wanted?
>

Our upstream code is at https://github.com/openbmc/phosphor-host-postd
for the snoop readers and the LED segment drivers, and the history 
and Dbus owner is https://github.com/openbmc/phosphor-post-code-manager.

To catalog the source of the host and bmc there is
https://github.com/openbmc/phosphor-state-manager/blob/master/obmcutil

In addition to phosphor-misc for "one file projects" there is 
openbmc-tools for handy tools which may be more developer focused.

>>> == IPMI / BMC permissions ==
>>> 
>>> An item that's come up recently is that, at least on our older
>>> OpenBMC versions,
>>> there's a complete disconnect between the BMC's shell user
>>> database and the
>>> IPMI user database.  

Mostly true, in part because the IPMI password for RCMP+ must be
stored on the BMC (reversiably encrypted for our implementation).
Note improper storage of this was an area of one or more CVEs.

In addition it has a limit of 20 characters in a password and 8
users.

>>> Resetting the BMC root password isn't possible from IPMI
>>> on the host, and setting up IPMI doesn't seem possible from the
>>>>BMC shell.  If

In our current code we have pam hooks that save the password 
during a change, if the user is in the ipmi group and the 
password is short enough (or returns an error).

>>> IPMI support is something OpenBMC provides alongside Redfish, it
>>> needs to be
>>> better integrated -- we're dealing with multiple locked-out BMC
>>> issues at the
>>> moment at various customer sites, and the recovery method is
>>> painful at best
>>> when it should be as simple as an ipmitool command from the host
>>> terminal.
>> 
>> I suspect most of this is a matter of IPMI command support and/or
>> enabling
>> those commands to the host IPMI path.  Most of us are fairly
>> untrusting
>> of IPMI (and the Host itself), so there hasn't been work to do
>> anything
>> here.  As long as whatever you're proposing can be disabled for
>> models
>> where we distrust the Host, it seems like these would be accepted
>> as
>> well.

Our current Redfish has multiple users and can enable and 
disable users to have ipmi access and set their password.

Of course this just moves the goal posts to the Redfish 
admin login, but in addition to mTLS certificate based 
trust (which should be customized to the customer), 

Redfish has the concept of a host firmware and os logins
including a binding for EFI to specify adapter path and
network in addition to read-once magic efi variables.  I 
know OpenPOWER boxes don't have EFI but the information
could be exposed in a similar fashion.  As far as I know 
we have not yet implemented these users in our Redfish
server.  

Or designate a physical jumper to tell the BMC to install
a known password.  Where's that turbo button again? :-)

milton