Bios upgrade from BMC

Thu Mar 12 00:50:14 AEDT 2020

Hi everyone

I finally figured it out. And with "it" I mean the problem
that reading/writing to the BIOS flash ROM from an AST2500 was unreliable.
The problem was that Linux on the BMC didn't like when I changed the flash
image while the BMC was running. I was using a Dediprog EM100 flash
emulator, which allows to do that safely. The fix is to either not do that
(which wouldn't happen in production anyway), or to explicitly unbind +
bind the mtd before using it again after changing the flash contents. Yeah!

Oskar.

On Thu, Jan 30, 2020 at 7:22 PM Patrick Williams <patrick at stwcx.xyz> wrote:

> On Thu, Jan 30, 2020 at 11:30:10AM -0500, Oskar Senft wrote:
> > Hi Patrick
> >
> > Here some thoughts:
> >
> > >   1.  Power off host server.
> > >
> > > I will admit I don't know the Intel architecture well enough yet, but
> is
> > > powering off the server prior to BIOS update actually required?  Is the
> > > BIOS NOR chip always mapped into a physical address and used, or is the
> > > BIOS at some point loaded and resident?  Are there stable points where
> > > it is safe to perform an update?  Can we monitor POST code to know when
> > > the BIOS is completed?
> > >
> > There are two issues:
> >
> >    - The host may access the BIOS SPI flash at any time by making BIOS
> >    calls. UEFI variables are such an example. The problem is that the
> BIOS
> >    code that executes these requests does not handle cases at all where
> the
> >    BIOS SPI flash becomes inaccessible. This results in an immediate
> crash of
> >    the host.
> >    - With ME in operational mode, we cannot guarantee that ME would not
> >    attempt to read/write from the SPI flash while the host is running.
> I'm not
> >    sure if it's possible to put ME into recovery mode WHILE the host is
> >    running or if the host needs to be shut down for that.
> >
> > My understanding is that the only way to safely write the BIOS SPI flash
> > from the BMC is to shut the host down and put ME into recovery mode.
> > Alternatively, hold the host in full reset via RSMRST.
>
> Good to know, thanks.
>
> > > >   2.  Set ME/NM (Management engine or Node manager in x86) to
> recovery
> > > mode
> > >
> > > Is this specific to the BIOS update path or is this something we should
> > > do whenever the Host is powered off?  In either case I guess you can
> > > make it a dependency on the systemd unit file, but it seems like it
> > > would be nice if it were able to be generically applied to all power
> > > on/off paths.
> > >
> > This question opens a can of worms. There are people who say that ME
> should
> > always be run in recovery mode ...
>
> Hah.  I think it is worth answering if the ME provides any useful
> function when the server is powered off though.  I don't know, but it
> would potentially simplify the BIOS update flow if Host Off => ME in
> Recovery.
>
> > > >   3.  Flip GPIO to access SPI flash used by host.
> > > >   4.  Bind spi driver to access flash
> > >
> > > This is another thing that seems like we could do generically on all
> > > power on / power off paths?  Any time the host isn't running we can hit
> > > the GPIO to put ownership at the BMC.  Is there any disadvantage to
> > > that?
> > >
> > Yes. You cannot turn the host on via a power button if the PCH cannot
> > access the SPI flash. You'd have to catch that signal in the BMC and do
> the
> > right thing.
> >
> > What's the advantage of having the BIOS SPI flash always connected to the
> > BMC when the host is off? That seems to be making things more complicated
> > to me.
>
> It was just another simplification.  Usually we have special user
> utilities to steal the flash to the BMC and we have this logic in BIOS
> update path.  Again, if Host Off => BIOS SPI owned by BMC, it simplifies
> / eliminates logic.
>
> > > Is the GPIO something unique to Facebook's machines or do most other
> > > Intel machines have the same requirements?
> > >
> >
> > I'm not sure if it was explained what the GPIO does:
> > Since the SPI flash can only have one master, a "mux" (it's really a
> > digital switch, or a pair of digital switches) connect the SPI flash
> either
> > to the PCH for access by the ME / host or to the BMC. The GPIO or pair of
> > GPIOs is used to control the mux / bus switches.
> >
> > If the SPI flash is connected to the BMC, the ME / host cannot access it
> at
> > all. As it turns out, the PCH needs to be able to read the SPI flash to
> be
> > able to "turn on" the host.
>
> Yep, I'm aware of the mux (on Facebook systems).  I wasn't sure if this
> was common or typical Intel architecture feature or something we
> specifically had on our Facebook systems.
>
> > >
> > > >   5.  Flashcp image to device.
> > >
> > > I don't think `flashcp` is used today, or at least not in my
> > > recollection of the previous Witherspoon implementation.  Is there any
> > > advantage to it over `dd` to the raw mtdblock device?
> > >
> > I'm new to this, too, and found this explanation:
> >
> https://unix.stackexchange.com/questions/274217/how-is-erasing-mtd-with-dd-if-dev-zero-different-from-flash-eraseall
> >
> > This question was asked in the context of erase, but it applies to
> writes,
> > too.
>
> The stackexchange here is referring to /dev/mtdN devices and not
> /dev/mtdblockN devices (and I agree for plain-mtd).  mtdblock
> specifically has the extra logic to deal with erasing and writing in
> pages as appropriate.
>
> > > >   9.  Power on server.
> > >
> > > Doesn't seem like "power on" should be a side-effect of a BIOS update.
> > > Is this intended to be "go back to the previous power state"?
> > >
> > +1
> >
> > Having said all that, I was experimenting with pretty much the same flow
> > but ended up with unreliable writes with individual bit flips. I'm pretty
> > sure the HW is fine, since the original (AMI) stock firmware that comes
> > with the board can do it just fine. This is with an Aspeed AST2500, a
> C620
> > PCH and a Dediprog EM100 SPI flash emulator.
> >
> > I had even tried to change the SPI flash clock from the Aspeed down to
> the
> > minimum, with no change :-/ I already hooked up a logic analyzer to see
> > what's going on but haven't had a chance to investigate yet. Any ideas?
>
> Sorry, I've got nothing except maybe the original code retries a bunch
> to get past random flips?  If you are seeing bit-flips even with the
> Dediprog, are you sure the bus is any good?  Did you solder on headers
> to be able to affix the Dediprog?  That might have changed the
> capacitance enough to affect SPI activity.
>
> > Oskar.
>
> --
> Patrick Williams
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20200311/caa67359/attachment.htm>