<div dir="ltr">Hi everyone<div><br></div><div>I finally figured it out. And with "it" I mean the problem that reading/writing to the BIOS flash ROM from an AST2500 was unreliable. The problem was that Linux on the BMC didn't like when I changed the flash image while the BMC was running. I was using a Dediprog EM100 flash emulator, which allows to do that safely. The fix is to either not do that (which wouldn't happen in production anyway), or to explicitly unbind + bind the mtd before using it again after changing the flash contents. Yeah!</div><div><br></div><div>Oskar.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 30, 2020 at 7:22 PM Patrick Williams <<a href="mailto:patrick@stwcx.xyz">patrick@stwcx.xyz</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Jan 30, 2020 at 11:30:10AM -0500, Oskar Senft wrote:<br>

> Hi Patrick<br>

> <br>

> Here some thoughts:<br>

> <br>

> >   1.  Power off host server.<br>

> ><br>

> > I will admit I don't know the Intel architecture well enough yet, but is<br>

> > powering off the server prior to BIOS update actually required?  Is the<br>

> > BIOS NOR chip always mapped into a physical address and used, or is the<br>

> > BIOS at some point loaded and resident?  Are there stable points where<br>

> > it is safe to perform an update?  Can we monitor POST code to know when<br>

> > the BIOS is completed?<br>

> ><br>

> There are two issues:<br>

> <br>

>    - The host may access the BIOS SPI flash at any time by making BIOS<br>

>    calls. UEFI variables are such an example. The problem is that the BIOS<br>

>    code that executes these requests does not handle cases at all where the<br>

>    BIOS SPI flash becomes inaccessible. This results in an immediate crash of<br>

>    the host.<br>

>    - With ME in operational mode, we cannot guarantee that ME would not<br>

>    attempt to read/write from the SPI flash while the host is running. I'm not<br>

>    sure if it's possible to put ME into recovery mode WHILE the host is<br>

>    running or if the host needs to be shut down for that.<br>

> <br>

> My understanding is that the only way to safely write the BIOS SPI flash<br>

> from the BMC is to shut the host down and put ME into recovery mode.<br>

> Alternatively, hold the host in full reset via RSMRST.<br>

<br>

Good to know, thanks.<br>

<br>

> > >   2.  Set ME/NM (Management engine or Node manager in x86) to recovery<br>

> > mode<br>

> ><br>

> > Is this specific to the BIOS update path or is this something we should<br>

> > do whenever the Host is powered off?  In either case I guess you can<br>

> > make it a dependency on the systemd unit file, but it seems like it<br>

> > would be nice if it were able to be generically applied to all power<br>

> > on/off paths.<br>

> ><br>

> This question opens a can of worms. There are people who say that ME should<br>

> always be run in recovery mode ...<br>

<br>

Hah.  I think it is worth answering if the ME provides any useful<br>

function when the server is powered off though.  I don't know, but it<br>

would potentially simplify the BIOS update flow if Host Off => ME in<br>

Recovery.<br>

<br>

> > >   3.  Flip GPIO to access SPI flash used by host.<br>

> > >   4.  Bind spi driver to access flash<br>

> ><br>

> > This is another thing that seems like we could do generically on all<br>

> > power on / power off paths?  Any time the host isn't running we can hit<br>

> > the GPIO to put ownership at the BMC.  Is there any disadvantage to<br>

> > that?<br>

> ><br>

> Yes. You cannot turn the host on via a power button if the PCH cannot<br>

> access the SPI flash. You'd have to catch that signal in the BMC and do the<br>

> right thing.<br>

> <br>

> What's the advantage of having the BIOS SPI flash always connected to the<br>

> BMC when the host is off? That seems to be making things more complicated<br>

> to me.<br>

<br>

It was just another simplification.  Usually we have special user<br>

utilities to steal the flash to the BMC and we have this logic in BIOS<br>

update path.  Again, if Host Off => BIOS SPI owned by BMC, it simplifies<br>

/ eliminates logic.<br>

<br>

> > Is the GPIO something unique to Facebook's machines or do most other<br>

> > Intel machines have the same requirements?<br>

> ><br>

> <br>

> I'm not sure if it was explained what the GPIO does:<br>

> Since the SPI flash can only have one master, a "mux" (it's really a<br>

> digital switch, or a pair of digital switches) connect the SPI flash either<br>

> to the PCH for access by the ME / host or to the BMC. The GPIO or pair of<br>

> GPIOs is used to control the mux / bus switches.<br>

> <br>

> If the SPI flash is connected to the BMC, the ME / host cannot access it at<br>

> all. As it turns out, the PCH needs to be able to read the SPI flash to be<br>

> able to "turn on" the host.<br>

<br>

Yep, I'm aware of the mux (on Facebook systems).  I wasn't sure if this<br>

was common or typical Intel architecture feature or something we<br>

specifically had on our Facebook systems.<br>

<br>

> ><br>

> > >   5.  Flashcp image to device.<br>

> ><br>

> > I don't think `flashcp` is used today, or at least not in my<br>

> > recollection of the previous Witherspoon implementation.  Is there any<br>

> > advantage to it over `dd` to the raw mtdblock device?<br>

> ><br>

> I'm new to this, too, and found this explanation:<br>

> <a href="https://unix.stackexchange.com/questions/274217/how-is-erasing-mtd-with-dd-if-dev-zero-different-from-flash-eraseall" rel="noreferrer" target="_blank">https://unix.stackexchange.com/questions/274217/how-is-erasing-mtd-with-dd-if-dev-zero-different-from-flash-eraseall</a><br>

> <br>

> This question was asked in the context of erase, but it applies to writes,<br>

> too.<br>

<br>

The stackexchange here is referring to /dev/mtdN devices and not<br>

/dev/mtdblockN devices (and I agree for plain-mtd).  mtdblock<br>

specifically has the extra logic to deal with erasing and writing in<br>

pages as appropriate.<br>

<br>

> > >   9.  Power on server.<br>

> ><br>

> > Doesn't seem like "power on" should be a side-effect of a BIOS update.<br>

> > Is this intended to be "go back to the previous power state"?<br>

> ><br>

> +1<br>

> <br>

> Having said all that, I was experimenting with pretty much the same flow<br>

> but ended up with unreliable writes with individual bit flips. I'm pretty<br>

> sure the HW is fine, since the original (AMI) stock firmware that comes<br>

> with the board can do it just fine. This is with an Aspeed AST2500, a C620<br>

> PCH and a Dediprog EM100 SPI flash emulator.<br>

> <br>

> I had even tried to change the SPI flash clock from the Aspeed down to the<br>

> minimum, with no change :-/ I already hooked up a logic analyzer to see<br>

> what's going on but haven't had a chance to investigate yet. Any ideas?<br>

<br>

Sorry, I've got nothing except maybe the original code retries a bunch<br>

to get past random flips?  If you are seeing bit-flips even with the<br>

Dediprog, are you sure the bus is any good?  Did you solder on headers<br>

to be able to affix the Dediprog?  That might have changed the<br>

capacitance enough to affect SPI activity.<br>

<br>

> Oskar.<br>

<br>

-- <br>

Patrick Williams<br>

</blockquote></div>