<div dir="ltr">Hi everyone<div><br></div><div>I finally figured it out. And with "it" I mean the problem that reading/writing to the BIOS flash ROM from an AST2500 was unreliable. The problem was that Linux on the BMC didn't like when I changed the flash image while the BMC was running. I was using a Dediprog EM100 flash emulator, which allows to do that safely. The fix is to either not do that (which wouldn't happen in production anyway), or to explicitly unbind + bind the mtd before using it again after changing the flash contents. Yeah!</div><div><br></div><div>Oskar.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 30, 2020 at 7:22 PM Patrick Williams <<a href="mailto:patrick@stwcx.xyz">patrick@stwcx.xyz</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Jan 30, 2020 at 11:30:10AM -0500, Oskar Senft wrote:<br>
> Hi Patrick<br>
> <br>
> Here some thoughts:<br>
> <br>
> > 1. Power off host server.<br>
> ><br>
> > I will admit I don't know the Intel architecture well enough yet, but is<br>
> > powering off the server prior to BIOS update actually required? Is the<br>
> > BIOS NOR chip always mapped into a physical address and used, or is the<br>
> > BIOS at some point loaded and resident? Are there stable points where<br>
> > it is safe to perform an update? Can we monitor POST code to know when<br>
> > the BIOS is completed?<br>
> ><br>
> There are two issues:<br>
> <br>
> - The host may access the BIOS SPI flash at any time by making BIOS<br>
> calls. UEFI variables are such an example. The problem is that the BIOS<br>
> code that executes these requests does not handle cases at all where the<br>
> BIOS SPI flash becomes inaccessible. This results in an immediate crash of<br>
> the host.<br>
> - With ME in operational mode, we cannot guarantee that ME would not<br>
> attempt to read/write from the SPI flash while the host is running. I'm not<br>
> sure if it's possible to put ME into recovery mode WHILE the host is<br>
> running or if the host needs to be shut down for that.<br>
> <br>
> My understanding is that the only way to safely write the BIOS SPI flash<br>
> from the BMC is to shut the host down and put ME into recovery mode.<br>
> Alternatively, hold the host in full reset via RSMRST.<br>
<br>
Good to know, thanks.<br>
<br>
> > > 2. Set ME/NM (Management engine or Node manager in x86) to recovery<br>
> > mode<br>
> ><br>
> > Is this specific to the BIOS update path or is this something we should<br>
> > do whenever the Host is powered off? In either case I guess you can<br>
> > make it a dependency on the systemd unit file, but it seems like it<br>
> > would be nice if it were able to be generically applied to all power<br>
> > on/off paths.<br>
> ><br>
> This question opens a can of worms. There are people who say that ME should<br>
> always be run in recovery mode ...<br>
<br>
Hah. I think it is worth answering if the ME provides any useful<br>
function when the server is powered off though. I don't know, but it<br>
would potentially simplify the BIOS update flow if Host Off => ME in<br>
Recovery.<br>
<br>
> > > 3. Flip GPIO to access SPI flash used by host.<br>
> > > 4. Bind spi driver to access flash<br>
> ><br>
> > This is another thing that seems like we could do generically on all<br>
> > power on / power off paths? Any time the host isn't running we can hit<br>
> > the GPIO to put ownership at the BMC. Is there any disadvantage to<br>
> > that?<br>
> ><br>
> Yes. You cannot turn the host on via a power button if the PCH cannot<br>
> access the SPI flash. You'd have to catch that signal in the BMC and do the<br>
> right thing.<br>
> <br>
> What's the advantage of having the BIOS SPI flash always connected to the<br>
> BMC when the host is off? That seems to be making things more complicated<br>
> to me.<br>
<br>
It was just another simplification. Usually we have special user<br>
utilities to steal the flash to the BMC and we have this logic in BIOS<br>
update path. Again, if Host Off => BIOS SPI owned by BMC, it simplifies<br>
/ eliminates logic.<br>
<br>
> > Is the GPIO something unique to Facebook's machines or do most other<br>
> > Intel machines have the same requirements?<br>
> ><br>
> <br>
> I'm not sure if it was explained what the GPIO does:<br>
> Since the SPI flash can only have one master, a "mux" (it's really a<br>
> digital switch, or a pair of digital switches) connect the SPI flash either<br>
> to the PCH for access by the ME / host or to the BMC. The GPIO or pair of<br>
> GPIOs is used to control the mux / bus switches.<br>
> <br>
> If the SPI flash is connected to the BMC, the ME / host cannot access it at<br>
> all. As it turns out, the PCH needs to be able to read the SPI flash to be<br>
> able to "turn on" the host.<br>
<br>
Yep, I'm aware of the mux (on Facebook systems). I wasn't sure if this<br>
was common or typical Intel architecture feature or something we<br>
specifically had on our Facebook systems.<br>
<br>
> ><br>
> > > 5. Flashcp image to device.<br>
> ><br>
> > I don't think `flashcp` is used today, or at least not in my<br>
> > recollection of the previous Witherspoon implementation. Is there any<br>
> > advantage to it over `dd` to the raw mtdblock device?<br>
> ><br>
> I'm new to this, too, and found this explanation:<br>
> <a href="https://unix.stackexchange.com/questions/274217/how-is-erasing-mtd-with-dd-if-dev-zero-different-from-flash-eraseall" rel="noreferrer" target="_blank">https://unix.stackexchange.com/questions/274217/how-is-erasing-mtd-with-dd-if-dev-zero-different-from-flash-eraseall</a><br>
> <br>
> This question was asked in the context of erase, but it applies to writes,<br>
> too.<br>
<br>
The stackexchange here is referring to /dev/mtdN devices and not<br>
/dev/mtdblockN devices (and I agree for plain-mtd). mtdblock<br>
specifically has the extra logic to deal with erasing and writing in<br>
pages as appropriate.<br>
<br>
> > > 9. Power on server.<br>
> ><br>
> > Doesn't seem like "power on" should be a side-effect of a BIOS update.<br>
> > Is this intended to be "go back to the previous power state"?<br>
> ><br>
> +1<br>
> <br>
> Having said all that, I was experimenting with pretty much the same flow<br>
> but ended up with unreliable writes with individual bit flips. I'm pretty<br>
> sure the HW is fine, since the original (AMI) stock firmware that comes<br>
> with the board can do it just fine. This is with an Aspeed AST2500, a C620<br>
> PCH and a Dediprog EM100 SPI flash emulator.<br>
> <br>
> I had even tried to change the SPI flash clock from the Aspeed down to the<br>
> minimum, with no change :-/ I already hooked up a logic analyzer to see<br>
> what's going on but haven't had a chance to investigate yet. Any ideas?<br>
<br>
Sorry, I've got nothing except maybe the original code retries a bunch<br>
to get past random flips? If you are seeing bit-flips even with the<br>
Dediprog, are you sure the bus is any good? Did you solder on headers<br>
to be able to affix the Dediprog? That might have changed the<br>
capacitance enough to affect SPI activity.<br>
<br>
> Oskar.<br>
<br>
-- <br>
Patrick Williams<br>
</blockquote></div>