[Skiboot] [PATCH] mbox: Harden against BMC daemon errors

Thu Mar 22 16:54:06 AEDT 2018

On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:
> On Thu, 22 Mar 2018 15:31:46 +1100
> Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> 
> > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> > > Thanks!
> > > 
> > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> > >   
> > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > already detects these conditions.
> > > > 
> > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > lockups and ultimately the host going down something needs to be done.
> > > > 
> > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > is supposed to detect this or what it can even do but this is better
> > > > than a crash.
> > > > 
> > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > be marginally more difficult with this patch. This is because skiboot
> > > > will now only print one warning about the zero sized window rather than
> > > > continuously spitting it out.
> > > > 
> > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > > > Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
> > > > ---
> > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > index 4a3c53f5..70f43f36 100644
> > > > --- a/libflash/mbox-flash.c
> > > > +++ b/libflash/mbox-flash.c
> > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > >  {
> > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > >  		long now = tb_to_secs(mftb());
> > > >  		if (now - start > last) {
> > > > -			timeout_sec--;
> > > > -			last = now - start;
> > > >  			if (last < timeout_sec / 2)
> > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > >  			else
> > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > +			last++;
> > > >  		}
> > > >  		/*
> > > >  		 * Both functions are important.  
> > > 
> > > This looks like another change got pulled in. Was this intentional?
> > > It doesn't seem like it changes the logic.
> > >   
> > 
> > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > I was wanting to start some discussion around exactly what you're
> > saying so that's good.
> > 
> > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > outside crashing, hardware failure handling, and genuine offline
> > > operations (e.g., flashing firmware) and we should be able to count
> > > the milliseconds in firmware.
> > > 
> > > If this is only for writing flash, well then a few seconds seems fine.
> > > What's the minimum reasonable value? Something under 10 seconds would
> > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > >   
> > 
> > Well as discussed before, all bets are off now since it seems the BMC
> > won't tell us of problems in any kind of timely manner. I could move
> > that hunk into a patch
> 
> If it could be split and this one fleshed out, that would be good.
> 
> > that just hardcodes 3 seconds everywhere (i'm
> > glad we put effort into putting it into the mbox protocol so that the
> > BMC wouldn't do it)
> 
> That would be good if we can, or even lower. What's the consequence
> of timing out here in high level terms?  Will it brick your machine,
> or just some tool just error out?
> 
> And what's the response time of the BMC look like in worst conditions?
> Basically we want the point where waiting for longer will be worse
> than failing. 
> 

Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
login prompt.

I'll leave that there.

Three seconds is really only going to save us if something loads the
BMC up at exactly the wrong time and it hickups (which does happen...).

I'll pull it out and send that patch because... what other choice do we
have?

> > 
> > > Is there anything more useful you can tell the administrator in the
> > > error message? Should they ensure the BMC is running? Power cycle it?
> > >   
> > 
> > Could add a mention that they need to powercycle their BMC but if we've
> > gone that bad...
> 
> If there is something concise and reasonable we can put in the log
> to help the admin out, that's always good.
> 
> > who knows what state the host is in...
> > 
> > >   
> > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > +		/*
> > > > +		 * In practice skiboot gets stuck and this eventually
> > > > +		 * brings down the host. Just fail pass the error back
> > > > +		 * up and hope someone makes a good decision
> > > > +		 */
> > > > +		return MBOX_R_SYSTEM_ERROR;
> > > >  	}
> > > >  
> > > >  	return rc;  
> > > 
> > > What call chains lead here? Will the callers handle the error properly?
> > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > >   
> > 
> > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > failing attempt to write it back would silently fail (as far as linux
> > is concerned).
> > 
> > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > failure), pflash does get a failed read when trying to open the FFS
> > header from host userspace. As tested by Pridhiviraj.
> 
> If there's a friendlier message you could add here too. I mean if it's
> an expected condition with the firmware stack... Can you say
> 
> "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> 

Where xxx fix your BMC mbox daemon... this really is a "should never
happen case".

The example I use all the time to illustrate this is: 
skiboot tells Linux there is 64mb of flash.
powernv_flash driver creates a mtd device (all good).
Newer ubuntu by default (I wouldn't but they're totally entitled to)
seems to probe that device for what I assume are filesystem special
blocks, sometimes the probe is at a block in the "flash" where the
daemon isn't expecting anyone to read because the flash layout header
indicates there's nothing there.
So it returns a "valid" (when it could have returned an invalid) window
of size zero.

The BMC daemon should just give a window of "empty" flash or, worse
case, say that the window is invalid - both make sense... both work,
both don't result in crashes.

> Thanks,
> Nick
>