[Skiboot] [PATCH] mbox: Harden against BMC daemon errors

Thu Mar 22 17:07:45 AEDT 2018

On Thu, 2018-03-22 at 16:02 +1000, Nicholas Piggin wrote:
> On Thu, 22 Mar 2018 16:54:06 +1100
> Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> 
> > On Thu, 2018-03-22 at 15:15 +1000, Nicholas Piggin wrote:
> > > On Thu, 22 Mar 2018 15:31:46 +1100
> > > Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> > >   
> > > > On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:  
> > > > > Thanks!
> > > > > 
> > > > > On Thu, 22 Mar 2018 14:32:35 +1100
> > > > > Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> > > > >     
> > > > > > Bugs present in the BMC daemon mean that skiboot gets presented with
> > > > > > mbox windows of size zero. These windows cannot be valid and skiboot
> > > > > > already detects these conditions.
> > > > > > 
> > > > > > Currently skiboot warns quite strongly about the occurrence of these
> > > > > > problems. The problem for skiboot is that it doesn't take any action.
> > > > > > Initially I wanting to avoid putting policy like this into skiboot but
> > > > > > since these bugs aren't going away and skiboot barfing is leading to
> > > > > > lockups and ultimately the host going down something needs to be done.
> > > > > > 
> > > > > > I propose that when we detect the problem we fail the mbox call and punt
> > > > > > the problem back up to Linux. I don't like it but at least it will cause
> > > > > > errors to cascade and won't bring the host down. I'm not sure how Linux
> > > > > > is supposed to detect this or what it can even do but this is better
> > > > > > than a crash.
> > > > > > 
> > > > > > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > > > > > be marginally more difficult with this patch. This is because skiboot
> > > > > > will now only print one warning about the zero sized window rather than
> > > > > > continuously spitting it out.
> > > > > > 
> > > > > > Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > > > > > Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > > > > > Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
> > > > > > ---
> > > > > >  libflash/mbox-flash.c | 11 ++++++++---
> > > > > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > > > > > index 4a3c53f5..70f43f36 100644
> > > > > > --- a/libflash/mbox-flash.c
> > > > > > +++ b/libflash/mbox-flash.c
> > > > > > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > > > > >  {
> > > > > >  	unsigned long last = 1, start = tb_to_secs(mftb());
> > > > > >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > > > > > -	while (mbox_flash->busy && timeout_sec) {
> > > > > > +	while (mbox_flash->busy && timeout_sec > last) {
> > > > > >  		long now = tb_to_secs(mftb());
> > > > > >  		if (now - start > last) {
> > > > > > -			timeout_sec--;
> > > > > > -			last = now - start;
> > > > > >  			if (last < timeout_sec / 2)
> > > > > >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > > > > >  			else
> > > > > >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > > > > > +			last++;
> > > > > >  		}
> > > > > >  		/*
> > > > > >  		 * Both functions are important.    
> > > > > 
> > > > > This looks like another change got pulled in. Was this intentional?
> > > > > It doesn't seem like it changes the logic.
> > > > >     
> > > > 
> > > > Yeah I messed with it a bit, thought I'd throw that in here, no change.
> > > > I was wanting to start some discussion around exactly what you're
> > > > saying so that's good.
> > > >   
> > > > > While I'm looking at this code... wow, 30 seconds is a long time to
> > > > > wait in firmware with interrupts disabled. 3 seconds is too. Anything
> > > > > outside crashing, hardware failure handling, and genuine offline
> > > > > operations (e.g., flashing firmware) and we should be able to count
> > > > > the milliseconds in firmware.
> > > > > 
> > > > > If this is only for writing flash, well then a few seconds seems fine.
> > > > > What's the minimum reasonable value? Something under 10 seconds would
> > > > > be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> > > > >     
> > > > 
> > > > Well as discussed before, all bets are off now since it seems the BMC
> > > > won't tell us of problems in any kind of timely manner. I could move
> > > > that hunk into a patch  
> > > 
> > > If it could be split and this one fleshed out, that would be good.
> > >   
> > > > that just hardcodes 3 seconds everywhere (i'm
> > > > glad we put effort into putting it into the mbox protocol so that the
> > > > BMC wouldn't do it)  
> > > 
> > > That would be good if we can, or even lower. What's the consequence
> > > of timing out here in high level terms?  Will it brick your machine,
> > > or just some tool just error out?
> > > 
> > > And what's the response time of the BMC look like in worst conditions?
> > > Basically we want the point where waiting for longer will be worse
> > > than failing. 
> > >   
> > 
> > Well, I went and typed reboot on spoon2-bmc 70 seconds later I was at a
> > login prompt.
> > 
> > I'll leave that there.
> > 
> > Three seconds is really only going to save us if something loads the
> > BMC up at exactly the wrong time and it hickups (which does happen...).
> 
> Right. We can never cope with BMC reboot unless the OPAL API is
> completely async, and the mbox protocol is sufficiently stateless
> or recoverable so it will do something useful across reboots.
> 
> Which we should aim for *eventually*. I'll take 3 seconds for now
> so the watchdog won't kill the host :) 
> 

Ack

> > 
> > I'll pull it out and send that patch because... what other choice do we
> > have?
> > 
> > > >   
> > > > > Is there anything more useful you can tell the administrator in the
> > > > > error message? Should they ensure the BMC is running? Power cycle it?
> > > > >     
> > > > 
> > > > Could add a mention that they need to powercycle their BMC but if we've
> > > > gone that bad...  
> > > 
> > > If there is something concise and reasonable we can put in the log
> > > to help the admin out, that's always good.
> > >   
> > > > who knows what state the host is in...
> > > >   
> > > > >     
> > > > > > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > > > > >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> > > > > >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > > > > >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > > > > > +		/*
> > > > > > +		 * In practice skiboot gets stuck and this eventually
> > > > > > +		 * brings down the host. Just fail pass the error back
> > > > > > +		 * up and hope someone makes a good decision
> > > > > > +		 */
> > > > > > +		return MBOX_R_SYSTEM_ERROR;
> > > > > >  	}
> > > > > >  
> > > > > >  	return rc;    
> > > > > 
> > > > > What call chains lead here? Will the callers handle the error properly?
> > > > > (e.g., will this give something like EIO writing to /dev/nvram?)
> > > > >     
> > > > 
> > > > /dev/nvram is a bit of a special case since skiboot buffers it. A
> > > > failing attempt to write it back would silently fail (as far as linux
> > > > is concerned).
> > > > 
> > > > /dev/mtd by the powernv_flash driver would give EIO (or some other
> > > > failure), pflash does get a failed read when trying to open the FFS
> > > > header from host userspace. As tested by Pridhiviraj.  
> > > 
> > > If there's a friendlier message you could add here too. I mean if it's
> > > an expected condition with the firmware stack... Can you say
> > > 
> > > "Flash read failed due to BMC mbox daemon indicating 0 size window! Do xxx to fix it."
> > >   
> > 
> > Where xxx fix your BMC mbox daemon... this really is a "should never
> > happen case".
> > 
> > The example I use all the time to illustrate this is: 
> > skiboot tells Linux there is 64mb of flash.
> > powernv_flash driver creates a mtd device (all good).
> > Newer ubuntu by default (I wouldn't but they're totally entitled to)
> > seems to probe that device for what I assume are filesystem special
> > blocks, sometimes the probe is at a block in the "flash" where the
> > daemon isn't expecting anyone to read because the flash layout header
> > indicates there's nothing there.
> > So it returns a "valid" (when it could have returned an invalid) window
> > of size zero.
> > 
> > The BMC daemon should just give a window of "empty" flash or, worse
> > case, say that the window is invalid - both make sense... both work,
> > both don't result in crashes.
> 
> Okay, I think EIO from the /dev/ IO is pretty appropriate in that case.
> Does there even need to even be a user visible message?

The only reason there is a big barf about zero sized windows is that it
is hard to debug then skiboot either locks up (as it does today) or as
will happen now, read/writes start failing. I see your point and I
don't disagree that perhaps those could be debug level prints with a
single line "Please update your BMC!" (because I'm assuming these bugs
will get fixed) at error level or something.

> 
> Thanks,
> Nick
>