[Skiboot] [PATCH] mbox: Harden against BMC daemon errors

Thu Mar 22 15:31:46 AEDT 2018

On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> Thanks!
> 
> On Thu, 22 Mar 2018 14:32:35 +1100
> Cyril Bur <cyril.bur at au1.ibm.com> wrote:
> 
> > Bugs present in the BMC daemon mean that skiboot gets presented with
> > mbox windows of size zero. These windows cannot be valid and skiboot
> > already detects these conditions.
> > 
> > Currently skiboot warns quite strongly about the occurrence of these
> > problems. The problem for skiboot is that it doesn't take any action.
> > Initially I wanting to avoid putting policy like this into skiboot but
> > since these bugs aren't going away and skiboot barfing is leading to
> > lockups and ultimately the host going down something needs to be done.
> > 
> > I propose that when we detect the problem we fail the mbox call and punt
> > the problem back up to Linux. I don't like it but at least it will cause
> > errors to cascade and won't bring the host down. I'm not sure how Linux
> > is supposed to detect this or what it can even do but this is better
> > than a crash.
> > 
> > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > be marginally more difficult with this patch. This is because skiboot
> > will now only print one warning about the zero sized window rather than
> > continuously spitting it out.
> > 
> > Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
> > ---
> >  libflash/mbox-flash.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > index 4a3c53f5..70f43f36 100644
> > --- a/libflash/mbox-flash.c
> > +++ b/libflash/mbox-flash.c
> > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> >  {
> >  	unsigned long last = 1, start = tb_to_secs(mftb());
> >  	prlog(PR_TRACE, "Waiting for BMC\n");
> > -	while (mbox_flash->busy && timeout_sec) {
> > +	while (mbox_flash->busy && timeout_sec > last) {
> >  		long now = tb_to_secs(mftb());
> >  		if (now - start > last) {
> > -			timeout_sec--;
> > -			last = now - start;
> >  			if (last < timeout_sec / 2)
> >  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> >  			else
> >  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > +			last++;
> >  		}
> >  		/*
> >  		 * Both functions are important.
> 
> This looks like another change got pulled in. Was this intentional?
> It doesn't seem like it changes the logic.
> 

Yeah I messed with it a bit, thought I'd throw that in here, no change.
I was wanting to start some discussion around exactly what you're
saying so that's good.

> While I'm looking at this code... wow, 30 seconds is a long time to
> wait in firmware with interrupts disabled. 3 seconds is too. Anything
> outside crashing, hardware failure handling, and genuine offline
> operations (e.g., flashing firmware) and we should be able to count
> the milliseconds in firmware.
> 
> If this is only for writing flash, well then a few seconds seems fine.
> What's the minimum reasonable value? Something under 10 seconds would
> be good so it doesn't trip the hard lockup watchdog. 3 seconds?
> 

Well as discussed before, all bets are off now since it seems the BMC
won't tell us of problems in any kind of timely manner. I could move
that hunk into a patch that just hardcodes 3 seconds everywhere (i'm
glad we put effort into putting it into the mbox protocol so that the
BMC wouldn't do it)

> Is there anything more useful you can tell the administrator in the
> error message? Should they ensure the BMC is running? Power cycle it?
> 

Could add a mention that they need to powercycle their BMC but if we've
gone that bad... who knows what state the host is in...

> 
> > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> >  		prlog(PR_ERR, "Move window is indicating size zero!\n");
> >  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> >  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > +		/*
> > +		 * In practice skiboot gets stuck and this eventually
> > +		 * brings down the host. Just fail pass the error back
> > +		 * up and hope someone makes a good decision
> > +		 */
> > +		return MBOX_R_SYSTEM_ERROR;
> >  	}
> >  
> >  	return rc;
> 
> What call chains lead here? Will the callers handle the error properly?
> (e.g., will this give something like EIO writing to /dev/nvram?)
> 

/dev/nvram is a bit of a special case since skiboot buffers it. A
failing attempt to write it back would silently fail (as far as linux
is concerned).

/dev/mtd by the powernv_flash driver would give EIO (or some other
failure), pflash does get a failed read when trying to open the FFS
header from host userspace. As tested by Pridhiviraj.

Cyril

> Thanks,
> Nick
>