[Skiboot] [PATCH] mbox: Harden against BMC daemon errors
Cyril Bur
cyril.bur at au1.ibm.com
Thu Mar 22 15:31:46 AEDT 2018
On Thu, 2018-03-22 at 14:01 +1000, Nicholas Piggin wrote:
> Thanks!
>
> On Thu, 22 Mar 2018 14:32:35 +1100
> Cyril Bur <cyril.bur at au1.ibm.com> wrote:
>
> > Bugs present in the BMC daemon mean that skiboot gets presented with
> > mbox windows of size zero. These windows cannot be valid and skiboot
> > already detects these conditions.
> >
> > Currently skiboot warns quite strongly about the occurrence of these
> > problems. The problem for skiboot is that it doesn't take any action.
> > Initially I wanting to avoid putting policy like this into skiboot but
> > since these bugs aren't going away and skiboot barfing is leading to
> > lockups and ultimately the host going down something needs to be done.
> >
> > I propose that when we detect the problem we fail the mbox call and punt
> > the problem back up to Linux. I don't like it but at least it will cause
> > errors to cascade and won't bring the host down. I'm not sure how Linux
> > is supposed to detect this or what it can even do but this is better
> > than a crash.
> >
> > Diagnosing a failure to boot if skiboot its self fails to read flash may
> > be marginally more difficult with this patch. This is because skiboot
> > will now only print one warning about the zero sized window rather than
> > continuously spitting it out.
> >
> > Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> > Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
> > ---
> > libflash/mbox-flash.c | 11 ++++++++---
> > 1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> > index 4a3c53f5..70f43f36 100644
> > --- a/libflash/mbox-flash.c
> > +++ b/libflash/mbox-flash.c
> > @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
> > {
> > unsigned long last = 1, start = tb_to_secs(mftb());
> > prlog(PR_TRACE, "Waiting for BMC\n");
> > - while (mbox_flash->busy && timeout_sec) {
> > + while (mbox_flash->busy && timeout_sec > last) {
> > long now = tb_to_secs(mftb());
> > if (now - start > last) {
> > - timeout_sec--;
> > - last = now - start;
> > if (last < timeout_sec / 2)
> > prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
> > else
> > prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> > + last++;
> > }
> > /*
> > * Both functions are important.
>
> This looks like another change got pulled in. Was this intentional?
> It doesn't seem like it changes the logic.
>
Yeah I messed with it a bit, thought I'd throw that in here, no change.
I was wanting to start some discussion around exactly what you're
saying so that's good.
> While I'm looking at this code... wow, 30 seconds is a long time to
> wait in firmware with interrupts disabled. 3 seconds is too. Anything
> outside crashing, hardware failure handling, and genuine offline
> operations (e.g., flashing firmware) and we should be able to count
> the milliseconds in firmware.
>
> If this is only for writing flash, well then a few seconds seems fine.
> What's the minimum reasonable value? Something under 10 seconds would
> be good so it doesn't trip the hard lockup watchdog. 3 seconds?
>
Well as discussed before, all bets are off now since it seems the BMC
won't tell us of problems in any kind of timely manner. I could move
that hunk into a patch that just hardcodes 3 seconds everywhere (i'm
glad we put effort into putting it into the mbox protocol so that the
BMC wouldn't do it)
> Is there anything more useful you can tell the administrator in the
> error message? Should they ensure the BMC is running? Power cycle it?
>
Could add a mention that they need to powercycle their BMC but if we've
gone that bad... who knows what state the host is in...
>
> > @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
> > prlog(PR_ERR, "Move window is indicating size zero!\n");
> > prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
> > prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> > + /*
> > + * In practice skiboot gets stuck and this eventually
> > + * brings down the host. Just fail pass the error back
> > + * up and hope someone makes a good decision
> > + */
> > + return MBOX_R_SYSTEM_ERROR;
> > }
> >
> > return rc;
>
> What call chains lead here? Will the callers handle the error properly?
> (e.g., will this give something like EIO writing to /dev/nvram?)
>
/dev/nvram is a bit of a special case since skiboot buffers it. A
failing attempt to write it back would silently fail (as far as linux
is concerned).
/dev/mtd by the powernv_flash driver would give EIO (or some other
failure), pflash does get a failed read when trying to open the FFS
header from host userspace. As tested by Pridhiviraj.
Cyril
> Thanks,
> Nick
>
More information about the Skiboot
mailing list