[Skiboot] [PATCH] mbox: Harden against BMC daemon errors

Thu Mar 22 15:01:55 AEDT 2018

Thanks!

On Thu, 22 Mar 2018 14:32:35 +1100
Cyril Bur <cyril.bur at au1.ibm.com> wrote:

> Bugs present in the BMC daemon mean that skiboot gets presented with
> mbox windows of size zero. These windows cannot be valid and skiboot
> already detects these conditions.
> 
> Currently skiboot warns quite strongly about the occurrence of these
> problems. The problem for skiboot is that it doesn't take any action.
> Initially I wanting to avoid putting policy like this into skiboot but
> since these bugs aren't going away and skiboot barfing is leading to
> lockups and ultimately the host going down something needs to be done.
> 
> I propose that when we detect the problem we fail the mbox call and punt
> the problem back up to Linux. I don't like it but at least it will cause
> errors to cascade and won't bring the host down. I'm not sure how Linux
> is supposed to detect this or what it can even do but this is better
> than a crash.
> 
> Diagnosing a failure to boot if skiboot its self fails to read flash may
> be marginally more difficult with this patch. This is because skiboot
> will now only print one warning about the zero sized window rather than
> continuously spitting it out.
> 
> Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
> ---
>  libflash/mbox-flash.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
> index 4a3c53f5..70f43f36 100644
> --- a/libflash/mbox-flash.c
> +++ b/libflash/mbox-flash.c
> @@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
>  {
>  	unsigned long last = 1, start = tb_to_secs(mftb());
>  	prlog(PR_TRACE, "Waiting for BMC\n");
> -	while (mbox_flash->busy && timeout_sec) {
> +	while (mbox_flash->busy && timeout_sec > last) {
>  		long now = tb_to_secs(mftb());
>  		if (now - start > last) {
> -			timeout_sec--;
> -			last = now - start;
>  			if (last < timeout_sec / 2)
>  				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
>  			else
>  				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
> +			last++;
>  		}
>  		/*
>  		 * Both functions are important.

This looks like another change got pulled in. Was this intentional?
It doesn't seem like it changes the logic.

While I'm looking at this code... wow, 30 seconds is a long time to
wait in firmware with interrupts disabled. 3 seconds is too. Anything
outside crashing, hardware failure handling, and genuine offline
operations (e.g., flashing firmware) and we should be able to count
the milliseconds in firmware.

If this is only for writing flash, well then a few seconds seems fine.
What's the minimum reasonable value? Something under 10 seconds would
be good so it doesn't trip the hard lockup watchdog. 3 seconds?

Is there anything more useful you can tell the administrator in the
error message? Should they ensure the BMC is running? Power cycle it?

> @@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
>  		prlog(PR_ERR, "Move window is indicating size zero!\n");
>  		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
>  		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
> +		/*
> +		 * In practice skiboot gets stuck and this eventually
> +		 * brings down the host. Just fail pass the error back
> +		 * up and hope someone makes a good decision
> +		 */
> +		return MBOX_R_SYSTEM_ERROR;
>  	}
>  
>  	return rc;

What call chains lead here? Will the callers handle the error properly?
(e.g., will this give something like EIO writing to /dev/nvram?)

Thanks,
Nick