[Skiboot] [PATCH] mbox: Harden against BMC daemon errors

Cyril Bur cyril.bur at au1.ibm.com
Thu Mar 22 14:32:35 AEDT 2018


Bugs present in the BMC daemon mean that skiboot gets presented with
mbox windows of size zero. These windows cannot be valid and skiboot
already detects these conditions.

Currently skiboot warns quite strongly about the occurrence of these
problems. The problem for skiboot is that it doesn't take any action.
Initially I wanting to avoid putting policy like this into skiboot but
since these bugs aren't going away and skiboot barfing is leading to
lockups and ultimately the host going down something needs to be done.

I propose that when we detect the problem we fail the mbox call and punt
the problem back up to Linux. I don't like it but at least it will cause
errors to cascade and won't bring the host down. I'm not sure how Linux
is supposed to detect this or what it can even do but this is better
than a crash.

Diagnosing a failure to boot if skiboot its self fails to read flash may
be marginally more difficult with this patch. This is because skiboot
will now only print one warning about the zero sized window rather than
continuously spitting it out.

Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
Signed-off-by: Cyril Bur <cyril.bur at au1.ibm.com>
---
 libflash/mbox-flash.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/libflash/mbox-flash.c b/libflash/mbox-flash.c
index 4a3c53f5..70f43f36 100644
--- a/libflash/mbox-flash.c
+++ b/libflash/mbox-flash.c
@@ -334,15 +334,14 @@ static int wait_for_bmc(struct mbox_flash_data *mbox_flash, unsigned int timeout
 {
 	unsigned long last = 1, start = tb_to_secs(mftb());
 	prlog(PR_TRACE, "Waiting for BMC\n");
-	while (mbox_flash->busy && timeout_sec) {
+	while (mbox_flash->busy && timeout_sec > last) {
 		long now = tb_to_secs(mftb());
 		if (now - start > last) {
-			timeout_sec--;
-			last = now - start;
 			if (last < timeout_sec / 2)
 				prlog(PR_TRACE, "Been waiting for the BMC for %lu secs\n", last);
 			else
 				prlog(PR_ERR, "BMC NOT RESPONDING %lu second wait\n", last);
+			last++;
 		}
 		/*
 		 * Both functions are important.
@@ -709,6 +708,12 @@ static int mbox_window_move(struct mbox_flash_data *mbox_flash,
 		prlog(PR_ERR, "Move window is indicating size zero!\n");
 		prlog(PR_ERR, "pos: 0x%" PRIx64 ", len: 0x%" PRIx64 "\n", pos, len);
 		prlog(PR_ERR, "win pos: 0x%08x win size: 0x%08x\n", win->cur_pos, win->size);
+		/*
+		 * In practice skiboot gets stuck and this eventually
+		 * brings down the host. Just fail pass the error back
+		 * up and hope someone makes a good decision
+		 */
+		return MBOX_R_SYSTEM_ERROR;
 	}
 
 	return rc;
-- 
2.16.2



More information about the Skiboot mailing list