[Skiboot] [PATCH 1/2] eSEL: Make sure PANIC logs are sent to BMC before calling assert

Fri Mar 6 00:23:19 AEDT 2020

Disclaimer: I'm far from being acquainted with the code (actually first 
time looking into it), but I did spent some time trying to figure it out..

On 3/4/2020 5:58 AM, Vasant Hegde wrote:
> eSEL logs are split into multiple smaller chunks and sent to BMC.
> We use ipmi_queue_msg_sync() interface for sending OPAL_ERROR_PANIC
> severity events to BMC. But callback handler (ipmi_cmd_done()) clears
> 'sync_msg' after getting response to first chunk as its not aware that
> we have more data to send.
> 
> So in assert()/checkstop path we may endup checkstoping system before
> error log is sent to BMC completely. We will miss useful error log.
> 
> This patch introduces new wait loop in ipmi_elog_commit(). It will wait
> until error log is sent to BMC. I think this is safe because even if
> something goes wrong (like BMC reset) we will hit timeout and eventually
> we will come out of this loop.

FWIW, I couldn't find a better way to do this as well

> Alternatively we can add additional check in ipmi_cmd_done() path. But
> I don't wanted to make this path aware of message type.
> 
>   /* OEM SEL fields */
>   #define SEL_OEM_ID_0		0x55
> @@ -434,10 +435,22 @@ int ipmi_elog_commit(struct errorlog *elog_buf)
> 
>   	msg->error = ipmi_elog_error;
>   	msg->req_size = 0;
> -	if (elog_buf->event_severity == OPAL_ERROR_PANIC)
> +	if (elog_buf->event_severity == OPAL_ERROR_PANIC) {
>   		ipmi_queue_msg_sync(msg);
> -	else
> +
> +		/*
> +		 * eSEL logs are split into multiple smaller chunks and sent
> +		 * to BMC. Lets wait until we finish sending all the chunks
> +		 * to BMC.
> +		 */
> +		while (ipmi_sel_panic_msg.busy != false) {
> +			if (msg->backend->poll)
> +				msg->backend->poll();
I think this works for msg-> because PANIC msg are never de-allocated, 
even if ipmi_sel_free_msg(msg) may have raced with the conditionals 
above it.

But is there a chance there could be a race between ipmi_sel_free_msg() 
setting ->busy to false and another caller/thread immediately setting it 
back to true on a new elog_commit()->ipmi_sel_alloc_msg()? That could 
also mean that this new PANIC is still not expired on the ipmi/bt stack 
(i.e., in theory could cause this loop never to finish if there's always 
a new caller with PANIC waiting to be scheduled, so classify that under 
"nitpicking")

> +			time_wait_ms(10);
> +		}
> +	} else {
>   		ipmi_queue_msg(msg);
> +	}
> 
>   	return 0;
>   }
> 

Thanks,

  -Klaus

-- 
Klaus Heinrich Kiwi <klaus at linux.vnet.ibm.com>