[Skiboot] [PATCH] FSP/CONSOLE: Workaround for unresponsive ipmi daemon

Vasant Hegde hegdevasant at linux.vnet.ibm.com
Thu Jun 8 16:54:39 AEST 2017


On 06/07/2017 12:20 PM, Vasant Hegde wrote:
> We use TCE mapped area to write data to console. Console header
> (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
> next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
>
> Kernel makes opal_console_write() OPAL call to write data to console.
> OPAL write data to TCE mapped area and sends MBOX command to FSP.
> If our console becomes full and we have data to write to console,
> we keep on waiting until FSP reads data.
>
> In some corner cases, where FSP is active but not responding to
> console MBOX message (due to buggy IPMI) and we have heavy console
> write happening from kernel, then eventually our console buffer
> becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
> kernel. Kernel will keep on retrying. This is creating kernel soft
> lockups. In some extreme case when every CPU is trying to write to
> console, user will not be able to ssh and thinks system is hang.
>
> If we reset FSP or restart IPMI daemon on FSP, system recovers and
> everything becomes normal.
>
> This patch adds workaround to above issue by returning OPAL_HARDWARE
> when cosole is full. Side effect of this patch is, we may endup dropping
> latest console data. But better to drop console data than system hang.
>
> Alternative approach is to drop old data from console buffer, make space
> for new data. But in normal condition only FSP can update 'next_out'
> pointer and if we touch that pointer, it may introduce some other
> race conditions. Hence we decided to just new console write request.

Stewart,

We have to backport this patch to 860.30 release as well.

I think it will apply cleanly on 830.30 branch. Let me know in case if you want 
me to send
backported patch for 860.30 series.

-Vasant




More information about the Skiboot mailing list