[Skiboot] [PATCH v2] FSP/CONSOLE: Workaround for unresponsive ipmi daemon

Stewart Smith stewart at linux.vnet.ibm.com
Wed Jun 14 17:00:09 AEST 2017


Vasant Hegde <hegdevasant at linux.vnet.ibm.com> writes:
> We use TCE mapped area to write data to console. Console header
> (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
> next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
>
> Kernel makes opal_console_write() OPAL call to write data to console.
> OPAL write data to TCE mapped area and sends MBOX command to FSP.
> If our console becomes full and we have data to write to console,
> we keep on waiting until FSP reads data.
>
> In some corner cases, where FSP is active but not responding to
> console MBOX message (due to buggy IPMI) and we have heavy console
> write happening from kernel, then eventually our console buffer
> becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
> kernel. Kernel will keep on retrying. This is creating kernel soft
> lockups. In some extreme case when every CPU is trying to write to
> console, user will not be able to ssh and thinks system is hang.
>
> If we reset FSP or restart IPMI daemon on FSP, system recovers and
> everything becomes normal.
>
> This patch adds workaround to above issue by returning OPAL_HARDWARE
> when cosole is full. Side effect of this patch is, we may endup dropping
> latest console data. But better to drop console data than system hang.
>
> Alternative approach is to drop old data from console buffer, make space
> for new data. But in normal condition only FSP can update 'next_out'
> pointer and if we touch that pointer, it may introduce some other
> race conditions. Hence we decided to just new console write request.
>
> Signed-off-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>
> Acked-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> ---
> @Vaidy, Stewart,
>   As suggested, I've added error log message. As Vaidy suggested it may not
>   be a good idea to reset FSP. Hence I'm not initiating Host initiated Reset.
>
>   Also I've retained Vaidy's Ack from V1.
> -Vasant

Okay... let's see how this goes from a practical sense (it's certainly
the simplest solution). It's managed to survive a bunch of
op-test-framework tests, which is more than can be said for some service
processor's console implementations.

Merged to master as of c8a7535f3539c79955645e6b3714b367a994b1e9
and 5.4.x as of 316f99bdb4e0911c2d3970a8ca23f30101dba57a
-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list