[Skiboot] [PATCH v2] FSP/CONSOLE: Workaround for unresponsive ipmi daemon
stewart at linux.vnet.ibm.com
Wed Jun 14 17:00:09 AEST 2017
Vasant Hegde <hegdevasant at linux.vnet.ibm.com> writes:
> We use TCE mapped area to write data to console. Console header
> (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
> next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
> Kernel makes opal_console_write() OPAL call to write data to console.
> OPAL write data to TCE mapped area and sends MBOX command to FSP.
> If our console becomes full and we have data to write to console,
> we keep on waiting until FSP reads data.
> In some corner cases, where FSP is active but not responding to
> console MBOX message (due to buggy IPMI) and we have heavy console
> write happening from kernel, then eventually our console buffer
> becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
> kernel. Kernel will keep on retrying. This is creating kernel soft
> lockups. In some extreme case when every CPU is trying to write to
> console, user will not be able to ssh and thinks system is hang.
> If we reset FSP or restart IPMI daemon on FSP, system recovers and
> everything becomes normal.
> This patch adds workaround to above issue by returning OPAL_HARDWARE
> when cosole is full. Side effect of this patch is, we may endup dropping
> latest console data. But better to drop console data than system hang.
> Alternative approach is to drop old data from console buffer, make space
> for new data. But in normal condition only FSP can update 'next_out'
> pointer and if we touch that pointer, it may introduce some other
> race conditions. Hence we decided to just new console write request.
> Signed-off-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>
> Acked-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> @Vaidy, Stewart,
> As suggested, I've added error log message. As Vaidy suggested it may not
> be a good idea to reset FSP. Hence I'm not initiating Host initiated Reset.
> Also I've retained Vaidy's Ack from V1.
Okay... let's see how this goes from a practical sense (it's certainly
the simplest solution). It's managed to survive a bunch of
op-test-framework tests, which is more than can be said for some service
processor's console implementations.
Merged to master as of c8a7535f3539c79955645e6b3714b367a994b1e9
and 5.4.x as of 316f99bdb4e0911c2d3970a8ca23f30101dba57a
OPAL Architect, IBM.
More information about the Skiboot