[Skiboot] [PATCH] FSP/CONSOLE: Workaround for unresponsive ipmi daemon

Vasant Hegde hegdevasant at linux.vnet.ibm.com
Fri Jun 9 15:19:19 AEST 2017


On 06/09/2017 10:40 AM, Stewart Smith wrote:
> Vasant Hegde <hegdevasant at linux.vnet.ibm.com> writes:
>
>> We use TCE mapped area to write data to console. Console header
>> (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
>> next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
>>
>> Kernel makes opal_console_write() OPAL call to write data to console.
>> OPAL write data to TCE mapped area and sends MBOX command to FSP.
>> If our console becomes full and we have data to write to console,
>> we keep on waiting until FSP reads data.
>>
>> In some corner cases, where FSP is active but not responding to
>> console MBOX message (due to buggy IPMI) and we have heavy console
>> write happening from kernel, then eventually our console buffer
>> becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
>> kernel. Kernel will keep on retrying. This is creating kernel soft
>> lockups. In some extreme case when every CPU is trying to write to
>> console, user will not be able to ssh and thinks system is hang.
>>
>> If we reset FSP or restart IPMI daemon on FSP, system recovers and
>> everything becomes normal.
>>
>> This patch adds workaround to above issue by returning OPAL_HARDWARE
>> when cosole is full. Side effect of this patch is, we may endup dropping
>> latest console data. But better to drop console data than system hang.
>>
>> Alternative approach is to drop old data from console buffer, make space
>> for new data. But in normal condition only FSP can update 'next_out'
>> pointer and if we touch that pointer, it may introduce some other
>> race conditions. Hence we decided to just new console write request.
>>
>> Signed-off-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>
>> ---
>>  hw/fsp/fsp-console.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/fsp/fsp-console.c b/hw/fsp/fsp-console.c
>> index fd67b20..a7b0281 100644
>> --- a/hw/fsp/fsp-console.c
>> +++ b/hw/fsp/fsp-console.c
>> @@ -610,7 +610,7 @@ static int64_t fsp_console_write(int64_t term_number, int64_t *length,
>>  	*length = written;
>>  	unlock(&fsp_con_lock);
>>
>> -	return written ? OPAL_SUCCESS : OPAL_BUSY_EVENT;
>> +	return written ? OPAL_SUCCESS : OPAL_HARDWARE;
>>  }
>>
>>  static int64_t fsp_console_write_buffer_space(int64_t term_number,
>
> I've been thinking about this problem a bit... and I'm not quite
> convinced that this is the best solution. This would have us start
> dropping console output fairly soon after the FSP slows down or stops
> responding for a bit.....

We have 128K buffer. Slow response from FSP is fine. We hit this issue only when 
FSP stops responding to console message.

>
> What about an approach where we start returting OPAL_HARDWARE if we haven't
> seen any progress from the FSP in, say, a second or something? (and

We may still have space in buffer right? So lets use it until it becomes full.

> probably also log an error log and/or do a HIR?)

That's good option. Log error and start Host initiated Reset.

-Vasant



More information about the Skiboot mailing list