[Skiboot] [PATCH] FSP/CONSOLE: Workaround for unresponsive ipmi daemon
Vasant Hegde
hegdevasant at linux.vnet.ibm.com
Fri Jun 9 15:19:19 AEST 2017
On 06/09/2017 10:40 AM, Stewart Smith wrote:
> Vasant Hegde <hegdevasant at linux.vnet.ibm.com> writes:
>
>> We use TCE mapped area to write data to console. Console header
>> (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
>> next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
>>
>> Kernel makes opal_console_write() OPAL call to write data to console.
>> OPAL write data to TCE mapped area and sends MBOX command to FSP.
>> If our console becomes full and we have data to write to console,
>> we keep on waiting until FSP reads data.
>>
>> In some corner cases, where FSP is active but not responding to
>> console MBOX message (due to buggy IPMI) and we have heavy console
>> write happening from kernel, then eventually our console buffer
>> becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
>> kernel. Kernel will keep on retrying. This is creating kernel soft
>> lockups. In some extreme case when every CPU is trying to write to
>> console, user will not be able to ssh and thinks system is hang.
>>
>> If we reset FSP or restart IPMI daemon on FSP, system recovers and
>> everything becomes normal.
>>
>> This patch adds workaround to above issue by returning OPAL_HARDWARE
>> when cosole is full. Side effect of this patch is, we may endup dropping
>> latest console data. But better to drop console data than system hang.
>>
>> Alternative approach is to drop old data from console buffer, make space
>> for new data. But in normal condition only FSP can update 'next_out'
>> pointer and if we touch that pointer, it may introduce some other
>> race conditions. Hence we decided to just new console write request.
>>
>> Signed-off-by: Vasant Hegde <hegdevasant at linux.vnet.ibm.com>
>> ---
>> hw/fsp/fsp-console.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/fsp/fsp-console.c b/hw/fsp/fsp-console.c
>> index fd67b20..a7b0281 100644
>> --- a/hw/fsp/fsp-console.c
>> +++ b/hw/fsp/fsp-console.c
>> @@ -610,7 +610,7 @@ static int64_t fsp_console_write(int64_t term_number, int64_t *length,
>> *length = written;
>> unlock(&fsp_con_lock);
>>
>> - return written ? OPAL_SUCCESS : OPAL_BUSY_EVENT;
>> + return written ? OPAL_SUCCESS : OPAL_HARDWARE;
>> }
>>
>> static int64_t fsp_console_write_buffer_space(int64_t term_number,
>
> I've been thinking about this problem a bit... and I'm not quite
> convinced that this is the best solution. This would have us start
> dropping console output fairly soon after the FSP slows down or stops
> responding for a bit.....
We have 128K buffer. Slow response from FSP is fine. We hit this issue only when
FSP stops responding to console message.
>
> What about an approach where we start returting OPAL_HARDWARE if we haven't
> seen any progress from the FSP in, say, a second or something? (and
We may still have space in buffer right? So lets use it until it becomes full.
> probably also log an error log and/or do a HIR?)
That's good option. Log error and start Host initiated Reset.
-Vasant
More information about the Skiboot
mailing list