[Skiboot] [PATCH 2/2] core/hmi: Do not call find_core_checkstop_reason() unconditionally

Mon Dec 4 21:54:41 AEDT 2017

On 12/04/2017 11:31 AM, Balbir Singh wrote:
> On Fri, Dec 1, 2017 at 7:05 PM, Mahesh Jagannath Salgaonkar
> <mahesh at linux.vnet.ibm.com> wrote:
>> On 11/30/2017 06:20 PM, Balbir Singh wrote:
>>> We tend to call find_core_checkstop_reason() even if the NPU/NX/CAPI
>>> HMI handling paths found and accepted the reason for the local checkstop.
>>
>> Yes, this is because there is a possibility of cascading checkstops
>> where bits from multiple FIRs (NPU/NX/CAPI/CORE) may get fired at once.
>> Hence we need to make sure that we detect reasons from all FIRs
>> including core FIR.
>>
> 
> Cascading as in one event for all the triggered FIR's - each causing a
> malfunction
> error? Have you ever seen this case, several cascaded FIR bits causing just
> one HMI malfunction error?

Nope. I have never come across such a case but there is a possibility. I
remember HW folks mentioning this when we initially implemented HMI
handling.

> 
>>> We also trash hmi_evt in the call.
>>
>> Every find_*_checkstop*() caller fills and queues up the hmi_evt to be
>> sent to kernel. Once it is queued, the hmi_evt is free to use by next
>> caller.
>>
> 
> I agree on the queuing bits,
> 
> The code looks a bit odd to me
> 
> find_core_checkstop_reason() unconditionally initializes
> 
>         hmi_evt->severity = OpalHMI_SEV_FATAL;
>         hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT;
>         hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_CORE;
> 
> Then in decode_malfunction we do
> 
>         /*
>          * If we fail to find checkstop reason, send an unknown HMI event.
>          */
>         if (!event_generated) {
>                 hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_UNKNOWN;
>                 hmi_evt->u.xstop_error.xstop_reason = 0;
>                 queue_hmi_event(hmi_evt, false);
>         }
> 
> Do we want to leave the severity as FATAL from there? Moreover the
> code organization
> needs revisiting

That was the original intent for unknown malfunction alert since we
considered all malfunction alerts as fatal unrecovered HW errors. Hence
the code uses the hmi_evt that was initialized in
find_core_checkstop_reason() and just change the xstop_type to Unknown.
I agree code need to be more readable. We can re-initialize the hmi_evt
again in (!event_generated) case to remove the confusion.

Unknown malf alert means that opal handler has failed to detect the HW
error reason and we don't know whether it is good idea for kernel to
continue. Hence we leave the severity as FATAL and disposition to
unrecovered. This also means hmi handler need to be enhanced to add the
missing functionality.

Do you think we should mark it as a warning and send it as recovered event ?

Thanks,
-Mahesh.