[Skiboot] [PATCH v3 0/4] opal: Check NX FIRs to find the reason for Malfunction Alert.

Tue Mar 17 11:08:06 AEDT 2015

On 2015-03-10 10:56, Mahesh Jagannath Salgaonkar wrote:
> On 03/09/2015 07:49 PM, Dan Streetman wrote:
>> 
>> 
>> On 2015-03-05 00:04, Mahesh J Salgaonkar wrote:
>>> This patch series enhances HMI event structure to accommodate CORE/NX
>>> check
>>> stop error information and bumps up the HMI event version to V2.
>>> 
>>> Changes in V3:
>>> - New patch 2/4 adds the documentation for HMI event struct in
>>>   doc/opal-api/opal-messages.txt
>>> - Added BUILD_ASSERT() call.
>>> 
>>> Changes in V2:
>>> - Introduced changes to include information about NX checkstop.
>>> - Added PIR and chip_id fields to HMI event structure.
>>> - Improved the logic to detect and report UNKNOWN event of we fail to
>>> gather
>>>   checkstop reason.
>>> - New patch 3/3 to identify and report reason for NX checkstop.
>> 
>> So, I'm unclear as to NX errors/hangs should be handled here in 
>> skiboot,
>> or by the kernel driver (or other OS driver).  My WIP kernel driver 
>> does
>> currently check the NX engine FIR, and clear any errors signaled, but 
>> I
>> haven't added error recovery yet.
> 
> This patch intended to scan and detect only those NX errors which are
> reported to hypervisor through HMI as Malfunction Alert. Currently this
> patch does not attempt any recovery. It just detects and reports them 
> to
> host kernel as un-recovered with a summary of error info (through HMI
> event structure). Currently kernel invokes panic for these errors.
> 
> If we already have kernel driver that does NX error recovery, then we
> should look at diverting these notifications to your kernel driver for
> recovery instead of kernel panic.

yeah, since in order for anything useful to happen the kernel will need 
to handle either the NX error or the HMI, i think my preference is to 
just let the kernel nx driver monitor the NX registers directly (through 
opal calls, as benh suggested) and correct NX errors directly.  Going 
through HMI would just be extra work for me.

So unless I hear otherwise, I'll plan to add NX error detection and 
handling to the kernel driver, and assume opal isn't monitoring/handling 
the NX error/status registers.

> 
>> 
>> Are you planning to add NX error recovery and/or NX channel hang
>> recovery?  I also don't see you clearing FIR errors (i.e. writing the
>> set bits to the DMA & Engine FIR Data Clear Register).
> 
> Nope.
> 
>> 
>> Should error detection and recovery (including clearing error bits in
>> the NX FIR Data Register) be done by the OS (kernel) driver?  Or 
>> should
>> the kernel driver not read/write any NX xscom registers and leave that
>> entirely to skiboot?
> 
> Not all NX errors are reported to hypervisor through HMI. We may still
> need error detection and recovery.
> 
>> 
>> I had planned on the kernel driver handling NX FIR error detection and
>> recovery, as well as monitoring each channel for hangs and 
>> implementing
>> channel request abort.  However, if you plan to add that to skiboot 
>> (or
>> if I should add it to skiboot), let me know.
> 
> I have no plan to add NX error recovery. I think doing recovery at one
> single place is a good idea. In my opinion kernel driver would be best
> place. I may be wrong. Adding monitoring logic in skiboot would make
> thinks bit difficult unless error detection is interrupt driven. For 
> the
> errors routed through HMI, we can divert those to your kernel driver
> which would then do the recovery.
> 
>> 
>> One specific thing we need to avoid is having both skiboot and the
>> kernel driver checking and/or modifying the NX registers.
> 
> Agree.
> 
> Thanks,
> -Mahesh.