[Skiboot] [PATCH v3 0/4] opal: Check NX FIRs to find the reason for Malfunction Alert.
Mahesh Jagannath Salgaonkar
mahesh at linux.vnet.ibm.com
Wed Mar 11 01:56:12 AEDT 2015
On 03/09/2015 07:49 PM, Dan Streetman wrote:
>
>
> On 2015-03-05 00:04, Mahesh J Salgaonkar wrote:
>> This patch series enhances HMI event structure to accommodate CORE/NX
>> check
>> stop error information and bumps up the HMI event version to V2.
>>
>> Changes in V3:
>> - New patch 2/4 adds the documentation for HMI event struct in
>> doc/opal-api/opal-messages.txt
>> - Added BUILD_ASSERT() call.
>>
>> Changes in V2:
>> - Introduced changes to include information about NX checkstop.
>> - Added PIR and chip_id fields to HMI event structure.
>> - Improved the logic to detect and report UNKNOWN event of we fail to
>> gather
>> checkstop reason.
>> - New patch 3/3 to identify and report reason for NX checkstop.
>
> So, I'm unclear as to NX errors/hangs should be handled here in skiboot,
> or by the kernel driver (or other OS driver). My WIP kernel driver does
> currently check the NX engine FIR, and clear any errors signaled, but I
> haven't added error recovery yet.
This patch intended to scan and detect only those NX errors which are
reported to hypervisor through HMI as Malfunction Alert. Currently this
patch does not attempt any recovery. It just detects and reports them to
host kernel as un-recovered with a summary of error info (through HMI
event structure). Currently kernel invokes panic for these errors.
If we already have kernel driver that does NX error recovery, then we
should look at diverting these notifications to your kernel driver for
recovery instead of kernel panic.
>
> Are you planning to add NX error recovery and/or NX channel hang
> recovery? I also don't see you clearing FIR errors (i.e. writing the
> set bits to the DMA & Engine FIR Data Clear Register).
Nope.
>
> Should error detection and recovery (including clearing error bits in
> the NX FIR Data Register) be done by the OS (kernel) driver? Or should
> the kernel driver not read/write any NX xscom registers and leave that
> entirely to skiboot?
Not all NX errors are reported to hypervisor through HMI. We may still
need error detection and recovery.
>
> I had planned on the kernel driver handling NX FIR error detection and
> recovery, as well as monitoring each channel for hangs and implementing
> channel request abort. However, if you plan to add that to skiboot (or
> if I should add it to skiboot), let me know.
I have no plan to add NX error recovery. I think doing recovery at one
single place is a good idea. In my opinion kernel driver would be best
place. I may be wrong. Adding monitoring logic in skiboot would make
thinks bit difficult unless error detection is interrupt driven. For the
errors routed through HMI, we can divert those to your kernel driver
which would then do the recovery.
>
> One specific thing we need to avoid is having both skiboot and the
> kernel driver checking and/or modifying the NX registers.
Agree.
Thanks,
-Mahesh.
More information about the Skiboot
mailing list