[Skiboot] [PATCH v3 0/4] opal: Check NX FIRs to find the reason for Malfunction Alert.
ddstreet at us.ibm.com
Tue Mar 10 01:19:35 AEDT 2015
On 2015-03-05 00:04, Mahesh J Salgaonkar wrote:
> This patch series enhances HMI event structure to accommodate CORE/NX
> stop error information and bumps up the HMI event version to V2.
> Changes in V3:
> - New patch 2/4 adds the documentation for HMI event struct in
> - Added BUILD_ASSERT() call.
> Changes in V2:
> - Introduced changes to include information about NX checkstop.
> - Added PIR and chip_id fields to HMI event structure.
> - Improved the logic to detect and report UNKNOWN event of we fail to
> checkstop reason.
> - New patch 3/3 to identify and report reason for NX checkstop.
So, I'm unclear as to NX errors/hangs should be handled here in skiboot,
or by the kernel driver (or other OS driver). My WIP kernel driver does
currently check the NX engine FIR, and clear any errors signaled, but I
haven't added error recovery yet.
Are you planning to add NX error recovery and/or NX channel hang
recovery? I also don't see you clearing FIR errors (i.e. writing the
set bits to the DMA & Engine FIR Data Clear Register).
Should error detection and recovery (including clearing error bits in
the NX FIR Data Register) be done by the OS (kernel) driver? Or should
the kernel driver not read/write any NX xscom registers and leave that
entirely to skiboot?
I had planned on the kernel driver handling NX FIR error detection and
recovery, as well as monitoring each channel for hangs and implementing
channel request abort. However, if you plan to add that to skiboot (or
if I should add it to skiboot), let me know.
One specific thing we need to avoid is having both skiboot and the
kernel driver checking and/or modifying the NX registers.
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> Mahesh Salgaonkar (4):
> opal: Enhance HMI event structure to accommodate checkstop info.
> opal: Update doc/opal-api/opal-messages.txt
> opal: Check Core FIRs to find the reason for Malfunction Alert.
> opal: Check NX FIRs to find the reason for Malfunction Alert.
> core/hmi.c | 215
> doc/opal-api/opal-messages.txt | 46 ++++++++-
> include/opal.h | 61 +++++++++++
> 3 files changed, 314 insertions(+), 8 deletions(-)
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
More information about the Skiboot