[Skiboot] [PATCH v3 0/4] opal: Check NX FIRs to find the reason for Malfunction Alert.

Dan Streetman ddstreet at us.ibm.com
Tue Mar 10 01:19:35 AEDT 2015

On 2015-03-05 00:04, Mahesh J Salgaonkar wrote:
> This patch series enhances HMI event structure to accommodate CORE/NX 
> check
> stop error information and bumps up the HMI event version to V2.
> Changes in V3:
> - New patch 2/4 adds the documentation for HMI event struct in
>   doc/opal-api/opal-messages.txt
> - Added BUILD_ASSERT() call.
> Changes in V2:
> - Introduced changes to include information about NX checkstop.
> - Added PIR and chip_id fields to HMI event structure.
> - Improved the logic to detect and report UNKNOWN event of we fail to 
> gather
>   checkstop reason.
> - New patch 3/3 to identify and report reason for NX checkstop.

So, I'm unclear as to NX errors/hangs should be handled here in skiboot, 
or by the kernel driver (or other OS driver).  My WIP kernel driver does 
currently check the NX engine FIR, and clear any errors signaled, but I 
haven't added error recovery yet.

Are you planning to add NX error recovery and/or NX channel hang 
recovery?  I also don't see you clearing FIR errors (i.e. writing the 
set bits to the DMA & Engine FIR Data Clear Register).

Should error detection and recovery (including clearing error bits in 
the NX FIR Data Register) be done by the OS (kernel) driver?  Or should 
the kernel driver not read/write any NX xscom registers and leave that 
entirely to skiboot?

I had planned on the kernel driver handling NX FIR error detection and 
recovery, as well as monitoring each channel for hangs and implementing 
channel request abort.  However, if you plan to add that to skiboot (or 
if I should add it to skiboot), let me know.

One specific thing we need to avoid is having both skiboot and the 
kernel driver checking and/or modifying the NX registers.

> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> ---
> Mahesh Salgaonkar (4):
>       opal: Enhance HMI event structure to accommodate checkstop info.
>       opal: Update doc/opal-api/opal-messages.txt
>       opal: Check Core FIRs to find the reason for Malfunction Alert.
>       opal: Check NX FIRs to find the reason for Malfunction Alert.
>  core/hmi.c                     |  215 
> +++++++++++++++++++++++++++++++++++++++-
>  doc/opal-api/opal-messages.txt |   46 ++++++++-
>  include/opal.h                 |   61 +++++++++++
>  3 files changed, 314 insertions(+), 8 deletions(-)
> --
> Signature
> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot

More information about the Skiboot mailing list