[OpenPower-Firmware] Error reporting to sysadmins from firmware

Sun Apr 10 07:27:55 AEST 2016

Following up on FFDC discussion I started on the openpower-firmware list,
Let's take a tour of where we are today from a firmware PoV:

I think there are three main categories of things we have to care about
communicating to a sysadmin:
1) previous runtime fatal errors
  - e.g. checkstop, kernel panic.
2) boot time errors/warnings
  - e.g. "couldn't set up X bit of hardware" (e.g. one of your cores is
  broken, but we can still boot"
3) runtime errors/warnings
  - e.g. "an NX unit just went away", "processor recovery done", "core
  checkstop"

Currently, we have the following solutions:

1) previous runtime fatal errors:

   *IF* you are running the opal_errd application, you will get notified
    of PELs and sysdumps via syslog, and can use the opal-elog-parse
    utility to view an ASCII version of the PEL, opal-dump-parse will
    extract parts of the dump, and the gard utility will let you list
    GARDed parts.

   a) PELs (POWER specific)
      up to 16kb of binary format.
      104 page specification
      Common to PowerVM and PAPR, although tools are *not* currently
      shared.
      Even when printed to ASCII, generally incomprehensible to any
      single human.

   b) SELs (BMC/IPMI specific, but not POWER specific)
      very small logs (bytes)
      We throw them at a BMC and you can retreive over IPMI from there.
      4 page spec

   c) sysdumps (POWER specific)
      archive of blocks of system memory (megabytes)

   d) GARD records
      *should* be hand-in-hand with a PEL.
      Documents a piece of hardware that is disabled due to a previous
      error condition.
      *SHOULD* only exist in the event of actually bad hardware (reality
      however....)

2) Boot time errors / warnings

   For both of these, the only way to get at a textual representation
   (that you can hopefully understand) is using POWER specific utilities.

   a) PELs
      If error occurs in Hostboot, this is exclusively (along with GARD)
      how you're going to find out about it.
      May also come from skiboot.

   b) GARD
      To the user, it will seem like this is a boot time error, but it
      is *technically* a "previous boot attempt" error, and we booted
      now because the faulty HW was disabled.

3) Runtime errors / warnings

   For userspace or kernel issues, these end up in syslog (kernel goes
   via kernel log). However, for our firmware/hardware related issues:

   a) PRD
      Appears in kernel syslog *IF* opal-prd is running

   b) Kernel/skiboot handled events

      End up in linux kernel log, in plain text.
      A concise explanation of the event.
      If particular event isn't known to kernel, you may get "unhandled
      HMI" as the text, but you'll get HMER/TFMR, severity, and
      recovered/not recovered.

So... if you're running opal_prd and opal_errd and looking at the
resulting logs to then go and look at the specific OPAL errors (that you
can then decode) along with any parts that may be GARDed out, you'll
have an idea of what went wrong (when combind with kernel log).

Out of band (e.g. from FSP/BMC) you *may* get a sysdump, you *may* get a
SEL, you *may* get a PEL - and this *will* vary from machine to machine
depending on sophistication and implementation of service processor.

Obviously, this is not one unified thing to help Linux distros capture,
or sysadmins look at.

In my not so humble opinion, I think everything should eventually end up
as human understandable text in the kernel log. This may be in
*addition* to other methods of gathering information, but fundamentally,
the existing mechanism used by every linux sysadmin should work.

Thoughts?

Currently, we have ~12kLoC dedicated to parsing these POWER specific
things, and that's not nearly complete to give us (as firmware des)
something that we're comfortable reading and identifying what's going
on.

So, How should we approach an implementation of getting things to a
string?

-- 
Stewart Smith
OPAL Architect, IBM.