[OpenPower-Firmware] Error reporting to sysadmins from firmware
Stewart Smith
stewart at linux.vnet.ibm.com
Sun Apr 10 07:27:55 AEST 2016
Following up on FFDC discussion I started on the openpower-firmware list,
Let's take a tour of where we are today from a firmware PoV:
I think there are three main categories of things we have to care about
communicating to a sysadmin:
1) previous runtime fatal errors
- e.g. checkstop, kernel panic.
2) boot time errors/warnings
- e.g. "couldn't set up X bit of hardware" (e.g. one of your cores is
broken, but we can still boot"
3) runtime errors/warnings
- e.g. "an NX unit just went away", "processor recovery done", "core
checkstop"
Currently, we have the following solutions:
1) previous runtime fatal errors:
*IF* you are running the opal_errd application, you will get notified
of PELs and sysdumps via syslog, and can use the opal-elog-parse
utility to view an ASCII version of the PEL, opal-dump-parse will
extract parts of the dump, and the gard utility will let you list
GARDed parts.
a) PELs (POWER specific)
up to 16kb of binary format.
104 page specification
Common to PowerVM and PAPR, although tools are *not* currently
shared.
Even when printed to ASCII, generally incomprehensible to any
single human.
b) SELs (BMC/IPMI specific, but not POWER specific)
very small logs (bytes)
We throw them at a BMC and you can retreive over IPMI from there.
4 page spec
c) sysdumps (POWER specific)
archive of blocks of system memory (megabytes)
d) GARD records
*should* be hand-in-hand with a PEL.
Documents a piece of hardware that is disabled due to a previous
error condition.
*SHOULD* only exist in the event of actually bad hardware (reality
however....)
2) Boot time errors / warnings
For both of these, the only way to get at a textual representation
(that you can hopefully understand) is using POWER specific utilities.
a) PELs
If error occurs in Hostboot, this is exclusively (along with GARD)
how you're going to find out about it.
May also come from skiboot.
b) GARD
To the user, it will seem like this is a boot time error, but it
is *technically* a "previous boot attempt" error, and we booted
now because the faulty HW was disabled.
3) Runtime errors / warnings
For userspace or kernel issues, these end up in syslog (kernel goes
via kernel log). However, for our firmware/hardware related issues:
a) PRD
Appears in kernel syslog *IF* opal-prd is running
b) Kernel/skiboot handled events
End up in linux kernel log, in plain text.
A concise explanation of the event.
If particular event isn't known to kernel, you may get "unhandled
HMI" as the text, but you'll get HMER/TFMR, severity, and
recovered/not recovered.
So... if you're running opal_prd and opal_errd and looking at the
resulting logs to then go and look at the specific OPAL errors (that you
can then decode) along with any parts that may be GARDed out, you'll
have an idea of what went wrong (when combind with kernel log).
Out of band (e.g. from FSP/BMC) you *may* get a sysdump, you *may* get a
SEL, you *may* get a PEL - and this *will* vary from machine to machine
depending on sophistication and implementation of service processor.
Obviously, this is not one unified thing to help Linux distros capture,
or sysadmins look at.
In my not so humble opinion, I think everything should eventually end up
as human understandable text in the kernel log. This may be in
*addition* to other methods of gathering information, but fundamentally,
the existing mechanism used by every linux sysadmin should work.
Thoughts?
Currently, we have ~12kLoC dedicated to parsing these POWER specific
things, and that's not nearly complete to give us (as firmware des)
something that we're comfortable reading and identifying what's going
on.
So, How should we approach an implementation of getting things to a
string?
--
Stewart Smith
OPAL Architect, IBM.
More information about the OpenPower-Firmware
mailing list