[Skiboot] Error reporting to sysadmins from firmware

Tue Apr 12 15:26:04 AEST 2016

On Sat, Apr 09, 2016 at 11:27:55AM -1000, Stewart Smith wrote:

...

> So... if you're running opal_prd and opal_errd and looking at the
> resulting logs to then go and look at the specific OPAL errors (that you
> can then decode) along with any parts that may be GARDed out, you'll
> have an idea of what went wrong (when combind with kernel log).
> 
> Out of band (e.g. from FSP/BMC) you *may* get a sysdump, you *may* get a
> SEL, you *may* get a PEL - and this *will* vary from machine to machine
> depending on sophistication and implementation of service processor.
> 
> Obviously, this is not one unified thing to help Linux distros capture,
> or sysadmins look at.
> 
> In my not so humble opinion, I think everything should eventually end up
> as human understandable text in the kernel log. This may be in
> *addition* to other methods of gathering information, but fundamentally,
> the existing mechanism used by every linux sysadmin should work.
> 
> Thoughts?

FWIW, work is underway to refactor the OPAL error logging code so OPAL
generated errors are pushed up to the Linux instance as they happen.
These are in PEL and the 'summary' with severity will be logged as a
readable line into syslog.

> Currently, we have ~12kLoC dedicated to parsing these POWER specific
> things, and that's not nearly complete to give us (as firmware des)
> something that we're comfortable reading and identifying what's going
> on.
> 
> So, How should we approach an implementation of getting things to a
> string?

For the traditional FSP based systems, we have a fairly well working set
of tools/infrastructure to put in a readable one line summary into syslog.

For the OP boxes, OPAL errors get surfaced to the Linux instance and the
summary will be logged. Additionally we should pull up the BMC logs and
put in a readable summary into the syslog. That is WIP.

There are some challenges around whether we will get all of the BMC
logs, detecting duplicates, etc., but there are workable solutions for
those. Of course, nothing can be done if the BMC logs are purged out of
band using IPMI, before the host gets to see them.

The current thought is to key off the IPMI message sequence number
and/or the time to determine if there are new BMC logs to pull, if yes,
get them via in-band IPMI to the host and extract the summary.

We are also looking at the PRD code to see if there is a small tweak to
get HBRT errors directly on the host. Watch this space...

Ananth
determine