[OpenPower-Firmware] Future of First Failure Data Capture (FFDC)

Sun Apr 10 07:33:24 AEST 2016

Hi all,

One of the topics that came up in the OpenPower Foundation System
Software Workgroup face2face at the recent OpenPower Summit was how we
go into the future with First Failure Data Capture (FFDC) on OpenPower.

For those unfamiliar with the term, it's an IBM term for, when a system
experiences a failure, collecting everything that could possibly be
useful and packaging it up to put somewhere. This is both used
internally and how you may gather information on end user machines to
diagnose the fault.

A lot of history exists around high-end systems, where the failure of a
single system is a big deal.

Historically, IBM owned the stack end-to-end: Processor, Hypervisor and
OS. Now, this (obviously) isn't the case, so we must fit into existing
systems well.

When something goes wrong we're either going to have people diagnosing
it from the service processor or the host OS (that is, the bare metal
operating system).

Most likely from the host OS, as this provides far more flexibility in
gathering data from a system - *especially* at runtime. While IBM
Enterprise systems have hardware that enables large scale data dump from
the host when everything goes bad (host OS crash or other
non-recoverable error), we're generally more limited on OpenPower
machines (and even if we did have that hardware hooked up, there's
always secure and trusted boot around the corner to make things a bit
more fun).

As people come from the x86 linux world, they'll already be looking at
kernel log and syslog, as this is where many errors show up (or, indeed,
on the majority of systems, the *only* place you'l get anything
meaningful as access to the service processor if there is one is not
super common).

We also hope that OS and userspace errors are a lot more common for end
users than firmware/hardware errors, so they're more likely to be used
to that interface.

As such, we likely want solutions that integrate with things like ABRT,
kdump, kernel logs, system logs and the like.

We want to make it easy for an OS vendor to know when to  pass the crash
report to firmware/hardware support (and vice versa and avoid ping-pong
of support requests.

Opaque blobs are a problem as, well, taking myself as an example, there
is *NO* way I'm handing anybody an opaque blob of debug data unless I'm
both pretty sure myself that it doesn't contain anything
personal/sensitive *and* I personally know the person I'm handing it
to. Other users are likely even more paranoid than I.

Thoughts?

-- 
Stewart Smith
OPAL Architect, IBM.