[OpenPower-Firmware] Future of First Failure Data Capture (FFDC)
Ananth N Mavinakayanahalli
ananth at in.ibm.com
Tue Apr 12 15:35:08 AEST 2016
On Sat, Apr 09, 2016 at 11:33:24AM -1000, Stewart Smith wrote:
> Most likely from the host OS, as this provides far more flexibility in
> gathering data from a system - *especially* at runtime. While IBM
> Enterprise systems have hardware that enables large scale data dump from
> the host when everything goes bad (host OS crash or other
> non-recoverable error), we're generally more limited on OpenPower
> machines (and even if we did have that hardware hooked up, there's
> always secure and trusted boot around the corner to make things a bit
> more fun).
Kdump is functional for the host OS on OP systems too.
...
> As such, we likely want solutions that integrate with things like ABRT,
> kdump, kernel logs, system logs and the like.
>
> We want to make it easy for an OS vendor to know when to pass the crash
> report to firmware/hardware support (and vice versa and avoid ping-pong
> of support requests.
>
> Opaque blobs are a problem as, well, taking myself as an example, there
> is *NO* way I'm handing anybody an opaque blob of debug data unless I'm
> both pretty sure myself that it doesn't contain anything
> personal/sensitive *and* I personally know the person I'm handing it
> to. Other users are likely even more paranoid than I.
I am not aware of anybody more paranoid than the System z folks. We
worked on a library called EPPIC, which, in conjunction with a C like
scripting language, can 'scrub' the makedumpfile filtered kernel crash
dump of any sensitive information. These are pre-canned scripts that
automatically cleanse the dump before its sent out for service. We could
reuse the infrastructure without much difficulty.
> Thoughts?
What we are currently missing is a mechanism to gather the OPAL logs
from the time of an assert. One mechanism we are working on, is to ride
on the FIR based trigger that exists currently for OCC to gather all the
FIRs in the event of an unrecoverable error.
The problem however is where do we store this information? There is very
limited space on flash; full OPAL log is 1MB. We are looking at
capturing the last 16kb of OPAL console log into the flash. This is
WIP...
With P9 and PPEs and possibly bigger flashes, the situation may get
better, but for now, this is the plan. Any thoughts?
Ananth
More information about the OpenPower-Firmware
mailing list