[OpenPower-Firmware] Future of First Failure Data Capture (FFDC)

Stewart Smith stewart at linux.vnet.ibm.com
Mon May 9 10:38:24 AEST 2016


Ananth N Mavinakayanahalli <ananth at in.ibm.com> writes:
> Kdump is functional for the host OS on OP systems too.

I don't think Ubuntu enables kdump though? It seems that we get more
reports on Ubuntu than RHEL.

>> As such, we likely want solutions that integrate with things like ABRT,
>> kdump, kernel logs, system logs and the like.
>> 
>> We want to make it easy for an OS vendor to know when to  pass the crash
>> report to firmware/hardware support (and vice versa and avoid ping-pong
>> of support requests.
>> 
>> Opaque blobs are a problem as, well, taking myself as an example, there
>> is *NO* way I'm handing anybody an opaque blob of debug data unless I'm
>> both pretty sure myself that it doesn't contain anything
>> personal/sensitive *and* I personally know the person I'm handing it
>> to. Other users are likely even more paranoid than I.
>
> I am not aware of anybody more paranoid than the System z folks. We
> worked on a library called EPPIC, which, in conjunction with a C like
> scripting language, can 'scrub' the makedumpfile filtered kernel crash
> dump of any sensitive information. These are pre-canned scripts that
> automatically cleanse the dump before its sent out for service. We could
> reuse the infrastructure without much difficulty.

I wonder if we could get this integrated into ABRT and then somehow hook
into that to get it to come to us first?

>> Thoughts?
>
> What we are currently missing is a mechanism to gather the OPAL logs
> from the time of an assert. One mechanism we are working on, is to ride
> on the FIR based trigger that exists currently for OCC to gather all the
> FIRs in the event of an unrecoverable error.
>
> The problem however is where do we store this information? There is very
> limited space on flash; full OPAL log is 1MB. We are looking at
> capturing the last 16kb of OPAL console log into the flash. This is
> WIP...
>
> With P9 and PPEs and possibly bigger flashes, the situation may get
> better, but for now, this is the plan. Any thoughts?

Even 16k would be useful. We don't want to spend too much time saving
things off when instead we could have rebooted and had the system back
in service.

After we gather it all though, we need to then dump it on the system
somewhere, not use up all the disk space and be able to ever get the
data to be able to debug.

When we have platform specific utilties, people tend not to install/run
them (I can't count how many times I've seen opal_prd/opal_errd not
installed, let alone running)


-- 
Stewart Smith
OPAL Architect, IBM.



More information about the OpenPower-Firmware mailing list