checkstop processing

Tue Nov 14 17:01:49 AEDT 2017

Oliver <oohall at gmail.com> writes:
> On Tue, Nov 14, 2017 at 2:42 PM, Joel Stanley <joel at jms.id.au> wrote:
>> On Tue, Nov 14, 2017 at 8:04 AM, Sergey Kachkin <s.kachkin at gmail.com> wrote:
>>> Hi all,
>>>
>>> i'm investigating the checkstop processing and looking for a way to isolate
>>> a faulty component with OpenBmc.
>>> So far SEL logs available via REST are not really helpful.
>>>
>>> Is there any data source in the openbmc to troubleshoot checkstops?
>>>
>>> I guess eSEL binary data parsed with eSEL.pl can be more informative but do
>>> we have any procedure to grab the binary sel data and parse it with the
>>> latest obmc?
>>>
>>> Currently it seems that IPL checkstop analysis is not really working. i mean
>>> that faulty component is not deconfigured on the next boot and gard list is
>>> empty.
>>> It can be easily duplicated by injecting an error manually via putscom.
>>
>> I think you've identified an area that would be great for improvement.
>>
>> I'd like to expand the scope beyond just checkstop to other boot
>> failures: I've tried to boot machines recently that have failed to
>> even start hostboot, and I haven't known what has failed.
>>
>> A tool that inspects recent error logs, and the state of the SBE would
>> be useful. We can leverage libpdbg to talk to the host.
>
> The SBE stores some state information in cfam 2809 that we can use to
> find out the currents istep. I think we can also dump the SBE trace
> buffer out of PIB memory on non-secure systems too. Parsing the trace
> buffer requires the tracehash file from the SBE build, but we can
> probably able to add that to the squashfs file for the host firmware.

This would be ideal to put in a sensor for boot progress.

-- 
Stewart Smith
OPAL Architect, IBM.