[OpenPower-Firmware] [POWER8] OCC Firdata over IPMI
Artem Senichev
artemsen at gmail.com
Thu May 23 22:46:35 AEST 2019
On Wed, May 22, 2019 at 10:44 PM Douglas Gilbert <dgilbert at us.ibm.com> wrote:
>
> Hi Artem,
>
> The OCC FIRDATA collection is only run if there were system errors that require FIRDATA to be collected.
> You said that sometimes data written in the FIRDATA partition has ECC errors.
> Does this mean that sometimes the OCC FIRDATA does not have ECC errors or is that just when the OCC has not written any FIRDATA at all?
> ECC errors would indicate that at least some data got written to PNOR.
1. We have an internal stable release of OpenPOWER without HIOMAP
(op-build 2.2 based).
Very rarely (actually I see it 2 or 3 times), if there is a hardware
problem with CPU0 we get
FIRDATA with ECC error.
Unfortunately, we don't have saved logs or dumps of the partition. Our
QA department just
replaces CPU0 and I can't catch the error - I need a broken CPU for
this, installed not in
a production server.
Manually injected errors via putscom work as expected - FIRDATA
doesn't contain broken ECC,
but filled with a right data.
2. The latest version of OpenPOWER firmware we use (master branch) has
HIOMAP support.
There is a lot of problems with SPI in OpenBMC, so HIOMAP solves most of them.
But now, any checkstop with writing to the FIRDATA ends up with an ECC
error in the last QWORD:
a. L3 Directory Read UE Checkstop Error: putscom -c 0x0 0x1101080D
0x0A00000000000000
Uncorrectable error offset=0x0cf9, data=0x0000000040b000ff (ecc 0x00!=0xcc)
Part of FIRDATA dump:
00000ce0 00 14 00 00 00 00 0c 00 04 00 29 00 00 00 24 00 |..........)...$.|
00000cf0 c7 77 80 0b 00 04 00 30 03 00 00 00 00 40 b0 00 |.w.....0..... at ..|
00000d00 ff 00 ff ff ff ff ff ff ff ff 00 ff ff ff ff ff |................|
b. L2 Directory Read UE Checkstop Error: putscom -c 0x0 0x1101280C
0xA000000000000000
Uncorrectable error offset=0x01f8, data=0x7fffffffff800000 (ecc 0x00!=0xe2)
Part of FIRDATA dump:
000001e0 00 00 00 00 00 62 09 01 18 43 02 80 00 00 59 00 |.....b...C....Y.|
000001f0 00 00 00 09 04 00 0d 8d 7f ff ff ff ff 80 00 00 |................|
00000200 00 ff ff ff ff ff ff ff ff 00 ff ff ff ff ff ff |................|
c. Memory Buffer UE Checkstop Error: putscom -c 0x80000000 0x02011440
0x0000000000010000
Correctable error offset=0x0efa, data=0x00000f07c000ffff
Part of FIRDATA dump:
00000ee0 cc cc cc f0 00 00 00 48 02 01 09 c3 08 1f f2 00 |.......H........|
00000ef0 e6 4c 00 00 00 02 01 0c 03 c4 00 00 0f 07 c0 00 |.L..............|
00000f00 ff ff 00 ff ff ff ff ff ff ff ff 00 ff ff ff ff |................|
In this case (correctable error) hostboot print error but continue to boot.
> In the trace you provided It looks like there is a timeout waiting for a response over IPMI. Since you said you added a call to getInfo() to OCC main.c, can I assume this trace was taken when the OCC goes active at runtime? That's not really when the OCC is suppose to be using the IPMI bus and I wonder if its OK to do so.
I have added a call to the 'getInfo' function just before main thread
loop started:
https://github.com/artemsen/occ/blob/5ae7e7c76f7f3e522678d4a211fe67c7d9183a6a/src/occ/main.c#L806
--
Regards,
Artem Senichev
Software Engineer, YADRO.
More information about the OpenPower-Firmware
mailing list