The Power9 host booting problem with OpenBMC kernel 5.7.x
Alexander A. Filippov
a.filippov at yadro.com
Wed Aug 12 04:33:14 AEST 2020
On Tue, Aug 11, 2020 at 06:12:30AM +0000, Joel Stanley wrote:
> On Mon, 10 Aug 2020 at 18:48, Alexander A. Filippov
> <a.filippov at yadro.com> wrote:
> > Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
> > hosts booting.
> > On host with one Power9 CPU the failure happens during the Petitboot is trying
> > to initialize the network and it leads to host restarts.
> > On host with two Power9 CPU the same failure happens during OS booting. It
> > increases boot time, but at the end the host OS is completely started.
> Oh no. I have spent some time testing the 5.7 tree primarily on
> Tacoma, our ast2600/p9 platform. We saw some strange systemd failures,
> where services such as udevd and journald would be killed by systemd's
> watchdog functionality. I did some preliminary debugging but didn't
> find a root cause.
> I have since published a 5.8 based tree that does not suffer from this
> issue. Could you give that a spin on your hardware and see if it
> recreates your issue?
With the kerenl 5.8 the host is still not booting.
I've checked on both machines and they have very different results:
- On the machine with two CPUs the issue is still reproduced.
I see no difference, neither in the behavior, nor in the logs.
- On the machine with one CPU the failure happens due the PNOR flash.
It looks like this:
[ 16:23:27 ] --== Welcome to Hostboot hostboot-9865ef9/hbicore.bin ==--
[ 16:23:27 ]
[ 16:23:27 ] 5.31049|secure|SecureROM valid - enabling functionality
[ 16:23:30 ] 8.00820|Booting from SBE side 0 on master proc=00050000
[ 16:23:30 ] 8.04587|ISTEP 6. 5 - host_init_fsi
[ 16:23:30 ] 8.21815|ISTEP 6. 6 - host_set_ipl_parms
[ 16:23:30 ] 8.40171|ISTEP 6. 7 - host_discover_targets
[ 16:23:32 ] 9.55142|HWAS|PRESENT> DIMM=A0A0000000000000
[ 16:23:32 ] 9.55144|HWAS|PRESENT> Proc=8000000000000000
[ 16:23:32 ] 9.55145|HWAS|PRESENT> Core=33FFC30000000000
[ 16:23:33 ] 10.38865|ISTEP 6. 8 - host_update_master_tpm
[ 16:23:33 ] 10.41071|SECURE|Security Access Bit> 0x0000000000000000
[ 16:23:33 ] 10.41072|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
[ 16:23:33 ] 10.41089|ISTEP 6. 9 - host_gard
[ 16:23:33 ] 10.68154|HWAS|FUNCTIONAL> DIMM=A0A0000000000000
[ 16:23:33 ] 10.68156|HWAS|FUNCTIONAL> Proc=8000000000000000
[ 16:23:33 ] 10.68157|HWAS|FUNCTIONAL> Core=33FFC30000000000
[ 16:23:33 ] 10.68776|ISTEP 6.11 - host_start_occ_xstop_handler
[ 16:23:34 ] 11.10376|ECC error in PNOR flash in section offset 0x030DF600
[ 16:23:34 ]
[ 16:23:34 ] 11.10387|System shutting down with error status 0x60F
[ 16:24:52 ]
[ 16:24:52 ]
[ 16:24:52 ] --== Welcome to SBE - CommitId[0xc58e8fd0] ==--
After that the PNOR flash is corrupted and all other trying to boot stops
at stage 'SBE starting hostboot'.
I've noticed that the kernel 5.8 detect the flash driver incorrectly:
mx25l51245g instead of mx66l51235f.
It happens on both machines and I don't understand why it leads to the problems
on only one of them.
After restoring the previous firmware and power cycle both machines work fine.
> > So, I have two questions:
> > - Could you please, check if Romulus is also affected by this issue?
> > - Do you have any idea what is going wrong?
> I'll fire up a romulus and see if it reproduces.
> My guess is it's something to do with the timekeeping, irq or rcu
> code. All areas of complexity!
> Thanks for the report.
> > I've attached the tarball with full logs.
> > - poopsy is a system with two Power9 CPU
> > - whoopsy is a system with one Power9 CPU
> > --
> > Regards,
> > Alexander
More information about the openbmc