The Power9 host booting problem with OpenBMC kernel 5.7.x

Alexander A. Filippov a.filippov at yadro.com
Wed Aug 12 04:33:14 AEST 2020


On Tue, Aug 11, 2020 at 06:12:30AM +0000, Joel Stanley wrote:
> On Mon, 10 Aug 2020 at 18:48, Alexander A. Filippov
> <a.filippov at yadro.com> wrote:
> >
> > Since the kernel in OpenBMC was updated to 5.7.x we have a problem with the P9
> > hosts booting.
> > On host with one Power9 CPU the failure happens during the Petitboot is trying
> > to initialize the network and it leads to host restarts.
> > On host with two Power9 CPU the same failure happens during OS booting. It
> > increases boot time, but at the end the host OS is completely started.
> 
> Oh no. I have spent some time testing the 5.7 tree primarily on
> Tacoma, our ast2600/p9 platform. We saw some strange systemd failures,
> where services such as udevd and journald would be killed by systemd's
> watchdog functionality. I did some preliminary debugging but didn't
> find a root cause.
> 
> I have since published a 5.8 based tree that does not suffer from this
> issue. Could you give that a spin on your hardware and see if it
> recreates your issue?
> 
>  https://gerrit.openbmc-project.xyz/c/openbmc/meta-aspeed/+/35315
> 

With the kerenl 5.8 the host is still not booting.
I've checked on both machines and they have very different results:
 - On the machine with two CPUs the issue is still reproduced.
   I see no difference, neither in the behavior, nor in the logs.
 - On the machine with one CPU the failure happens due the PNOR flash.
   It looks like this:

[ 16:23:27 ] --== Welcome to Hostboot hostboot-9865ef9/hbicore.bin ==--
[ 16:23:27 ] 
[ 16:23:27 ]   5.31049|secure|SecureROM valid - enabling functionality
[ 16:23:30 ]   8.00820|Booting from SBE side 0 on master proc=00050000
[ 16:23:30 ]   8.04587|ISTEP  6. 5 - host_init_fsi
[ 16:23:30 ]   8.21815|ISTEP  6. 6 - host_set_ipl_parms
[ 16:23:30 ]   8.40171|ISTEP  6. 7 - host_discover_targets
[ 16:23:32 ]   9.55142|HWAS|PRESENT> DIMM[03]=A0A0000000000000
[ 16:23:32 ]   9.55144|HWAS|PRESENT> Proc[05]=8000000000000000
[ 16:23:32 ]   9.55145|HWAS|PRESENT> Core[07]=33FFC30000000000
[ 16:23:33 ]  10.38865|ISTEP  6. 8 - host_update_master_tpm
[ 16:23:33 ]  10.41071|SECURE|Security Access Bit> 0x0000000000000000
[ 16:23:33 ]  10.41072|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
[ 16:23:33 ]  10.41089|ISTEP  6. 9 - host_gard
[ 16:23:33 ]  10.68154|HWAS|FUNCTIONAL> DIMM[03]=A0A0000000000000
[ 16:23:33 ]  10.68156|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
[ 16:23:33 ]  10.68157|HWAS|FUNCTIONAL> Core[07]=33FFC30000000000
[ 16:23:33 ]  10.68776|ISTEP  6.11 - host_start_occ_xstop_handler
[ 16:23:34 ]  11.10376|ECC error in PNOR flash in section offset 0x030DF600
[ 16:23:34 ] 
[ 16:23:34 ]  11.10387|System shutting down with error status 0x60F
[ 16:24:52 ] 
[ 16:24:52 ] 
[ 16:24:52 ] --== Welcome to SBE - CommitId[0xc58e8fd0] ==--


   After that the PNOR flash is corrupted and all other trying to boot stops
   at stage 'SBE starting hostboot'.

I've noticed that the kernel 5.8 detect the flash driver incorrectly:
mx25l51245g instead of mx66l51235f.
It happens on both machines and I don't understand why it leads to the problems
on only one of them.

After restoring the previous firmware and power cycle both machines work fine.

> > So, I have two questions:
> > - Could you please, check if Romulus is also affected by this issue?
> > - Do you have any idea what is going wrong?
> 
> I'll fire up a romulus and see if it reproduces.
> 
> My guess is it's something to do with the timekeeping, irq or rcu
> code. All areas of complexity!
> 
> Thanks for the report.
> 
> Cheers,
> 
> Joel
> 
> > I've attached the tarball with full logs.
> > - poopsy is a system with two Power9 CPU
> > - whoopsy is a system with one Power9 CPU
> >
> > --
> > Regards,
> > Alexander

--
Regards,
Alexander


More information about the openbmc mailing list