SQUASHFS errors and OpenBMC hang

Tue Sep 1 22:35:06 AEST 2020

On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
> Hi Team,
> 
> I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?

Yeah, this does look like a bad SPI NOR.  Have you tried flashing on a
fresh image to the NOR and then reading it back to confirm all the bits
keep their values?  It is possible that the corruption is hitting the
other BMC code in a less-important location.

> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}

I'm surprised to see anyone using jffs2.  Don't we generally use ubifs
in OpenBMC?  Is there a reason you've chosen to use jffs2?

I don't necessarily think jffs2 will be better or worse in this
particular scenario but we've seen lots of upgrade issues over the years
with jffs2.

> BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,
> 
> root at dgx:~# df
> Filesystem 1K-blocks Used Available Use% Mounted on
> dev 212904 0 212904 0% /dev
> tmpfs 246728 20172 226556 8% /run
> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
> cow 4096 880 3216 21% /
> tmpfs 246728 8 246720 0% /dev/shm
> tmpfs 246728 0 246728 0% /sys/fs/cgroup
> tmpfs 246728 0 246728 0% /tmp
> tmpfs 246728 8 246720 0% /var/volatile
> 
> and can see more and more ipmid coredump files,

This implies to me that we need to adjust the systemd recovery for
ipmid.  We shouldn't just keep re-launching the same process over and
over after a coredump.  Systemd has some thresholding capability.

> I found the following actions could trigger this failure,
> 
> 
>   1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
> $ ssh root@<bmc ip>
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.

I have no idea why this procedure would solve SPI NOR issues.  It
doesn't seem connected on the surface.

> The code is based on upstream commit 5ddb5fa99ec259 on master branch.
> The flash layout definition is the default openbmc-flash-layout.dtsi.
> The SPI ROM is Macronix MX25L25635F
> 
> Some questions,
> 
>   1.  Any SPI lock feature enabled in OpenBMC?
>   2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?

There is not, to my knowledge, a software SPI lock.  Some machines have
a 'golden' NOR which they enable by, in hardware, setting the
write-protect input pin on the SPI NOR (with a strapping resistor).
Does your machine do this mechanism?  If so, it is possible that you're
booting onto the 'wrong' NOR flash in some conditions and a reboot
resets the chip-select logic in the SPI controller.  (Usually, you have
the watchdog configured to automatically swap the chip-select after some
number of boot failures.)

-- 
Patrick Williams
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20200901/61c57a45/attachment.sig>