SQUASHFS errors and OpenBMC hang

Sat Aug 29 10:40:31 AEST 2020

Hi Team,

I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?

There are 2 symptoms,

#1,

BMC debug console somehow shows this error,

[ 4242.029061] SQUASHFS error: xz decompression failed, data probably corrupt
[ 4242.035970] SQUASHFS error: squashfs_read_data failed to read block 0xce5cb0
[ 4242.043159] SQUASHFS error: Unable to read data cache entry [ce5cb0]
[ 4242.049627] SQUASHFS error: Unable to read page, block ce5cb0, size da44
[ 4242.056386] SQUASHFS error: Unable to read data cache entry [ce5cb0]

After rebooting, BMC may show that error again and then stop at reading rootfs with the following errors,

[ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
[ 3.383951] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0a60. {1985,e002,15000044,98f7fb1d}
[ 3.394949] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e09e4. {1985,e002,15000044,98f7fb1d}
[ 3.405958] jffs2: notice: (78) check_node_data: wrong data CRC in data node at 0x003e0af0: read 0x5ab53bf4, calculated 0xb6f14204.
[ 3.417873] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #8
[ 3.426478] jffs2: Returned error for crccheck of ino #8. Expect badness...
[ 3.492939] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0bc8. {1985,e002,15000044,98f7fb1d}
[ 3.503923] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #9
[ 3.512462] jffs2: Returned error for crccheck of ino #9. Expect badness...

After that, BMC either enter  recovery mode or hang.

#2,

BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,

root at dgx:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
dev 212904 0 212904 0% /dev
tmpfs 246728 20172 226556 8% /run
/dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
/dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
cow 4096 880 3216 21% /
tmpfs 246728 8 246720 0% /dev/shm
tmpfs 246728 0 246728 0% /sys/fs/cgroup
tmpfs 246728 0 246728 0% /tmp
tmpfs 246728 8 246720 0% /var/volatile

and can see more and more ipmid coredump files,

root at dgx:~# ls -al /run/initramfs/rw/cow/var/lib/systemd/coredump/
drwxr-xr-x 2 root root 0 Aug 21 16:04 .
rw-r---- 1 root root 57344 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000.xzaba143da6d9b5571
rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000ba58c927628d3950
rw-r---- 1 root root 0 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.1598025880000000000000.xzee8c94e72fc5b173
rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.159802588000000000000089ee90c2a557ac1c
drwxr-xr-x 6 root root 0 Jan 1 1970 ..
rw-r---- 1 root root 92492 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5630.1598025699000000000000.xz
rw-r---- 1 root root 92572 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5641.1598025723000000000000.xz
rw-r---- 1 root root 92652 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5645.1598025728000000000000.xz
rw-r---- 1 root root 92476 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5651.1598025754000000000000.xz

By checking journal logs and found ipmid failed on access files like /usr/share/ipmi-providers/channel_config.json. So seems ipmid is also a victim from the filesystem failure.
And after a while, BMC just hang.

Some recovery methods are available, but success rate are very low,

  *   leave BMC there for some time, it will be back to work. but not always.
  *   reboot BMC or AC cycle sometime can make it work but not always.

I found the following actions could trigger this failure,

  1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
$ ssh root@<bmc ip>
ssh_exchange_identification: read: Connection reset by peer

  1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.

The code is based on upstream commit 5ddb5fa99ec259 on master branch.
The flash layout definition is the default openbmc-flash-layout.dtsi.
The SPI ROM is Macronix MX25L25635F

Some questions,

  1.  Any SPI lock feature enabled in OpenBMC?
  2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?

Thanks.

Best regards,

Kun Zhao
/*
  zkxz at hotmail.com<mailto:zkxz at hotmail.com>
*/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20200829/c1155a39/attachment.htm>