SQUASHFS errors and OpenBMC hang

Kun Zhao zkxz at hotmail.com
Thu Sep 3 08:46:26 AEST 2020


On 9/1/20 5:35 AM, Patrick Williams wrote:
> On Sat, Aug 29, 2020 at 12:40:31AM +0000, Kun Zhao wrote:
>> Hi Team,
>>
>> I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system original 3rd-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?
> Yeah, this does look like a bad SPI NOR. 
Thank you, Patrick for the comments. I think so. But my only confusion is about the POC system original 3rd-party BMC doesn't have any issue, it also uses jffs2.
>  Have you tried flashing on a
> fresh image to the NOR and then reading it back to confirm all the bits
> keep their values?  It is possible that the corruption is hitting the
> other BMC code in a less-important location.

I doubted that, too. So I tried to burn my image to the NOR, boot it, and then read it back. But the only differences are there are contents in u-boot-env and rwfs partitions in the read-back image that is as expected, and no any data overflowed crossing any partition boundaries there either.

I also tried to move rofs/rwfs positions, change their sizes bigger/smaller, reduce kernel partition size, making 64KB neutral zones between them. But none of them improves the case.

>
>> [ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}
> I'm surprised to see anyone using jffs2.  Don't we generally use ubifs
> in OpenBMC?  Is there a reason you've chosen to use jffs2?
I just uses the default settings based on ast2500-evb for our POC. But thanks for the hint. I'm trying to enable ubifs now.
>
> I don't necessarily think jffs2 will be better or worse in this
> particular scenario but we've seen lots of upgrade issues over the years
> with jffs2.
>
>> BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,
>>
>> root at dgx:~# df
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> dev 212904 0 212904 0% /dev
>> tmpfs 246728 20172 226556 8% /run
>> /dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro
>> /dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw
>> cow 4096 880 3216 21% /
>> tmpfs 246728 8 246720 0% /dev/shm
>> tmpfs 246728 0 246728 0% /sys/fs/cgroup
>> tmpfs 246728 0 246728 0% /tmp
>> tmpfs 246728 8 246720 0% /var/volatile
>>
>> and can see more and more ipmid coredump files,
> This implies to me that we need to adjust the systemd recovery for
> ipmid.  We shouldn't just keep re-launching the same process over and
> over after a coredump.  Systemd has some thresholding capability.
Can I disable the coredump for ipmid?
>> I found the following actions could trigger this failure,
>>
>>
>>   1.  do SSH login to BMC debug console remotely, it will show this error when triggered,
>> $ ssh root@<bmc ip>
>> ssh_exchange_identification: read: Connection reset by peer
>>
>>
>>   1.  set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.
> I have no idea why this procedure would solve SPI NOR issues.  It
> doesn't seem connected on the surface.
Not to solve the issues, they can trigger the errors to be printed on BMC debug console. I think the reason is some files on rwfs or u-boot-env will be read/write when we do them.
>> The code is based on upstream commit 5ddb5fa99ec259 on master branch.
>> The flash layout definition is the default openbmc-flash-layout.dtsi.
>> The SPI ROM is Macronix MX25L25635F
>>
>> Some questions,
>>
>>   1.  Any SPI lock feature enabled in OpenBMC?
>>   2.  If yes, do I have to unlock u-boot-env partition before fw_setenv?
> There is not, to my knowledge, a software SPI lock.  Some machines have
> a 'golden' NOR which they enable by, in hardware, setting the
> write-protect input pin on the SPI NOR (with a strapping resistor).
> Does your machine do this mechanism?  If so, it is possible that you're
> booting onto the 'wrong' NOR flash in some conditions and a reboot
> resets the chip-select logic in the SPI controller.  (Usually, you have
> the watchdog configured to automatically swap the chip-select after some
> number of boot failures.)
>
No, we have only one NOR flash in the system. The SPI lock feature, I mean, is the NOR flash chip SW Block Protection functions which can enable/disable write-protect for particular blocks for BMC code, not the HW W/P pin.


More information about the openbmc mailing list