<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:244657727;
mso-list-type:hybrid;
mso-list-template-ids:-870125908 67698689 67698691 67698693 67698689 67698691 67698693 67698689 67698691 67698693;}
@list l0:level1
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level2
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level3
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level4
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level5
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level6
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level7
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level8
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level9
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l1
{mso-list-id:1152989048;
mso-list-type:hybrid;
mso-list-template-ids:1200374366 -1 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l1:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2
{mso-list-id:1555698536;
mso-list-type:hybrid;
mso-list-template-ids:1146099230 -1 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l2:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l2:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l2:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style>
</head>
<body lang="EN-US" link="blue" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Hi Team,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’m working on validating OpenBMC on our POC system for a while, but starting from 2 weeks ago, the BMC filesystem sometimes report failures, and after that sometimes the BMC will hang after running for a while. It started to happen on
one system and then on another. Tried to use programmer to re-flash, still see this issue. Tried to flash back to the very first known good OpenBMC image we built, still see the same symptoms. It seems like a SPI ROM failure. But when flash back the POC system
original 3<sup>rd</sup>-party BMC, no such issue at all. Not sure if anyone ever met similar issues before?
</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">There are 2 symptoms,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><u>#1</u>,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">BMC debug console somehow shows this error,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 4242.029061] SQUASHFS error: xz decompression failed, data probably corrupt<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 4242.035970] SQUASHFS error: squashfs_read_data failed to read block 0xce5cb0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 4242.043159] SQUASHFS error: Unable to read data cache entry [ce5cb0]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 4242.049627] SQUASHFS error: Unable to read page, block ce5cb0, size da44<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 4242.056386] SQUASHFS error: Unable to read data cache entry [ce5cb0]<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">After rebooting, BMC may show that error again and then stop at reading rootfs with the following errors,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.372932] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0aa4. {1985,e002,0000004a,78280c2e}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.383951] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0a60. {1985,e002,15000044,98f7fb1d}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.394949] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e09e4. {1985,e002,15000044,98f7fb1d}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.405958] jffs2: notice: (78) check_node_data: wrong data CRC in data node at 0x003e0af0: read 0x5ab53bf4, calculated 0xb6f14204.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.417873] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.426478] jffs2: Returned error for crccheck of ino #8. Expect badness...<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.492939] jffs2: notice: (78) jffs2_get_inode_nodes: Node header CRC failed at 0x3e0bc8. {1985,e002,15000044,98f7fb1d}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.503923] jffs2: warning: (78) jffs2_do_read_inode_internal: no data nodes found for ino #9<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">[ 3.512462] jffs2: Returned error for crccheck of ino #9. Expect badness...<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">After that, BMC either enter recovery mode or hang.</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><u>#2</u>,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">BMC debug console shows the same SQUASHFS error as above, by checking filesystem usage we could see rwfs usage keep increasing like this,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:Consolas">root@dgx:~# df<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">Filesystem 1K-blocks Used Available Use% Mounted on<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">dev 212904 0 212904 0% /dev<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">tmpfs 246728 20172 226556 8% /run<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">/dev/mtdblock4 22656 22656 0 100% /run/initramfs/ro<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">/dev/mtdblock5 4096 880 3216 21% /run/initramfs/rw<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">cow 4096 880 3216 21% /<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">tmpfs 246728 8 246720 0% /dev/shm<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">tmpfs 246728 0 246728 0% /sys/fs/cgroup<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">tmpfs 246728 0 246728 0% /tmp<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">tmpfs 246728 8 246720 0% /var/volatile<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">and can see more and more ipmid coredump files,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="font-family:Consolas">root@dgx:~# ls -al /run/initramfs/rw/cow/var/lib/systemd/coredump/<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">drwxr-xr-x 2 root root 0 Aug 21 16:04 .<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 57344 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000.xzaba143da6d9b5571<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5710.1598025874000000000000ba58c927628d3950<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 0 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.1598025880000000000000.xzee8c94e72fc5b173<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 655360 Aug 21 16:04 .#core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5713.159802588000000000000089ee90c2a557ac1c<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">drwxr-xr-x 6 root root 0 Jan 1 1970 ..<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 92492 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5630.1598025699000000000000.xz<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 92572 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5641.1598025723000000000000.xz<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 92652 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5645.1598025728000000000000.xz<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">rw-r---- 1 root root 92476 Aug 21 16:02 core.ipmid.0.86cd480e19db45ee9417b2d0af1a443c.5651.1598025754000000000000.xz<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">By checking journal logs and found ipmid failed on access files like /usr/share/ipmi-providers/channel_config.json. So seems ipmid is also a victim from the filesystem failure.</p>
<p class="MsoNormal">And after a while, BMC just hang.</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Some recovery methods are available, but success rate are very low,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">leave BMC there for some time, it will be back to work. but not always.</li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1">reboot BMC or AC cycle sometime can make it work but not always.</li></ul>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I found the following actions could trigger this failure,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">do SSH login to BMC debug console remotely, it will show this error when triggered,</li></ol>
<p class="MsoNormal"><span style="font-family:Consolas">$ ssh root@<bmc ip><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:Consolas">ssh_exchange_identification: read: Connection reset by peer<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" start="2" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l2 level1 lfo2">set BMC MAC address by fw_setenv in BMC debug console, reboot BMC, and do 'ip -a'.</li></ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The code is based on upstream commit <b>5ddb5fa99ec259 </b>on master branch.<o:p></o:p></p>
<p class="MsoNormal">The flash layout definition is the default <b>openbmc-flash-layout.dtsi</b>.<o:p></o:p></p>
<p class="MsoNormal">The SPI ROM is <b>Macronix MX25L25635F</b><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Some questions,<o:p></o:p></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">Any SPI lock feature enabled in OpenBMC?<o:p></o:p></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l1 level1 lfo3">If yes, do I have to unlock u-boot-env partition before fw_setenv?<o:p></o:p></li></ol>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Best regards,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Kun Zhao<o:p></o:p></p>
<p class="MsoNormal">/*<o:p></o:p></p>
<p class="MsoNormal"> <a href="mailto:zkxz@hotmail.com">zkxz@hotmail.com</a><o:p></o:p></p>
<p class="MsoNormal">*/<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>