Wedge400 (AST2520) OpenBMC stuck at reboot
Tao Ren
rentao.bupt at gmail.com
Tue Sep 27 17:04:32 AEST 2022
Hi Chin-Ting,
Thank you for spending time on this problem!
Could you please share the git-repo/url of the kernel-5.15 SDK you are
referring to? Do you see any critical SDK kernel patches that are not
upstreamed yet, but could potentially help to solve the reboot hange
issue?
BTW, below is my kernel tree, which is derived from Joel's kernel tree,
dev-5.15 branch:
https://github.com/facebook/openbmc-linux/tree/dev-5.15
Cheers,
Tao
On Mon, Sep 26, 2022 at 07:28:27AM +0000, Chin-Ting Kuo wrote:
> Hi Tao,
>
> This problem cannot be reproduced on our AST2500 EVB with our kernel-5.15 SDK image.
> We have implemented three days reboot stress for about 1,980 times.
>
>
>
>
> Thanks.
>
> Best Wishes,
> Chin-Ting
>
> > -----Original Message-----
> > From: openbmc
> > <openbmc-bounces+chin-ting_kuo=aspeedtech.com at lists.ozlabs.org> On
> > Behalf Of Tao Ren
> > Sent: Friday, September 23, 2022 6:54 AM
> > To: Konstantin Klubnichkin <kitsok at yandex-team.ru>; Ryan Chen
> > <ryan_chen at aspeedtech.com>
> > Cc: openbmc at lists.ozlabs.org
> > Subject: Re: Wedge400 (AST2520) OpenBMC stuck at reboot
> >
> > Hi Konstantin,
> >
> > Thanks for the sharing. The watchdog control logic in the script is similar to
> > aspeed_wdt_restart(), but the good part is: system is still reachable if
> > watchdog cannot reset the system successfully.
> >
> > Hi Ryan,
> >
> > Have you ever seen the problem in your environment? Looks like it is affecting
> > multiple ASPEED platforms. Any suggestions?
> >
> > BTW, I'm running Linux 5.15 in Wedge400 AST2520A2 OpenBMC.
> >
> >
> > Cheers,
> >
> > Tao
> >
> > On Thu, Sep 22, 2022 at 10:59:12AM +0300, Konstantin Klubnichkin wrote:
> > > <div>- все</div><div> </div><div>Hello!</div><div> </div><div>I've
> > > faced this issue.</div><div>Finally my solution was to modify shutdown
> > > script:</div><div>
> > >
> > </div><div>======================================================
> > </div
> > > ><div><div># tcsattr(tty, TIOCDRAIN, mode) to drain tty messages to
> > > console </div><div>test -t 1 && stty cooked
> > 0<&1
> > >
> > </div><div>
> >
> > > </div><div>echo
> > "Syncing..."
> >
> > > </div><div>sync
> > || :
> > > </div><div>sync || : </div><div>sync
> > || :
> > >
> > </div><div> </div><div>e
> > cho
> > > "Stopping
> > WDTs..."
> >
> > > </div><div>rev=$(ast_getrev || :) </div><div>if [ "$rev"
> > > = "G5" ]; then </div><div> devmem 0x1e78500c 32
> > 0 ||
> > > : </div><div> devmem 0x1e78502c 32 0 || :
> > > </div><div> devmem 0x1e78504c 32 0 || :
> > > </div><div>fi
> > > </div><div>if [ "$rev" = "G6" ];
> > then
> > > </div><div> devmem 0x1e78500c 32 0
> > || :
> > > </div><div> devmem 0x1e78504c 32 0
> > || :
> > > </div><div> devmem 0x1e78508c 32 0
> > || :
> > > </div><div> devmem 0x1e7850cc 32 0
> > || :
> > > </div><div> devmem 0x1e78510c 32 0
> > || :
> > > </div><div> devmem 0x1e78514c 32 0
> > || :
> > > </div><div> devmem 0x1e78518c 32 0
> > || :
> > > </div><div> devmem 0x1e7851cc 32 0
> > || :
> > >
> > </div><div>fi
> >
> > >
> > </div><div>
> >
> > > </div><div>sled_hb_hb
> > || :
> > >
> > </div><div>
> >
> > > </div><div>echo "Setting up WDT1 for ARM
> > reboot"
> > > </div><div># Set timeout to 5
> > seconds
> > > </div><div>devmem 0x1e785004 32 0x4c4b40
> > || :
> > > </div><div># Load counter reload value to counter
> > register
> > > </div><div>devmem 0x1e785008 32 0x4755
> > || :
> > > </div><div># Enable WDT1, reset ARM core only, use first flash
> > > (AST2500 only),</div><div># disable interrupt, use 1MHz clock
> > > (AST2500 only)</div><div>devmem 0x1e78500c 32 0x53
> > || :
> > >
> > </div><div> </div
> > ><div>echo
> > > -n "WDT1CR " || : </div><div>devmem
> > 0x1e78500c ||
> > > : </div><div>
> >
> > > </div><div>echo "Last heart beats
> > following..."
> > >
> > </div><div>
> >
> > > </div><div>while true;
> > do
> > > </div><div> echo "KNOCK
> > knock..."
> > > </div><div> sleep
> > 1
> > >
> > </div><div>done
> >
> >
> >
> > > </div><div> </div><div>echo "WARNING!!!! ZOMBIE
> > ATTACK!!!" </div><div>
> >
> > > </div><div># Execute the command systemd told us
> > to ...
> > > </div><div>if test -d /oldroot && test
> > "$1"
> > >
> > </div><div>then
> >
> > > </div><div> if test "$1" =
> > kexec
> > >
> > </div><div> then
> >
> > > </div><div> $1 -f
> > -e
> > >
> > </div><div> else
> >
> > > </div><div> $1
> > -f
> > >
> > </div><div> fi
> >
> > > </div><div>fi
> > >
> > </div></div><div><div>=============================================
> > ===
> > > ======</div></div><div> </div><div>22.09.2022, 01:09, "Tao Ren"
> > > <rentao.bupt at gmail.com>:</div><blockquote><p>Hi there,<br /><br
> > > />Recently I noticed a few Wedge400 (AST2520A2) units stuck after
> > > "reboot"<br />command. It's hard to reproduce (affecting ~1 out of
> > > 1,000 units), but<br />once it happens, I have to power cycle the
> > > chassis to recover OpenBMC.<br /><br />I checked aspeed_wdt.c and
> > > manually played with watchdog registers, but<br />everything looks
> > > normal to me. Did anyone hit the similar error before?<br />Any
> > > suggestions which area I should look into?<br /><br />Below are the
> > > last few lines of logs before OpenBMC hangs:<br /><br />bmc-oob
> > > login:<br />INIT: Sending processes configured via /etc/inittab the
> > > TERM signal<br />Stopping OpenBSD Secure Shell server: sshdstopped
> > > /usr/sbin/sshd (pid 7397 1189)<br />Stopping ntpd: done<br />stopping
> > > rsyslogd ... done<br />Stopping random number generator daemon.<br
> > > />Deconfiguring network interfaces... done.<br />Sending all processes
> > > the TERM signal...<br />rackmond[1747]: Got request exit[ 528.383133]
> > > watchdog: watchdog0: watchdog did not stop!<br />Sending all processes
> > > the KILL signal...<br />Unmounting remote filesystems...<br
> > > />Deactivating swap...<br />Unmounting local filesystems...<br
> > > />Rebooting... [ 529.725009] reboot: Restarting system<br /><br /><br
> > > />Cheers,<br /><br />Tao</p></blockquote><div> </div><div>
> > > </div><div>-- </div><div>Best regards,</div><div>Konstantin
> > > Klubnichkin,</div><div>lead firmware engineer,</div><div>server
> > > hardware R&D group,</div><div>Yandex Moscow office.</div><div>tel:
> > > +7-903-510-33-33</div><div> </div>
More information about the openbmc
mailing list