Wedge400 (AST2520) OpenBMC stuck at reboot
Chin-Ting Kuo
chin-ting_kuo at aspeedtech.com
Mon Sep 26 17:28:27 AEST 2022
Hi Tao,
This problem cannot be reproduced on our AST2500 EVB with our kernel-5.15 SDK image.
We have implemented three days reboot stress for about 1,980 times.
Thanks.
Best Wishes,
Chin-Ting
> -----Original Message-----
> From: openbmc
> <openbmc-bounces+chin-ting_kuo=aspeedtech.com at lists.ozlabs.org> On
> Behalf Of Tao Ren
> Sent: Friday, September 23, 2022 6:54 AM
> To: Konstantin Klubnichkin <kitsok at yandex-team.ru>; Ryan Chen
> <ryan_chen at aspeedtech.com>
> Cc: openbmc at lists.ozlabs.org
> Subject: Re: Wedge400 (AST2520) OpenBMC stuck at reboot
>
> Hi Konstantin,
>
> Thanks for the sharing. The watchdog control logic in the script is similar to
> aspeed_wdt_restart(), but the good part is: system is still reachable if
> watchdog cannot reset the system successfully.
>
> Hi Ryan,
>
> Have you ever seen the problem in your environment? Looks like it is affecting
> multiple ASPEED platforms. Any suggestions?
>
> BTW, I'm running Linux 5.15 in Wedge400 AST2520A2 OpenBMC.
>
>
> Cheers,
>
> Tao
>
> On Thu, Sep 22, 2022 at 10:59:12AM +0300, Konstantin Klubnichkin wrote:
> > <div>- все</div><div> </div><div>Hello!</div><div> </div><div>I've
> > faced this issue.</div><div>Finally my solution was to modify shutdown
> > script:</div><div>
> >
> </div><div>======================================================
> </div
> > ><div><div># tcsattr(tty, TIOCDRAIN, mode) to drain tty messages to
> > console </div><div>test -t 1 && stty cooked
> 0<&1
> >
> </div><div>
>
> > </div><div>echo
> "Syncing..."
>
> > </div><div>sync
> || :
> > </div><div>sync || : </div><div>sync
> || :
> >
> </div><div> </div><div>e
> cho
> > "Stopping
> WDTs..."
>
> > </div><div>rev=$(ast_getrev || :) </div><div>if [ "$rev"
> > = "G5" ]; then </div><div> devmem 0x1e78500c 32
> 0 ||
> > : </div><div> devmem 0x1e78502c 32 0 || :
> > </div><div> devmem 0x1e78504c 32 0 || :
> > </div><div>fi
> > </div><div>if [ "$rev" = "G6" ];
> then
> > </div><div> devmem 0x1e78500c 32 0
> || :
> > </div><div> devmem 0x1e78504c 32 0
> || :
> > </div><div> devmem 0x1e78508c 32 0
> || :
> > </div><div> devmem 0x1e7850cc 32 0
> || :
> > </div><div> devmem 0x1e78510c 32 0
> || :
> > </div><div> devmem 0x1e78514c 32 0
> || :
> > </div><div> devmem 0x1e78518c 32 0
> || :
> > </div><div> devmem 0x1e7851cc 32 0
> || :
> >
> </div><div>fi
>
> >
> </div><div>
>
> > </div><div>sled_hb_hb
> || :
> >
> </div><div>
>
> > </div><div>echo "Setting up WDT1 for ARM
> reboot"
> > </div><div># Set timeout to 5
> seconds
> > </div><div>devmem 0x1e785004 32 0x4c4b40
> || :
> > </div><div># Load counter reload value to counter
> register
> > </div><div>devmem 0x1e785008 32 0x4755
> || :
> > </div><div># Enable WDT1, reset ARM core only, use first flash
> > (AST2500 only),</div><div># disable interrupt, use 1MHz clock
> > (AST2500 only)</div><div>devmem 0x1e78500c 32 0x53
> || :
> >
> </div><div> </div
> ><div>echo
> > -n "WDT1CR " || : </div><div>devmem
> 0x1e78500c ||
> > : </div><div>
>
> > </div><div>echo "Last heart beats
> following..."
> >
> </div><div>
>
> > </div><div>while true;
> do
> > </div><div> echo "KNOCK
> knock..."
> > </div><div> sleep
> 1
> >
> </div><div>done
>
>
>
> > </div><div> </div><div>echo "WARNING!!!! ZOMBIE
> ATTACK!!!" </div><div>
>
> > </div><div># Execute the command systemd told us
> to ...
> > </div><div>if test -d /oldroot && test
> "$1"
> >
> </div><div>then
>
> > </div><div> if test "$1" =
> kexec
> >
> </div><div> then
>
> > </div><div> $1 -f
> -e
> >
> </div><div> else
>
> > </div><div> $1
> -f
> >
> </div><div> fi
>
> > </div><div>fi
> >
> </div></div><div><div>=============================================
> ===
> > ======</div></div><div> </div><div>22.09.2022, 01:09, "Tao Ren"
> > <rentao.bupt at gmail.com>:</div><blockquote><p>Hi there,<br /><br
> > />Recently I noticed a few Wedge400 (AST2520A2) units stuck after
> > "reboot"<br />command. It's hard to reproduce (affecting ~1 out of
> > 1,000 units), but<br />once it happens, I have to power cycle the
> > chassis to recover OpenBMC.<br /><br />I checked aspeed_wdt.c and
> > manually played with watchdog registers, but<br />everything looks
> > normal to me. Did anyone hit the similar error before?<br />Any
> > suggestions which area I should look into?<br /><br />Below are the
> > last few lines of logs before OpenBMC hangs:<br /><br />bmc-oob
> > login:<br />INIT: Sending processes configured via /etc/inittab the
> > TERM signal<br />Stopping OpenBSD Secure Shell server: sshdstopped
> > /usr/sbin/sshd (pid 7397 1189)<br />Stopping ntpd: done<br />stopping
> > rsyslogd ... done<br />Stopping random number generator daemon.<br
> > />Deconfiguring network interfaces... done.<br />Sending all processes
> > the TERM signal...<br />rackmond[1747]: Got request exit[ 528.383133]
> > watchdog: watchdog0: watchdog did not stop!<br />Sending all processes
> > the KILL signal...<br />Unmounting remote filesystems...<br
> > />Deactivating swap...<br />Unmounting local filesystems...<br
> > />Rebooting... [ 529.725009] reboot: Restarting system<br /><br /><br
> > />Cheers,<br /><br />Tao</p></blockquote><div> </div><div>
> > </div><div>-- </div><div>Best regards,</div><div>Konstantin
> > Klubnichkin,</div><div>lead firmware engineer,</div><div>server
> > hardware R&D group,</div><div>Yandex Moscow office.</div><div>tel:
> > +7-903-510-33-33</div><div> </div>
More information about the openbmc
mailing list