{PATCH] Firmware update using the update_flash -f <filename> results to soft lockup BUG

Michael Neuling mikey at neuling.org
Fri Jul 8 11:51:09 EST 2011


In message <4E155F15.1030805 at linux.vnet.ibm.com> you wrote:
> This is a multi-part message in MIME format.
> --===============3790206687486290502==
> Content-Type: multipart/alternative;
>  boundary="------------080309090408040507080807"
> 
> This is a multi-part message in MIME format.
> --------------080309090408040507080807
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> On Thursday 07 July 2011 11:42 AM, Michael Neuling wrote:
> > In message<4E1543B6.9060800 at linux.vnet.ibm.com>  you wrote:
> >    
> >> Hi ,
> >>
> >> Problem Description:
> >> Firmware update using the update_flash -f<filename>   results to soft lock
up
> >> BUG
> >> FLASH: preparing saved firmware image for flash
> >> FLASH: flash image is 50141296 bytes
> >> FLASH: performing flash and reboot
> >> FLASH: this will take several minutes.  Do not power off!
> >> BUG: soft lockup - CPU#1 stuck for 67s! [events/1:36]
> >>
> >> Steps to reproduce:
> >> 1. Check the firmware information on the machine (using ASM or lsmcode)
> >> 2. Update the system firmware with the update_flash command
> >> update_flash -f 01FL350_039_038.img
> >> info: Temporary side will be updated with a newer or
> >> identical image
> >>
> >> Projected Flash Update Results:
> >> Current T Image: FL350_039
> >> Current P Image: FL350_039
> >> New T Image:     FL350_039
> >> New P Image:     FL350_039
> >> Flash image ready...rebooting the system...
> >>
> >> Broadcast message from root at abc
> >> (/dev/hvc0) at 5:25 ...
> >>
> >> The system is going down for reboot NOW!
> >> [root at abc /]# Stopping rhsmcertd[  OK  ]
> >> Stopping atd: [  OK  ]
> >> Stopping cups: [  OK  ]
> >> Stopping abrt daemon: [  OK  ]
> >> Stopping sshd: [  OK  ]
> >> Shutting down postfix: [  OK  ]
> >> Stopping rtas_errd (platform error handling) daemon: [  OK  ]
> >> Stopping crond: [  OK  ]
> >> Stopping automount: [  OK  ]
> >> Stopping HAL daemon: [  OK  ]
> >> Stopping iprdump: [  OK  ]
> >> Killing mdmonitor: [  OK  ]]
> >> Stopping system message bus: [  OK  ]
> >> Stopping rpcbind: [  OK  ]
> >> Stopping auditd: [  OK  ]
> >> Shutting down interface eth0:  [  OK  ]
> >> Shutting down loopback interface:  [  OK  ]
> >> ip6tables: Flushing firewall rules: [  OK  ]
> >> ip6tables: Setting chains to policy ACCEPT: filter [  OK  ]
> >> ip6tables: Unloading modules: [  OK  ]
> >> iptables: Flushing firewall rules: [  OK  ]
> >> iptables: Setting chains to policy ACCEPT: filter [  OK  ]
> >> iptables: Unloading modules: [  OK  ]
> >> Sending all processes the TERM signal... [  OK  ]
> >> Sending all processes the KILL signal... [  OK  ]
> >> Saving random seed:  [  OK  ]
> >> Turning off swap:  [  OK  ]
> >> Turning off quotas:  [  OK  ]
> >> Unmounting pipe file systems:  [  OK  ]
> >> Unmounting file systems:  [  OK  ]
> >> init: Re-executing /sbin/init
> >> Please stand by while rebooting the system...
> >> Restarting system.
> >> FLASH: preparing saved firmware image for flash
> >> FLASH: flash image is 50141296 bytes
> >> FLASH: performing flash and reboot
> >> FLASH: this will take several minutes.  Do not power off!
> >> BUG: soft lockup - CPU#1 stuck for 67s! [events/1:36]
> >>
> >> This is solved by the following patch
> >>      
> > Can you please explain how it fixes it?
> >    
> The flash update is conducted with an RTAS call. The RTAS calls are
> serialized by lock_rtas() which uses a spin_lock.
> 
> Now there is rtasd which keeps scanning for the RTAS events generated
> on the machine. This is performed via workqueue mechanism. The
> rtas_event_scan() also uses an RTAS call to scan the events,
> eventually taking the lock_rtas() before it issues the request.
>
> The flash update is an operation which takes long time, and hence
> while we are at it, anyboy else who wants to make an RTAS call will
> have to wait until the update is completed. Now in this case, the
> rtas_event_scan() is being kicked in to check for events and it waits
> a long time on the spin_lock, getting us a SOFT Lockup.

What other RTAS calls are going on at this point?  It worries me we are
stopping a CPU that's doing RTAS calls.  Your solution would seem to be
papering over a more serious problem.

> Before the rtas firmware update starts, all other CPUs should be
> stopped. Which means no other CPU should be in lock_rtas(). We do not
> want other CPUs execute while FW update is in progress and the system
> will be rebooted anyway after the update.

Mikey


More information about the Linuxppc-dev mailing list