Hardware Watchdog Device in pSeries?

Linas Vepstas linas at austin.ibm.com
Thu Oct 14 08:53:16 EST 2004


On Wed, Oct 13, 2004 at 05:32:10PM -0500, Joel Schopp was heard to remark:
> 
> >I think Alan's point of view is from the other side of the table:
> >why should someone buy 12 pci-card watchdogs, one for each partition,
> >chewing up 12 pci slots, when the pSeries is already capable of doing
> >watchdog functions?   To add insult to injury, the sysadmin now needs
> >to duct-tape each of the watchdog cards to some sort of kill-switch,
> >to reboot a dead partition.  The kill-switch needs to then ssh to 
> >the fsp or the hmc to start the reboot.  So it gets pretty byzantine
> >for something that could have been 'simple' and built-in.  Never mind
> >that the reliability goes down:  the kill switch could fail, the 
> >pci watchdog card could fail (or get EEH'ed out), causing a reboot 
> >when no reboot was necessary, etc. 
> 
> I will miss the old school hardware watchdog.  If I'd had a vote I would 
> have voted to keep it.  But since it is not a democracy I can only add a 
> couple points to this argument.
> 
> First, if people really care about reliability that much they will be 
> running with hot spares in a HA environment.  In that case there are 
> already external monitors that activate the spare on any sign of problems.

Yes, well, Alan is the guy who designs and builds these systems :)
He's trying to figure out how to hook them up to the pSeries.  
You can't just cut the power, like you can for PC's :)

http://www.linux-ha.org

> Second, this can all be done from the HMC.  The HMC is perfectly capable 
> of determining the partition is hung (LED error codes, heartbeat 
> timeouts).  It is also perfectly capable of rebooting a partition.  I am 
> not aware that there is a way to put the two together right now, so that 
> the HMC automatically reboots the partition if it hangs, but it would 
> certainly be an easy feature to add the HMC.

The HMC is a natural place for this.  One of Alan's complaints
is that (non-pSeries) HMC's tend to be semi-proprietary and mostly 
unarchitected, with a wide variation from one model to another.
The dependance on Java for core functions also makes them untrustworthy.

--linas



More information about the Linuxppc64-dev mailing list