Hardware Watchdog Device in pSeries?
linas at austin.ibm.com
Thu Oct 14 08:53:16 EST 2004
On Wed, Oct 13, 2004 at 05:32:10PM -0500, Joel Schopp was heard to remark:
> >I think Alan's point of view is from the other side of the table:
> >why should someone buy 12 pci-card watchdogs, one for each partition,
> >chewing up 12 pci slots, when the pSeries is already capable of doing
> >watchdog functions? To add insult to injury, the sysadmin now needs
> >to duct-tape each of the watchdog cards to some sort of kill-switch,
> >to reboot a dead partition. The kill-switch needs to then ssh to
> >the fsp or the hmc to start the reboot. So it gets pretty byzantine
> >for something that could have been 'simple' and built-in. Never mind
> >that the reliability goes down: the kill switch could fail, the
> >pci watchdog card could fail (or get EEH'ed out), causing a reboot
> >when no reboot was necessary, etc.
> I will miss the old school hardware watchdog. If I'd had a vote I would
> have voted to keep it. But since it is not a democracy I can only add a
> couple points to this argument.
> First, if people really care about reliability that much they will be
> running with hot spares in a HA environment. In that case there are
> already external monitors that activate the spare on any sign of problems.
Yes, well, Alan is the guy who designs and builds these systems :)
He's trying to figure out how to hook them up to the pSeries.
You can't just cut the power, like you can for PC's :)
> Second, this can all be done from the HMC. The HMC is perfectly capable
> of determining the partition is hung (LED error codes, heartbeat
> timeouts). It is also perfectly capable of rebooting a partition. I am
> not aware that there is a way to put the two together right now, so that
> the HMC automatically reboots the partition if it hangs, but it would
> certainly be an easy feature to add the HMC.
The HMC is a natural place for this. One of Alan's complaints
is that (non-pSeries) HMC's tend to be semi-proprietary and mostly
unarchitected, with a wide variation from one model to another.
The dependance on Java for core functions also makes them untrustworthy.
More information about the Linuxppc64-dev