Hardware Watchdog Device in pSeries?
jschopp at austin.ibm.com
Thu Oct 14 08:32:10 EST 2004
> I think Alan's point of view is from the other side of the table:
> why should someone buy 12 pci-card watchdogs, one for each partition,
> chewing up 12 pci slots, when the pSeries is already capable of doing
> watchdog functions? To add insult to injury, the sysadmin now needs
> to duct-tape each of the watchdog cards to some sort of kill-switch,
> to reboot a dead partition. The kill-switch needs to then ssh to
> the fsp or the hmc to start the reboot. So it gets pretty byzantine
> for something that could have been 'simple' and built-in. Never mind
> that the reliability goes down: the kill switch could fail, the
> pci watchdog card could fail (or get EEH'ed out), causing a reboot
> when no reboot was necessary, etc.
I will miss the old school hardware watchdog. If I'd had a vote I would
have voted to keep it. But since it is not a democracy I can only add a
couple points to this argument.
First, if people really care about reliability that much they will be
running with hot spares in a HA environment. In that case there are
already external monitors that activate the spare on any sign of problems.
Second, this can all be done from the HMC. The HMC is perfectly capable
of determining the partition is hung (LED error codes, heartbeat
timeouts). It is also perfectly capable of rebooting a partition. I am
not aware that there is a way to put the two together right now, so that
the HMC automatically reboots the partition if it hangs, but it would
certainly be an easy feature to add the HMC.
More information about the Linuxppc64-dev