Hardware Watchdog Device in pSeries?
alanr at unix.sh
Thu Oct 14 07:30:02 EST 2004
Mike Strosaker wrote:
> Linas Vepstas wrote:
>> I might have volunteered to hack this up real quick, were it not for
>> Mike Strosaker's correction, that the surveillance featues were taken
>> out of Power5.
>> Anyone on this list know why?
> I sent the reason I got from the hardware RAS folks to this list a while
> Luckily, it's still in my sent mail folder:
> "Because of the virtualization layer and partitioning, the surveillance
> requirement was moved to PHYP<->SP. Apparently, this was a hotly
> contested issue among the platform design folks (especially considering
> partitioned power4 systems still have OS<->SP surveillance). I think
> the logic
> is: If an OS goes down, its not likely a server problem, hence no
> to monitor from the server side.
> At least the platform gets notified of panics via os-term. I gather
> that some user space tools are expected to monitor for deadlocks/hangs
> (maybe clustering tools). "
This is about half-right.
There is one particular circumstance which can ONLY be monitored from a
If the OS hangs, then, nothing but a hardware timer can bring the machine
out of it's hung state. Hangs do NOT panic (by definition), and can't be
reliably detected any other way.
In highly available systems (like telecom systems), hardware level monitors
are required. Leaving it out sends the message that "availability isn't
The normal way that a highly available systems is to have layers (or a
hierarchy) of watchers.
At the bottom is the hardware monitor.
Above that is an application monitor
above that is resource monitors
But, there are certain kinds of faults which cannot be caught without this
bottom layer monitor.
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
More information about the Linuxppc64-dev