Hardware Watchdog Device in pSeries?
alanr at unix.sh
Fri Oct 15 03:34:48 EST 2004
Linas Vepstas wrote:
> Hi Alan,
> Long emails confuse me ...
> On Wed, Oct 13, 2004 at 10:41:26PM -0600, Alan Robertson was heard to remark:
>>Linas Vepstas wrote:
>>>why should someone buy 12 pci-card watchdogs, one for each partition,
>>>chewing up 12 pci slots, when the pSeries is already capable of doing
>> It looks really Rube Goldberg-ish (to say the least).
>>The hardware watchdog timer is a 3rd party
>>monitoring system, and therefore is likely to be reliable when the thing it
>>is watching is sick -
> Not sure where you're going with this; are you saying that
> 3rd-party watchdog PCI cards, one for each partition, is a
> good idea, or a bad idea?
> Would you rather have the OS monitoring done with
> (a) watchdog PCI cards,
> (b) with 'surveillance' done by firmware/hypervisor,
> (c) or with some other method?
I would prefer (b). Because the software and address spaces of the
firmware/hypervisor are separate, it is effectively a third party reset
mechanism. The test I would use is: Does failure of the thing being
monitored cause or correlate to failure in the thing doing the monitoring -
and the answer is "no" -- therefore it's a third-party reset.
I don't have a (c) method in mind that would work in this environment.
Evaluating (a) and (b):
+ is third party
- is complex and hard to configure all around
(think about configuring those cards with passwords
and ssh, and ip addresses and partition names
and so on - also think about how many things
could break and keep this from working).
- difficult to support
- doesn't scale well in any obvious way
- is relatively expensive for the customer (adds several hundred
dollars for each partition - maybe as much as $1K)
- difficult to bring into existence (compared to (b))
- is ugly, kludgy, and Rube Goldberg-ish.
+ is third party
+ is relatively simple when compared to (a) (i.e., more reliable)
+ requires little/no special configuration to make it work
+ Shows off the advantages of pSeries architecture
+ adds no cost to the customer's solution
+ is comparatively easy to bring into existence (compared to a)
+ is a natural and clean solution.
>> The bootstrap loader should work much the
> I guess I didn't get this exposition either.
---- OK -- as I said this is an improvement over the above - but
not absolutely critical -- But I'll try explaining
it again and see if giving a shorter answer helps -------
> Although its nice to
> know that boot was successful, I see boot as a whole lot less
> important than monitoring the system once its gone 'online'. The boot
> sequence can be monitored much more loosely, with a whole-lot less
> complexity. The hypervisor knows when the OS boot sequence starts.
> If the OS hasn't completely booted after, say, 10 minutes, then it
> can call a human to look at the problem. I don't see why one needs
> to heartbeat once a second during boot; that's hard to do and seems
I didn't say anything about once a second. It could be once every 30
seconds - or even 5 minutes. That gives you lots of time, and you then
only have to heartbeat in a couple of select places, and while in input
loops waiting for human input. These aren't so much periodic heartbeats as
they are progress reports. If you stop making progress, you get reset.
> By contrast, I'd expect to turn on the once-per-second
> heartbeat just before the system goes 'online' or 'critical'.
This change decreases MTTR. MTTR has an effect on system availability -
even in a redundant HA cluster - since MTTR determines the probability of
"simultaneous" failures from which the HA system cannot recover.
Calling a human is slow and often expensive (particularly on an emergency
basis). It takes minutes to hours and may result in an extra service
charge from someone (depending on who gets the call, what time it is, and
what arrangements are made, etc.).
A system which doesn't boot isn't providing service. If service isn't
being provided, it doesn't matter why it's not being provided (OS, dump,
bootstrap, BIOS, etc.)... The OS is not the only possible cause of
failure. The OS is by far more likely than these others, but all software
has bugs. And, hardware has transient failures as well as permanent ones.
A system with these capabilities will continue to try and provide service
in the presence of (transient) errors until it succeeds, or exceeds some
retry threshold, meaning a human needs to intervene and fix whatever's wrong.
This is essentially autonomic computing for the boot process.
With this architecture, the system will come up and provide
service, or it is broken so badly that retrying won't
help and a human really is needed.
Otherwise, no recovery will be performed for errors which keep
the system from coming up (after a crash or otherwise)
and some outages may be unnecessarily prolonged.
If your availability is poor, this will make zero difference. If your
availability is very good, this helps a little. And, when your
availability is very good, it's hard to find things that help even a little...
Of course, being able to say "autonomic computing wired into the lowest
levels of the system" probably has marketing value beyond the small amount
of improved availability it provides ;-)
[[If this system is running the air traffic control system while I'm in the
air, I vote for adding this feature ;-)]].
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
More information about the Linuxppc64-dev