Hardware Watchdog Device in pSeries?

Fri Oct 15 03:34:48 EST 2004

Linas Vepstas wrote:
> Hi Alan,
> 
> Long emails confuse me ...
> 
> On Wed, Oct 13, 2004 at 10:41:26PM -0600, Alan Robertson was heard to remark:
> 
>>Linas Vepstas wrote:
>>
>>>why should someone buy 12 pci-card watchdogs, one for each partition,
>>>chewing up 12 pci slots, when the pSeries is already capable of doing
>>
>> It looks really Rube Goldberg-ish (to say the least).
> 
> 
> [...]
> 
>>The hardware watchdog timer is a 3rd party 
>>monitoring system, and therefore is likely to be reliable when the thing it 
>>is watching is sick - 
> 
> 
> 
> Not sure where you're going with this; are you saying that 
> 3rd-party watchdog PCI cards, one for each partition, is a 
> good idea, or a bad idea?  
> 
> Would you rather have the OS monitoring done with 
> (a) watchdog PCI cards,
> (b) with 'surveillance' done by firmware/hypervisor, 
> (c) or with some other method?

I would prefer (b).  Because the software and address spaces of the 
firmware/hypervisor are separate, it is effectively a third party reset 
mechanism.  The test I would use is:  Does failure of the thing being 
monitored cause or correlate to failure in the thing doing the monitoring - 
and the answer is "no" -- therefore it's a third-party reset.

I don't have a (c) method in mind that would work in this environment.

Evaluating (a) and (b):

Method (a):
	+ is third party
	- is complex and hard to configure all around
		(think about configuring those cards with passwords
			and ssh, and ip addresses and partition names
			and so on - also think about how many things
			could break and keep this from working).
	- difficult to support
	- doesn't scale well in any obvious way
	- is relatively expensive for the customer (adds several hundred
		dollars for each partition - maybe as much as $1K)
	- difficult to bring into existence (compared to (b))
	- is ugly, kludgy, and Rube Goldberg-ish.

Method (b):
	+ is third party
	+ is relatively simple when compared to (a) (i.e., more reliable)
	+ requires little/no special configuration to make it work
	+ Shows off the advantages of pSeries architecture
	+ adds no cost to the customer's solution
	+ is comparatively easy to bring into existence (compared to a)
	+ is a natural and clean solution.

>>	The bootstrap loader should work much the
> 
> 
> I guess I didn't get this exposition either. 

---- OK -- as I said this is an improvement over the above - but
	not absolutely critical -- But I'll try explaining
	it again and see if giving a shorter answer helps -------

 > Although its nice to
> know that boot was successful,  I see boot as a whole lot less 
> important than monitoring the system once its gone 'online'.  The boot
> sequence can be monitored much more loosely, with a whole-lot less
> complexity.  The hypervisor knows when the OS boot sequence starts.
> If the OS hasn't completely booted after, say, 10 minutes, then it
> can call a human to look at the problem.  I don't see why one needs
> to heartbeat once a second during boot; that's hard to do and seems
> un-neccessary. 

I didn't say anything about once a second.  It could be once every 30 
seconds - or even 5 minutes.  That gives you lots of time, and you then 
only have to heartbeat in a couple of select places, and while in input 
loops waiting for human input.  These aren't so much periodic heartbeats as 
they are progress reports.  If you stop making progress, you get reset.

 > By contrast, I'd expect to turn on the once-per-second
> heartbeat just before the system goes 'online' or 'critical'.

This change decreases MTTR.  MTTR has an effect on system availability - 
even in a redundant HA cluster - since MTTR determines the probability of 
"simultaneous" failures from which the HA system cannot recover.

Calling a human is slow and often expensive (particularly on an emergency 
basis).   It takes minutes to hours and may result in an extra service 
charge from someone (depending on who gets the call, what time it is, and 
what arrangements are made, etc.).

A system which doesn't boot isn't providing service.  If service isn't 
being provided, it doesn't matter why it's not being provided (OS, dump, 
bootstrap, BIOS, etc.)...  The OS is not the only possible cause of 
failure.  The OS is by far more likely than these others, but all software 
has bugs.  And, hardware has transient failures as well as permanent ones.

A system with these capabilities will continue to try and provide service 
in the presence of (transient) errors until it succeeds, or exceeds some 
retry threshold, meaning a human needs to intervene and fix whatever's wrong.

This is essentially autonomic computing for the boot process.

In short:
	With this architecture, the system will come up and provide
		service, or it is broken so badly that retrying won't
		help and a human really is needed.

	Otherwise, no recovery will be performed for errors which keep
		the system from coming up (after a crash or otherwise)
		and some outages may be unnecessarily prolonged.

If your availability is poor, this will make zero difference.  If your 
availability is very good, this helps a little.  And, when your 
availability is very good, it's hard to find things that help even a little...

Of course, being able to say "autonomic computing wired into the lowest 
levels of the system" probably has marketing value beyond the small amount 
of improved availability it provides ;-)

[[If this system is running the air traffic control system while I'm in the 
air, I vote for adding this feature ;-)]].

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce