Hardware Watchdog Device in pSeries?

Thu Oct 14 14:41:26 EST 2004

Linas Vepstas wrote:
> Hi,
> 
> On Wed, Oct 13, 2004 at 03:30:02PM -0600, Alan Robertson was heard to remark:
> 
>>Mike Strosaker wrote:
>>
>>>Linas Vepstas wrote:
>>>
>>>
>>>>I might have volunteered to hack this up real quick, were it not for
>>>>Mike Strosaker's correction, that the surveillance featues were taken
>>>>out of Power5.  
>>>>Anyone on this list know why?
>>>>
>>>
>>>I sent the reason I got from the hardware RAS folks to this list a while 
>>>back.
>>>Luckily, it's still in my sent mail folder:
>>>
>>>"Because of the virtualization layer and partitioning, the surveillance
>>>requirement was moved to PHYP<->SP.  Apparently, this was a hotly
>>>contested issue among the platform design folks (especially considering 
>>>that
>>>partitioned power4 systems still have OS<->SP surveillance).  I think 
>>>the logic
>>>is: If an OS goes down, its not likely a server problem, hence no 
>>>requirement
>>>to monitor from the server side.
>>>
>>>At least the platform gets notified of panics via os-term.  I gather
>>>that some user space tools are expected to monitor for deadlocks/hangs
>>>(maybe clustering tools). "
>>
>>This is about half-right.
>>
>>There is one particular circumstance which can ONLY be monitored from a 
>>hardware-level monitor.
>>
>>OS hangs.
> 
> 
> Heh. I think I can clarify, after talking to the firmware folks.
> 
> The core thinking behind the the "platform architecture" was to make
> sure that the underlying hardware, i.e. the "platform" wasn't hung.
> They were not concerned about the OS itself; they assumed that OS'es
> have thier own independent mechanisms for detecting hung-ness.
> 
>>From the platform point of view, they are concerned that they'll
> have a machine with a dozen different partitons on it (a dozen 
> different OS'es), and a hardware hang will take down all twelve.
> So they've got the hypervisor and service processor montioring
> each other, keeping things humming.  If just one partition goes
> down due to a kernel hang/crash, well, that's too bad, but its
> not the end of the world from the platform point of view.

And this is a great set of goals as far as they go.  But, not sufficient 
when looking at the platform as something which actually delivers services, 
not just runs the hypervisor.

[[I guess I forgot to say that in addition to being the architect for IBM's 
OSS Linux strategy and product, I worked for 21 years for Bell Labs on 
highly reliable telecommunications systems before this.  So, I have some 
reasonable knowledge of how these kinds of things work in well-tested, 
well-proven systems.  Typically, telephone systems are considered extremely 
reliable - because they follow a well-proven discipline of design.  The 
international telephone system is in effect the worlds largest 
ultra-reliable computer.  And, it has been since back when telephone 
switches were made with discrete transistors - largely because of good HA 
system design]]

> I think Alan's point of view is from the other side of the table:
> why should someone buy 12 pci-card watchdogs, one for each partition,
> chewing up 12 pci slots, when the pSeries is already capable of doing
> watchdog functions?   To add insult to injury, the sysadmin now needs
> to duct-tape each of the watchdog cards to some sort of kill-switch,
> to reboot a dead partition.  The kill-switch needs to then ssh to 
> the fsp or the hmc to start the reboot.  So it gets pretty byzantine
> for something that could have been 'simple' and built-in.  Never mind
> that the reliability goes down:  the kill switch could fail, the 
> pci watchdog card could fail (or get EEH'ed out), causing a reboot 
> when no reboot was necessary, etc. 

Linas is right about the cost and complexity of the monitoring cards and 
the whole system.  In addition, if we're trying to see pSeries as a premium 
highly-reliable system better than the competition, it just doesn't send 
the right message if you tell a customer that this is what they have to do. 
  It looks really Rube Goldberg-ish (to say the least).

In addition, from a technical perspective, there is a basic principle in HA 
systems which is being ignored here...

	A sick system cannot reliably monitor itself.

If you're relying on a system which you believe to be sick to monitor 
itself, it will be unable to do this reliably under all circumstances - 
it's sick, and therefore not reliable -- by definition.  Crazy people may 
not think they're insane ;-).  The hardware watchdog timer is a 3rd party 
monitoring system, and therefore is likely to be reliable when the thing it 
is watching is sick - because its sanity is uncorrelated to the failure of 
the thing it is watching.

For example, if by a programming error in the kernel, you halt or loop with 
interrupts disabled -- you're screwed with no way out.  In mainframes I 
think this is called a disabled wait state.  Of course, there are more 
complex ways to do this, but hopefully one example makes the point.

This is the point of the hierarchy of monitoring I described before.  This 
is very much standard operating procedure for reliable systems in the 
telecom industry (and many others).  In fact, such a watchdog timer is a 
requirement for Carrier Grade Linux (CGL).

Here is the standard way which highly available systems are architected to 
work -- and it's consistent with 35-year industry practice in telephony 
systems, the formal CGL requirements, and the architecture of the Linux-HA 
system.

The hardware watchdog timer times out when it doesn't get
	a heartbeat in the allotted time.  (duhhh!)

	Just before loading the BIOS, the watchdog timer should be set
		for some "reasonable" amount of time (like a few
		seconds) for the BIOS to load and begin executing.

	The BIOS should set the timer for a reasonable
		time for the bootstrap program to load.  It must tickle
		it periodically while waiting for input from humans.*

	The bootstrap loader should work much the
		same way.  Before it jumps to the OS, it should set
		the timer for a reasonable amount of time for
		the OS to take over the tickling.*

	When it first comes up, the OS takes over and tickles the
		watchdog timer.

	When the HA monitoring subsystem comes up, it takes over
		and tickles the watchdog timer.

	As HA-aware processes start up, they tickle individual watchdog
		timers maintained by the HA monitoring subsystem (apphbd).  		If they 
die, or hang, they are restarted by the Recovery
		Manager.  As a special case, apphbd will restart the
		recovery manager as described below.

	The recovery manager registers with the HA monitoring subsystem
		and receives notification of insane or dead
		processes.  If they're insane it kills them.
		When they die, it restarts them.
		If the recovery manager dies (or goes insane), then
		apphbd will (kill and) restart the recovery manager.**

	When the system panics, then the watchdog timer needs to be
		tickled while waiting for human input, and while making
		progress taking a dump. [but only when actually
		making progress].

	When the OS jumps back into the BIOS for any reason
		then the timer is reset to some value suitable
		for the BIOS to take over and start tickling it.
		(~ same as the original value).

Now if the BIOS or OS or bootstrap loader, or dump process craps out and 
hangs, or the hard disk can't boot, or a peripherial hangs the bus, then 
this watchdog timer will trigger, and the system will be reset - and you'll 
get a chance to try it again.  [[If you fail too often in too short a 
period of time, then "phone home" or cry "uncle" or sit and cry if you 
like.  Or, you can just keep persisting...]]

Later on when HA monitoring system is running, if it (or the scheduler or 
other piece of the OS) craps out and the HA monitoring system doesn't (or 
isn't able to) tickle this watchdog timer - for whatever reason - then 
everything will reboot just like it should.

Notice how many different kinds of errors this one single timer can detect 
and recover from - and how many of them cannot easily be recovered from at 
all without it.  Note how handy it is in designing the system to know that 
your underlying hardware has this capability built-in.  It eliminates a lot 
of complexity from several pieces of software, and does a better job too!

Without this timer, you can't easily design a truly reliable system.  (and 
maybe not at all).

<pedantic-mode>
The lowest level monitor should be the simplest and most reliable.  It 
monitors the OS.  The driver for this in the kernel should also be solid 
and no-frills.  The base-level HA monitoring system (which monitors 
processes for their health) should also be as simple as possible. 
Complexity is the enemy of reliability.  If any of these components fail, 
then the system will be rebooted unnecessarily.  This is a BadThing(TM).

Now, to use this "right", the thing that any subsystem tickling the timer 
at the next higher level should do is periodically schedule something to 
evaluate its internal sanity (data structure consistency or queue lengths 
or whatever), and tickle the watchdog timer only when it passes whatever 
its internal sanity measure is.

Then, if you go into an infinite loop, or doubt your own sanity long 
enough, someone else will eventually do something about it - you'll be 
killed and restarted (if a process) -- or rebooted (if you're the HA 
process monitor, or the BIOS, or bootstrap loader or OS).

Of course, this doesn't *replace* external monitoring (see the note above 
about declaring oneself sick), but it is a good orthogonal measure, and 
simpler to implement for subsystems with limited external interfaces - like 
the bootstrap loader.
</pedantic-mode>

* = Note that these layers may have to deal with bootstrap loaders and/or 
OSes which won't tickle the watchdog timer - so they have to shut it off 
(or set it really long) when booting a layer under them which isn't 
watchdog-aware.

** = The reason why the recovery manager is not part of the apphbd process 
in our design is because the apphbd process should be as simple as it can 
be - because it's death or insanity would trigger a system restart. 
Putting it in a separate process lessens the liklihood of an unnecessary 
system restart.  This is not a necessity, but I believe it to be a good 
design choice - after all it was my design choice ;-)

It is certainly true that we don't have to implement all these things 
today, or at all, but with the hardware watchdog timer, they're possible. 
And, without it, they're not.

Even without implementing all these extra HA features, it still monitors 
the OS more reliably than it can monitor itself.  So, I think this is a 
very worthwhile feature for the platform to have.

Hope this helps!

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce