Hardware Watchdog Device in pSeries?
Alan Robertson
alanr at unix.sh
Thu Oct 14 14:41:26 EST 2004
Linas Vepstas wrote:
> Hi,
>
> On Wed, Oct 13, 2004 at 03:30:02PM -0600, Alan Robertson was heard to remark:
>
>>Mike Strosaker wrote:
>>
>>>Linas Vepstas wrote:
>>>
>>>
>>>>I might have volunteered to hack this up real quick, were it not for
>>>>Mike Strosaker's correction, that the surveillance featues were taken
>>>>out of Power5.
>>>>Anyone on this list know why?
>>>>
>>>
>>>I sent the reason I got from the hardware RAS folks to this list a while
>>>back.
>>>Luckily, it's still in my sent mail folder:
>>>
>>>"Because of the virtualization layer and partitioning, the surveillance
>>>requirement was moved to PHYP<->SP. Apparently, this was a hotly
>>>contested issue among the platform design folks (especially considering
>>>that
>>>partitioned power4 systems still have OS<->SP surveillance). I think
>>>the logic
>>>is: If an OS goes down, its not likely a server problem, hence no
>>>requirement
>>>to monitor from the server side.
>>>
>>>At least the platform gets notified of panics via os-term. I gather
>>>that some user space tools are expected to monitor for deadlocks/hangs
>>>(maybe clustering tools). "
>>
>>This is about half-right.
>>
>>There is one particular circumstance which can ONLY be monitored from a
>>hardware-level monitor.
>>
>>OS hangs.
>
>
> Heh. I think I can clarify, after talking to the firmware folks.
>
> The core thinking behind the the "platform architecture" was to make
> sure that the underlying hardware, i.e. the "platform" wasn't hung.
> They were not concerned about the OS itself; they assumed that OS'es
> have thier own independent mechanisms for detecting hung-ness.
>
>>From the platform point of view, they are concerned that they'll
> have a machine with a dozen different partitons on it (a dozen
> different OS'es), and a hardware hang will take down all twelve.
> So they've got the hypervisor and service processor montioring
> each other, keeping things humming. If just one partition goes
> down due to a kernel hang/crash, well, that's too bad, but its
> not the end of the world from the platform point of view.
And this is a great set of goals as far as they go. But, not sufficient
when looking at the platform as something which actually delivers services,
not just runs the hypervisor.
[[I guess I forgot to say that in addition to being the architect for IBM's
OSS Linux strategy and product, I worked for 21 years for Bell Labs on
highly reliable telecommunications systems before this. So, I have some
reasonable knowledge of how these kinds of things work in well-tested,
well-proven systems. Typically, telephone systems are considered extremely
reliable - because they follow a well-proven discipline of design. The
international telephone system is in effect the worlds largest
ultra-reliable computer. And, it has been since back when telephone
switches were made with discrete transistors - largely because of good HA
system design]]
> I think Alan's point of view is from the other side of the table:
> why should someone buy 12 pci-card watchdogs, one for each partition,
> chewing up 12 pci slots, when the pSeries is already capable of doing
> watchdog functions? To add insult to injury, the sysadmin now needs
> to duct-tape each of the watchdog cards to some sort of kill-switch,
> to reboot a dead partition. The kill-switch needs to then ssh to
> the fsp or the hmc to start the reboot. So it gets pretty byzantine
> for something that could have been 'simple' and built-in. Never mind
> that the reliability goes down: the kill switch could fail, the
> pci watchdog card could fail (or get EEH'ed out), causing a reboot
> when no reboot was necessary, etc.
Linas is right about the cost and complexity of the monitoring cards and
the whole system. In addition, if we're trying to see pSeries as a premium
highly-reliable system better than the competition, it just doesn't send
the right message if you tell a customer that this is what they have to do.
It looks really Rube Goldberg-ish (to say the least).
In addition, from a technical perspective, there is a basic principle in HA
systems which is being ignored here...
A sick system cannot reliably monitor itself.
If you're relying on a system which you believe to be sick to monitor
itself, it will be unable to do this reliably under all circumstances -
it's sick, and therefore not reliable -- by definition. Crazy people may
not think they're insane ;-). The hardware watchdog timer is a 3rd party
monitoring system, and therefore is likely to be reliable when the thing it
is watching is sick - because its sanity is uncorrelated to the failure of
the thing it is watching.
For example, if by a programming error in the kernel, you halt or loop with
interrupts disabled -- you're screwed with no way out. In mainframes I
think this is called a disabled wait state. Of course, there are more
complex ways to do this, but hopefully one example makes the point.
This is the point of the hierarchy of monitoring I described before. This
is very much standard operating procedure for reliable systems in the
telecom industry (and many others). In fact, such a watchdog timer is a
requirement for Carrier Grade Linux (CGL).
Here is the standard way which highly available systems are architected to
work -- and it's consistent with 35-year industry practice in telephony
systems, the formal CGL requirements, and the architecture of the Linux-HA
system.
The hardware watchdog timer times out when it doesn't get
a heartbeat in the allotted time. (duhhh!)
Just before loading the BIOS, the watchdog timer should be set
for some "reasonable" amount of time (like a few
seconds) for the BIOS to load and begin executing.
The BIOS should set the timer for a reasonable
time for the bootstrap program to load. It must tickle
it periodically while waiting for input from humans.*
The bootstrap loader should work much the
same way. Before it jumps to the OS, it should set
the timer for a reasonable amount of time for
the OS to take over the tickling.*
When it first comes up, the OS takes over and tickles the
watchdog timer.
When the HA monitoring subsystem comes up, it takes over
and tickles the watchdog timer.
As HA-aware processes start up, they tickle individual watchdog
timers maintained by the HA monitoring subsystem (apphbd). If they
die, or hang, they are restarted by the Recovery
Manager. As a special case, apphbd will restart the
recovery manager as described below.
The recovery manager registers with the HA monitoring subsystem
and receives notification of insane or dead
processes. If they're insane it kills them.
When they die, it restarts them.
If the recovery manager dies (or goes insane), then
apphbd will (kill and) restart the recovery manager.**
When the system panics, then the watchdog timer needs to be
tickled while waiting for human input, and while making
progress taking a dump. [but only when actually
making progress].
When the OS jumps back into the BIOS for any reason
then the timer is reset to some value suitable
for the BIOS to take over and start tickling it.
(~ same as the original value).
Now if the BIOS or OS or bootstrap loader, or dump process craps out and
hangs, or the hard disk can't boot, or a peripherial hangs the bus, then
this watchdog timer will trigger, and the system will be reset - and you'll
get a chance to try it again. [[If you fail too often in too short a
period of time, then "phone home" or cry "uncle" or sit and cry if you
like. Or, you can just keep persisting...]]
Later on when HA monitoring system is running, if it (or the scheduler or
other piece of the OS) craps out and the HA monitoring system doesn't (or
isn't able to) tickle this watchdog timer - for whatever reason - then
everything will reboot just like it should.
Notice how many different kinds of errors this one single timer can detect
and recover from - and how many of them cannot easily be recovered from at
all without it. Note how handy it is in designing the system to know that
your underlying hardware has this capability built-in. It eliminates a lot
of complexity from several pieces of software, and does a better job too!
Without this timer, you can't easily design a truly reliable system. (and
maybe not at all).
<pedantic-mode>
The lowest level monitor should be the simplest and most reliable. It
monitors the OS. The driver for this in the kernel should also be solid
and no-frills. The base-level HA monitoring system (which monitors
processes for their health) should also be as simple as possible.
Complexity is the enemy of reliability. If any of these components fail,
then the system will be rebooted unnecessarily. This is a BadThing(TM).
Now, to use this "right", the thing that any subsystem tickling the timer
at the next higher level should do is periodically schedule something to
evaluate its internal sanity (data structure consistency or queue lengths
or whatever), and tickle the watchdog timer only when it passes whatever
its internal sanity measure is.
Then, if you go into an infinite loop, or doubt your own sanity long
enough, someone else will eventually do something about it - you'll be
killed and restarted (if a process) -- or rebooted (if you're the HA
process monitor, or the BIOS, or bootstrap loader or OS).
Of course, this doesn't *replace* external monitoring (see the note above
about declaring oneself sick), but it is a good orthogonal measure, and
simpler to implement for subsystems with limited external interfaces - like
the bootstrap loader.
</pedantic-mode>
* = Note that these layers may have to deal with bootstrap loaders and/or
OSes which won't tickle the watchdog timer - so they have to shut it off
(or set it really long) when booting a layer under them which isn't
watchdog-aware.
** = The reason why the recovery manager is not part of the apphbd process
in our design is because the apphbd process should be as simple as it can
be - because it's death or insanity would trigger a system restart.
Putting it in a separate process lessens the liklihood of an unnecessary
system restart. This is not a necessity, but I believe it to be a good
design choice - after all it was my design choice ;-)
It is certainly true that we don't have to implement all these things
today, or at all, but with the hardware watchdog timer, they're possible.
And, without it, they're not.
Even without implementing all these extra HA features, it still monitors
the OS more reliably than it can monitor itself. So, I think this is a
very worthwhile feature for the platform to have.
Hope this helps!
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce
More information about the Linuxppc64-dev
mailing list