Hardware Watchdog Device in pSeries?

Thu Oct 14 08:12:54 EST 2004

Hi,

On Wed, Oct 13, 2004 at 03:30:02PM -0600, Alan Robertson was heard to remark:
> Mike Strosaker wrote:
> >Linas Vepstas wrote:
> >
> >>I might have volunteered to hack this up real quick, were it not for
> >>Mike Strosaker's correction, that the surveillance featues were taken
> >>out of Power5.  
> >>Anyone on this list know why?
> >>
> >
> >I sent the reason I got from the hardware RAS folks to this list a while 
> >back.
> >Luckily, it's still in my sent mail folder:
> >
> >"Because of the virtualization layer and partitioning, the surveillance
> >requirement was moved to PHYP<->SP.  Apparently, this was a hotly
> >contested issue among the platform design folks (especially considering 
> >that
> >partitioned power4 systems still have OS<->SP surveillance).  I think 
> >the logic
> >is: If an OS goes down, its not likely a server problem, hence no 
> >requirement
> >to monitor from the server side.
> >
> >At least the platform gets notified of panics via os-term.  I gather
> >that some user space tools are expected to monitor for deadlocks/hangs
> >(maybe clustering tools). "
> 
> This is about half-right.
> 
> There is one particular circumstance which can ONLY be monitored from a 
> hardware-level monitor.
> 
> OS hangs.

Heh. I think I can clarify, after talking to the firmware folks.

The core thinking behind the the "platform architecture" was to make
sure that the underlying hardware, i.e. the "platform" wasn't hung.
They were not concerned about the OS itself; they assumed that OS'es
have thier own independent mechanisms for detecting hung-ness.

>From the platform point of view, they are concerned that they'll
have a machine with a dozen different partitons on it (a dozen 
different OS'es), and a hardware hang will take down all twelve.
So they've got the hypervisor and service processor montioring
each other, keeping things humming.  If just one partition goes
down due to a kernel hang/crash, well, that's too bad, but its
not the end of the world from the platform point of view.

I think Alan's point of view is from the other side of the table:
why should someone buy 12 pci-card watchdogs, one for each partition,
chewing up 12 pci slots, when the pSeries is already capable of doing
watchdog functions?   To add insult to injury, the sysadmin now needs
to duct-tape each of the watchdog cards to some sort of kill-switch,
to reboot a dead partition.  The kill-switch needs to then ssh to 
the fsp or the hmc to start the reboot.  So it gets pretty byzantine
for something that could have been 'simple' and built-in.  Never mind
that the reliability goes down:  the kill switch could fail, the 
pci watchdog card could fail (or get EEH'ed out), causing a reboot 
when no reboot was necessary, etc. 

--linas