Is anyone using Critical Interrupts on PPC440 in 2.4 ?

Fri Sep 24 03:46:51 EST 2004

Sorry if this has been covered before - I for one could make use of that
old list tarball if it were available.

I am interested in using the interrupts controlled by MSR[CE],
specifically to help debug Watchdog timeout and PCI hang issues and
perhaps ultimately recover from them somehow.  While I am able to
achieve some level of operation a quick browse of 2.6.8 makes me very
nervous that I may be re-inventing the wheel because it now includes the
code to properly load a 32 bit MSR_KERNEL setting but does not appear to
actually use either of the MSR[CE] controlled vectors (Watchdog,
CriticalInput).

- Are there patches available that somehow use either Critical Input or
Watchdog for something other than "unknown?"  I would not want to do
anything that conflicts too badly with some low-latency interrupt scheme
for instance.

- Has a back-port of the updated 32 bit MSR_KERNEL handling to 2.4 been
done already?  I did this before realizing it was already done in 2.6.

- Is there any version that has normal IRQ handling smart enough to
leave critical interrupts alone?  I had to resort to the rather ugly
trick of registering a dummy handler on IRQ 31 (UIC1 CI cascade) to keep
it from getting disabled.

- Has anyone ever considered a patch that replaces the normal HW
exception handling with an immediate panic?  For an embedded system
continuing to run after one or more user space processes has been
terminated does not seem to be an optimal behavior.

- Are patches being accepted for 2.4, or is everything 2.6 now?

What I am up to:

Situation:

The board I am working on is subject to random re-boots due to Watchdog
timeouts.  Some of these are due to driver bugs (500mS udelay's, waiting
forever for stuff that never happens etc), and others are due to PCI
devices that insist on claiming split transactions and then waiting
several seconds before responding.  These situations are of course very
rare and almost impossible to duplicate so a JTAG debugger is out of the
question.

Proposed solution:

Install an interrupt hander that captures the register dump and stack
trace whenever a watchdog or PCI error occurs.  Since there is no choice
for the watchdog and the PCI errors could happen during interrupt top
halves or with interrupts disabled both situations pretty much require
the use of Critical Exceptions.

The PCI error capture is not completely redundant because some of my
boards have a bridge that can be configured to time out and abort the
offending transactions.  Doing this without the handler however just
makes my problems harder to debug by allowing the driver to continue for
who knows how long before crashing.

For now I just capture the trace and then panic the kernel (cheap and
easy HA solution) to "recover".  Longer term I may try to develop a
mechanism to allow drivers to register for some sort of notification and
recover more gracefully.

David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20040923/92a36a36/attachment.htm>