DMA memory

Thu Mar 31 05:26:53 EST 2005

On Wed, Mar 30, 2005 at 12:25:50PM -0500, Nathan Glasser was heard to remark:
> >I presume the truncated RTAS blob is due to some RH 3.0 bug; is there a
> >chance you can try with a newer RH 3, or RH4 or kernel.org, so as to get
> >the detailed report?
> 
> The RH3.0 is a patch version already (2.4.21-20.EL). But no, there's no chance
> of trying anything else.

:( 

> >The RTAS message is supposed to be a good bit longer; among other things
> >it will sometimes contain a raw dump of the pci controller state.  If I 
> >had that, I *might* be able to decode the details of what the pci
> >controller didn't like (including the faulting address, if that's what
> >it is.).  
> 
> I had installed and enabled (apparently temporarily) some error logging thing
> I had been pointed last week, it seemed to have caused some extra info to
> appear in /var/log/platform:

Hmm, even that is very short. On decode, it says:

   Residual error from previous boot
   Date/Time: 20050325 16023600

which is darned uninformative.

There's some other bug that is causing the full error not be logged.

> I had run some more tests this week which also caused crashes, but this
> is the entire file.

You can avoid crashes by editing "arch/ppc64/kernel/eeh.c" and
commenting out the call to eeh_panic().   That might help you with your
debug efforts.

The philosphy of panic'ing-on-error comes from the theory that its
better to panic, than it is to corrupt data.  Think "banking", a
traditional IBM customer segment, to understand the origin of this 
theory.

FWIW, new kernels include code that attempts to pci-reset the device
after fielding one of these pci errors.  A generic all-architecture 
implementation for this is being discussed on LKML now, as the new
PCI-E chipsets do something similar.  Of course, reseting the hardware
because there's a software bug won't help driver developers very much
... 

Even with the full RTAS message, fully decoded, figuring out what wen
wrong is hard.  If you can't find the bug easily, I am afraid that
you'll be reduced to staring at PCI bus analyzer traces, which is how
many of these bugs are found ... :(

Assuming that you have full access to the device you're coding for, your
best bet is to add debug code to its firmware, and have it tell you 
where it plans to DMA to; compare that to where you thought it would
be going.

--linas