eeh bug

Linas Vepstas linas at austin.ibm.com
Fri May 18 02:44:38 EST 2007


On Thu, May 17, 2007 at 02:59:06PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote:
> > 
> > When an RTAS PCI config space call returns all f's, we do an eeh error
> > check by calling eeh_dn_check_failure(pdn->node, NULL);
> > 
> > The problem is that second argument... NULL for the pci_dev *. It looks
> > like the EEH code will try to printk pci_name of that and later on
> > dereference it within eehd, thus causing an oops.
> 
> Ok, so I just added a
> 
> 	if (dev == NULL)
> 		dev = pdn->pcidev;
> 
> To eeh_dn_check_failure(), and that fixes one of the NULL (name
> printing), but I get another one a bit later, in pci_find_capability
> called from eeh_slot_error_detail called from handle_eeh_events.
> (Probably in gather_pci_data).

OK, clearly I have been sloppy. The initial eeh design used pci_dev
for everything; and as time went on, I realized that the device node
made a better fit for what needed to be manipulated. So the code
migrated in that direction, but not unambiguously; it tried to
keep allegience to both ways of identifying a slot.

> One thing that looks suspicions is that just before that I see:
> 
> EEH: of node=/pci/@8000000200000d3/pci at 2,4
> 
> Which is not a device but the bridge above it... 

That's the "partition endpoint", which is what the firmware wants. 
There's some ambiguity, as older systems with EADS and newer
direct-attached P5IOC slots have different relationships between
the "partition endpoint", the device, the slot, the bridge and 
PHB; which of these are equivalent and which are subordinate
can be confusing.

> we should probably not sure
> pci_find_capability in that code anyway and implent our own version
> using RTAS in case we don't have a pci_dev around, don't you think ?

I'll take a look. Usually, there's no pci_dev only when its a slot
with no device plugged into it; these can still receive EEH errors
during config space i/o to the bridge (I presume that the justification
is when aluminum scrap shorts out a pci connector or something like
that). In all other cases, there's a pci_dev, which is why the 
bug slipped by.

--linas

> 



More information about the Linuxppc-dev mailing list