eeh bug
Linas Vepstas
linas at austin.ibm.com
Fri May 18 02:44:38 EST 2007
On Thu, May 17, 2007 at 02:59:06PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2007-05-17 at 14:46 +1000, Benjamin Herrenschmidt wrote:
> >
> > When an RTAS PCI config space call returns all f's, we do an eeh error
> > check by calling eeh_dn_check_failure(pdn->node, NULL);
> >
> > The problem is that second argument... NULL for the pci_dev *. It looks
> > like the EEH code will try to printk pci_name of that and later on
> > dereference it within eehd, thus causing an oops.
>
> Ok, so I just added a
>
> if (dev == NULL)
> dev = pdn->pcidev;
>
> To eeh_dn_check_failure(), and that fixes one of the NULL (name
> printing), but I get another one a bit later, in pci_find_capability
> called from eeh_slot_error_detail called from handle_eeh_events.
> (Probably in gather_pci_data).
OK, clearly I have been sloppy. The initial eeh design used pci_dev
for everything; and as time went on, I realized that the device node
made a better fit for what needed to be manipulated. So the code
migrated in that direction, but not unambiguously; it tried to
keep allegience to both ways of identifying a slot.
> One thing that looks suspicions is that just before that I see:
>
> EEH: of node=/pci/@8000000200000d3/pci at 2,4
>
> Which is not a device but the bridge above it...
That's the "partition endpoint", which is what the firmware wants.
There's some ambiguity, as older systems with EADS and newer
direct-attached P5IOC slots have different relationships between
the "partition endpoint", the device, the slot, the bridge and
PHB; which of these are equivalent and which are subordinate
can be confusing.
> we should probably not sure
> pci_find_capability in that code anyway and implent our own version
> using RTAS in case we don't have a pci_dev around, don't you think ?
I'll take a look. Usually, there's no pci_dev only when its a slot
with no device plugged into it; these can still receive EEH errors
during config space i/o to the bridge (I presume that the justification
is when aluminum scrap shorts out a pci connector or something like
that). In all other cases, there's a pci_dev, which is why the
bug slipped by.
--linas
>
More information about the Linuxppc-dev
mailing list