[PATCH] Hold reference to device_node during EEH event handling

Michael Ellerman michael at ellerman.id.au
Fri Jul 17 10:36:13 EST 2009


On Thu, 2009-07-16 at 09:33 -0700, Mike Mason wrote:
> Michael Ellerman wrote:
> > On Wed, 2009-07-15 at 14:43 -0700, Mike Mason wrote:
> >> This patch increments the device_node reference counter when an EEH
> >> error occurs and decrements the counter when the event has been
> >> handled.  This is to prevent the device_node from being released until
> >> eeh_event_handler() has had a chance to deal with the event.  We've
> >> seen cases where the device_node is released too soon when an EEH
> >> event occurs during a dlpar remove, causing the event handler to
> >> attempt to access bad memory locations.
> >>
> >> Please review and let me know of any concerns.
> > 
> > Taking a reference sounds sane, but ...
> > 
> >> Signed-off-by: Mike Mason <mmlnx at us.ibm.com> 
> >>
> >> --- a/arch/powerpc/platforms/pseries/eeh_event.c	2008-10-09 15:13:53.000000000 -0700
> >> +++ b/arch/powerpc/platforms/pseries/eeh_event.c	2009-07-14 14:14:00.000000000 -0700
> >> @@ -75,6 +75,14 @@ static int eeh_event_handler(void * dumm
> >>  	if (event == NULL)
> >>  		return 0;
> >>  
> >> +	/* EEH holds a reference to the device_node, so if it
> >> +	 * equals 1 it's no longer valid and the event should
> >> +	 * be ignored */
> >> +	if (atomic_read(&event->dn->kref.refcount) == 1) {
> >> +		of_node_put(event->dn);
> >> +		return 0;
> >> +	}
> > 
> > That's really gross :)
> 
> Agreed.  I'll look for another way to determine if device is gone and
> the event should be ignored.  Suggestions are welcome :-)

Benh and I had a quick chat about it, and were wondering whether what
you really should be doing is taking a reference to the pci device
(perhaps as well as the device node).

@@ -140,7 +149,7 @@ int eeh_send_failure_event (struct devic
        if (dev)
                pci_dev_get(dev);
 
-       event->dn = dn;
+       event->dn = of_node_get(dn);
        event->dev = dev;

pci devs are refcounted too, see pci_dev_get(), so taking a reference
there would be the "right" thing to do - otherwise there's no guarantee
it still exists later, unless there's some other trick in the EEH code.

Taking a reference would presumably block a concurrent hotunplug until
you'd processed the EEH event and dropped your reference. That might be
OK, or you could add a hotplug notifier to the EEH code and drop the
reference there and mark the event as handled or something.

All of that with the caveat that I don't really know the EEH or hotplug
code :D

cheers



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20090717/e407c37c/attachment.pgp>


More information about the Linuxppc-dev mailing list