[Powerpc / eHEA] Circular dependency with 2.6.29-rc6

Thu Feb 26 05:24:09 EST 2009

On Wed, 2009-02-25 at 18:07 +0100, Jan-Bernd Themann wrote:
> Hi,
> 
> yes, sorry for the funny wrapping... and thanks for your quick answer!
> 
> Peter Zijlstra wrote:
> > On Wed, 2009-02-25 at 16:05 +0100, Jan-Bernd Themann wrote:
> >
> >   
> >> - When "open" is called for a registered network device, port->port_lock
> >> is taken first,
> >>   then ehea_fw_handles.lock
> >> - When "open" is left these locks are released in a proper way (inverse
> >> order)
> >>     
> >
> > So this has:
> >
> >   port->port_lock
> >     ehea_fw_handles.lock
> >
> > This would be the case that is generating the warning.
> >
> >   
> >> - In addition: ehea_fw_handles.lock is held by the function
> >> "driver_probe_device"
> >>   that registers all available network devices (register_netdev)
> >> - When multiple network devices are registered, it is possible that
> >> "open" is
> >>   called on an already registered network device while further
> >> netdevices are still registered
> >>   in "driver_probe_device". ---> "open" will take port->port_lock, but
> >> won't get ehea_fw_handles.lock
> >>     
> >
> > Right, so here you have 
> >
> >   ehea_fw_handles.lock
> >     port->port_lock
> >
> > Overlay these two cases and you have AB-BA deadlocks.
> >
> >   
> The thing here is that I did not see that "open" is called from this
> "probe" function,
> this happens probably indirectly as each new device causes a notifier chain
> to be called --> If I got it right then a userspace tool triggers the
> "open".
> In that case the open would run in an other task/thread and thus when
> the kernel
> preemts the task/thread the probe function would continue and free the lock.
> 
> Lets assume that it is actually possible that "open" is called in the
> same context as
> "probe", wound't that mean that we actually need to hit a deadlock?
> (probe helds
> the lock all the time). We have never observed a deadlock so far.

That's the brilliant bit about lockdep, it can observe potential
deadlocks without ever hitting them :-)

> Is there a way to find out if all these locks are actually taken in the
> same context
> (kthread, tasklet...)?

They don't need to happen in the same context, suppose a kthread (1)
does the probe and some user task (2) does the open:

    1 - probe                    2 - open

lock(ehea_fw_handles.lock)

			lock(port->port_lock)

lock(port->port_lock) <-- waiting for 2

			lock(ehea_fw_handles.lock) <-- waiting for 1

Which is the classic AB-BA deadlock scenario.

Hitting it will be very unlikely, as this probe thing is a very rare
event, but that doesn't mean it cannot happen.

Now, if you can guarantee that the probe and open port object are
_never_ the same one, then we can say this is a false positive and work
on teaching lockdep about that.

> >> - However, ehea_fw_handles.lock is freed once all netdevices are registered.
> >> - When the second netdevice is registered in "driver_probe_device", it
> >> will also try to get
> >>   the port->port_lock (which in fact is a different one, as there is one
> >> per netdevice).
> >> - Does the mutex debug mechanism distinguish between the different
> >> port->port_lock instances?
> >>     
> >
> > Not unless you tell it to.
> >   
> > Are you really sure the port->port_lock in this AB-BA scenario are never
> > the same? The above explanation didn't convince me (also very hard to
> > read due to funny wrapping).
> >   
> I'm not sure, especially as I just ran the same test with just one port
> and we still
> get the warning. But having two instances of port accessing the locks
> does not
> look like a problem to me as they allocate and free the locks properly
> (right order).

The initial probe will establish the A->B order, the subsequent open
will attempt B->A at which point lockdep will warn.