[Skiboot] FSP code calling pollers with locks held

Benjamin Herrenschmidt benh at kernel.crashing.org
Wed Feb 18 20:49:07 AEDT 2015


On Wed, 2015-02-18 at 17:46 +1100, Stewart Smith wrote:
> Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:
> > So we can start fixing those cases
> >
> > Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
> > ---
> >  core/opal.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/core/opal.c b/core/opal.c
> > index fc18db3..a49d022 100644
> > --- a/core/opal.c
> > +++ b/core/opal.c
> > @@ -284,6 +284,11 @@ void opal_run_pollers(void)
> >  {
> >  	struct opal_poll_entry *poll_ent;
> >  
> > +	if (this_cpu()->lock_depth) {
> > +		prlog(PR_ERR, "Running pollers with lock held !\n");
> > +		backtrace();
> > +	}
> > +
> 
> Merged and then added a limit, because otherwise we could get stuck in a
> loop printing backtraces for approximately forever on some FSP machines.
> 
> Basically, FSP pretty heavily calls pollers with locks held.
> 
>  S: 0000000031a03b40 R: 0000000030017eb4   .opal_run_pollers+0x44.
>  S: 0000000031a03bc0 R: 00000000300449ec   .fsp_sync_msg+0x7c.
>  S: 0000000031a03c50 R: 000000003004f6e0   .op_display+0x110.
>  S: 0000000031a03d00 R: 000000003004797c   .fsp_console_preinit+0xe4.
>  S: 0000000031a03d80 R: 000000003005650c   .ibm_fsp_init+0xac.
> 
> I've fixed this one, but we're going to have to chase these a bit.
> 
> I like the warning though, it shows us places where we're likely to have
> bugs... and we should certainly fix them.
> 
> Interestingly enough, I modified the above code path back in january due
> to a bug in being called from a poller.

Yes, I wouldn't recommend merging my patch the day before a release but
I figured if we really want that mess sorted, putting that annoying
warning in will provide motivation :-)

I want to completely kill calling pollers with locks held for various
reasons. We've had a number of re-entrancy bugs with pollers already,
they are a fragile construct and opal_run_pollers should essentially
never be called outside of fairly top-of-stack situations and almost
never after boot (appart from opal_poll_events/opal_handle_interrupts of
course).

When I introduce the worker mechanism (kind of thread switching), the
context switch will only happen if no lock is held and I don't intend to
implement the equivalent of linux sleeping mutex for multiple reasons,
in part to force us to use less "lazy" constructs and essentially write
more robust code.

Cheers,
Ben.




More information about the Skiboot mailing list