[Skiboot] FSP code calling pollers with locks held

Wed Feb 18 20:49:07 AEDT 2015

On Wed, 2015-02-18 at 17:46 +1100, Stewart Smith wrote:
> Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:
> > So we can start fixing those cases
> >
> > Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
> > ---
> >  core/opal.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/core/opal.c b/core/opal.c
> > index fc18db3..a49d022 100644
> > --- a/core/opal.c
> > +++ b/core/opal.c
> > @@ -284,6 +284,11 @@ void opal_run_pollers(void)
> >  {
> >  	struct opal_poll_entry *poll_ent;
> >  
> > +	if (this_cpu()->lock_depth) {
> > +		prlog(PR_ERR, "Running pollers with lock held !\n");
> > +		backtrace();
> > +	}
> > +
> 
> Merged and then added a limit, because otherwise we could get stuck in a
> loop printing backtraces for approximately forever on some FSP machines.
> 
> Basically, FSP pretty heavily calls pollers with locks held.
> 
>  S: 0000000031a03b40 R: 0000000030017eb4   .opal_run_pollers+0x44.
>  S: 0000000031a03bc0 R: 00000000300449ec   .fsp_sync_msg+0x7c.
>  S: 0000000031a03c50 R: 000000003004f6e0   .op_display+0x110.
>  S: 0000000031a03d00 R: 000000003004797c   .fsp_console_preinit+0xe4.
>  S: 0000000031a03d80 R: 000000003005650c   .ibm_fsp_init+0xac.
> 
> I've fixed this one, but we're going to have to chase these a bit.
> 
> I like the warning though, it shows us places where we're likely to have
> bugs... and we should certainly fix them.
> 
> Interestingly enough, I modified the above code path back in january due
> to a bug in being called from a poller.

Yes, I wouldn't recommend merging my patch the day before a release but
I figured if we really want that mess sorted, putting that annoying
warning in will provide motivation :-)

I want to completely kill calling pollers with locks held for various
reasons. We've had a number of re-entrancy bugs with pollers already,
they are a fragile construct and opal_run_pollers should essentially
never be called outside of fairly top-of-stack situations and almost
never after boot (appart from opal_poll_events/opal_handle_interrupts of
course).

When I introduce the worker mechanism (kind of thread switching), the
context switch will only happen if no lock is held and I don't intend to
implement the equivalent of linux sleeping mutex for multiple reasons,
in part to force us to use less "lazy" constructs and essentially write
more robust code.

Cheers,
Ben.