[PATCH v2 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()

Wed Apr 22 13:30:24 AEST 2020

On Tue, Apr 21, 2020 at 06:33:36PM -0500, Nathan Lynch wrote:
> Sam Bobroff <sbobroff at linux.ibm.com> writes:
> > If a device is hot unplgged during EEH recovery, it's possible for the
> > RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
> > parameter error (-3), however negative return values are not checked
> > for and this leads to an infinite loop.
> >
> > Fix this by correctly bailing out on negative values.
> >
> > Signed-off-by: Sam Bobroff <sbobroff at linux.ibm.com>
> > ---
> >  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > index 893ba3f562c4..c4ef03bec0de 100644
> > --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> > +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > @@ -605,7 +605,7 @@ static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
> >  				config_addr, BUID_HI(pe->phb->buid),
> >  				BUID_LO(pe->phb->buid));
> >  
> > -		if (!ret)
> > +		if (ret <= 0)
> >  			return ret;
> 
> Note that this returns the firmware error value (e.g. -3 parameter
> error) without converting it to a Linux errno. Nothing checks the error
> value of this function as best I can tell, but -EINVAL would be better
> than an implicit -ESRCH here.

Right, it's never used but I agree. I'll change it for v3.

> And while this will behave correctly, the pr_warn() at the end of
> pseries_eeh_configure_bridge() hints that someone had the intention
> that this code should log a message on such an error:
> 
> static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
> {
> 	int config_addr;
> 	int ret;
> 	/* Waiting 0.2s maximum before skipping configuration */
> 	int max_wait = 200;
> 
> 	/* Figure out the PE address */
> 	config_addr = pe->config_addr;
> 	if (pe->addr)
> 		config_addr = pe->addr;
> 
> 	while (max_wait > 0) {
> 		ret = rtas_call(ibm_configure_pe, 3, 1, NULL,
> 				config_addr, BUID_HI(pe->phb->buid),
> 				BUID_LO(pe->phb->buid));
> 
> 		if (!ret)
> 			return ret;
> 
> 		/*
> 		 * If RTAS returns a delay value that's above 100ms, cut it
> 		 * down to 100ms in case firmware made a mistake.  For more
> 		 * on how these delay values work see rtas_busy_delay_time
> 		 */
> 		if (ret > RTAS_EXTENDED_DELAY_MIN+2 &&
> 		    ret <= RTAS_EXTENDED_DELAY_MAX)
> 			ret = RTAS_EXTENDED_DELAY_MIN+2;
> 
> 		max_wait -= rtas_busy_delay_time(ret);
> 
> 		if (max_wait < 0)
> 			break;
> 
> 		rtas_busy_delay(ret);
> 	}
> 
> 	pr_warn("%s: Unable to configure bridge PHB#%x-PE#%x (%d)\n",
> 		__func__, pe->phb->global_number, pe->addr, ret);
> 	return ret;
> }
> 
> So perhaps the error path should be made to break out of the loop
> instead of returning. Or is the parameter error result simply
> uninteresting in this scenario?

Sounds reasonable to me, and given that the only way I know to trigger
the error path (see the commit message) is not going to be common, I
think a message is a good idea. (And, as one of the people likely to
debug a future issue here, I'll probably appreciate it.)

Cheers,
Sam.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20200422/efe64fe3/attachment.sig>