[PATCH linux dev-4.13 2/4] fsi/occ: Add Retries on checksum errors

Tue May 22 00:48:30 AEST 2018

On Mon, 2018-05-21 at 14:56 +0930, Andrew Jeffery wrote:
> On Fri, 18 May 2018, at 11:04, Benjamin Herrenschmidt wrote:
> > Similarily to the new retry on SBE fifo errors, this adds
> > retries if the data we obtain from the OCC has a bad checksum.
> > 
> > Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
> > ---
> >  drivers/fsi/fsi-occ.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/fsi/fsi-occ.c b/drivers/fsi/fsi-occ.c
> > index f4b2df7a3084..7a5afa78fb6b 100644
> > --- a/drivers/fsi/fsi-occ.c
> > +++ b/drivers/fsi/fsi-occ.c
> > @@ -652,7 +652,7 @@ static void occ_worker(struct work_struct *work)
> >  	struct occ_client *client;
> >  	struct occ *occ = container_of(work, struct occ, work);
> >  	struct device *sbefifo = occ->sbefifo;
> > -
> > +	int retries = 0;
> >  again:
> >  	if (occ->cancel)
> >  		return;
> > @@ -720,7 +720,10 @@ static void occ_worker(struct work_struct *work)
> >  	xfr->resp_data_length = resp_data_length + 7;
> >  
> >  	rc = occ_verify_checksum(resp, resp_data_length);
> > -
> > +	if (rc) {
> > +		if (retries++ < OCC_COMMAND_RETRIES)
> > +			goto again;
> > +	}
> 
> How should this interact with the OCC error handling mentioned in my
> reply on the previous patch? I feel like a checksum error is a bit of
> a grey area - probably caused by the transport, but possibly due to
> OCC firmware bugs as well?

Would it hurt to retry in any case ?

>  If it's the former then retrying independent of the OCC error
> handling protocol is probably okay, but if we're trying to catch the
> latter then maybe we should let this be handled as part of the OCC
> error handling code?
>
> Eddie?
> 
> Ben: Did you actually hit cases where this path was triggered? There
> was the corruption issue with simultaneous LPC cycles that turned out
> to be issues around level-shifters and synchronisers, was that it?

Yes, and I had cases where the CRC4 didn't "catch" the errors. The
retry fixed it. Now with the FSI layer being much more reliable, it
might be that all that retry stuff I added is no longer necessary, so I
won't be fighting for it, though I did find the upper layer error
handling to be somewhat lacking in efficacy...

I plan to do a deep dive on the rest of the OCC driver this week
regardless. I don't like a few things about it, such as the 2 layers
between fsi-occ and sbe_p9, that should be just one (sadly this change
will break the userspace binding code...).

I'll see if I can figure out how that error hanlding works.

Cheers,
Ben.

> >  done:
> >  	mutex_unlock(&occ->occ_lock);
> >  
> > @@ -732,6 +735,7 @@ static void occ_worker(struct work_struct *work)
> >  	clear_bit(XFR_IN_PROGRESS, &xfr->flags);
> >  	list_del(&xfr->link);
> >  	empty = list_empty(&occ->xfrs);
> > +	retries = 0;
> >  
> >  	spin_unlock_irqrestore(&occ->list_lock, flags);
> >  
> > -- 
> > 2.17.0
> >