[PATCH linux dev-4.13 2/4] fsi/occ: Add Retries on checksum errors
Benjamin Herrenschmidt
benh at kernel.crashing.org
Tue May 22 00:48:30 AEST 2018
On Mon, 2018-05-21 at 14:56 +0930, Andrew Jeffery wrote:
> On Fri, 18 May 2018, at 11:04, Benjamin Herrenschmidt wrote:
> > Similarily to the new retry on SBE fifo errors, this adds
> > retries if the data we obtain from the OCC has a bad checksum.
> >
> > Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>
> > ---
> > drivers/fsi/fsi-occ.c | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/fsi/fsi-occ.c b/drivers/fsi/fsi-occ.c
> > index f4b2df7a3084..7a5afa78fb6b 100644
> > --- a/drivers/fsi/fsi-occ.c
> > +++ b/drivers/fsi/fsi-occ.c
> > @@ -652,7 +652,7 @@ static void occ_worker(struct work_struct *work)
> > struct occ_client *client;
> > struct occ *occ = container_of(work, struct occ, work);
> > struct device *sbefifo = occ->sbefifo;
> > -
> > + int retries = 0;
> > again:
> > if (occ->cancel)
> > return;
> > @@ -720,7 +720,10 @@ static void occ_worker(struct work_struct *work)
> > xfr->resp_data_length = resp_data_length + 7;
> >
> > rc = occ_verify_checksum(resp, resp_data_length);
> > -
> > + if (rc) {
> > + if (retries++ < OCC_COMMAND_RETRIES)
> > + goto again;
> > + }
>
> How should this interact with the OCC error handling mentioned in my
> reply on the previous patch? I feel like a checksum error is a bit of
> a grey area - probably caused by the transport, but possibly due to
> OCC firmware bugs as well?
Would it hurt to retry in any case ?
> If it's the former then retrying independent of the OCC error
> handling protocol is probably okay, but if we're trying to catch the
> latter then maybe we should let this be handled as part of the OCC
> error handling code?
>
> Eddie?
>
> Ben: Did you actually hit cases where this path was triggered? There
> was the corruption issue with simultaneous LPC cycles that turned out
> to be issues around level-shifters and synchronisers, was that it?
Yes, and I had cases where the CRC4 didn't "catch" the errors. The
retry fixed it. Now with the FSI layer being much more reliable, it
might be that all that retry stuff I added is no longer necessary, so I
won't be fighting for it, though I did find the upper layer error
handling to be somewhat lacking in efficacy...
I plan to do a deep dive on the rest of the OCC driver this week
regardless. I don't like a few things about it, such as the 2 layers
between fsi-occ and sbe_p9, that should be just one (sadly this change
will break the userspace binding code...).
I'll see if I can figure out how that error hanlding works.
Cheers,
Ben.
> > done:
> > mutex_unlock(&occ->occ_lock);
> >
> > @@ -732,6 +735,7 @@ static void occ_worker(struct work_struct *work)
> > clear_bit(XFR_IN_PROGRESS, &xfr->flags);
> > list_del(&xfr->link);
> > empty = list_empty(&occ->xfrs);
> > + retries = 0;
> >
> > spin_unlock_irqrestore(&occ->list_lock, flags);
> >
> > --
> > 2.17.0
> >
More information about the openbmc
mailing list