[Skiboot] [PATCH] xscom: Don't log xscom errors caused bu OPAL calls

Fri Dec 6 20:26:12 AEDT 2019

* Oliver O'Halloran <oohall at gmail.com> [2019-12-06 13:51:46]:

> On Fri, Dec 6, 2019 at 2:32 AM Gautham R Shenoy <ego at linux.vnet.ibm.com> wrote:
> >
> > Hello Oliver,
> >
> >
> > On Fri, Dec 06, 2019 at 12:14:34AM +1100, Oliver O'Halloran wrote:
> > > The XSCOM read/write OPAL calls are largely there to support running
> > > PRD in the OS. PRD itself handles submitting error logs (if needed)
> > > when a XSCOM operation fails so there's no need to send an error log
> > > from inside of OPAL.
> > >
> > > Cc: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
> > > Signed-off-by: Oliver O'Halloran <oohall at gmail.com>

Reviewed-by: Vaidyanathan Srinivasan <svaidy at linux.ibm.com>

> >
> > This is much better, simpler version than what I posted.
> >
> > Only one question.
> > > ---
> > >  core/opal.c |  3 +++
> > >  hw/xscom.c  | 15 +++++++++++----
> > >  2 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/core/opal.c b/core/opal.c
> > > index da746e805a3f..922dc5abddca 100644
> > > --- a/core/opal.c
> > > +++ b/core/opal.c
> > > @@ -189,6 +189,9 @@ int64_t opal_exit_check(int64_t retval, struct stack_frame *eframe)
> > >                     "Spent %llu msecs in OPAL call %llu!\n",
> > >                     call_time, token);
> > >       }
> > > +
> > > +     this_cpu->current_token = 0;
> > > +
> > >       return retval;
> > >  }
> > >
> > > diff --git a/hw/xscom.c b/hw/xscom.c
> > > index f3f4e1039ecf..381cf2c4aaa4 100644
> > > --- a/hw/xscom.c
> > > +++ b/hw/xscom.c
> > > @@ -272,10 +272,17 @@ static int64_t xscom_handle_error(uint64_t hmer, uint32_t gcid, uint32_t pcb_add
> > >               break;
> > >       }
> > >
> > > -     /* XXX: Create error log entry ? */
> > > -     log_simple_error(&e_info(OPAL_RC_XSCOM_RW),
> > > -             "XSCOM: %s error gcid=0x%x pcb_addr=0x%x stat=0x%x\n",
> > > -             is_write ? "write" : "read", gcid, pcb_addr, stat);
> > > +     /*
> > > +      * If we're in an XSCOM opal call then squash the error
> > > +      * we assume that the caller (probably opal-prd) will
> > > +      * handle logging it
> > > +      */
> > > +     if (this_cpu()->current_token != OPAL_XSCOM_READ &&
> > > +         this_cpu()->current_token != OPAL_XSCOM_WRITE) {
> >
> >
> > Shouldn't we restrict to not creating PELs only when the error is
> > OPAL_XSCOM_ADDR_ERROR, as this was the only observable error when the
> > XSCOM read/writes initiated by HBRT were failing ?

Thank Oliver for the nice solution to check for OPAL call context.
This takes care of HBRT and userspace xscom commandline for which we
need not log an error.  Anything else calling xscom during boot will
still log an error which is very rare anyway.  

> I'm pretty sure HBRT will creates PEL for any (unexpected) xscom
> failure so we'd just be doubling up. The other error results we return
> are:
> 
> #define OPAL_XSCOM_BUSY         OPAL_BUSY
> #define OPAL_XSCOM_CHIPLET_OFF  OPAL_WRONG_STATE
> #define OPAL_XSCOM_PARTIAL_GOOD -25
> #define OPAL_XSCOM_ADDR_ERROR   -26
> #define OPAL_XSCOM_CLOCK_ERROR  -27
> #define OPAL_XSCOM_PARITY_ERROR -28
> #define OPAL_XSCOM_TIMEOUT      -29
> #define OPAL_XSCOM_CTR_OFFLINED -30
> 
> IMO these are all variations on the same theme. The caller did a scom
> to a register that isn't accessible for $reason so we might as well
> handle it consistently. If the error needs to be logged then the
> caller can deal with it.

This is fine now since the xscom code is stable and we have coded
error recovery procedures.  We originally needed error logging where
some xscom fails mysteriously affects subsequent xscoms and make them
fail as well. Now we will get prlog prints if there where retries.
We can revisit this logging area if we run into new problems.  

--Vaidy