[Skiboot] [PATCH 7/7] hmi: Add handling for NPU checkstops
Russell Currey
ruscur at russell.cc
Fri Mar 18 15:48:43 AEDT 2016
On Fri, 2016-03-18 at 11:29 +1100, Alistair Popple wrote:
> Russell,
>
> A couple of minor nit-picks below, probably worth fixing if doing a V2.
>
> - Alistair
>
> On Tue, 15 Mar 2016 18:33:57 Russell Currey wrote:
<snip>
> >
> > +static void find_npu_checkstop_reason(int flat_chip_id,
> > + struct OpalHMIEvent *hmi_evt,
> > + int *event_generated)
> > +{
> > + struct phb *phb;
> > + struct npu *p = NULL;
> > +
> > + int i;
> > +
> > + uint64_t npu_fir;
> > + uint64_t npu_fir_mask;
> > + uint64_t npu_fir_action0;
> > + uint64_t npu_fir_action1;
> > + uint64_t fatal_errors;
> > +
> > + /* Find the NPU on the chip associated with the HMI. */
> > + for (i = 0; i < MAX_PHB_ID; i++) {
> > + phb = pci_get_phb(i);
> > +
> > + /* If we've gone through all PHBs, we're done. */
> > + if (phb == NULL)
> > + break;
> This assumes there are no holes in the PHB ID's which might not be true
> depending on how the PHBs were registered (ie. if the opal_id was passed
> to
> pci_register_phb instead of OPAL_DYNAMIC_PHB_ID) or if we ever implement
> pci_unregister_phb.
>
> It might better to implement for_each_phb() or similar in pci.h that can
> walk
> all the PHBs.
Apparently PHBs aren't contiguous, will have to make some PCI changes.
Definitely a good idea to handle that in PCI land and not HMI land.
>
> >
> > + if (dt_node_is_compatible(phb->dt_node, "ibm,power8-
> > npu-
> pciex") &&
> >
> > + (dt_get_chip_id(phb->dt_node) == flat_chip_id)) {
> > + p = phb_to_npu(phb);
> > + break;
> This assumes each chip only has a single NPU, which is true today but if
> you're doing a v2 it might be worth a comment to remind us to update this
> if
> it ever changes...
Good idea.
>
> >
> > + }
> > + }
> > +
> > + /* If we didn't find a NPU on the chip, it's not our
> > checkstop. */
> > + if (p == NULL)
> > + return;
> > +
> > + /* Read all the registers necessary to find a checkstop
> > condition. */
> > + if (xscom_read(flat_chip_id,
> > + p->at_xscom + NX_FIR, &npu_fir) ||
> > + xscom_read(flat_chip_id,
> > + p->at_xscom + NX_FIR_MASK, &npu_fir_mask) ||
> > + xscom_read(flat_chip_id,
> > + p->at_xscom + NX_FIR_ACTION0, &npu_fir_action0)
> > ||
> > + xscom_read(flat_chip_id,
> > + p->at_xscom + NX_FIR_ACTION1,
> > &npu_fir_action1)) {
> > + prerror("HMI: Couldn't read NPU registers with
> > XSCOM\n");
> > + return;
> > + }
> > +
> > + fatal_errors = npu_fir & ~npu_fir_mask & npu_fir_action0 &
> npu_fir_action1;
> >
> > +
> > + /* If there's no errors, we don't need to do anything. */
> > + if (!fatal_errors)
> > + return;
> > +
> > + prlog(PR_DEBUG,
> > + "NPU: FIR %llx FIR mask %llx FIR ACTION0 %llx FIR
> > ACTION1
> %llx\n",
> >
> > + npu_fir, npu_fir_mask, npu_fir_action0,
> > npu_fir_action1);
> > +
> > + /* Set the NPU to fenced since it can't recover. */
> > + p->fenced = true;
> I assume we don't need a lock here because we never clear the flag?
Yeah, that's right. If fences are ever recoverable that would need to
change.
>
> >
> > +
> > + /* Set up the HMI event */
> > + hmi_evt->severity = OpalHMI_SEV_WARNING;
> > + hmi_evt->type = OpalHMI_ERROR_MALFUNC_ALERT;
> > + hmi_evt->u.xstop_error.xstop_type = CHECKSTOP_TYPE_NPU;
> > + hmi_evt->u.xstop_error.u.chip_id = flat_chip_id;
> > +
> > + /* The HMI is "recoverable" because it shouldn't crash the
> > system */
> > + queue_hmi_event(hmi_evt, 1);
> > + *event_generated = 1;
> > +}
> > +
> > static void decode_malfunction(struct OpalHMIEvent *hmi_evt)
> > {
> > int i;
> > @@ -450,8 +523,11 @@ static void decode_malfunction(struct
> > OpalHMIEvent
> *hmi_evt)
> >
> > queue_hmi_event(hmi_evt, recover);
> > event_generated = 1;
> > }
> > -
> > find_nx_checkstop_reason(i, hmi_evt,
> &event_generated);
> >
> > + /* Only check for NPU errors if we have a NPU.
> > */
> > + if (PVR_TYPE(mfspr(SPR_PVR)) ==
> > PVR_TYPE_P8NVL)
> > + find_npu_checkstop_reason(i, hmi_evt,
> > + &event_gener
> > ated);
> > }
> >
> > if (recover != -1) {
> > diff --git a/include/opal-api.h b/include/opal-api.h
> > index 369aa93..0b7b0bb 100644
> > --- a/include/opal-api.h
> > +++ b/include/opal-api.h
> > @@ -577,6 +577,7 @@ enum OpalHMI_XstopType {
> > CHECKSTOP_TYPE_UNKNOWN = 0,
> > CHECKSTOP_TYPE_CORE = 1,
> > CHECKSTOP_TYPE_NX = 2,
> > + CHECKSTOP_TYPE_NPU = 3
> > };
> >
> > enum OpalHMI_CoreXstopReason {
> >
More information about the Skiboot
mailing list