[Skiboot] [PATCH] opal/hmi: Display correct chip id while printing NPU FIRs.

Balbir Singh bsingharora at gmail.com
Tue Jun 5 10:45:14 AEST 2018


On Mon, Jun 4, 2018 at 4:09 PM, Mahesh Jagannath Salgaonkar
<mahesh at linux.vnet.ibm.com> wrote:
>
> On 06/01/2018 08:54 PM, Balbir Singh wrote:
> > On Thu, May 31, 2018 at 6:34 PM, Mahesh J Salgaonkar
> > <mahesh at linux.vnet.ibm.com> wrote:
> >> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> >>
> >> HMIs for NPU xstops are broadcasted to all chips. All cores on all the
> >> chips receive HMI. HMI handler correctly identifies and extracts the
> >> NPU FIR details from affected chip, but while printing FIR data it
> >> prints chip id and location code details of this_cpu()->chip_id which
> >> may not be correct. This patch fixes this issue.
> >>
> >
> > The core does not matter here, does it? It's the NPU, it's one per
> > socket/processor
> > If fir, mask and action match, this_cpu()->chip_id's location code
> > should be correct.
> > Am I missing something? Have you seen examples of this printing the wrong thing
> > or did you catch this via code-review?
>
> There is a BUG reported where submitter see same NPU FIR value shown for
> both chip ids. It creates a confusion as which NPU got an error. The
> code checks for pMisc Receive Malfunction Alert to find out correct chip
> id (flat_chip_id) where NPU got an error. We query NPU FIRs using
> flat_chip_id not this_cpu()->chip_id
>
> # putscom -c 0x0  0x5013C00 0x4000000000000000
> [84667.176936787,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc0] P:0
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
> [84667.187268490,3] HMI: NPU2: [Loc: UOPWR.786ECFA-Node0-Proc1] P:8
> FIR#0 FIR 0x4000000000000000 mask 0x009a48180f03ffff
>
> The second o/p above creates the confusion. It prints the same info but
> with different chip id and location code.

OK, so both cores see the HMI (it's probably broadcast?) and pass the
following test

xscom_read(this_cpu()->chip_id, malf_alert_scom, &malf_alert);
if (!malf_alert)
    return;

They also have bit i set in the malf_alert, since they seem to pass,
so we know that for that this_cpu()->chip_id, malf_alert did point to
chiplet i as potentially having an error due to the

if (malf_alert & PPC_BIT(i)) check

Anyway, I am fine with the changes, if you feel it's the right thing.

Balbir Singh.


More information about the Skiboot mailing list