[Skiboot] [PATCH] phb4: Don't probe a PHB if its garded

Thu Sep 13 18:50:04 AEST 2018

Oliver <oohall at gmail.com> writes:
> On Tue, Aug 21, 2018 at 8:14 PM, Vaibhav Jain <vaibhav at linux.ibm.com> wrote:
>> Presently phb4_probe_stack() causes an exception while trying to probe
>> a PHB if its garded. This causes skiboot to go into a reboot loop with
>> following exception log:
>>
>>  ***********************************************
>>  Fatal MCE at 000000003006ecd4   .probe_phb4+0x570
>>  CFAR : 00000000300b98a0
>>  <snip>
>>  Aborting!
>> CPU 0018 Backtrace:
>>  S: 0000000031cc37e0 R: 000000003001a51c   ._abort+0x4c
>>  S: 0000000031cc3860 R: 0000000030028170   .exception_entry+0x180
>>  S: 0000000031cc3a40 R: 0000000000001f10 *
>>  S: 0000000031cc3c20 R: 000000003006ecb0   .probe_phb4+0x54c
>>  S: 0000000031cc3e30 R: 0000000030014ca4   .main_cpu_entry+0x5b0
>>  S: 0000000031cc3f00 R: 0000000030002700   boot_entry+0x1b8
>>
>> This is caused as phb4_probe_stack() will ignore all xscom read/write
>> errors to enable PHB Bars and then tries to perform an mmio to read
>> PHB Version registers that cause the fatal MCE.
>>
>> We fix this by ignoring the PHB probe if the first xscom_write() to
>> populate the PHB Bar register fails, which indicates that there is
>> something wrong with the PHB.
>
> There's a fix for this in upstream hostboot too. Unfortunately, the HB
> version in op-build hasn't been bumped in months due to compiler
> issues so this is probably the best way to fix this bug.
>
> Reviewed-by: Oliver O'Halloran <oohall at gmail.com>
>
>> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
>
> This should probably be CCed to stable so it goes into the sable builds too.
>
>> ---
>>  hw/phb4.c | 13 +++++++++++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index d1245dce..89b0b859 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -5546,6 +5546,7 @@ static void phb4_probe_stack(struct dt_node *stk_node, uint32_t pec_index,
>>         char *path;
>>         uint64_t capp_ucode_base;
>>         unsigned int max_link_speed;
>> +       int rc;
>>
>>         gcid = dt_get_chip_id(stk_node);
>>         stk_index = dt_prop_get_u32(stk_node, "reg");
>> @@ -5567,9 +5568,17 @@ static void phb4_probe_stack(struct dt_node *stk_node, uint32_t pec_index,
>>
>>         /* Initialize PHB register BAR */
>>         phys_map_get(gcid, PHB4_REG_SPC, phb_num, &phb_bar, NULL);
>> -       xscom_write(gcid, nest_stack + XPEC_NEST_STK_PHB_REG_BAR, phb_bar << 8);
>> -       bar_en |= XPEC_NEST_STK_BAR_EN_PHB;
>> +       rc = xscom_write(gcid, nest_stack + XPEC_NEST_STK_PHB_REG_BAR,
>> +                        phb_bar << 8);
>> +
>> +       /* A scom error here probably indicates a defective/garded PHB */
>> +       if (rc != OPAL_SUCCESS) {
>> +               prerror("PHB[%d:%d] Unable to set PHB BAR. Error=%d\n",
>> +                     gcid, phb_num, rc);
>> +               return;
>
> Setting status = "broken" in the device-tree might also be a good
> idea.

Yeah, it would be. I'll take it in a follow-up patch though.

We really need to get some good degraded mode testing into op-test.

-- 
Stewart Smith
OPAL Architect, IBM.