[Skiboot] [PATCH] phb4: Don't probe a PHB if its garded
Stewart Smith
stewart at linux.ibm.com
Thu Sep 13 18:50:04 AEST 2018
Oliver <oohall at gmail.com> writes:
> On Tue, Aug 21, 2018 at 8:14 PM, Vaibhav Jain <vaibhav at linux.ibm.com> wrote:
>> Presently phb4_probe_stack() causes an exception while trying to probe
>> a PHB if its garded. This causes skiboot to go into a reboot loop with
>> following exception log:
>>
>> ***********************************************
>> Fatal MCE at 000000003006ecd4 .probe_phb4+0x570
>> CFAR : 00000000300b98a0
>> <snip>
>> Aborting!
>> CPU 0018 Backtrace:
>> S: 0000000031cc37e0 R: 000000003001a51c ._abort+0x4c
>> S: 0000000031cc3860 R: 0000000030028170 .exception_entry+0x180
>> S: 0000000031cc3a40 R: 0000000000001f10 *
>> S: 0000000031cc3c20 R: 000000003006ecb0 .probe_phb4+0x54c
>> S: 0000000031cc3e30 R: 0000000030014ca4 .main_cpu_entry+0x5b0
>> S: 0000000031cc3f00 R: 0000000030002700 boot_entry+0x1b8
>>
>> This is caused as phb4_probe_stack() will ignore all xscom read/write
>> errors to enable PHB Bars and then tries to perform an mmio to read
>> PHB Version registers that cause the fatal MCE.
>>
>> We fix this by ignoring the PHB probe if the first xscom_write() to
>> populate the PHB Bar register fails, which indicates that there is
>> something wrong with the PHB.
>
> There's a fix for this in upstream hostboot too. Unfortunately, the HB
> version in op-build hasn't been bumped in months due to compiler
> issues so this is probably the best way to fix this bug.
>
> Reviewed-by: Oliver O'Halloran <oohall at gmail.com>
>
>> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
>
> This should probably be CCed to stable so it goes into the sable builds too.
>
>> ---
>> hw/phb4.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index d1245dce..89b0b859 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -5546,6 +5546,7 @@ static void phb4_probe_stack(struct dt_node *stk_node, uint32_t pec_index,
>> char *path;
>> uint64_t capp_ucode_base;
>> unsigned int max_link_speed;
>> + int rc;
>>
>> gcid = dt_get_chip_id(stk_node);
>> stk_index = dt_prop_get_u32(stk_node, "reg");
>> @@ -5567,9 +5568,17 @@ static void phb4_probe_stack(struct dt_node *stk_node, uint32_t pec_index,
>>
>> /* Initialize PHB register BAR */
>> phys_map_get(gcid, PHB4_REG_SPC, phb_num, &phb_bar, NULL);
>> - xscom_write(gcid, nest_stack + XPEC_NEST_STK_PHB_REG_BAR, phb_bar << 8);
>> - bar_en |= XPEC_NEST_STK_BAR_EN_PHB;
>> + rc = xscom_write(gcid, nest_stack + XPEC_NEST_STK_PHB_REG_BAR,
>> + phb_bar << 8);
>> +
>> + /* A scom error here probably indicates a defective/garded PHB */
>> + if (rc != OPAL_SUCCESS) {
>> + prerror("PHB[%d:%d] Unable to set PHB BAR. Error=%d\n",
>> + gcid, phb_num, rc);
>> + return;
>
> Setting status = "broken" in the device-tree might also be a good
> idea.
Yeah, it would be. I'll take it in a follow-up patch though.
We really need to get some good degraded mode testing into op-test.
--
Stewart Smith
OPAL Architect, IBM.
More information about the Skiboot
mailing list