[Skiboot] [PATCH] phb4: Reset pfir and nfir if new errors reported during ETU reset
Vasant Hegde
hegdevasant at linux.vnet.ibm.com
Fri Aug 31 21:21:45 AEST 2018
On 08/31/2018 11:30 AM, Oliver O'Halloran wrote:
> On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote:
>> During fast-reboot a PCI device can continue sending requests even
>> after ETU-Reset is asserted. This will cause new errors to be
>> reported
>> in ETU fir-registers and will result in values of variables
>> nfir_cache
>> and pfir_cache to be out of sync.
>>
>> Presently during step-2 of CRESET nfir_cache and pfir_cache values
>> are
>> used to bring the PHB out of reset state. However if these variables
>> are out of date the nfir/pfir registers are never reset completely
>> and
>> ETU still remains frozen.
>>
>> Hence this patch updates step-2 of phb4_creset to re-read the values
>> of
>> nfir/pfir registers to check if any new errors were reported after
>> ETU-reset was asserted, report these new errors and reset the
>> nfir/pfir registers. This should bring the ETU out of reset
>> successfully.
>
> Fun bug.
>
> We could avoid this by setting the link disable bit before we assert
> the ETU reset in stage 1. That way we shouldn't get any more PCIe
> traffic while the reset is in progress. We set LD later on when we
> call into phb4_freset() so it shouldn't hurt.
>
> Can you or Vasant give this a try with your repro case?
Yeah. Below change didn't fix the issue. I still see PHB errors and disks are
not getting detected.
/ # grep PHB /sys/firmware/opal/msglog |grep Error
[ 286.016664627,7] PHB#0033:00:00.0 Error -6 resetting
[ 286.016682002,7] PHB#0002:00:00.0 Error -6 resetting
[ 286.016690554,7] PHB#0001:00:00.0 Error -6 resetting
[ 286.016725126,7] PHB#0004:00:00.0 Error -6 resetting
[ 286.016736261,7] PHB#0003:00:00.0 Error -6 resetting
[ 286.016758274,7] PHB#0031:00:00.0 Error -6 resetting
[ 286.016763042,7] PHB#0030:00:00.0 Error -6 resetting
[ 286.017111640,7] PHB#0000:00:00.0 Error -6 resetting
[ 286.021380651,3] PHB#0002[0:2]: phb4_get_link_info: Error -6 getting link state
[ 286.021380646,3] PHB#0001[0:1]: phb4_get_link_info: Error -6 getting link state
[ 286.021384703,3] PHB#0033[8:3]: phb4_get_link_info: Error -6 getting link state
[ 286.021397387,3] PHB#0033:00:00.0 Error -6 querying link status
[ 286.021381928,3] PHB#0004[0:4]: phb4_get_link_info: Error -6 getting link state
[ 286.021381356,3] PHB#0003[0:3]: phb4_get_link_info: Error -6 getting link state
[ 286.021412009,3] PHB#0004:00:00.0 Error -6 querying link status
[ 286.021418175,3] PHB#0003:00:00.0 Error -6 querying link status
[ 286.021383000,3] PHB#0030[8:0]: phb4_get_link_info: Error -6 getting link state
[ 286.021383992,3] PHB#0002:00:00.0 Error -6 querying link status
[ 286.021388334,3] PHB#0001:00:00.0 Error -6 querying link status
[ 286.021440040,3] PHB#0030:00:00.0 Error -6 querying link status
[ 286.021383563,3] PHB#0031[8:1]: phb4_get_link_info: Error -6 getting link state
[ 286.021468511,3] PHB#0031:00:00.0 Error -6 querying link status
[ 286.021384405,3] PHB#0000[0:0]: phb4_get_link_info: Error -6 getting link state
[ 286.021523256,3] PHB#0000:00:00.0 Error -6 querying link status
-Vasant
>
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 7cb8b89e5c20..6578aab09e2e 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
> PHBDBG(p, "CRESET: Starts\n");
>
> phb4_prepare_link_change(slot, false);
> +
> + phb4_pcicfg_read16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> + ®16);
> + reg16 |= PCICAP_EXP_LCTL_LINK_DIS;
> + phb4_pcicfg_write16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> + reg16);
> /* Clear error inject register, preventing recursive
> errors */
> xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0);
>
>
>>
>> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
>> ---
>> hw/phb4.c | 27 +++++++++++++++++++++++++++
>> 1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index d1245dce..9c4b54b5 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot
>> *slot)
>> xscom_write(p->chip_id, p->pe_stk_xscom +
>> 0x1,
>> ~p->nfir_cache);
>>
>> + /* Re-read errors in PFIR and NFIR and reset
>> any new
>> + * error reported. This may happen as after
>> fundamental
>> + * reset was asserted in previous step the
>> device may
>> + * still be sending TLPs causing fence to be
>> raised.
>> + */
>> + xscom_read(p->chip_id, p->pci_stk_xscom +
>> + XPEC_PCI_STK_PCI_FIR, &p-
>>> pfir_cache);
>> + xscom_read(p->chip_id, p->pe_stk_xscom +
>> + XPEC_NEST_STK_PCI_NFIR, &p-
>>> nfir_cache);
>> +
>> + if (p->pfir_cache || p->nfir_cache) {
>> + PHBERR(p, "CRESET: PHB still fenced
>> !!\n");
>> + PHBERR(p, "PCI FIR=0x%016llx\n",
>> + p->pfir_cache);
>> + PHBERR(p, "NEST FIR=0x%016llx\n",
>> + p->nfir_cache);
>> +
>> + /* Dump other error registers */
>> + phb4_eeh_dump_regs(p);
>> +
>> + /* Reset the PHB errors */
>> + xscom_write(p->chip_id, p-
>>> pci_stk_xscom +
>> + XPEC_PCI_STK_PCI_FIR,
>> 0);
>> + xscom_write(p->chip_id, p-
>>> pe_stk_xscom +
>> + XPEC_NEST_STK_PCI_NFIR,
>> 0);
>
> Do we also need to call phb4_err_clear() here? If a TLP is causing an
> error during the reset I'd expect some of the lower-level error
> registers to be set, but those might be cleared during the reset.
>
>> + }
>> +
>> /* Clear PHB from reset */
>> xscom_write(p->chip_id,
>> p->pci_stk_xscom +
>> XPEC_PCI_STK_ETU_RESET, 0x0);
>
More information about the Skiboot
mailing list