[Skiboot] [PATCH] phb4: Reset pfir and nfir if new errors reported during ETU reset

Fri Aug 31 21:21:45 AEST 2018

On 08/31/2018 11:30 AM, Oliver O'Halloran wrote:
> On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote:
>> During fast-reboot a PCI device can continue sending requests even
>> after ETU-Reset is asserted. This will cause new errors to be
>> reported
>> in ETU fir-registers and will result in values of variables
>> nfir_cache
>> and pfir_cache to be out of sync.
>>
>> Presently during step-2 of CRESET nfir_cache and pfir_cache values
>> are
>> used to bring the PHB out of reset state. However if these variables
>> are out of date the nfir/pfir registers are never reset completely
>> and
>> ETU still remains frozen.
>>
>> Hence this patch updates step-2 of phb4_creset to re-read the values
>> of
>> nfir/pfir registers to check if any new errors were reported after
>> ETU-reset was asserted, report these new errors and reset the
>> nfir/pfir registers. This should bring the ETU out of reset
>> successfully.
> 
> Fun bug.
> 
> We could avoid this by setting the link disable bit before we assert
> the ETU reset in stage 1. That way we shouldn't get any more PCIe
> traffic while the reset is in progress. We set LD later on when we
> call into phb4_freset() so it shouldn't hurt.
> 
> Can you or Vasant give this a try with your repro case?

Yeah. Below change didn't fix the issue. I still see PHB errors and disks are 
not getting detected.

/ # grep PHB /sys/firmware/opal/msglog |grep Error
[  286.016664627,7] PHB#0033:00:00.0 Error -6 resetting
[  286.016682002,7] PHB#0002:00:00.0 Error -6 resetting
[  286.016690554,7] PHB#0001:00:00.0 Error -6 resetting
[  286.016725126,7] PHB#0004:00:00.0 Error -6 resetting
[  286.016736261,7] PHB#0003:00:00.0 Error -6 resetting
[  286.016758274,7] PHB#0031:00:00.0 Error -6 resetting
[  286.016763042,7] PHB#0030:00:00.0 Error -6 resetting
[  286.017111640,7] PHB#0000:00:00.0 Error -6 resetting
[  286.021380651,3] PHB#0002[0:2]: phb4_get_link_info: Error -6 getting link state
[  286.021380646,3] PHB#0001[0:1]: phb4_get_link_info: Error -6 getting link state
[  286.021384703,3] PHB#0033[8:3]: phb4_get_link_info: Error -6 getting link state
[  286.021397387,3] PHB#0033:00:00.0 Error -6 querying link status
[  286.021381928,3] PHB#0004[0:4]: phb4_get_link_info: Error -6 getting link state
[  286.021381356,3] PHB#0003[0:3]: phb4_get_link_info: Error -6 getting link state
[  286.021412009,3] PHB#0004:00:00.0 Error -6 querying link status
[  286.021418175,3] PHB#0003:00:00.0 Error -6 querying link status
[  286.021383000,3] PHB#0030[8:0]: phb4_get_link_info: Error -6 getting link state
[  286.021383992,3] PHB#0002:00:00.0 Error -6 querying link status
[  286.021388334,3] PHB#0001:00:00.0 Error -6 querying link status
[  286.021440040,3] PHB#0030:00:00.0 Error -6 querying link status
[  286.021383563,3] PHB#0031[8:1]: phb4_get_link_info: Error -6 getting link state
[  286.021468511,3] PHB#0031:00:00.0 Error -6 querying link status
[  286.021384405,3] PHB#0000[0:0]: phb4_get_link_info: Error -6 getting link state
[  286.021523256,3] PHB#0000:00:00.0 Error -6 querying link status

-Vasant

> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 7cb8b89e5c20..6578aab09e2e 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>   		PHBDBG(p, "CRESET: Starts\n");
>   
>   		phb4_prepare_link_change(slot, false);
> +
> +		phb4_pcicfg_read16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> +				   &reg16);
> +		reg16 |= PCICAP_EXP_LCTL_LINK_DIS;
> +		phb4_pcicfg_write16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> +				    reg16);
>   		/* Clear error inject register, preventing recursive
> errors */
>   		xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0);
>   
> 
>>
>> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
>> ---
>>   hw/phb4.c | 27 +++++++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index d1245dce..9c4b54b5 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot
>> *slot)
>>   			xscom_write(p->chip_id, p->pe_stk_xscom +
>> 0x1,
>>   				    ~p->nfir_cache);
>>   
>> +			/* Re-read errors in PFIR and NFIR and reset
>> any new
>> +			 * error reported. This may happen as after
>> fundamental
>> +			 * reset was asserted in previous step the
>> device may
>> +			 * still be sending TLPs causing fence to be
>> raised.
>> +			 */
>> +			xscom_read(p->chip_id, p->pci_stk_xscom +
>> +				   XPEC_PCI_STK_PCI_FIR, &p-
>>> pfir_cache);
>> +			xscom_read(p->chip_id, p->pe_stk_xscom +
>> +				   XPEC_NEST_STK_PCI_NFIR, &p-
>>> nfir_cache);
>> +
>> +			if (p->pfir_cache || p->nfir_cache) {
>> +				PHBERR(p, "CRESET: PHB still fenced
>> !!\n");
>> +				PHBERR(p, "PCI FIR=0x%016llx\n",
>> +				       p->pfir_cache);
>> +				PHBERR(p, "NEST FIR=0x%016llx\n",
>> +				       p->nfir_cache);
>> +
>> +				/* Dump other error registers */
>> +				phb4_eeh_dump_regs(p);
>> +
>> +				/* Reset the PHB errors */
>> +				xscom_write(p->chip_id, p-
>>> pci_stk_xscom +
>> +					    XPEC_PCI_STK_PCI_FIR,
>> 0);
>> +				xscom_write(p->chip_id, p-
>>> pe_stk_xscom +
>> +					    XPEC_NEST_STK_PCI_NFIR,
>> 0);
> 
> Do we also need to call phb4_err_clear() here? If a TLP is causing an
> error during the reset I'd expect some of the lower-level error
> registers to be set, but those might be cleared during the reset.
> 
>> +			}
>> +
>>   			/* Clear PHB from reset */
>>   			xscom_write(p->chip_id,
>>   				    p->pci_stk_xscom +
>> XPEC_PCI_STK_ETU_RESET, 0x0);
>