[Skiboot] [PATCH] phb4: Reset pfir and nfir if new errors reported during ETU reset

Fri Aug 31 16:00:01 AEST 2018

On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote:
> During fast-reboot a PCI device can continue sending requests even
> after ETU-Reset is asserted. This will cause new errors to be
> reported
> in ETU fir-registers and will result in values of variables
> nfir_cache
> and pfir_cache to be out of sync.
> 
> Presently during step-2 of CRESET nfir_cache and pfir_cache values
> are
> used to bring the PHB out of reset state. However if these variables
> are out of date the nfir/pfir registers are never reset completely
> and
> ETU still remains frozen.
> 
> Hence this patch updates step-2 of phb4_creset to re-read the values
> of
> nfir/pfir registers to check if any new errors were reported after
> ETU-reset was asserted, report these new errors and reset the
> nfir/pfir registers. This should bring the ETU out of reset
> successfully.

Fun bug.

We could avoid this by setting the link disable bit before we assert
the ETU reset in stage 1. That way we shouldn't get any more PCIe
traffic while the reset is in progress. We set LD later on when we
call into phb4_freset() so it shouldn't hurt.

Can you or Vasant give this a try with your repro case?

diff --git a/hw/phb4.c b/hw/phb4.c
index 7cb8b89e5c20..6578aab09e2e 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot
*slot)
 		PHBDBG(p, "CRESET: Starts\n");
 
 		phb4_prepare_link_change(slot, false);
+
+		phb4_pcicfg_read16(&p->phb, 0, p->ecap +
PCICAP_EXP_LCTL,
+				   &reg16);
+		reg16 |= PCICAP_EXP_LCTL_LINK_DIS;
+		phb4_pcicfg_write16(&p->phb, 0, p->ecap +
PCICAP_EXP_LCTL,
+				    reg16);
 		/* Clear error inject register, preventing recursive
errors */
 		xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0);
 

> 
> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
> ---
>  hw/phb4.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index d1245dce..9c4b54b5 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>  			xscom_write(p->chip_id, p->pe_stk_xscom +
> 0x1,
>  				    ~p->nfir_cache);
>  
> +			/* Re-read errors in PFIR and NFIR and reset
> any new
> +			 * error reported. This may happen as after
> fundamental
> +			 * reset was asserted in previous step the
> device may
> +			 * still be sending TLPs causing fence to be
> raised.
> +			 */
> +			xscom_read(p->chip_id, p->pci_stk_xscom +
> +				   XPEC_PCI_STK_PCI_FIR, &p-
> >pfir_cache);
> +			xscom_read(p->chip_id, p->pe_stk_xscom +
> +				   XPEC_NEST_STK_PCI_NFIR, &p-
> >nfir_cache);
> +
> +			if (p->pfir_cache || p->nfir_cache) {
> +				PHBERR(p, "CRESET: PHB still fenced
> !!\n");
> +				PHBERR(p, "PCI FIR=0x%016llx\n",
> +				       p->pfir_cache);
> +				PHBERR(p, "NEST FIR=0x%016llx\n",
> +				       p->nfir_cache);
> +
> +				/* Dump other error registers */
> +				phb4_eeh_dump_regs(p);
> +
> +				/* Reset the PHB errors */
> +				xscom_write(p->chip_id, p-
> >pci_stk_xscom +
> +					    XPEC_PCI_STK_PCI_FIR,
> 0);
> +				xscom_write(p->chip_id, p-
> >pe_stk_xscom +
> +					    XPEC_NEST_STK_PCI_NFIR,
> 0);

Do we also need to call phb4_err_clear() here? If a TLP is causing an
error during the reset I'd expect some of the lower-level error
registers to be set, but those might be cleared during the reset.

> +			}
> +
>  			/* Clear PHB from reset */
>  			xscom_write(p->chip_id,
>  				    p->pci_stk_xscom +
> XPEC_PCI_STK_ETU_RESET, 0x0);