[Skiboot] [PATCH] phb4: Reset pfir and nfir if new errors reported during ETU reset

Mon Sep 17 13:22:44 AEST 2018

We should probably just merge this and if it turns out there's other
problems we can fix them later.

On Thu, Aug 30, 2018 at 10:57 PM, Vaibhav Jain <vaibhav at linux.ibm.com> wrote:
> During fast-reboot a PCI device can continue sending requests even
> after ETU-Reset is asserted. This will cause new errors to be reported
> in ETU fir-registers and will result in values of variables nfir_cache
> and pfir_cache to be out of sync.
>
> Presently during step-2 of CRESET nfir_cache and pfir_cache values are
> used to bring the PHB out of reset state. However if these variables
> are out of date the nfir/pfir registers are never reset completely and
> ETU still remains frozen.
>
> Hence this patch updates step-2 of phb4_creset to re-read the values of
> nfir/pfir registers to check if any new errors were reported after
> ETU-reset was asserted, report these new errors and reset the
> nfir/pfir registers. This should bring the ETU out of reset
> successfully.
>
> Signed-off-by: Vaibhav Jain <vaibhav at linux.ibm.com>
> ---
>  hw/phb4.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> diff --git a/hw/phb4.c b/hw/phb4.c
> index d1245dce..9c4b54b5 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot *slot)
>                         xscom_write(p->chip_id, p->pe_stk_xscom + 0x1,
>                                     ~p->nfir_cache);
>
> +                       /* Re-read errors in PFIR and NFIR and reset any new
> +                        * error reported. This may happen as after fundamental
> +                        * reset was asserted in previous step the device may
> +                        * still be sending TLPs causing fence to be raised.
> +                        */
> +                       xscom_read(p->chip_id, p->pci_stk_xscom +
> +                                  XPEC_PCI_STK_PCI_FIR, &p->pfir_cache);
> +                       xscom_read(p->chip_id, p->pe_stk_xscom +
> +                                  XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache);
> +
> +                       if (p->pfir_cache || p->nfir_cache) {
> +                               PHBERR(p, "CRESET: PHB still fenced !!\n");
> +                               PHBERR(p, "PCI FIR=0x%016llx\n",
> +                                      p->pfir_cache);
> +                               PHBERR(p, "NEST FIR=0x%016llx\n",
> +                                      p->nfir_cache);

The AIB error log register also useful here, can you also dump that?
Something like this does the trick:

diff --git a/hw/phb4.c b/hw/phb4.c
index f463d20d15be..0f6d8bc5eb0b 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3200,9 +3200,16 @@ static int64_t phb4_creset(struct pci_slot *slot)
     XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache);

  if (p->pfir_cache || p->nfir_cache) {
+ uint64_t err_aib;
+
+ xscom_read(p->chip_id, p->pci_stk_xscom +
+ XPEC_PCI_STK_PBAIB_ERR_REPORT,
+ &err_aib);
+
  PHBERR(p, "CRESET: PHB still fenced !!\n");
  PHBERR(p, "PCI FIR=0x%016llx\n",
         p->pfir_cache);
  PHBERR(p, "NEST FIR=0x%016llx\n",
+ PHBERR(p, "AIB ERR=0x%016llx\n", err_aib);
         p->nfir_cache);

> +                               /* Dump other error registers */
> +                               phb4_eeh_dump_regs(p);

At this point the ETU is still in reset and trying to dump the error
registers floods the msglog with XSCOM errors. I think this is because
the HV indirect register access xscoms don't work while reset is
asserted so this should probably be removed.

Otherwise,  Reviewed-by: Oliver O'Halloran <oohall at gmail.com>

> +
> +                               /* Reset the PHB errors */
> +                               xscom_write(p->chip_id, p->pci_stk_xscom +
> +                                           XPEC_PCI_STK_PCI_FIR, 0);
> +                               xscom_write(p->chip_id, p->pe_stk_xscom +
> +                                           XPEC_NEST_STK_PCI_NFIR, 0);
> +                       }
> +
>                         /* Clear PHB from reset */
>                         xscom_write(p->chip_id,
>                                     p->pci_stk_xscom + XPEC_PCI_STK_ETU_RESET, 0x0);
> --
> 2.17.1
>
> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot