[Skiboot] [PATCH] fast-reboot: move pci_reset error handling into fast-reboot code

Vaidyanathan Srinivasan svaidy at linux.vnet.ibm.com
Sat Feb 3 05:05:05 AEDT 2018


* Nicholas Piggin <npiggin at gmail.com> [2018-02-02 15:46:24]:

> pci_reset() currently does a platform reboot if it fails. It
> should not know about fast-reboot at this level, so instead have
> it return an error, and the fast reboot caller will do the
> platform reboot.
> 
> The code essentially does the same thing, but flexibility is
> improved. Ideally the fast reboot code should perform pci_reset
> and all such fail-able operations before the CPU resets itself
> and destroys its own stack. That's not the case now, but that
> should be the goal.
> 
> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>

Acked-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>

> ---
>  core/fast-reboot.c | 13 ++++++++++++-
>  core/pci.c         | 11 +++++------
>  include/pci.h      |  2 +-
>  3 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/core/fast-reboot.c b/core/fast-reboot.c
> index 1c76c0891..382e781ae 100644
> --- a/core/fast-reboot.c
> +++ b/core/fast-reboot.c
> @@ -348,7 +348,18 @@ void __noreturn fast_reboot_entry(void)
>  	}
>  
>  	/* Remove all PCI devices */
> -	pci_reset();
> +	if (pci_reset()) {
> +		prlog(PR_NOTICE, "RESET: Fast reboot failed to reset PCI\n");
> +
> +		/*
> +		 * Can't return to caller here because we're past no-return.
> +		 * Attempt an IPL here which is what the caller would do.
> +		 */
> +		if (platform.cec_reboot)
> +			platform.cec_reboot();
> +		for (;;)
> +			;
> +	}

The code change improves error handling.


>  
>  	ipmi_set_fw_progress_sensor(IPMI_FW_PCI_INIT);
>  
> diff --git a/core/pci.c b/core/pci.c
> index 0809521f8..494a33a45 100644
> --- a/core/pci.c
> +++ b/core/pci.c
> @@ -1668,7 +1668,7 @@ static void __pci_reset(struct list_head *list)
>  	}
>  }
>  
> -void pci_reset(void)
> +int64_t pci_reset(void)
>  {
>  	unsigned int i;
>  	struct pci_slot *slot;
> @@ -1695,11 +1695,9 @@ void pci_reset(void)
>  				rc = slot->ops.run_sm(slot);
>  			}
>  			if (rc < 0) {
> -				PCIERR(phb, 0, "Complete reset failed, aborting"
> -				               "fast reboot (rc=%lld)\n", rc);
> -				if (platform.cec_reboot)
> -					platform.cec_reboot();
> -				while (true) {}
> +				PCIERR(phb, 0, "Complete reset failed "
> +				               "(rc=%lld)\n", rc);
> +				return rc;

However...

If we abort at the PCI link that fails and return at this point, we
are giving up too early. If subsequent links can get re-enabled then
fast-reboot can be made to work.

[  141.208697990,7] PHB#0003[0:3]: LINK: Link is stable
[  141.208763671,7] PHB#0003[0:3]: LINK: Card [15b3:1013] Degraded Retry:disabled
[  141.208832155,7] PHB#0003[0:3]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3
[  141.208872220,7] PHB#0003[0:3]: LINK: Width Train:x08 PHB:x16 DEV:x16 *
[  141.214051722,3] PCI-SLOT-0000000000000003 Invalid state 00000000
[  141.215181558,3] PHB#0003:00:00.0 Complete reset failed (rc=-6)
[  141.216418848,5] RESET: Fast reboot failed to reset PCI
[  141.217638534,6] IPMI: sending chassis control request 0x03
[  141.217676559,6] BT: seq 0x1c netfn 0x00 cmd 0x02: Message sent to host
[  141.482616079,7] MBOX-FLASH: Adjusting the window
[  141.482676812,7] LPC-MBOX: Sending BMC interrupt
[  142.246839786,7] MBOX-FLASH: Adjusting the window
[  142.247414703,7] LPC-MBOX: Sending BMC interrupt
[  142.349862574,6] BT: seq 0x1c netfn 0x00 cmd 0x02: IPMI MSG done
[  143.015709854,6] STB: BOOTKERNEL verified
[  143.015768123,7] LPC-MBOX: Sending BMC interrupt
[  143.118218291,7] blocklevel_read: 0x0        0x31c03b48      0x30
[  143.118283317,7] blocklevel_raw_read: 0x0    0x31c03b48      0x30
[  143.118352801,7] MBOX-FLASH: Adjusting the window
[  143.118382662,7] LPC-MBOX: Sending BMC interrupt
[  143.220835377,7] FFS: Partition map size: 0x1000
[  143.221135902,7] blocklevel_read: 0x0        0x30a3f218      0x1000
[  143.221166425,7] blocklevel_raw_read: 0x0    0x30a3f218      0x1000
[  143.223081918,7] FLASH: No ROOTFS partition

On one of the failing system where one of the PHB link does not train,
I am able to skip by continuing this loop and successfully complete
the fast-reset.

In this specific case there are no devices attached to that PHB link.
We need a way to understand why the link reset fails and skip it if
this is not due to failing hardware and proceed with reset/train of
all other links and complete the loop.

diff --git a/core/pci.c b/core/pci.c
index 0809521f..f708b1a0 100644
--- a/core/pci.c
+++ b/core/pci.c
@@ -1695,11 +1695,8 @@ void pci_reset(void)
                                rc = slot->ops.run_sm(slot);
                        }
                        if (rc < 0) {
-                               PCIERR(phb, 0, "Complete reset failed, aborting"
+                               PCIERR(phb, 0, "Complete reset failed, CONTINUING "
                                               "fast reboot (rc=%lld)\n", rc);
-                               if (platform.cec_reboot)
-                                       platform.cec_reboot();
-                               while (true) {}
                        }

The above hack makes fast-reboot work well on one of the test system.
We need a way to avoid treating all PCI reset fails as fatal and fallback
to full IPL.

I am using fast-reboot for runtime PState (OCC) reload and related
tests and hence making fast-reboot work is less than ideal scenarios
will be very helpful :)

--Vaidy



More information about the Skiboot mailing list