[Skiboot] [PATCH] fast-reboot: move pci_reset error handling into fast-reboot code
Vaidyanathan Srinivasan
svaidy at linux.vnet.ibm.com
Sat Feb 3 05:05:05 AEDT 2018
* Nicholas Piggin <npiggin at gmail.com> [2018-02-02 15:46:24]:
> pci_reset() currently does a platform reboot if it fails. It
> should not know about fast-reboot at this level, so instead have
> it return an error, and the fast reboot caller will do the
> platform reboot.
>
> The code essentially does the same thing, but flexibility is
> improved. Ideally the fast reboot code should perform pci_reset
> and all such fail-able operations before the CPU resets itself
> and destroys its own stack. That's not the case now, but that
> should be the goal.
>
> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
Acked-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> ---
> core/fast-reboot.c | 13 ++++++++++++-
> core/pci.c | 11 +++++------
> include/pci.h | 2 +-
> 3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/core/fast-reboot.c b/core/fast-reboot.c
> index 1c76c0891..382e781ae 100644
> --- a/core/fast-reboot.c
> +++ b/core/fast-reboot.c
> @@ -348,7 +348,18 @@ void __noreturn fast_reboot_entry(void)
> }
>
> /* Remove all PCI devices */
> - pci_reset();
> + if (pci_reset()) {
> + prlog(PR_NOTICE, "RESET: Fast reboot failed to reset PCI\n");
> +
> + /*
> + * Can't return to caller here because we're past no-return.
> + * Attempt an IPL here which is what the caller would do.
> + */
> + if (platform.cec_reboot)
> + platform.cec_reboot();
> + for (;;)
> + ;
> + }
The code change improves error handling.
>
> ipmi_set_fw_progress_sensor(IPMI_FW_PCI_INIT);
>
> diff --git a/core/pci.c b/core/pci.c
> index 0809521f8..494a33a45 100644
> --- a/core/pci.c
> +++ b/core/pci.c
> @@ -1668,7 +1668,7 @@ static void __pci_reset(struct list_head *list)
> }
> }
>
> -void pci_reset(void)
> +int64_t pci_reset(void)
> {
> unsigned int i;
> struct pci_slot *slot;
> @@ -1695,11 +1695,9 @@ void pci_reset(void)
> rc = slot->ops.run_sm(slot);
> }
> if (rc < 0) {
> - PCIERR(phb, 0, "Complete reset failed, aborting"
> - "fast reboot (rc=%lld)\n", rc);
> - if (platform.cec_reboot)
> - platform.cec_reboot();
> - while (true) {}
> + PCIERR(phb, 0, "Complete reset failed "
> + "(rc=%lld)\n", rc);
> + return rc;
However...
If we abort at the PCI link that fails and return at this point, we
are giving up too early. If subsequent links can get re-enabled then
fast-reboot can be made to work.
[ 141.208697990,7] PHB#0003[0:3]: LINK: Link is stable
[ 141.208763671,7] PHB#0003[0:3]: LINK: Card [15b3:1013] Degraded Retry:disabled
[ 141.208832155,7] PHB#0003[0:3]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3
[ 141.208872220,7] PHB#0003[0:3]: LINK: Width Train:x08 PHB:x16 DEV:x16 *
[ 141.214051722,3] PCI-SLOT-0000000000000003 Invalid state 00000000
[ 141.215181558,3] PHB#0003:00:00.0 Complete reset failed (rc=-6)
[ 141.216418848,5] RESET: Fast reboot failed to reset PCI
[ 141.217638534,6] IPMI: sending chassis control request 0x03
[ 141.217676559,6] BT: seq 0x1c netfn 0x00 cmd 0x02: Message sent to host
[ 141.482616079,7] MBOX-FLASH: Adjusting the window
[ 141.482676812,7] LPC-MBOX: Sending BMC interrupt
[ 142.246839786,7] MBOX-FLASH: Adjusting the window
[ 142.247414703,7] LPC-MBOX: Sending BMC interrupt
[ 142.349862574,6] BT: seq 0x1c netfn 0x00 cmd 0x02: IPMI MSG done
[ 143.015709854,6] STB: BOOTKERNEL verified
[ 143.015768123,7] LPC-MBOX: Sending BMC interrupt
[ 143.118218291,7] blocklevel_read: 0x0 0x31c03b48 0x30
[ 143.118283317,7] blocklevel_raw_read: 0x0 0x31c03b48 0x30
[ 143.118352801,7] MBOX-FLASH: Adjusting the window
[ 143.118382662,7] LPC-MBOX: Sending BMC interrupt
[ 143.220835377,7] FFS: Partition map size: 0x1000
[ 143.221135902,7] blocklevel_read: 0x0 0x30a3f218 0x1000
[ 143.221166425,7] blocklevel_raw_read: 0x0 0x30a3f218 0x1000
[ 143.223081918,7] FLASH: No ROOTFS partition
On one of the failing system where one of the PHB link does not train,
I am able to skip by continuing this loop and successfully complete
the fast-reset.
In this specific case there are no devices attached to that PHB link.
We need a way to understand why the link reset fails and skip it if
this is not due to failing hardware and proceed with reset/train of
all other links and complete the loop.
diff --git a/core/pci.c b/core/pci.c
index 0809521f..f708b1a0 100644
--- a/core/pci.c
+++ b/core/pci.c
@@ -1695,11 +1695,8 @@ void pci_reset(void)
rc = slot->ops.run_sm(slot);
}
if (rc < 0) {
- PCIERR(phb, 0, "Complete reset failed, aborting"
+ PCIERR(phb, 0, "Complete reset failed, CONTINUING "
"fast reboot (rc=%lld)\n", rc);
- if (platform.cec_reboot)
- platform.cec_reboot();
- while (true) {}
}
The above hack makes fast-reboot work well on one of the test system.
We need a way to avoid treating all PCI reset fails as fatal and fallback
to full IPL.
I am using fast-reboot for runtime PState (OCC) reload and related
tests and hence making fast-reboot work is less than ideal scenarios
will be very helpful :)
--Vaidy
More information about the Skiboot
mailing list