[Skiboot] [PATCH] hw/phb3: improve handling of PHB init failure

Gavin Shan gwshan at linux.vnet.ibm.com
Sat Oct 22 01:09:51 AEDT 2016


On Fri, Oct 21, 2016 at 06:07:47PM +1100, Andrew Donnellan wrote:
>On 21/10/16 17:27, Gavin Shan wrote:
>>>diff --git a/hw/phb3.c b/hw/phb3.c
>>>index 1c09ffe..cdfc38a 100644
>>>--- a/hw/phb3.c
>>>+++ b/hw/phb3.c
>>>@@ -38,7 +38,7 @@
>>>/* Enable this to disable error interrupts for debug purposes */
>>>#undef DISABLE_ERR_INTS
>>>
>>>-static void phb3_init_hw(struct phb3 *p, bool first_init);
>>>+static int64_t phb3_init_hw(struct phb3 *p, bool first_init);
>>>
>>>#define PHBDBG(p, fmt, a...)	prlog(PR_DEBUG, "PHB#%04x: " fmt, \
>>>				      (p)->phb.opal_id, ## a)
>>>@@ -2481,6 +2481,7 @@ static int64_t phb3_creset(struct pci_slot *slot)
>>>{
>>>	struct phb3 *p = phb_to_phb3(slot->phb);
>>>	uint64_t cqsts, val;
>>>+	int64_t rc;
>>>
>>>	switch (slot->state) {
>>>	case PHB3_SLOT_NORMAL:
>>>@@ -2545,7 +2546,9 @@ static int64_t phb3_creset(struct pci_slot *slot)
>>>		 */
>>>		p->flags &= ~PHB3_AIB_FENCED;
>>>		p->flags &= ~PHB3_CAPP_RECOVERY;
>>>-		phb3_init_hw(p, false);
>>>+		rc = phb3_init_hw(p, false);
>>>+		if (rc)
>>>+			goto error;
>>
>>There are two users: recovery on fenced PHB or frozen PE behind root
>>bus or root port [A], kdump kernel [B]. For [A], the change looks ok.
>>For [B], the PHB has been seen by kernel but it is left in uninitialized
>>state. It's likely not working when kernel tries to do PCI probing,
>>memory transaction, DMA etc. if I'm correct enough.
>
>This should leave the PHB state in PHB3_STATE_BROKEN, so a lot of things will
>fail and return OPAL_HARDWARE.
>
>I'm not sure what the problem is here though... without this change, the PHB
>is *still* not initialised properly...
>

PHB3_STATE_BROKEN isn't enough. The kernel might not know the broken PHB yet,
meaning transactions might be initiated by kernel after complete reset failure.
Among the tranactions, PCI config is blocked by flag PHB3_STATE_BROKEN. MMIO
and DMA access cannot be blocked. It will cause error that will be catched by EEH
subsystem in kernel. It's likely what happens in scenario [B]. To EEH, it's
a dead PHB error and the PHB (including its subordinate devices) will be removed
from the system. However, we need a fenced PHB error so that EEH can try to
recover it before removing the entire PHB.

Actually, we have similar issue in scenario [A] with this patch applied. On
receiving fenced PHB error, EEH tries to do complete reset for 3 times if
needed. If we put the PHB into broken state in the first complete reset,
the other 2 complete resets won't be issued successfully.

The problem we have: PHB3_STATE_BROKEN is a permanent state. Once a PHB is
marked as broken, we cannot bring it back to normal. However, it's correct
to propagate the skiboot error to kernel, but it doesn't sound correct to
mark PHB broken, not in all cases at least.

Thanks,
Gavin



More information about the Skiboot mailing list