[Skiboot] [PATCH] hw/phb3: improve handling of PHB init failure

Andrew Donnellan andrew.donnellan at au1.ibm.com
Tue Oct 25 13:29:45 AEDT 2016


On 22/10/16 01:09, Gavin Shan wrote:
> PHB3_STATE_BROKEN isn't enough. The kernel might not know the broken PHB yet,
> meaning transactions might be initiated by kernel after complete reset failure.
> Among the tranactions, PCI config is blocked by flag PHB3_STATE_BROKEN. MMIO
> and DMA access cannot be blocked. It will cause error that will be catched by EEH
> subsystem in kernel. It's likely what happens in scenario [B]. To EEH, it's
> a dead PHB error and the PHB (including its subordinate devices) will be removed
> from the system. However, we need a fenced PHB error so that EEH can try to
> recover it before removing the entire PHB.
>
> Actually, we have similar issue in scenario [A] with this patch applied. On
> receiving fenced PHB error, EEH tries to do complete reset for 3 times if
> needed. If we put the PHB into broken state in the first complete reset,
> the other 2 complete resets won't be issued successfully.
>
> The problem we have: PHB3_STATE_BROKEN is a permanent state. Once a PHB is
> marked as broken, we cannot bring it back to normal. However, it's correct
> to propagate the skiboot error to kernel, but it doesn't sound correct to
> mark PHB broken, not in all cases at least.

Is the *current* use of goto error; in the case of a pending transaction 
timeout in PHB3_SLOT_CRESET_WAIT_CQ correct?

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com  IBM Australia Limited



More information about the Skiboot mailing list