[PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.

Mahesh J Salgaonkar mahesh at linux.ibm.com
Tue Oct 19 04:28:08 AEDT 2021


On 2021-05-07 10:41:46 Fri, Oliver O'Halloran wrote:
> On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar <mahesh at linux.ibm.com> wrote:
> >
> > When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> > state as temporarily unavailable. In this case, per PAPR, rtas call
> > ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
> > and OS has to wait until that recovery is complete. During this state the
> > slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
> > connector empty which leads to assumption that the device has been
> > hot-removed. This results into no EEH recovery on this device and it stays
> > in failed state forever.
> >
> > This patch fixes this issue by skipping slot presence check only if device
> > PE state is temporarily unavailable(5).
> >
> > Signed-off-by: Mahesh Salgaonkar <mahesh at linux.ibm.com>
> > ---
> > * snip*
> >
> >         /*
> >          * It should be corner case that the parent PE has been
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 3eff6a4888e79..a0913768f33de 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
> >                 return;
> >         }
> >
> > +       /*
> > +        * When PE's state is temporarily unavailable, the slot
> > +        * presence check returns as DR connector empty.
> 
> That sounds like a bug in either RTAS or the hotplug slot driver (or
> both). The presence check is there largely to filter out events that
> we can guarantee are not recoverable (i.e. surprise hot-unplug). In
> every other case (especially if we can't determine the state) we
> should be going down the recovery path. If the hotplug slot driver is
> incorrectly reporting the card has been removed then you should be
> fixing the slot driver.

Thanks Oliver for the comment.

So phyp fixed the issue where it was incorrectly reporting the card has
been removed. After the phyp fix, the slot presence check
'get-sensor-state(dr-entity-sense)' returns extended busy error (9902)
until PHB is recovered by phyp. And once PHB is recovered, the
get-sensor-state() returns success with correct presence status.

But now we have different problem. The Linux rtas call interface
rtas_get_sensor() loops over the rtas call on extended delay return code
(9902) until the return value is either success (0) or error (-1).  This
causes EEH handler to get stuck at presence check 'rtas_get_sensor()'
for ~6 seconds before it could indicate network driver that error has
been detected and stop any active operations. With no I/O traffic this
doesn't cause any issue and EEH recovery works fine.  However with
running I/O traffic, during this 6 seconds, network driver continues its
operation and hits timeout (netdev watchdog). On timeouts, network
driver go into ffdc capture mode and reset path assuming PCI device is
in fatal condition. This causes EEH recovery to fail and sometimes it
leads to system hang or crash.

------------
[52732.244731] DEBUG: ibm_read_slot_reset_state2()
[52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
[52732.244798] DEBUG: in eeh_slot_presence_check
[52732.244804] DEBUG: error state check
[52732.244807] DEBUG: Is slot hotpluggable
[52732.244810] DEBUG: hotpluggable ops ?
[52732.244953] DEBUG: Calling ops->get_adapter_status
[52732.244958] DEBUG: calling rpaphp_get_sensor_state
[52736.564262] ------------[ cut here ]------------
[52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
[52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
[...]
[52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
[52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
------------

I am working on ways to fix this and looking at below two options. More
ideas are welcome.

1. There is an alternate call rtas_get_sensor_fast() available that does
not use rtas_busy_delay() and returns immediately with error code. Using
rtas_get_sensor_fast() for slot presence check fixes the above issue and
EEH recovery works fine. However there is no provision in
hotplug_slot_ops struct to do a quick check of adapter status that can
be used to call rtas_get_sensor_fast().

2. Another option is to move the slot presence check after reporting
network driver that error has been detected. This also fixes the issue.
However need to verify the hotplug case where if slot is empty, inform
driver to resume while skiping the recovery.

Let me know what do you think about above options and if there is any
other better way to fix this.

Thanks,
-Mahesh.


More information about the Linuxppc-dev mailing list