[Skiboot] [PATCH] fsp: return OPAL_BUSY_EVENT on failure sending FSP_CMD_POWERDOWN_NORM

Fri Oct 6 20:06:31 AEDT 2017

On Mon, Oct 02, 2017 at 01:08:25AM +0000, Stewart Smith wrote:
> We had a race condition between FSP Reset/Reload and powering down
> the system from the host:
> 
> Roughly:
> 
>   FSP                Host
>   ---                ----
>   Power on
>                      Power on
>   (inject EPOW)
>   (trigger FSP R/R)
>                      Processes EPOW event, starts shutting down
>                      calls OPAL_CEC_POWER_DOWN
>   (is still in R/R)
>                      gets OPAL_INTERNAL_ERROR, spins in opal_poll_events
>   (FSP comes back)
>                      spinning in opal_poll_events
>   (thinks host is running)
> 
> The call to OPAL_CEC_POWER_DOWN is only made once as the reset/reload
> error path for fsp_sync_msg() is to return -1, which means we give
> the OS OPAL_INTERNAL_ERROR, which is fine, except that our own API
> docs give us the opportunity to return OPAL_BUSY when trying again
> later may be successful, and we're ambiguous as to if you should retry
> on OPAL_INTERNAL_ERROR.
> 
> For reference, the linux code looks like this:
> >static void __noreturn pnv_power_off(void)
> >{
> >        long rc = OPAL_BUSY;
> >
> >        pnv_prepare_going_down();
> >
> >        while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
> >                rc = opal_cec_power_down(0);
> >                if (rc == OPAL_BUSY_EVENT)
> >                        opal_poll_events(NULL);
> >                else
> >                        mdelay(10);
> >        }
> >        for (;;)
> >                opal_poll_events(NULL);
> >}
> 
> Which means that *practically* our only option is to return OPAL_BUSY
> or OPAL_BUSY_EVENT.
> 
> We choose OPAL_BUSY_EVENT for FSP systems as we do want to ensure we're
> running pollers to communicate with the FSP and do the final bits of
> Reset/Reload handling before we power off the system.
> 
> Additionally, we really should update our documentation to point all
> of these return codes and what action an OS should take.

Superb analysis...

> CC: stable
> Reported-by: Pridhiviraj Paidipeddi <ppaidipe at linux.vnet.ibm.com>
> Signed-off-by: Stewart Smith <stewart at linux.vnet.ibm.com>

Acked-by: Ananth N Mavinakayanahalli <ananth at linux.vnet.ibm.com>