[PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

Sat May 24 00:36:55 EST 2014

On Fri, 2014-05-23 at 15:00 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
> > >There's no notification, the user needs to observe the return value an
> > >poll?  Should we be enabling an eventfd to notify the user of the state
> > >change?
> > >
> > 
> > Yes. The user needs to monitor the return value. we should have one notification,
> > but it's for later as we discussed :-)
> 
>  ../..
> 
> > >How does the guest learn about the error?  Does it need to?
> > 
> > When guest detects 0xFF's from reading PCI config space or IO, it's going
> > check the device (PE) state. If the device (PE) has been put into frozen
> > state, the recovery will be started.
> 
> Quick recap for Alex W (we discussed that with Alex G).
> 
> While a notification looks like a worthwhile addition in the long run, it
> is not sufficient and not used today and I prefer that we keep that as something
> to add later for those two main reasons:
> 
>  - First, the kernel itself isn't always notified. For example, if we implement
> on top of an RTAS backend (PR KVM under pHyp) or if we are on top of PowerNV but
> the error is a PHB "fence" (the entire PCI Host bridge gets fenced out in hardware
> due to an internal error), then we get no notification. Only polling of the
> hardware or firmware will tell us. Since we don't want to have a polling timer
> in the kernel, that means that the userspace client of VFIO (or alternatively
> the KVM guest) is the one that polls.
> 
>  - Second, this is how our primary user expects it: The primary (and only initial)
> user of this will be qemu/KVM for PAPR guests and they don't have a notification
> mechanism. Instead they query the EEH state after detecting an all 1's return from
> MMIO or config space. This is how PAPR specifies it so we are just implementing the
> spec here :-)
> 
> Because of these, I think we shouldn't worry too much about notification at
> this stage.

Ok, I was asking more about an error log that indicates what error
occurred to freeze the hardware so that the user can make a more
educated guess whether recovery is an option.  Given that you have cases
where there may be no notification and your guest/user already handles
this, the plan to start with polling makes sense.  Thanks,

Alex