Question: handling early hotplug interrupts

Wed Aug 30 07:55:00 AEST 2017

On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote:
> Hi,
> 
> This is a scenario I've been facing when working in early device 
> hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn 
> the guest of the event, then the kernel fetches it by calling 
> 'check_exception' and handles it. If the hotplug is done too early 
> (before SLOF, for example), the pulse is ignored and the hotplug event 
> is left unchecked in the events queue.
> 
> One solution would be to pulse the hotplug queue interrupt after CAS, 
> when we are sure that the hotplug queue is negotiated. However, this 
> panics the kernel with sig 11 kernel access of bad area, which suggests 
> that the kernel wasn't quite ready to handle it.

That's not right. This is a bug that needs fixing. The interrupt should
be masked anyway but still.

Tell us more about the crash (backtrace etc...)  this definitely needs
fixing.

> In my experiments using upstream 4.13 I saw that there is a 'safe time' 
> to pulse the queue, sometime after CAS and before mounting the root fs, 
> but I wasn't able to pinpoint it. From QEMU perspective, the last hcall 
> done (an h_set_mode) is still too early to pulse it and the kernel 
> panics. Looking at the kernel source I saw that the IRQ handling is 
> initiated quite early in the init process.
> 
> So my question (ok, actually 2 questions):
> 
> - Is my analysis correct? Is there an unsafe time to fire a IRQ pulse 
> before CAS that can break the kernel or am I overlooking/doing something 
> wrong?
> - is there a reliable way to know when can the kernel safely handle the 
> hotplug interrupt?

So I don't think that's the right approach. Virtual interrutps are edge
sensitive and we will potentially lose them if they occur early. I
think what needs to happen is:

 - Fix whatever's causing the above crash

and

 - The hotplug code should check for pending events (check_exception ?)
at boot time to enqueue whatever's there. It needs to do that after
unmasking the interrupt and in a way that is protected from races with
said interrupt.

Cheers,
Ben.

> 
> Thanks,
> 
> 
> Daniel