Question: handling early hotplug interrupts

Thu Aug 31 00:37:30 AEST 2017

On 08/30/2017 01:09 AM, Michael Ellerman wrote:
> Daniel Henrique Barboza <danielhb at linux.vnet.ibm.com> writes:
> 
>> Hi Ben,
>>
>> On 08/29/2017 06:55 PM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote:
>>>> Hi,
>>>>
>>>> This is a scenario I've been facing when working in early device
>>>> hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn
>>>> the guest of the event, then the kernel fetches it by calling
>>>> 'check_exception' and handles it. If the hotplug is done too early
>>>> (before SLOF, for example), the pulse is ignored and the hotplug event
>>>> is left unchecked in the events queue.
>>>>
>>>> One solution would be to pulse the hotplug queue interrupt after CAS,
>>>> when we are sure that the hotplug queue is negotiated. However, this
>>>> panics the kernel with sig 11 kernel access of bad area, which suggests
>>>> that the kernel wasn't quite ready to handle it.
>>> That's not right. This is a bug that needs fixing. The interrupt should
>>> be masked anyway but still.
>>>
>>> Tell us more about the crash (backtrace etc...)  this definitely needs
>>> fixing.
>>
>> This is the backtrace using a 4.13.0-rc3 guest:
>>
>> ---------
>> [    0.008913] Unable to handle kernel paging request for data at address 0x00000100
>> [    0.008989] Faulting instruction address: 0xc00000000012c318
>> [    0.009046] Oops: Kernel access of bad area, sig: 11 [#1]
>> [    0.009092] SMP NR_CPUS=1024
>> [    0.009092] NUMA
>> [    0.009128] pSeries
>> [    0.009173] Modules linked in:
>> [    0.009210] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc3+ #1
>> [    0.009268] task: c0000000feb02580 task.stack: c0000000fe108000
>> [    0.009325] NIP: c00000000012c318 LR: c00000000012c9c4 CTR: 0000000000000000
>> [    0.009394] REGS: c0000000fffef910 TRAP: 0380   Not tainted (4.13.0-rc3+)
>> [    0.009450] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>
>> [    0.009454]   CR: 28000822  XER: 20000000
>> [    0.009554] CFAR: c00000000012c9c0 SOFTE: 0
>> [    0.009554] GPR00: c00000000012c9c4 c0000000fffefb90 c00000000141f100 0000000000000400
>> [    0.009554] GPR04: 0000000000000000 c0000000fe1851c0 0000000000000000 00000000fee60000
>> [    0.009554] GPR08: 0000000fffffffe1 0000000000000000 0000000000000001 0000000002001001
>> [    0.009554] GPR12: 0000000000000040 c00000000fd80000 c00000000000db58 0000000000000000
>> [    0.009554] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [    0.009554] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
>> [    0.009554] GPR24: 0000000000000002 0000000000000013 c0000000fe14bc00 0000000000000400
>> [    0.009554] GPR28: 0000000000000400 0000000000000000 c0000000fe1851c0 0000000000000001
>> [    0.010121] NIP [c00000000012c318] __queue_work+0x48/0x640
>> [    0.010168] LR [c00000000012c9c4] queue_work_on+0xb4/0xf0
>> [    0.010213] Call Trace:
>> [    0.010239] [c0000000fffefb90] [c00000000000db58] kernel_init+0x8/0x160 (unreliable)
>> [    0.010308] [c0000000fffefc70] [c00000000012c9c4] queue_work_on+0xb4/0xf0
>> [    0.010368] [c0000000fffefcb0] [c0000000000c4608] queue_hotplug_event+0xd8/0x150
>> [    0.010435] [c0000000fffefd00] [c0000000000c30d0] ras_hotplug_interrupt+0x140/0x190
>> [    0.010505] [c0000000fffefd90] [c00000000018c8b0] __handle_irq_event_percpu+0x90/0x310
>> [    0.010573] [c0000000fffefe50] [c00000000018cb6c] handle_irq_event_percpu+0x3c/0x90
>> [    0.010642] [c0000000fffefe90] [c00000000018cc24] handle_irq_event+0x64/0xc0
>> [    0.010710] [c0000000fffefec0] [c0000000001928b0] handle_fasteoi_irq+0xc0/0x230
>> [    0.010779] [c0000000fffefef0] [c00000000018ae14] generic_handle_irq+0x54/0x80
>> [    0.010847] [c0000000fffeff20] [c0000000000189f0] __do_irq+0x90/0x210
>> [    0.010904] [c0000000fffeff90] [c00000000002e730] call_do_irq+0x14/0x24
>> [    0.010961] [c0000000fe10b640] [c000000000018c10] do_IRQ+0xa0/0x130
>> [    0.011021] [c0000000fe10b6a0] [c000000000008c58] hardware_interrupt_common+0x158/0x160
>> [    0.011090] --- interrupt: 501 at __replay_interrupt+0x38/0x3c
>> [    0.011090]     LR = arch_local_irq_restore+0x74/0x90
>> [    0.011179] [c0000000fe10b990] [c0000000fe10b9e0] 0xc0000000fe10b9e0 (unreliable)
>> [    0.011249] [c0000000fe10b9b0] [c000000000b967fc] _raw_spin_unlock_irqrestore+0x4c/0xb0
>> [    0.011316] [c0000000fe10b9e0] [c00000000018ff50] __setup_irq+0x630/0x9e0
>> [    0.011374] [c0000000fe10ba90] [c00000000019054c] request_threaded_irq+0x13c/0x250
>> [    0.011441] [c0000000fe10baf0] [c0000000000c2cd0] request_event_sources_irqs+0x100/0x180
>> [    0.011511] [c0000000fe10bc10] [c000000000eceda8] __machine_initcall_pseries_init_ras_IRQ+0xc4/0x12c
>> [    0.011591] [c0000000fe10bc40] [c00000000000d8c8] do_one_initcall+0x68/0x1e0
>> [    0.011659] [c0000000fe10bd00] [c000000000eb4484] kernel_init_freeable+0x284/0x370
>> [    0.011725] [c0000000fe10bdc0] [c00000000000db7c] kernel_init+0x2c/0x160
>> [    0.011782] [c0000000fe10be30] [c00000000000bc9c] ret_from_kernel_thread+0x5c/0xc0
>> [    0.011848] Instruction dump:
>> [    0.011885] fbc1fff0 f8010010 f821ff21 7c7c1b78 7c9d2378 7cbe2b78 787b0020 60000000
>> [    0.011955] 60000000 892d028a 2fa90000 409e04bc <813d0100> 75290001 408204c0 3d2061c8
>> [    0.012026] ---[ end trace e0b4d36daf3f8b2a ]---
>> [    0.013850]
>> [    2.013962] Kernel panic - not syncing: Fatal exception in interrupt
>> -------------
>>
>> To reproduce it, what I did was to fire a pulse in the hotplug queue 
>> right after CAS by hacking QEMU code.
> 
> That's not right after CAS, that's much later.
> 
> It appears the interrupt has been queued and we've taken it immediately
> on the first unmask of interrupts after registering the
> ras_hotplug_interrupt() IRQ.
> 
> That happens at subsys initcall time.
> 
> ras_hotplug_interrupt() calls queue_hotplug_event() which does:
> 
> 	queue_work(pseries_hp_wq, (struct work_struct *)work);
> 
> Where pseries_hp_wq is initialised in:
> 
>   static int __init pseries_dlpar_init(void)
>   {
>   	pseries_hp_wq = alloc_workqueue("pseries hotplug workqueue",
>   					WQ_UNBOUND, 1);
>   	return sysfs_create_file(kernel_kobj, &class_attr_dlpar.attr);
>   }
>   machine_device_initcall(pseries, pseries_dlpar_init);
> 
> 
> The ordering of subsys vs device init call is:
> 
>   #define subsys_initcall(fn)		__define_initcall(fn, 4)
>   #define fs_initcall(fn)			__define_initcall(fn, 5)
>   #define device_initcall(fn)		__define_initcall(fn, 6)
> 
> 
> So this is simply a case of the init calls being out of order.
> 
> We either need to create the pseries_hp_wq earlier, or register the
> event sources IRQs later. I'm not sure which is better.

Perhaps I'm erring on the side of caution, but I think registering
the IRQs later would be better. I think this would give the kernel
more time to come and better handle a hotplug request.

-Nathan 
> 
> cheers
>