[PATCH v2 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

Wed Aug 1 23:16:22 AEST 2018

John Allen <jallen at linux.ibm.com> writes:

> On Mon, Jul 23, 2018 at 11:41:24PM +1000, Michael Ellerman wrote:
>>John Allen <jallen at linux.ibm.com> writes:
>>
>>> While handling PRRN events, the time to handle the actual hotplug events
>>> dwarfs the time it takes to perform the device tree updates and queue the
>>> hotplug events. In the case that PRRN events are being queued continuously,
>>> hotplug events have been observed to be queued faster than the kernel can
>>> actually handle them. This patch avoids the problem by waiting for a
>>> hotplug request to complete before queueing more hotplug events.

Have you tested this patch in isolation, ie. not with patch 1?

>>So do we need the hotplug work queue at all? Can we just call
>>handle_dlpar_errorlog() directly?
>>
>>Or are we using the work queue to serialise things? And if so would a
>>mutex be better?
>
> Right, the workqueue is meant to serialize all hotplug events and it 
> gets used for more than just PRRN events. I believe the motivation for 
> using the workqueue over a mutex is that KVM guests initiate hotplug 
> events through the hotplug interrupt and can queue fairly large requests 
> meaning that in this scenario, waiting for a lock would block interrupts
> for a while.

OK, but that just means that path needs to schedule work to run later.

> Using the workqueue allows us to serialize hotplug events 
> from different sources in the same way without worrying about the 
> context in which the event is generated.

A lock would be so much simpler.

It looks like we have three callers of queue_hotplug_event(), the dlpar
code, the mobility code and the ras interrupt.

The dlpar code already waits synchronously:

  init_completion(&hotplug_done);
  queue_hotplug_event(hp_elog, &hotplug_done, &rc);
  wait_for_completion(&hotplug_done);

You're changing mobility to do the same (this patch), leaving only the
ras interrupt that actually queues work and returns.

So it really seems like a mutex would do the trick, and the ras
interrupt would be the only case that needs to schedule work for later.

cheers