[3/5] pseries: Create device hotplug entry point

Nathan Fontenot nfont at linux.vnet.ibm.com
Thu Sep 18 05:15:04 EST 2014


On 09/17/2014 02:07 AM, Michael Ellerman wrote:
> 
> On Mon, 2014-09-15 at 15:31 -0500, Nathan Fontenot wrote:
>> For pseries system the kernel will be notified of hotplug requests in
>> the form of rtas hotplug events. 
> 
> Can you flesh that design out a bit for me, I don't entirely get how it's going
> to work.
> 
> The kernel gets the rtas hotplug events (in rtasd.c) and spits them out to
> userspace, which then writes them back in ?
> 
>> This patch creates a common routine that can handle these requests in both
>> the PowerVM anbd PowerKVM environments, handle_dlpar_errorlog(). This also
>                 ^
>> creates the initial memory hotplug request handling stub.
>>
>> For PowerVM this patch also creates a new /proc file that the drmgr
>> command will use to write rtas hotplug events to.
> 
> Why is this different between phyp and KVM?
> 
>> For future PowerKVM handling the rtas check-exception code can pass
>> any rtas hotplug events received to handle_dlpar_errorlog().
> 
> Internally to the kernel you mean?
> 

Perhaps a better explanation of how things work today and where I see
them going is needed. I was trying to avoid a long explanation and I
don't think my shortened explanation worked. I'll include this in v2
of the patchset too.

The current hotplug (or dlpar) of devices (the process is generally the
same for memory, cpu, and pci) on PowerVM systems is initiated
from the HMC, which communicates the request to the partitions through
the RSCT framework. The RSCT framework then invokes the drmgr command.
The drmgr command performs the hotplug operation by doing some pieces,
such as most of the rtas calls and device tree parsing, in userspace
and make requests to the kernel to online/offline the device, update the
device tree and add/remove the device.

For PowerKVM the approach is to follow what is currently being done for
pci hotplug. A hotplug request is initiated from the host. QEMU then
sends an EPOW interrupt to the guest which causes the guest to make the
rtas,check-exception call. In QEMU, the rtas,check-exception call
returns a rtas hotplug event to the guest. I was using this same framework
to also enable memory (and next cpu) hotplug.

You are correct that the current pci hotplug path for PowerKVM involves
the kernel receiving the rtas event, passing it to rtas_errd in userspace,
and having rtas_errd invoke drmgr. The drmgr command then handles the request
as described above for PowerVM systems.

There is no need for this circuitous route, we should just handle the entire
hotplug of devices in the kernel. What I am hoping to do is to enable this
by moving the code to handle hotplug from drmgr into the kernel and 
provide a single path for handling hotplug for PowerVM and PowerKVM. To
make this work for PowerKVM we will update the kernel rtas code to
recognize rtas hotplug events returned from rtas,check-exception calls
and call handle_dlpar_errorlog(). The hotplug rtas event is never sent out
to userspace.

For PowerVM systems, I created the /proc/powerpc/dlpar file that a rtas
hotplug event can be written to and passed to handle_dlpar_errorlog().
There is no chance of updating how we receive hotplug requests on PowerVM
systems.

Hopefully that explains the design better.

>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
>> index a2450b8..574ec73 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -16,7 +16,9 @@
>>  #include <linux/cpu.h>
>>  #include <linux/slab.h>
>>  #include <linux/of.h>
>> +#include <linux/proc_fs.h>
>>  #include "offline_states.h"
>> +#include "pseries.h"
>>  
>>  #include <asm/prom.h>
>>  #include <asm/machdep.h>
>> @@ -530,13 +532,72 @@ static ssize_t dlpar_cpu_release(const char *buf, size_t count)
>>  	return count;
>>  }
>>  
>> +#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
> 
> That is really confusing, but I think it's just a diff artifact?

Yes, diff artifact.

> 
>> +static int handle_dlpar_errorlog(struct rtas_error_log *error_log)
>> +{
>> +	struct pseries_errorlog *pseries_log;
>> +	struct pseries_hp_errorlog *hp_elog;
>> +	int rc = -EINVAL;
>> +
>> +	pseries_log = get_pseries_errorlog(error_log,
>> +					   PSERIES_ELOG_SECT_ID_HOTPLUG);
>> +	if (!pseries_log)
>> +		return rc;
>> +
>> +	hp_elog = (struct pseries_hp_errorlog *)pseries_log->data;
>> +	if (!hp_elog)
>> +		return rc;
> 
> I don't see how that can happen?
> 
> struct pseries_errorlog {
> 	__be16 id;			/* 0x00 2-byte ASCII section ID	*/
> 	__be16 length;			/* 0x02 Section length in bytes	*/
> 	uint8_t version;		/* 0x04 Section version		*/
> 	uint8_t subtype;		/* 0x05 Section subtype		*/
> 	__be16 creator_component;	/* 0x06 Creator component ID	*/
> 	uint8_t data[];			/* 0x08 Start of section data	*/
> };
> 
> Should you be checking for length == 0 instead ?
>

You are correct.
 
> Also I think the code will probably end up cleaner if you do the endian
> conversions immediately when you read the hp_elog, rather than passing it
> around in BE and having to remember to convert at all the usages.
> 

Agreed. I'll clean that up.

>> +	switch (hp_elog->resource) {
>> +	case PSERIES_HP_ELOG_RESOURCE_MEM:
>> +		rc = dlpar_memory(hp_elog);
>> +		break;
> 
> Please add:
> 
>   default:
> 	pr_warn_ratelimited("Unknown resource ..")
> 
> Or something.

can do.

> 
> 
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +static ssize_t dlpar_write(struct file *file, const char __user *buf,
>> +			   size_t count, loff_t *offset)
>> +{
>> +	char *event_buf;
>> +	int rc;
>> +
>> +	event_buf = kmalloc(count + 1, GFP_KERNEL);
> 
> Why + 1 ? It's not null-terminated AFAICS.

I think that's just habit. You're correct, it's not needed.

> 
>> +	if (!event_buf)
>> +		return -ENOMEM;
>> +
>> +	rc = copy_from_user(event_buf, buf, count);
>> +	if (rc) {
>> +		kfree(event_buf);
>> +		return rc;
>> +	}
>> +
>> +	rc = handle_dlpar_errorlog((struct rtas_error_log *)event_buf);
> 
> If you start with a struct rtas_error_log * you shouldn't need any casts.

good point.

> 
>> +	kfree(event_buf);
>> +	return rc ? rc : count;
>> +}
>> +
>> +static const struct file_operations dlpar_fops = {
>> +	.write = dlpar_write,
>> +	.llseek = noop_llseek,
>> +};
>> +
>>  static int __init pseries_dlpar_init(void)
>>  {
>> +	struct proc_dir_entry *proc_ent;
>> +
>> +	proc_ent = proc_create("powerpc/dlpar", S_IWUSR, NULL, &dlpar_fops);
>> +	if (proc_ent)
>> +		proc_set_size(proc_ent, 0);
> 
>   else
> 	error message at least please
> 
> Why are we putting it in /proc, can't it go in /sys/kernel like the mobility
> stuff?

I have no personal preference, I'll move it to /sys/kernel and update the
creation handling.

> 
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index 24abc5c..0e60e15 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -20,6 +22,9 @@
>>  #include <asm/machdep.h>
>>  #include <asm/prom.h>
>>  #include <asm/sparsemem.h>
>> +#include <asm/rtas.h>
>> +
>> +DEFINE_MUTEX(dlpar_mem_mutex);
> 
> static ?

yes, that should have been static.

> 
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index b94516b..28bd994 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -62,6 +63,15 @@ extern int dlpar_detach_node(struct device_node *);
>>  extern int dlpar_acquire_drc(u32);
>>  extern int dlpar_release_drc(u32);
>>  
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +extern int dlpar_memory(struct pseries_hp_errorlog *);
>> +#else
>> +static inline int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
>> +{
>> +	return -ENOTSUPP;
> 
> EOPNOTSUPP is a bit more standard.

ok.

Thanks for all the feedback.

-Nathan



More information about the Linuxppc-dev mailing list