[4/6] pseries: Add CPU dlpar remove functionality

Nathan Fontenot nfont at linux.vnet.ibm.com
Wed Jul 22 07:34:11 AEST 2015


On 07/21/2015 04:27 AM, Michael Ellerman wrote:
> On Mon, 2015-22-06 at 21:00:49 UTC, Nathan Fontenot wrote:
>> Add the ability to dlpar remove CPUs via hotplug rtas events, either by
>> specifying the drc-index of the CPU to remove or providing a count of cpus 
>> to remove.
>>
>> To accomplish we create a list of possible dr cpus and their drc indexes
> 
> What does "dr" mean? I know but not everyone does. If it's an acronymn it
> should be uppercase and spelt out at its first usage.
> 

Good point.

>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> index 7890b2f..49b7196 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -506,6 +507,207 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index)
>>  	return rc;
>>  }
>>  
>> +static struct device_node *cpu_drc_index_to_dn(struct device_node *parent,
>> +					       u32 drc_index)
>> +{
>> +	struct device_node *dn;
>> +	u32 my_index;
>> +	int rc;
>> +
>> +	for_each_child_of_node(parent, dn) {
>> +		if (of_node_cmp(dn->type, "cpu") != 0)
>> +			continue;
> 
> parent is always "/cpus", so I wonder if this should just be:
> 
> 	for_each_node_by_type(dn, "cpu") {
> 
> And then you don't need parent in here at all.
> 
> Or do we realistically expect to be looking for cpus under a node other than "/cpus" ?

This, combined with the comments below about iterating over the list
of dr_cpus, is going away in the next version of the patch. The 'parent'
node pointer is not passed around anymore.

> 
>> +		rc = of_property_read_u32(dn, "ibm,my-drc-index", &my_index);
>> +		if (rc)
>> +			continue;
>> +
>> +		if (my_index == drc_index)
>> +			break;
>> +	}
>> +
>> +	return dn;
>> +}
>> +
>> +static int dlpar_cpu_remove_by_index(struct device_node *parent,
>> +				     u32 drc_index)
>> +{
>> +	struct device_node *dn;
>> +	int rc;
>> +
>> +	dn = cpu_drc_index_to_dn(parent, drc_index);
>> +	if (!dn)
>> +		return -ENODEV;
>> +
>> +	rc = dlpar_cpu_remove(dn, drc_index);
>> +	of_node_put(dn);
>> +	return rc;
>> +}
>> +
>> +static int dlpar_cpus_possible(struct device_node *parent)
>> +{
>> +	int dr_cpus_possible;
>> +
>> +	/* The first u32 in the ibm,drc-indexes property is the numnber
>> +	 * of drc entries in the property, which is the possible number
>> +	 * number of dr capable cpus.
>> +	 */
>> +	of_property_read_u32(parent, "ibm,drc-indexes", &dr_cpus_possible);
>> +	return dr_cpus_possible;
>> +}
>> +
>> +struct dr_cpu {
>> +	u32     drc_index;
>> +	bool    present;
>> +	bool    modified;
>> +};
>> +
>> +static struct dr_cpu *get_dlpar_cpus(struct device_node *parent)
>> +{
>> +	struct device_node *dn;
>> +	struct property *prop;
>> +	struct dr_cpu *dr_cpus;
>> +	const __be32 *p;
>> +	u32 drc_index;
>> +	int dr_cpus_possible, index;
>> +	bool first;
>> +
>> +	dr_cpus_possible = dlpar_cpus_possible(parent);
>> +	dr_cpus = kcalloc(dr_cpus_possible, sizeof(*dr_cpus), GFP_KERNEL);
>> +	if (!dr_cpus)
>> +		return NULL;
>> +
>> +	first = true;
>> +	index = 0;
>> +	of_property_for_each_u32(parent, "ibm,drc-indexes", prop, p,
>> +				 drc_index) {
>> +		if (first) {
> 
> What about:
> 
> 		if (index == 0) {
> 
> Then you don't need first at all?
>

This block of code is getting re-written, no need to look for the
first node in the new version.
 
>> +			/* The first entry is the number of drc indexes in
>> +			 * the property, skip it.
>> +			 */
>> +			first = false;
>> +			continue;
>> +		}
>> +
>> +		dr_cpus[index].drc_index = drc_index;
>> +
>> +		dn = cpu_drc_index_to_dn(parent, drc_index);
> 
> The outer loop is iterating over all drc indexes (ie. one per cpu usually?),
> and then cpu_drc_index_to_dn() will iterate over all cpus.
> 
> So that's n^2 for n = nr_cpus which is not ideal.
> 
> I realise we can't do much better, but what we can do is limit the number of
> iterations of the outer loop. Because usually we will be here because we want
> to offline less than nr_cpus cpus.
> 
> ie. if I ask to offline 1 cpu on a 4096 cpu system we will still do up to 16M
> iterations in here, when all we needed to do was one iteration of the outer
> loop.
> 
> So I think this routine should take the number of cpus we're trying to remove
> and only iterate until it's found that many.
> 

Yeah, now that you point it out there is a lot of no good in that code.

Re-writing this with a new approach to only create lists of what is present.

>> +		if (dn) {
>> +			dr_cpus[index].present = true;
>> +			of_node_put(dn);
>> +		}
> 
> Why do we put non present ones in the dr_cpus array at all? It just means you
> have to skip them later? (in two separate places)
>
>> +
>> +		index++;
>> +	}
>> +
>> +	return dr_cpus;
>> +}
>> +
>> +static int dlpar_cpu_remove_by_count(struct device_node *parent,
>> +				     u32 cpus_to_remove)
>> +{
>> +	struct dr_cpu *dr_cpus;
>> +	int dr_cpus_removed = 0;
>> +	int dr_cpus_present = 0;
>> +	int dr_cpus_possible;
>> +	int i, rc;
>> +
>> +	pr_info("Attempting to hot-remove %d CPUs\n", cpus_to_remove);
>> +
>> +	dr_cpus = get_dlpar_cpus(parent);
> 
> So I think this should be:
> 
>> +	dr_cpus = get_dlpar_cpus(parent, cpus_to_remove);
> 
>> +	if (!dr_cpus) {
>> +		pr_info("Could not gather dr CPU info\n");
>> +		return -EINVAL;
>> +	}
> 
> And get_dlpar_cpus() should return cpus_to_remove worth of present cpus, or it
> should error, meaning the below then won't be needed:
>

When adding cpus by count we may need more than just cpus_to_remove worth
of present cpus. The goal was to provide all possibilities so we could
continue trying to satisfy the request even if one or more cpus fails
to remove.

>From this comment and comments below I think your approach is that we
should bail if any error occurs during cpu remove. Is this what we
should be doing?
 
>> +	dr_cpus_possible = dlpar_cpus_possible(parent);
>> +
>> +	for (i = 0; i < dr_cpus_possible; i++) {
>> +		if (dr_cpus[i].present)
>> +			dr_cpus_present++;
>> +	}
>> +
>> +	/* Validate the available CPUs to remove.
>> +	 * NOTE: we can't remove the last CPU.
>> +	 */
>> +	if (cpus_to_remove >= dr_cpus_present) {
>> +		pr_err("Insufficient CPUs (%d) to satisfy remove request\n",
>> +		       dr_cpus_present);
>> +		kfree(dr_cpus);
>> +		return -EINVAL;
>> +	}
>> +
> 
> Having only present cpus in dr_cpus should also simplify this loop because you
> don't need to skip non-present ones:

Fixed in updated code.

> 
>> +	for (i = 0; i < dr_cpus_possible; i++) {
>> +		if (dr_cpus_removed == cpus_to_remove)
>> +			break;
>> +
>> +		if (!dr_cpus[i].present)
>> +			continue;
>> +
>> +		rc = dlpar_cpu_remove_by_index(parent, dr_cpus[i].drc_index);
>> +		if (!rc) {
>> +			dr_cpus_removed++;
>> +			dr_cpus[i].modified = true;
>> +		}
> 
> else .. ?
> 
> Surely you should bail on the first error?
> 
> Definitely you should, because you're going to undo everything below anyway:

See comment above, I kept going here in the hopes that we could satisfy the
request even if a failure occurred.

> 
>> +	}
>> +
>> +	if (dr_cpus_removed != cpus_to_remove) {
>> +		pr_info("CPU hot-remove failed, adding any removed CPUs\n");
>> +
>> +		for (i = 0; i < dr_cpus_possible; i++) {
>> +			if (!dr_cpus[i].modified)
>> +				continue;
> 
> And if you bail on the first error above you shouldn't need modified, instead
> you just iterate in reverse from i using a new counter, eg. j.

Yep, I'll work that into the new code.

> 
>> +
>> +			rc = dlpar_cpu_add(parent, dr_cpus[i].drc_index);
>> +			if (rc)
>> +				pr_info("Failed to re-add CPU (%x)\n",
>> +					dr_cpus[i].drc_index);
>> +		}
>> +
>> +		rc = -EINVAL;
>> +	} else {
>> +		rc = 0;
>> +	}
>> +
>> +	kfree(dr_cpus);
>> +	return rc;
>> +}
>> +
>> +int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
>> +{
>> +	struct device_node *parent;
>> +	u32 count, drc_index;
>> +	int rc;
>> +
>> +	count = hp_elog->_drc_u.drc_count;
>> +	drc_index = hp_elog->_drc_u.drc_index;
> 
> Those both need be32_to_cpu().
> 
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:750:15: warning: incorrect type in assignment (different base types)
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:750:15:    expected unsigned int [unsigned] [usertype] count
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:750:15:    got restricted __be32 [usertype] drc_count
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:751:19: warning: incorrect type in assignment (different base types)
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:751:19:    expected unsigned int [unsigned] [usertype] drc_index
>    arch/powerpc/platforms/pseries/hotplug-cpu.c:751:19:    got restricted __be32 [usertype] drc_index
>

Will fix.
 
> 
> Though looking closer I don't see where we ever pass or receive a
> pseries_hp_errorlog to or from firmware? So I'm a bit confused why we're
> bothering with the __be32 shenanigans. Hopefully I've just missed a detail
> somewhere.
> 

That patch is coming. For hotplug in KVM guests the pseries_hp_errorlog
is received when we call rtas_check_exception(). Currently these are
sent up to rtas_errd in userspace, When this partchset goes in I planned
on sending a patch to have cpu and memory hotplug requests handled entirely
in the kernel instead of going to userspace.

>> +	parent = of_find_node_by_path("/cpus");
>> +	if (!parent)
>> +		return -ENODEV;
>> +
>> +	lock_device_hotplug();
>> +
>> +	switch (hp_elog->action) {
>> +	case PSERIES_HP_ELOG_ACTION_REMOVE:
>> +		if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
>> +			rc = dlpar_cpu_remove_by_count(parent, count);
>> +		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
>> +			rc = dlpar_cpu_remove_by_index(parent, drc_index);
>> +		else
>> +			rc = -EINVAL;
>> +		break;
>> +	default:
>> +		pr_err("Invalid action (%d) specified\n", hp_elog->action);
>> +		rc = -EINVAL;
>> +		break;
>> +	}
>> +
>> +	unlock_device_hotplug();
>> +	of_node_put(parent);
>> +	return rc;
>> +}
>> +
>>  #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
>>  
>>  static ssize_t dlpar_cpu_probe(const char *buf, size_t count)
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index 8411c27..58892f1 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -66,11 +66,16 @@ extern int dlpar_release_drc(u32 drc_index);
>>  
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>>  int dlpar_memory(struct pseries_hp_errorlog *hp_elog);
>> +int dlpar_cpu(struct pseries_hp_errorlog *hp_elog);
> 
> I don't think this should be under CONFIG_MEMORY_HOTPLUG ?
> 

No it should not.

Thanks for the review.

-Nathan



More information about the Linuxppc-dev mailing list