[PATCH v3 2/2] powerpc: add support for new hcall H_BEST_ENERGY

Michael Neuling mikey at neuling.org
Mon Jun 28 16:11:06 EST 2010



In message <20100628053252.GA12751 at dirshya.in.ibm.com> you wrote:
> * Michael Neuling <mikey at neuling.org> [2010-06-28 11:44:31]:
> 
> > Vaidy,
> > 
> > > 	Create sysfs interface to export data from H_BEST_ENERGY hcall
> > > 	that can be used by administrative tools on supported pseries
> > > 	platforms for energy management	optimizations.
> > > 
> > > 	/sys/device/system/cpu/pseries_(de)activate_hint_list and
> > > 	/sys/device/system/cpu/cpuN/pseries_(de)activate_hint will provide
> > > 	hints for activation and deactivation of cpus respectively.
> > > 
> > > 	Added new driver module
> > > 		arch/powerpc/platforms/pseries/pseries_energy.c
> > > 	under new config option CONFIG_PSERIES_ENERGY
> > 
> > Can you provide some documentation on how to use these hints and what
> > format they are provided from sysfs.  Looks like two separate interfaces
> > two the same thing (one a comma sep list and 1 per cpu, why do need
> > both?).  What is the difference between activate and deactivate, with
> > out me having to read PAPR :-) ??
> 
> Hi Mike,
> 
> Thanks for reviewing this patch series.  Sure, I can provide
> additional information.
> 
> These hints are abstract number given by the hypervisor based on
> the extended knowledge the hypervisor has regarding the current system
> topology and resource mappings.
> 
> The activate and the deactivate part is for the two distinct
> operations that we could do for energy savings.  When we have more
> capacity than required, we could deactivate few core to save energy.
> The choice of the core to deactivate will be based on
> /sys/devices/system/cpu/deactivate_hint_list.  The comma separated
> list of cpus (cores) will be the preferred choice.
> 
> Once the system has few deactivated cores, based on workload demand we
> may have to activate them to meet the demand.  In that case the
> /sys/devices/system/cpu/activate_hint_list will be used to prefer the
> core in-order among the deactivated cores.
> 
> In simple terms, activate_hint_list will be null until we deactivate
> few cores.  Then we could look at the corresponding list for
> activation or deactivation.

Can you put these details in the code and in the check-in comments.

> 
> Regarding your second point, there is a reason for both a list and
> per-cpu interface.  The list gives us a system wide list of cores in
> one shot for userspace to base their decision.  This will be the
> preferred interface for most cases.  On the other hand, per-cpu file
> /sys/device/system/cpu/cpuN/pseries_(de)activate_hint provide more
> information since it exports the hint value as such.
> 
> The idea is that the list interface will be used to get a suggested
> list of cores to manage, while the per-cpu value can be used to
> further get fine grain information on a per-core bases from the
> hypervisor.  This allows Linux to have access to all information that
> the hypervisor has offered through this hcall interface.

OK, I didn't realise that they contained different info.  Just more
reasons that this interface needs better documentation :-)

Overall, I'm mostly happy with the interface.  It's pretty light weight.

> > Other comments below.
> > 
> > > 
> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> > > ---
> > >  arch/powerpc/include/asm/hvcall.h               |    3 
> > >  arch/powerpc/platforms/pseries/Kconfig          |   10 +
> > >  arch/powerpc/platforms/pseries/Makefile         |    1 
> > >  arch/powerpc/platforms/pseries/pseries_energy.c |  258 +++++++++++++++++
++++
> > ++
> > >  4 files changed, 271 insertions(+), 1 deletions(-)
> > >  create mode 100644 arch/powerpc/platforms/pseries/pseries_energy.c
> > > 
> > > diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm
/hvc
> > all.h
> > > index 5119b7d..34b66e0 100644
> > > --- a/arch/powerpc/include/asm/hvcall.h
> > > +++ b/arch/powerpc/include/asm/hvcall.h
> > > @@ -231,7 +231,8 @@
> > >  #define H_GET_EM_PARMS		0x2B8
> > >  #define H_SET_MPP		0x2D0
> > >  #define H_GET_MPP		0x2D4
> > > -#define MAX_HCALL_OPCODE	H_GET_MPP
> > > +#define H_BEST_ENERGY		0x2F4
> > > +#define MAX_HCALL_OPCODE	H_BEST_ENERGY
> > > 
> > >  #ifndef __ASSEMBLY__
> > > 
> > > diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platfo
rms/
> > pseries/Kconfig
> > > index c667f0f..b3dd108 100644
> > > --- a/arch/powerpc/platforms/pseries/Kconfig
> > > +++ b/arch/powerpc/platforms/pseries/Kconfig
> > > @@ -33,6 +33,16 @@ config PSERIES_MSI
> > >         depends on PCI_MSI && EEH
> > >         default y
> > > 
> > > +config PSERIES_ENERGY
> > 
> > Probably need a less generic name.  PSERIES_ENERGY_MANAGEMENT?
> > PSERIES_ENERGY_HOTPLUG_HINTS?
> 
> PSERIES_ENERGY_MANAGEMENT may be good but too long for a config
> option.
> 
> The idea is to collect all energy management functions in this module
> as and when new features are introduced in the pseries platform.  This
> hcall interface is the first to be included, but going forward in
> future I do not propose to have different modules for other energy
> management related features.
> 
> The name is specific enough for IBM pseries platform and energy
> management functions and enablements.  Having less generic name below
> this level will make it difficult to add all varieties of energy
> management functions in future.

OK, I thought this might be the case but you never said.  Please say
something like "This adds CONFIG_PSERIES_ENERGY which will be used for
future power saving code" or some such.

> 
> > > +	tristate "pseries energy management capabilities driver"
> > > +	depends on PPC_PSERIES
> > > +	default y
> > > +	help
> > > +	  Provides interface to platform energy management capabilities
> > > +	  on supported PSERIES platforms.
> > > +	  Provides: /sys/devices/system/cpu/pseries_(de)activation_hint_list
> > > +	  and /sys/devices/system/cpu/cpuN/pseries_(de)activation_hint
> > > +
> > >  config SCANLOG
> > >  	tristate "Scanlog dump interface"
> > >  	depends on RTAS_PROC && PPC_PSERIES
> > > diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platf
orms
> > /pseries/Makefile
> > > index 3dbef30..32ae72e 100644
> > > --- a/arch/powerpc/platforms/pseries/Makefile
> > > +++ b/arch/powerpc/platforms/pseries/Makefile
> > > @@ -16,6 +16,7 @@ obj-$(CONFIG_EEH)	+= eeh.o eeh_cache.o eeh_driver
.o eeh_e
> > vent.o eeh_sysfs.o
> > >  obj-$(CONFIG_KEXEC)	+= kexec.o
> > >  obj-$(CONFIG_PCI)	+= pci.o pci_dlpar.o
> > >  obj-$(CONFIG_PSERIES_MSI)	+= msi.o
> > > +obj-$(CONFIG_PSERIES_ENERGY)	+= pseries_energy.o
> > > 
> > >  obj-$(CONFIG_HOTPLUG_CPU)	+= hotplug-cpu.o
> > >  obj-$(CONFIG_MEMORY_HOTPLUG)	+= hotplug-memory.o
> > > diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/power
pc/p
> > latforms/pseries/pseries_energy.c
> > > new file mode 100644
> > > index 0000000..9a936b1
> > > --- /dev/null
> > > +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> > > @@ -0,0 +1,258 @@
> > > +/*
> > > + * POWER platform energy management driver
> > > + * Copyright (C) 2010 IBM Corporation
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License
> > > + * version 2 as published by the Free Software Foundation.
> > > + *
> > > + * This pseries platform device driver provides access to
> > > + * platform energy management capabilities.
> > > + */
> > > +
> > > +#include <linux/module.h>
> > > +#include <linux/types.h>
> > > +#include <linux/errno.h>
> > > +#include <linux/init.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/sysdev.h>
> > > +#include <linux/cpu.h>
> > > +#include <linux/of.h>
> > > +#include <asm/cputhreads.h>
> > > +#include <asm/page.h>
> > > +#include <asm/hvcall.h>
> > > +
> > > +
> > > +#define MODULE_VERS "1.0"
> > 
> > Argh, I hate module versions... but this one is less of an issue since
> > it doesn't seem to be being used anyway :-)
> > 
> > > +#define MODULE_NAME "pseries_energy"
> > 
> > Unused too.
> 
> Yes.  This keep the module template complete.  No overhead as yet :)
> We will certainly need the MODULE_VERS in future if we add/change
> sysfs interfaces.

Argh, change sysfs interfaces?!?  We don't even have the first one and
now we are going to change it?  :-)

> > > +
> > > +/* Helper Routines to convert between drc_index to cpu numbers */
> > > +
> > > +static u32 cpu_to_drc_index(int cpu)
> > > +{
> > > +	struct device_node *dn = NULL;
> > > +	const int *indexes;
> > > +	int i;
> > > +	dn = of_find_node_by_path("/cpus");
> > > +	if (dn == NULL)
> > > +		goto err;
> > 
> > Humm, I not sure this is really needed.  If you don't have /cpus you are
> > probably not going to boot.
> 
> Good suggestion.  I could add all these checks in module_init.   I was
> think if any of the functions being called is allocating memory and in
> case they fail, we need to abort.
> 
> I just reviewed and look like of_find_node_by_path() will not sleep or
> allocate any memory.  So if it succeeds once in module_init(), then it
> will never fail! 
> 
> 
> > > +	indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > > +	if (indexes == NULL)
> > > +		goto err;
> > 
> > These checks should probably be moved to module init rather than /sfs
> > read time.  If they fail, don't load the module and print a warning.  
> > 
> > These HCALLS and device-tree entire aren't going to be dynamic.
> 
> Agreed.  Only cause of runtime failure is OOM.  If none of these
> allocate memory, moving these checks once at module_init() will be
> a good optimization.

Cool, thanks.  

I also noticed you are doing this per cpu, so it's got the potential to
really suck on our big machines.

> But still I am wondering if it is worth the risk.  These are not in
> hot paths and these are just quick null comparisons.  Also in most other
> call sites, we do check for return values.

I found a /proc file the other day that took 60sec to read on a big
machine.  The file wasn't a hot path, but 60sec was way too long.

Things get bad quickly on our big machines.  I'd prefer it in module
init.

> 
> > > +	/* Convert logical cpu number to core number */
> > > +	i = cpu_core_of_thread(cpu);
> > > +	/*
> > > +	 * The first element indexes[0] is the number of drc_indexes
> > > +	 * returned in the list.  Hence i+1 will get the drc_index
> > > +	 * corresponding to core number i.
> > > +	 */
> > > +	WARN_ON(i > indexes[0]);
> > > +	return indexes[i + 1];
> > > +err:
> > > +	printk(KERN_WARNING "cpu_to_drc_index(%d) failed", cpu);
> > > +	return 0;
> > > +}
> > > +
> > > +static int drc_index_to_cpu(u32 drc_index)
> > > +{
> > > +	struct device_node *dn = NULL;
> > > +	const int *indexes;
> > > +	int i, cpu;
> > > +	dn = of_find_node_by_path("/cpus");
> > > +	if (dn == NULL)
> > > +		goto err;
> > 
> > same here
> 
> agreed, comments mentioned above.
> 
> > > +	indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> > > +	if (indexes == NULL)
> > > +		goto err;
> > > +	/*
> > > +	 * First element in the array is the number of drc_indexes
> > > +	 * returned.  Search through the list to find the matching
> > > +	 * drc_index and get the core number
> > > +	 */
> > > +	for (i = 0; i < indexes[0]; i++) {
> > > +		if (indexes[i + 1] == drc_index)
> > > +			break;
> > > +	}
> > > +	/* Convert core number to logical cpu number */
> > > +	cpu = cpu_first_thread_of_core(i);
> > > +	return cpu;
> > > +err:
> > > +	printk(KERN_WARNING "drc_index_to_cpu(%d) failed", drc_index);
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * pseries hypervisor call H_BEST_ENERGY provides hints to OS on
> > > + * preferred logical cpus to activate or deactivate for optimized
> > > + * energy consumption.
> > > + */
> > > +
> > > +#define FLAGS_MODE1	0x004E200000080E01
> > > +#define FLAGS_MODE2	0x004E200000080401
> > > +#define FLAGS_ACTIVATE  0x100
> > > +
> > > +static ssize_t get_best_energy_list(char *page, int activate)
> > > +{
> > > +	int rc, cnt, i, cpu;
> > > +	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > > +	unsigned long flags = 0;
> > > +	u32 *buf_page;
> > > +	char *s = page;
> > > +
> > > +	buf_page = (u32 *) get_zeroed_page(GFP_KERNEL);
> > > +	if (!buf_page)
> > > +		return -ENOMEM;
> > > +
> > > +	flags = FLAGS_MODE1;
> > > +	if (activate)
> > > +		flags |= FLAGS_ACTIVATE;
> > > +
> > > +	rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags, 0, __pa(buf_page),
> > > +				0, 0, 0, 0, 0, 0);
> > > +	if (rc != H_SUCCESS) {
> > > +		free_page((unsigned long) buf_page);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	cnt = retbuf[0];
> > > +	for (i = 0; i < cnt; i++) {
> > > +		cpu = drc_index_to_cpu(buf_page[2*i+1]);
> > > +		if ((cpu_online(cpu) && !activate) ||
> > > +		    (!cpu_online(cpu) && activate))
> > > +			s += sprintf(s, "%d,", cpu);
> > > +	}
> > > +	if (s > page) { /* Something to show */
> > > +		s--; /* Suppress last comma */
> > > +		s += sprintf(s, "\n");
> > > +	}
> > > +
> > > +	free_page((unsigned long) buf_page);
> > > +	return s-page;
> > > +}
> > > +
> > > +static ssize_t get_best_energy_data(struct sys_device *dev,
> > > +					char *page, int activate)
> > > +{
> > > +	int rc;
> > > +	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
> > > +	unsigned long flags = 0;
> > > +
> > > +	flags = FLAGS_MODE2;
> > > +	if (activate)
> > > +		flags |= FLAGS_ACTIVATE;
> > > +
> > > +	rc = plpar_hcall9(H_BEST_ENERGY, retbuf, flags,
> > > +				cpu_to_drc_index(dev->id),
> > > +				0, 0, 0, 0, 0, 0, 0);
> > > +
> > > +	if (rc != H_SUCCESS)
> > > +		return -EINVAL;
> > > +
> > > +	return sprintf(page, "%lu\n", retbuf[1] >> 32);
> > > +}
> > > +
> > > +/* Wrapper functions */
> > > +
> > > +static ssize_t cpu_activate_hint_list_show(struct sysdev_class *class,
> > > +			struct sysdev_class_attribute *attr, char *page)
> > > +{
> > > +	return get_best_energy_list(page, 1);
> > > +}
> > > +
> > > +static ssize_t cpu_deactivate_hint_list_show(struct sysdev_class *class,
> > > +			struct sysdev_class_attribute *attr, char *page)
> > > +{
> > > +	return get_best_energy_list(page, 0);
> > > +}
> > > +
> > > +static ssize_t percpu_activate_hint_show(struct sys_device *dev,
> > > +			struct sysdev_attribute *attr, char *page)
> > > +{
> > > +	return get_best_energy_data(dev, page, 1);
> > > +}
> > > +
> > > +static ssize_t percpu_deactivate_hint_show(struct sys_device *dev,
> > > +			struct sysdev_attribute *attr, char *page)
> > > +{
> > > +	return get_best_energy_data(dev, page, 0);
> > > +}
> > > +
> > > +/*
> > > + * Create sysfs interface:
> > > + * /sys/devices/system/cpu/pseries_activate_hint_list
> > > + * /sys/devices/system/cpu/pseries_deactivate_hint_list
> > > + * 	Comma separated list of cpus to activate or deactivate
> > > + * /sys/devices/system/cpu/cpuN/pseries_activate_hint
> > > + * /sys/devices/system/cpu/cpuN/pseries_deactivate_hint
> > > + *	Per-cpu value of the hint
> > 
> > Do we really need both interfaces?  Seems like awk could generate one
> > from the other in userspace?
> 
> Yes, it is possible, but will not scale.  Generating a list from set
> of per-cpu values is possible but will be too much overhead to build
> the list.  Having the list interface and deleting the per-cpu ones
> will reduce information available from hypervisor.  Having both the
> interface is a good balance between amount of information exported and
> quick access to a consolidated view.

OK.

"pseries_activate_hint" doesn't say what it's hinting about.  Can we
change the name to add it being a power save hint?  We might need to add
other hints later.  Performance hint etc..

> 
> > > + */
> > > +
> > > +struct sysdev_class_attribute attr_cpu_activate_hint_list =
> > > +		_SYSDEV_CLASS_ATTR(pseries_activate_hint_list, 0444,
> > > +		cpu_activate_hint_list_show, NULL);
> > > +
> > > +struct sysdev_class_attribute attr_cpu_deactivate_hint_list =
> > > +		_SYSDEV_CLASS_ATTR(pseries_deactivate_hint_list, 0444,
> > > +		cpu_deactivate_hint_list_show, NULL);
> > > +
> > > +struct sysdev_attribute attr_percpu_activate_hint =
> > > +		_SYSDEV_ATTR(pseries_activate_hint, 0444,
> > > +		percpu_activate_hint_show, NULL);
> > > +
> > > +struct sysdev_attribute attr_percpu_deactivate_hint =
> > > +		_SYSDEV_ATTR(pseries_deactivate_hint, 0444,
> > > +		percpu_deactivate_hint_show, NULL);
> > 
> > > +
> > > +static int __init pseries_energy_init(void)
> > > +{
> > > +	int cpu, err;
> > > +	struct sys_device *cpu_sys_dev;
> > > +
> > > +	/* Create the sysfs files */
> > > +	err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > > +				&attr_cpu_activate_hint_list.attr);
> > > +	if (!err)
> > > +		err = sysfs_create_file(&cpu_sysdev_class.kset.kobj,
> > > +				&attr_cpu_deactivate_hint_list.attr);
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		cpu_sys_dev = get_cpu_sysdev(cpu);
> > > +		err = sysfs_create_file(&cpu_sys_dev->kobj,
> > > +				&attr_percpu_activate_hint.attr);
> > > +		if (err)
> > > +			break;
> > > +		err = sysfs_create_file(&cpu_sys_dev->kobj,
> > > +				&attr_percpu_deactivate_hint.attr);
> > > +		if (err)
> > > +			break;
> > > +	}
> > > +	return err;
> > > +
> > > +}
> > > +
> > > +static void __exit pseries_energy_cleanup(void)
> > > +{
> > > +	int cpu;
> > > +	struct sys_device *cpu_sys_dev;
> > > +
> > > +	/* Remove the sysfs files */
> > > +	sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > > +				&attr_cpu_activate_hint_list.attr);
> > > +
> > > +	sysfs_remove_file(&cpu_sysdev_class.kset.kobj,
> > > +				&attr_cpu_deactivate_hint_list.attr);
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		cpu_sys_dev = get_cpu_sysdev(cpu);
> > > +		sysfs_remove_file(&cpu_sys_dev->kobj,
> > > +				&attr_percpu_activate_hint.attr);
> > > +		sysfs_remove_file(&cpu_sys_dev->kobj,
> > > +				&attr_percpu_deactivate_hint.attr);
> > > +	}
> > > +}
> > > +
> > > +module_init(pseries_energy_init);
> > > +module_exit(pseries_energy_cleanup);
> > > +MODULE_DESCRIPTION("Driver for pseries platform energy management");
> > 
> > Needs a less generic description. 
> 
> Explained above.
> 
> Thanks,
> Vaidy
> 


More information about the Linuxppc-dev mailing list