[PATCH v4] powernv, cpufreq: cpufreq driver for powernv platform
Gautham R Shenoy
ego at linux.vnet.ibm.com
Thu Mar 27 20:30:50 EST 2014
On Thu, Mar 27, 2014 at 12:09:53PM +0530, Viresh Kumar wrote:
> Cc'ing Rafael.
>
Thanks. It was a lapse on my part by not Cc'ing Rafael in the first
place.
> On 26 March 2014 22:25, Gautham R. Shenoy <ego at linux.vnet.ibm.com> wrote:
> > From: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> >
> > Backend driver to dynamically set voltage and frequency on
> > IBM POWER non-virtualized platforms. Power management SPRs
> > are used to set the required PState.
> >
> > This driver works in conjunction with cpufreq governors
> > like 'ondemand' to provide a demand based frequency and
> > voltage setting on IBM POWER non-virtualized platforms.
> >
> > PState table is obtained from OPAL v3 firmware through device
> > tree.
> >
> > powernv_cpufreq back-end driver would parse the relevant device-tree
> > nodes and initialise the cpufreq subsystem on powernv platform.
> >
> > The code was originally written by svaidy at linux.vnet.ibm.com. Over
> > time it was modified to accomodate bug-fixes as well as updates to the
> > the cpu-freq core. Relevant portions of the change logs corresponding
> > to those modifications are noted below:
> >
> > * The policy->cpus needs to be populated in a hotplug-invariant
> > manner instead of using cpu_sibling_mask() which varies with
> > cpu-hotplug. This is because the cpufreq core code copies this
> > content into policy->related_cpus mask which is should not vary on
>
> s/is /
Good catch! Will fix this
[...]
>
> > * On POWER systems, the CPU frequency is controlled at a core-level
> > and hence we need to serialize so that only one of the threads in
> > the core switches the core's frequency at a time. Introduce
> > per-core locking to enable finer-grained synchronization and
> > thereby enhance the speed and responsiveness of the cpufreq driver
> > to varying workload demands.
> >
> > The design of per-core locking is very simple and
> > straight-forward: we first define a Per-CPU lock and use the ones
> > that belongs to the first thread sibling of the core.
> >
> > cpu_first_thread_sibling() macro is used to find the *common*
> > lock for all thread siblings belonging to a core. [Authored by
> > srivatsa.bhat at linux.vnet.ibm.com]
>
> We don't need that after serialization patch of core is accepted. And it
> should be accepted soon, in one form or other.
As of now, I prefer this patch be based on code that is in the -next
tree. I'll get rid of the per-core locking once the serialization
patch of the core is accepted.
[...]
> > --- a/arch/powerpc/configs/pseries_defconfig
> > +++ b/arch/powerpc/configs/pseries_defconfig
> > @@ -353,3 +353,4 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
> > CONFIG_VIRTUALIZATION=y
> > CONFIG_KVM_BOOK3S_64=m
> > CONFIG_KVM_BOOK3S_64_HV=y
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> > diff --git a/arch/powerpc/configs/pseries_le_defconfig b/arch/powerpc/configs/pseries_le_defconfig
> > index 62771e0..47e6161 100644
> > --- a/arch/powerpc/configs/pseries_le_defconfig
> > +++ b/arch/powerpc/configs/pseries_le_defconfig
> > @@ -350,3 +350,4 @@ CONFIG_CRYPTO_LZO=m
> > # CONFIG_CRYPTO_ANSI_CPRNG is not set
> > CONFIG_CRYPTO_DEV_NX=y
> > CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
> > +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>
> don't know how Rafael want this, but probably this part could have gone
> through ppc tree..
>
That would mean multiple patches :-)
> > +
> > +#define pr_fmt(fmt) "powernv-cpufreq: " fmt
> > +
> > +#include <linux/module.h>
> > +#include <linux/cpufreq.h>
> > +#include <linux/smp.h>
> > +#include <linux/of.h>
> > +#include <asm/cputhreads.h>
>
> I thought I gave a comment on missing headers here?
Ok, so these seem to be the missing ones.
#include <linux/kernel.h>
#include <linux/percpu-defs.h>
#include <linux/mutex.h>
#include <linux/sysfs.h>
#include <linux/cpumask.h>
#include <asm/reg.h>
Have I missed anything else ?
> > +static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
> > +static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
>
> I though we don't need this anymore? Either Rafael will take my patch as
> is for the BOOST fixup or we will end up creating .isboost field in
> struct cpufreq_frequency_table and so you could have used .driver_data here.
>
I mentioned in my reply to your BOOST fixup patch that we would like
pstate_ids to be of the type "signed int", while the .driver_data is
an "unsigned int". If we end up using .driver_data, we would have to
be careful and not use it directly but reassign it to a signed int (or
typecast it) before using it.
Not just that, the pr_debugs in the core which are printed during
frequency transitions will then end up printing large positive values
(since it will interpret negative pstate_ids as unsigned int) in the
place of pstate_ids, which would not be particularly useful.
> > +struct powernv_pstate_info {
> > + int pstate_min_id;
> > + int pstate_max_id;
> > + int pstate_nominal_id;
> > + int nr_pstates;
> > +};
> > +static struct powernv_pstate_info powernv_pstate_info;
>
> Maybe write it as this (if you like :), as this is the only instance
> of this struct):
>
> Also, because 'powernv_pstate_info' is a local variable we can get rid of
> powerenv_ from its name and name it just pstate_info. That will make
> code shorter at many places and you may not be required to break
> lines into two at some places. If you wish :)
>
Ok fair enough!
> +static struct powernv_pstate_info {
> + int pstate_min_id;
> + int pstate_max_id;
> + int pstate_nominal_id;
> + int nr_pstates;
> +} powernv_pstate_info;
>
> > +/*
> > + * Initialize the freq table based on data obtained
> > + * from the firmware passed via device-tree
> > + */
> > +static int init_powernv_pstates(void)
> > +{
> > + struct device_node *power_mgt;
> > + int nr_pstates = 0;
> > + int pstate_min, pstate_max, pstate_nominal;
> > + const __be32 *pstate_ids, *pstate_freqs;
> > + int i;
>
> Can merge all the int definitions into a single line.
Too many ints to be incorporated in a single line. But will see if I
can beautify it :-)
> > + u32 len_ids, len_freqs;
> > +
> > + power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
> > + if (!power_mgt) {
> > + pr_warn("power-mgt node not found\n");
> > + return -ENODEV;
> > + }
> > +
> > + if (of_property_read_u32(power_mgt, "ibm,pstate-min", &pstate_min)) {
> > + pr_warn("ibm,pstate-min node not found\n");
> > + return -ENODEV;
> > + }
> > +
> > + if (of_property_read_u32(power_mgt, "ibm,pstate-max", &pstate_max)) {
> > + pr_warn("ibm,pstate-max node not found\n");
> > + return -ENODEV;
> > + }
>
> Why do you need to get these from DT? And not find that yourself here instead?
>
Well, I think we can obtain a more accurate value from the DT which
would have already computed it. But I'll let vaidy answer this.
> > + if (of_property_read_u32(power_mgt, "ibm,pstate-nominal",
> > + &pstate_nominal)) {
> > + pr_warn("ibm,pstate-nominal not found\n");
> > + return -ENODEV;
> > + }
> > + pr_info("cpufreq pstate min %d nominal %d max %d\n", pstate_min,
> > + pstate_nominal, pstate_max);
> > +
> > + pstate_ids = of_get_property(power_mgt, "ibm,pstate-ids", &len_ids);
> > + if (!pstate_ids) {
> > + pr_warn("ibm,pstate-ids not found\n");
> > + return -ENODEV;
> > + }
> > +
> > + pstate_freqs = of_get_property(power_mgt, "ibm,pstate-frequencies-mhz",
> > + &len_freqs);
> > + if (!pstate_freqs) {
> > + pr_warn("ibm,pstate-frequencies-mhz not found\n");
> > + return -ENODEV;
> > + }
> > +
> > + WARN_ON(len_ids != len_freqs);
> > + nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
> > + if (!nr_pstates) {
> > + WARN_ON(1);
> > + return -ENODEV;
> > + }
>
> Maybe like this:
>
> + if (WARN_ON(!nr_pstates))
> + return -ENODEV;
>
Thanks. This looks better.
> > + pr_debug("NR PStates %d\n", nr_pstates);
> > + for (i = 0; i < nr_pstates; i++) {
> > + u32 id = be32_to_cpu(pstate_ids[i]);
> > + u32 freq = be32_to_cpu(pstate_freqs[i]);
> > +
> > + pr_debug("PState id %d freq %d MHz\n", id, freq);
> > + powernv_freqs[i].driver_data = i;
>
> Looks like more than one comments aren't addressed :(
> You can use this field for your id. And even if you couldn't have
> done that, you don't need to initialize this field at all..
Ok, since we are better off not using it, we shouldn't be initializing
it.
>
> > + powernv_freqs[i].frequency = freq * 1000; /* kHz */
> > + powernv_pstate_ids[i] = id;
> > + }
> > + /* End of list marker entry */
> > + powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
> > +
> > + powernv_pstate_info.pstate_min_id = pstate_min;
> > + powernv_pstate_info.pstate_max_id = pstate_max;
> > + powernv_pstate_info.pstate_nominal_id = pstate_nominal;
> > + powernv_pstate_info.nr_pstates = nr_pstates;
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Returns the cpu frequency corresponding to the pstate_id.
> > + */
>
> Maybe:
>
> +/* Returns the cpu frequency corresponding to the pstate_id. */
>
Right, single line comment! Will fix this.
> > +static unsigned int pstate_id_to_freq(int pstate_id)
> > +{
> > + int i;
> > +
> > + i = powernv_pstate_info.pstate_max_id - pstate_id;
>
> It looks like these ids would always be contiguous? In that case
> it would be better if you can mention this property at the top of this
> file in some comment. So, that new people can understand things
> quickly.
>
Ok.
> > + BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
> > + WARN_ON(powernv_pstate_ids[i] != pstate_id);
>
> Do you really want it? We have already confirmed that 'i' is
> within limits.
"i" might be within limits. But I want to make sure that the pstate_id
corresponding to "i" is the same as the pstate_id that we
requested. Not that anything would have changed since the
initialization, but I get paranoid about these things from time to
time :-)
>
> > + return powernv_freqs[i].frequency;
> > +}
> > +
[...]
> > +
> > +/*
> > + * powernv_read_cpu_freq: Reads the current frequency on this cpu.
> > + *
> > + * Called via smp_call_function.
> > + *
> > + * Note: The caller of the smp_call_function should pass an argument of
> > + * the type 'struct powernv_smp_call_data *' along with this function.
> > + *
> > + * The current frequency on this cpu will be returned via
> > + * ((struct powernv_smp_call_data *)arg)->freq;
> > + */
> > +static void powernv_read_cpu_freq(void *arg)
> > +{
> > + unsigned long pmspr_val;
> > + s8 local_pstate_id;
> > + struct powernv_smp_call_data *freq_data;
> > +
> > + freq_data = (struct powernv_smp_call_data *)arg;
>
> don't need casting here ?
Doesn't harm having it there. Just like the #includes :-)
>
> > +
> > + pmspr_val = get_pmspr(SPRN_PMSR);
> > +
> > + /*
> > + * The local pstate id corresponds bits 48..55 in the PMSR.
> > + * Note: Watch out for the sign!
> > + */
> > + local_pstate_id = (pmspr_val >> 48) & 0xFF;
> > + freq_data->pstate_id = local_pstate_id;
> > + freq_data->freq = pstate_id_to_freq(freq_data->pstate_id);
> > +
> > + pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d kHz \n",
> > + smp_processor_id(), pmspr_val, freq_data->pstate_id,
>
> s/smp_processor_id/raw_smp_processor_id ?
No. This function is called via smp_call_function(). So we have
preempt_disable on and it is safe to use smp_processor_id.
>
> > + freq_data->freq);
> > +}
[...]
> > +/*
> > + * set_pstate: Sets the frequency on this cpu.
> > + *
> > + * This is called via an smp_call_function.
> > + *
> > + * The caller must ensure that freq_data is of the type
> > + * (struct powernv_smp_call_data *) and the pstate_id which needs to be set
> > + * on this cpu should be present in freq_data->pstate_id.
> > + */
> > +static void set_pstate(void *freq_data)
> > +{
> > + unsigned long val;
> > + unsigned long pstate_ul =
> > + ((struct powernv_smp_call_data *) freq_data)->pstate_id;
> > +
> > + val = get_pmspr(SPRN_PMCR);
> > + val = val & 0x0000ffffffffffffULL;
> > +
> > + pstate_ul = pstate_ul & 0xFF;
> > +
> > + /* Set both global(bits 56..63) and local(bits 48..55) PStates */
> > + val = val | (pstate_ul << 56) | (pstate_ul << 48);
> > +
> > + pr_debug("Setting cpu %d pmcr to %016lX\n", smp_processor_id(), val);
>
> s/smp_processor_id/raw_smp_processor_id ? At other places as well.
Even this function is called via smp_call_function(). So, we should
have preempt disabled.
>
> > + set_pmspr(SPRN_PMCR, val);
> > +}
> > +
> > +/*
> > + * powernv_set_freq: Sets the frequency corresponding to the cpufreq
> > + * table entry indexed by new_index on the cpus in the mask 'cpus'
>
> Rafael doesn't like CPUs to be written as cpus.. I got this comment long
> back :) (Applicable only for comments and logs)
Ah, ok. Will fix this.
>
> > + */
> > +static int powernv_set_freq(cpumask_var_t cpus, unsigned int new_index)
>
> Why do you want to keep this routine separately? Why not have a single routine
> powernv_cpufreq_target_index() ?
>
> > +{
> > + struct powernv_smp_call_data freq_data;
> > +
> > + freq_data.pstate_id = powernv_pstate_ids[new_index];
> > +
> > + /*
> > + * Use smp_call_function to send IPI and execute the
> > + * mtspr on target cpu. We could do that without IPI
> > + * if current CPU is within policy->cpus (core)
> > + */
> > + smp_call_function_any(cpus, set_pstate, &freq_data, 1);
>
> Not sure how smp_call_function_any() behaves but wouldn't it be
> a good optimization if you can check if raw_smp_processor_id()
> returns one of the CPUs from 'cpus'? And in that case don't
> shoot an IPI.
That's what smp_call_function_any() does.
>
> Same for the get part as well.
>
> But then I looked at the implementation of these routines and found
> they already have this optimization in :) .. But with overhead of few
> locks and disable_preempt() :(
>
Well, locks are taken when we are not running on this_cpu() so that we
can issue an IPI (or use some other optimized communication mechanism)
for executing set_pstate on one of the cpus in cpus. So this overhead
should be acceptable.
On the other hand if we are running on this_cpu(), we simply go ahead
and execute call which is what you're suggesting we do.
> > +}
> > +
> > +static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > + int base, i;
> > +
> > + base = cpu_first_thread_sibling(policy->cpu);
> > +
> > + for (i = 0; i < threads_per_core; i++)
> > + cpumask_set_cpu(base + i, policy->cpus);
> > +
> > + policy->cpuinfo.transition_latency = 25000;
> > + policy->cur = powernv_freqs[0].frequency;
>
> How can you be so sure? Also clock is doing this just after calling init()
> and so you can just remove it :)
You mean s/clock/core ?
You're right, the core sets policy->cur by invoking driver->get()
which in our case will read it from the PMSR and initialize it with
the correct value. So these lines can be removed.
>
> > + return cpufreq_table_validate_and_show(policy, powernv_freqs);
> > +}
> > +
> > +static int powernv_cpufreq_verify(struct cpufreq_policy *policy)
> > +{
> > + return cpufreq_generic_frequency_table_verify(policy);
> > +}
> > +
> > +static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
> > + unsigned int new_index)
> > +{
> > + int rc;
> > +
> > + lock_core_freq(policy->cpu);
> > + rc = powernv_set_freq(policy->cpus, new_index);
> > + unlock_core_freq(policy->cpu);
> > +
> > + return rc;
> > +}
> > +
> > +static struct cpufreq_driver powernv_cpufreq_driver = {
> > + .name = "powernv-cpufreq",
> > + .flags = CPUFREQ_CONST_LOOPS,
> > + .init = powernv_cpufreq_cpu_init,
> > + .verify = powernv_cpufreq_verify,
>
> Can do this instead:
> + .verify = cpufreq_generic_frequency_table_verify,
Ok.
>
Thanks for a detailed review again.
I'll send out the updated patch today incorporating the comments. It
shouldn't be hard since most of the comments do not affect the
functionality.
--
Thanks and Regards
gautham.
More information about the Linuxppc-dev
mailing list