[Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Tue Jan 30 18:39:04 EST 2007

On Monday 29 January 2007 20:48, Maynard Johnson wrote:
> Subject: Add support to OProfile for profiling Cell BE SPUs
> 
> From: Maynard Johnson <maynardj at us.ibm.com>
> 
> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
> to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
> was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
> code.
> 
> Signed-off-by: Carl Love <carll at us.ibm.com>
> Signed-off-by: Maynard Johnson <mpjohn at us.ibm.com>

I can't really say much about the common oprofile files that you are
touching, maybe someone from oprofile-list (Philippe?) to look over them
and ack/nack them.

> +#define number_of_online_nodes(nodes) {        \
> +        u32 cpu; u32 tmp;                      \
> +        nodes = 0;                             \
> +        for_each_online_cpu(cpu) {             \
> +                tmp = cbe_cpu_to_node(cpu) + 1;\
> +                if (tmp > nodes)               \
> +                        nodes++;               \
> +       }                                      \
> +}

I've been discussing with benh about a better way to do this. We should
change all the references to nodes and cpu numbers to something more
correct in the future, so we get rid of the assumption that each
numa node is a cell chip. It's probably best to leave your code as
is for now, but we need to remember this one when cleaning it up.

You should however convert this into an inline function
of prototype 

static inline int number_of_online_nodes(void)

instead of a macro.

> +/* Defines used for sync_start */
> +#define SKIP_GENERIC_SYNC 0
> +#define SYNC_START_ERROR -1
> +#define DO_GENERIC_SYNC 1
> +
> +typedef struct vma_map
> +{
> +       struct vma_map *next;
> +       unsigned int vma;
> +       unsigned int size;
> +       unsigned int offset;
> +       unsigned int guard_ptr;
> +       unsigned int guard_val;
> +} vma_map_t;

please don't typedef structures.

> +/* The three functions below are for maintaining and accessing
> + * the vma-to-file offset map.
> + */
> +vma_map_t * create_vma_map(const struct spu * spu, u64 objectid);
> +unsigned int vma_map_lookup(vma_map_t *map, unsigned int vma,
> +                           const struct spu * aSpu);
> +void vma_map_free(struct vma_map *map);
> +
> +/*
> + * Entry point for SPU profiling.
> + * cycles_reset is the SPU_CYCLES count value specified by the user.
> + */
> +void start_spu_profiling(unsigned int cycles_reset);
> +
> +void stop_spu_profiling(void);
> +
> + 
> +/* add the necessary profiling hooks */
> +int spu_sync_start(void);
> +
> +/* remove the hooks */
> +int spu_sync_stop(void);
> + 
> +/* Record SPU program counter samples to the oprofile event buffer. */
> +void spu_sync_buffer(int spu_num, unsigned int * samples, 
> +                    int num_samples);
> +
> +#endif    // PR_UTIL_H 
> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c
> ===================================================================
> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c  2007-01-29 10:32:03.392788696 -0600
> @@ -0,0 +1,204 @@
> +/*
> + * Cell Broadband Engine OProfile Support
> + *
> + * (C) Copyright IBM Corporation 2006
> + *
> + * Authors: Maynard Johnson <maynardj at us.ibm.com>
> + *          Carl Love <carll at us.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/hrtimer.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <asm/cell-pmu.h>
> +#include "pr_util.h"
> +
> +#define TRACE_ARRAY_SIZE 1024
> +static u32 * samples;
> +static u32 * samples_per_node;
> +
> +static int spu_prof_running = 0;
> +static unsigned int profiling_interval = 0;
> +
> +extern int num_nodes;
> +extern unsigned int khzfreq;

You really can't have global variable with such generic names. For
static variables, it's less of a problem, since they are not in the
same name space, but for easier debugging, it's good to always have
the name of your module (e.g. spu_prof_) as a prefix to the identifier.

Of course, the best way would be to avoid global and static variables
entirely, but that's not always possible.

> +/*
> + * Oprofile setup functions
> + */
> +
> +#define NUM_SPU_BITS_TRBUF 16
> +#define SPUS_PER_TB_ENTRY   4
> +#define SPUS_PER_NODE       8
> +
> +/* 
> + * Collect the SPU program counter samples from the trace buffer.
> + * The global variable usage is as follows:
> + *    samples[<total-spus>][TRACE_ARRAY_SIZE] - array to store SPU PC samples
> + *           Assumption, the array will be all zeros on entry.
> + *    u32 samples_per_node[num_nodes] - array of how many valid samples per node
> + */
> +static void cell_spu_pc_collection(void)
> +{
> +       int cpu;
> +       int node;
> +       int spu;
> +       u32 trace_addr;
> +        /* the trace buffer is 128 bits */
> +       u64 trace_buffer[2];
> +       u64 spu_pc_lower;  
> +       u64 spu_pc_upper;
> +       u64 spu_mask;
> +       int entry, node_factor;
> +       // process the collected SPU PC for each node
> +       for_each_online_cpu(cpu) {
> +               if (cbe_get_hw_thread_id(cpu))
> +                       continue;
> +
> +               node = cbe_cpu_to_node(cpu);
> +               node_factor = node * SPUS_PER_NODE;
> +                /* number of valid entries for this node */
> +               entry = 0;
> +
> +               trace_addr = cbe_read_pm(cpu, trace_address);
> +               while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> +               {
> +                       /* there is data in the trace buffer to process */
> +                       cbe_read_trace_buffer(cpu, trace_buffer);  
> +                       spu_mask = 0xFFFF000000000000;
> +
> +                       /* Each SPU PC is 16 bits; hence, four spus in each of 
> +                        * the two 64-bit buffer entries that make up the
> +                        * 128-bit trace_buffer entry.  Process the upper and
> +                        * lower 64-bit values simultaneously.
> +                        */
> +                       for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
> +                               spu_pc_lower = spu_mask & trace_buffer[0];
> +                               spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
> +                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> +
> +                               spu_pc_upper = spu_mask & trace_buffer[1];
> +                               spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
> +                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> +
> +                               spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
> +
> +                               /* spu PC trace entry is upper 16 bits of the
> +                                * 18 bit SPU program counter 
> +                                */
> +                               spu_pc_lower = spu_pc_lower << 2;
> +                               spu_pc_upper = spu_pc_upper << 2;
> +
> +                               samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> +                                       = (u32) spu_pc_lower;
> +                               samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
> +                                       = (u32) spu_pc_upper;
> +                       }
> +
> +                       entry++;
> +
> +                       if (entry >= TRACE_ARRAY_SIZE) 
> +                               /* spu_samples is full */
> +                               break;
> +
> +                       trace_addr = cbe_read_pm(cpu, trace_address);
> +               }
> +               samples_per_node[node] = entry;
> +       }
> +}

While I can't see anything technically wrong with this function, it would be
good to split it into smaller functions. Since you are nesting three
loops, it should be possible to make a separate function from one of the
inner loops without changing the actual logic behind it.

> +
> +static int profile_spus(struct hrtimer * timer)
> +{
> +       ktime_t kt;
> +        int cpu, node, k, num_samples, spu_num;

whitespace damage

> +       
> +       if (!spu_prof_running)
> +               goto STOP;
> +
> +       cell_spu_pc_collection();
> +       for_each_online_cpu(cpu) {
> +               if (cbe_get_hw_thread_id(cpu))
> +                       continue;

Here, you enter the same top-level loop again, why not make it
	for_each_online_cpu(cpu) {
		if (cbe_get_hw_thread_id(cpu))
                         continue;
		num_samples = cell_spu_pc_collection(cpu);
		...

> +       kt = ktime_set(0, profiling_interval);
> +       if (!spu_prof_running)
> +               goto STOP;
> +       hrtimer_forward(timer, timer->base->get_time(), kt);
> +       return HRTIMER_RESTART;

is hrtimer_forward really the right interface here? You are ignoring
the number of overruns anyway, so hrtimer_start(,,) sounds more
correct to me.

> +
> + STOP:

labels should be in small letters.

> +       printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
> +       return HRTIMER_NORESTART;
> +}

> +void start_spu_profiling(unsigned int cycles_reset) {
> +
> +       ktime_t kt;
> +
> +        /* To calculate a timeout in nanoseconds, the basic
> +        * formula is ns = cycles_reset * (NSEC_PER_SEC / cpu frequency).
> +        * To avoid floating point math, we use the scale math
> +        * technique as described in linux/jiffies.h.  We use
> +        * a scale factor of SCALE_SHIFT,which provides 4 decimal places
> +        * of precision, which is close enough for the purpose at hand.
> +        */
> +
> +       /* Since cpufreq_quick_get returns frequency in kHz, we use
> +        * USEC_PER_SEC here vs NSEC_PER_SEC.
> +        */
> +       unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
> +       profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
> +       
> +       pr_debug("timer resolution: %lu\n", 
> +                TICK_NSEC);

Don't you need to adapt the profiling_interval at run time, when cpufreq
changes the core frequency? You should probably use
cpufreq_register_notifier() to update this.

> +       kt = ktime_set(0, profiling_interval);
> +       hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_REL);
> +       timer.expires = kt;
> +       timer.function = profile_spus;
> +
> +        /* Allocate arrays for collecting SPU PC samples */
> +       samples = (u32 *) kzalloc(num_nodes * SPUS_PER_NODE * TRACE_ARRAY_SIZE * sizeof(u32), GFP_ATOMIC);

Try to avoid atomic allocations. I don't think you are in an atomic
context here at all, so you can just use GFP_KERNEL.

> +       samples_per_node = (u32 *) kzalloc(num_nodes * sizeof(u32), GFP_ATOMIC);

Since MAX_NUMNODES is small, it's probably more efficient to just allocate this
statically.

> +
> +       spu_prof_running = 1;
> +       hrtimer_start(&timer, kt, HRTIMER_REL);
> +}
>
> +
> +void stop_spu_profiling(void) 
> +{
> +
> +       hrtimer_cancel(&timer);
> +       kfree(samples);
> +       kfree(samples_per_node);
> +       pr_debug("SPU_PROF: stop_spu_profiling issued\n");
> +       spu_prof_running = 0;
> +}

shouldn't you set spu_prof_running = 0 before doing any of the other things?
It looks to me like you could otherwise get into a use-after-free
situation. If I'm wrong with that, you probably don't need spu_prof_running
at all ;-)

> +/* Conainer for caching information about an active SPU task.
> + *   :map -- pointer to a list of vma_maps
> + *   :spu -- the spu for this active SPU task
> + *   :list -- potentially could be used to contain the cached_infos
> + *            for inactive SPU tasks.

Documenting structures is good, but please use the common kerneldoc format
for it. There are a number of examples for this in include/linux/

> + * 
> + * Ideally, we would like to be able to create the cached_info for
> + * an SPU task just one time -- when libspe first loads the SPU 
> + * binary file.  We would store the cached_info in a list.  Then, as
> + * SPU tasks are switched out and new ones switched in, the cached_info
> + * for inactive tasks would be kept, and the active one would be placed
> + * at the head of the list.  But this technique may not with
> + * current spufs functionality since the spu used in bind_context may
> + * be a different spu than was used in a previous bind_context for a
> + * reactivated SPU task.  Additionally, a reactivated SPU task may be
> + * assigned to run on a different physical SPE.  We will investigate
> + * further if this can be done.
> + *
> + */

You should stuff a pointer to cached_info into struct spu_context,
e.g. 'void *profile_private'.

> +struct cached_info {
> +       vma_map_t * map;
> +       struct spu * the_spu;
> +       struct kref cache_ref;
> +       struct list_head list;
> +};

And replace the 'the_spu' member with a back pointer to the
spu_context if you need it.

> +
> +/* A data structure for cached information about active SPU tasks.
> + * Storage is dynamically allocated, sized as
> + * "number of active nodes multplied by 8". 
> + * The info_list[n] member holds 0 or more 
> + * 'struct cached_info' objects for SPU#=n. 
> + *
> + * As currently implemented, there will only ever be one cached_info 
> + * in the list for a given SPU.  If we can devise a way to maintain
> + * multiple cached_infos in our list, then it would make sense
> + * to also cache the dcookie representing the PPU task application.
> + * See above description of struct cached_info for more details.
> + */
> +struct spu_info_stacks {
> +       struct list_head * info_list;
> +};

Why do you store pointers to list_head structures? If you want to store
lists, you should have a lists_head itself in here.

Why do you store them per spu in the first place? The physical spu
doesn't have any relevance to this at all, the only data that is
per spu is the sample data collected on a profiling interrupt,
which you can then copy in the per-context data on a context switch.

> +/* Looks for cached info for the passed spu.  If not found, the
> + * cached info is created for the passed spu.
> + * Returns 0 for success; otherwise, -1 for error.  
> + */ 
> +static int
> +prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
> +{

see above, this should get the spu_context pointer as its argument,
not the spu.

> +       vma_map_t * new_map;
> +       unsigned long flags = 0;
> +       int retval = 0;
> +       /* spu->number is a system-wide value, not a per-node value. */
> +       struct cached_info * info = get_cached_info(spu->number);
> +       if (info == NULL) {

if you revert the logic to

	if (info)
		goto out;

then the bulk of your function doesn't need to get indented,
which helps readability.

> +               /* create cached_info and add it to the list for SPU #<n>.*/
> +               info = kzalloc(sizeof(struct cached_info), GFP_ATOMIC);

GFP_KERNEL

> +               if (!info) {
> +                       printk(KERN_ERR "SPU_PROF: "
> +                              "%s, line %d: create vma_map failed\n",
> +                              __FUNCTION__, __LINE__);
> +                       goto ERR_ALLOC;
> +               }
> +               new_map = create_vma_map(spu, objectId);
> +               if (!new_map) {
> +                       printk(KERN_ERR "SPU_PROF: "
> +                              "%s, line %d: create vma_map failed\n",
> +                              __FUNCTION__, __LINE__);
> +                       goto ERR_ALLOC;
> +               }
> +
> +               pr_debug("Created vma_map\n");
> +               info->map = new_map;
> +               info->the_spu = spu;
> +               kref_init(&info->cache_ref);
> +               spin_lock_irqsave(&cache_lock, flags);
> +               list_add(&info->list, &spu_info->info_list[spu->number]);
> +               spin_unlock_irqrestore(&cache_lock, flags);
> +               goto OUT;
> +       } else {
> +        /* Immedidately put back reference to cached_info since we don't
> +        * really need it -- just checking whether we have it.
> +        */
> +               put_cached_info(info);
> +               pr_debug("Found cached SPU info.\n");
> +       }
> +       
> +ERR_ALLOC:
> +       retval = -1;
> +OUT:
> +       return retval;
> +}

> +/* Look up the dcookie for the task's first VM_EXECUTABLE mapping,
> + * which corresponds loosely to "application name". Also, determine
> + * the offset for the SPU ELF object.  If computed offset is 
> + * non-zero, it implies an embedded SPU object; otherwise, it's a
> + * separate SPU binary, in which case we retrieve it's dcookie.
> + */
> +static unsigned long 
> +get_exec_dcookie_and_offset(
> +       struct spu * spu, unsigned int * offsetp,
> +       unsigned long * spu_bin_dcookie,
> +       unsigned int spu_ref)
> +{
> +        unsigned long cookie = 0;
> +       unsigned int my_offset = 0;
> +        struct vm_area_struct * vma;
> +       struct mm_struct * mm = spu->mm;

indenting

> +        if (!mm)
> +                goto OUT;
> +
> +        for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +                if (!vma->vm_file)
> +                        continue;
> +                if (!(vma->vm_flags & VM_EXECUTABLE))
> +                        continue;
> +                cookie = fast_get_dcookie(vma->vm_file->f_dentry,
> +                        vma->vm_file->f_vfsmnt);
> +               pr_debug("got dcookie for %s\n",
> +                         vma->vm_file->f_dentry->d_name.name);
> +                break;
> +        }
> +
> +       for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +               if (vma->vm_start > spu_ref || vma->vm_end < spu_ref)
> +                       continue;
> +               my_offset = spu_ref - vma->vm_start;
> +               pr_debug("Found spu ELF at "
> +                         " %X for file %s\n", my_offset,
> +                         vma->vm_file->f_dentry->d_name.name);
> +               *offsetp = my_offset;
> +               if (my_offset == 0) {
> +                       if (!vma->vm_file) {
> +                               goto FAIL_NO_SPU_COOKIE;
> +                       }
> +                       *spu_bin_dcookie = fast_get_dcookie(
> +                               vma->vm_file->f_dentry,
> +                               vma->vm_file->f_vfsmnt);
> +                       pr_debug("got dcookie for %s\n",
> +                                 vma->vm_file->f_dentry->d_name.name);
> +               }
> +               break;
> +       }
> +       
> +OUT:
> +        return cookie;
> +
> +FAIL_NO_SPU_COOKIE:
> +       printk(KERN_ERR "SPU_PROF: "
> +              "%s, line %d: Cannot find dcookie for SPU binary\n",
> +              __FUNCTION__, __LINE__);
> +       goto OUT;
> +}
> +
> +
> +
> +/* This function finds or creates cached context information for the
> + * passed SPU and records SPU context information into the OProfile
> + * event buffer.
> + */
> +static int process_context_switch(struct spu * spu, unsigned int objectId)
> +{
> +       unsigned long flags = 0;
> +       int retval = 0;
> +       unsigned int offset = 0;
> +       unsigned long spu_cookie = 0, app_dcookie = 0;
> +       retval = prepare_cached_spu_info(spu, objectId);
> +       if (retval == -1) {
> +               goto OUT;
> +       }
> +        /* Get dcookie first because a mutex_lock is taken in that
> +        * code path, so interrupts must not be disabled.
> +        */
> +       app_dcookie = get_exec_dcookie_and_offset(spu, &offset, 
> +                                                 &spu_cookie, objectId);
> +
> +        /* Record context info in event buffer */
> +       spin_lock_irqsave(&buffer_lock, flags);
> +       add_event_entry(ESCAPE_CODE);
> +       add_event_entry(SPU_CTX_SWITCH_CODE);
> +       add_event_entry(spu->number);
> +       add_event_entry(spu->pid);
> +       add_event_entry(spu->tgid);
> +       add_event_entry(app_dcookie);
> +
> +       add_event_entry(ESCAPE_CODE);
> +       if (offset) {
> +         /* When offset is non-zero,  this means the SPU ELF was embedded;
> +          * otherwise, it was loaded from a separate binary file.  For the
> +          * embedded case, we record the offset of the SPU ELF into the PPU
> +          * executable; for the non-embedded case, we record a dcookie that
> +          * points to the location of the SPU binary that was loaded.
> +          */
> +               add_event_entry(SPU_OFFSET_CODE);
> +               add_event_entry(offset);
> +       } else {
> +               add_event_entry(SPU_COOKIE_CODE);
> +               add_event_entry(spu_cookie);
> +       }

I don't get it. What is the app_dcookie used for? If the spu binary is
embedded into a library, you are still missing the dcookie to that .so file,
because you return only an offset.

<nitpicking>

> +       unsigned long flags = 0;

no need to initialize

> +       struct spu * the_spu = (struct spu *) data;

no need for the cast

> +       pr_debug("SPU event notification arrived\n");
> +       if (val == 0){

if (!val)

> +       pr_debug("spu_sync_start -- running.\n");
> +OUT:

out:

> +       return ret;     
> +}

</nitpicking>

> @@ -480,7 +491,22 @@
>                struct op_system_config *sys, int num_ctrs)
>  {
>         int i, j, cpu;
> +       spu_cycle_reset = 0;
>  
> +       /* The cpufreq_quick_get function requires that cbe_cpufreq module
> +        * be loaded.  This function is not actually provided and exported
> +        * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
> +        * data structures.  Since there's no way for depmod to realize
> +        * that our OProfile module depends on cbe_cpufreq, we currently
> +        * are letting the userspace tool, opcontrol, ensure that the
> +        * cbe_cpufreq module is loaded.
> +        */
> +       khzfreq = cpufreq_quick_get(smp_processor_id());

You should probably have a fallback in here in case the cpufreq module
is not loaded. There is a global variable ppc_proc_freq (in Hz) that
you can access.
>         ;
>  }
>  
> -static void cell_global_start(struct op_counter_config *ctr)
> +static int calculate_lfsr(int n)
> +{
> +#define size 24
> +       int i;
> +       unsigned int newlfsr0;
> +       unsigned int lfsr = 0xFFFFFF;
> +       unsigned int howmany = lfsr - n;
> +
> +       for (i = 2; i < howmany + 2; i++) {
> +               newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
> +                           ((lfsr >> (size - 1 - 1)) & 1) ^
> +                           (((lfsr >> (size - 1 - 6)) & 1) ^
> +                            ((lfsr >> (size - 1 - 23)) & 1)));
> +
> +               lfsr >>= 1;
> +               lfsr = lfsr | (newlfsr0 << (size - 1));
> +       }
> +       return lfsr;
> +
> +}

I don't have the slightest idea what this code is about, but
it certainly looks inefficient to loop 16 million times to
compute a constant. Could you use a faster algorithm instead,
or at least add a comment about why you do it this way?

> +static void cell_global_stop(void)
> +{
> +       if (spu_cycle_reset) {
> +               cell_global_stop_spu();
> +       } else {
> +               cell_global_stop_ppu();
> +       }
> +
> +}

This looks weird as well. I suppose it's a limitation of the hardware
that you can only do either ppu or spu profiling. However, making that
dependent of whether the 'spu_cycle_reset' variable is set sounds
rather bogus.

I don't know what the best interface for choosing the target from
user space would be, but you probably also want to be able to
switch between them at run time.

	Arnd <><