[Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Fri Feb 9 02:48:26 EST 2007

This is part 2, with specific comments to the existing patch.
I'm not subscribed to the list, so please cc.

Thanks,
milton

miltonm at bga.com  Milton Miller

On Feb 6, 2007, at 5:02 PM, Carl Love wrote:

> This is the first update to the patch previously posted by Maynard
> Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".
>
> This repost fixes the line wrap issue that Ben mentioned.  Also the 
> kref
> handling for the cached info has been fixed and simplified.
>
> There are still a few items from the comments being discussed
> specifically how to profile the dynamic code for the SPFS context 
> switch
> code and how to deal with dynamic code stubs for library support.  Our
> proposal is to assign the samples from the SPFS and dynamic library 
> code
> to an anonymous sample bucket.  The support for properly handling the
> symbol extraction in these cases would be deferred to a later SDK.
>
> There is also a bug in profiling overlay code that we are 
> investigating.
>
>
> Subject: Add support to OProfile for profiling Cell BE SPUs
>
> From: Maynard Johnson <maynardj at us.ibm.com>
>
> This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
> to add in the SPU profiling capabilities.  In addition, a 'cell' 
> subdirectory
> was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
> code.
>
> Signed-off-by: Carl Love <carll at us.ibm.com>
> Signed-off-by: Maynard Johnson <mpjohn at us.ibm.com>
>

> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h	2007-02-03 
> 15:56:01.094856152 -0600
> @@ -0,0 +1,78 @@
> + /*
> + * Cell Broadband Engine OProfile Support
> + *
>

not sure where the name pr_util comes from ...
how about cell_profile or profile or cell_prof

> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ 
> linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c	2007-02-05 
> 09:32:25.708937424 -0600
> @@ -0,0 +1,203 @@
> +/*
> + * Cell Broadband Engine OProfile Support
> + *
> + * (C) Copyright IBM Corporation 2006
> + *
> + * Authors: Maynard Johnson <maynardj at us.ibm.com>
> + *          Carl Love <carll at us.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/hrtimer.h>
> +#include <linux/smp.h>
> +#include <linux/slab.h>
> +#include <asm/cell-pmu.h>
> +#include <asm/time.h>
> +#include "pr_util.h"
> +
> +#define TRACE_ARRAY_SIZE 1024
> +#define SCALE_SHIFT 14
> +
> +static u32 * samples;
> +
> +static int spu_prof_running = 0;
> +static unsigned int profiling_interval = 0;
> +
> +extern int spu_prof_num_nodes;
> +
> +
> +#define NUM_SPU_BITS_TRBUF 16
> +#define SPUS_PER_TB_ENTRY   4
> +#define SPUS_PER_NODE       8
> +
> +void set_profiling_frequency(unsigned int freq_khz, unsigned int 
> cycles_reset)
> +{
> +	unsigned long nsPerCyc;
> +	if (!freq_khz)
> +		freq_khz = ppc_proc_freq/1000;
> +
> +        /* To calculate a timeout in nanoseconds, the basic
> +	 * formula is ns = cycles_reset * (NSEC_PER_SEC / cpu frequency).
> +	 * To avoid floating point math, we use the scale math
> +	 * technique as described in linux/jiffies.h.  We use
> +	 * a scale factor of SCALE_SHIFT,which provides 4 decimal places
> +	 * of precision, which is close enough for the purpose at hand.
> +	 */
> +
> +	nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/freq_khz;
> +	profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
> +
> +}

see first concept

> +
> +/*
> + * Extract SPU PC from trace buffer entry
> + */
> +static void spu_pc_extract(int cpu, int entry)
> +{
> +        /* the trace buffer is 128 bits */
> +	u64 trace_buffer[2];
> +	u64 spu_pc_lower;
> +	u64 spu_pc_upper;
> +	u64 spu_mask;
> +	int spu;
> +	int node_factor;
> +	
> +	spu_mask = 0xFFFF;

why not a #define now

> +	node_factor = cbe_cpu_to_node(cpu) * SPUS_PER_NODE;

hmm... this appears to be the base index for the portion of the array 
used on this node.

> +	
> +	/* Each SPU PC is 16 bits; hence, four spus in each of
> +	 * the two 64-bit buffer entries that make up the
> +	 * 128-bit trace_buffer entry.  Process the upper and
> +	 * lower 64-bit values simultaneously.
> +	 * trace[0] SPU PC contents are: 0 1 2 3
> +	 * trace[1] SPU PC contents are: 4 5 6 7
> +	 */
> +
> +	cbe_read_trace_buffer(cpu, trace_buffer);
> +
> +	for (spu = SPUS_PER_TB_ENTRY-1; spu >= 0; spu--) {
> +		spu_pc_lower = spu_mask & trace_buffer[0];
> +		trace_buffer[0] = trace_buffer[0] >> NUM_SPU_BITS_TRBUF;
> +
> +		spu_pc_upper = spu_mask & trace_buffer[1];
> +		trace_buffer[1] = trace_buffer[1] >> NUM_SPU_BITS_TRBUF;

ok like the constant shift now.
[optional] I would probably group the shfit right together, and place
them either lower in the loop or as the third clause of the for().

> +		
> +		/* spu PC trace entry is upper 16 bits of the
> +		 * 18 bit SPU program counter
> +		 */
> +		spu_pc_lower = spu_pc_lower << 2;
> +		spu_pc_upper = spu_pc_upper << 2;
> +		
> +		samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> +			= (u32) spu_pc_lower;
> +		samples[((node_factor + spu + SPUS_PER_TB_ENTRY)
> +			 * TRACE_ARRAY_SIZE) + entry] = (u32) spu_pc_upper;

What is all that calcuation?

Don't cast the value to u32 when assigning it to the array.   For that
matter, spu_pc* should just be u32, the value is already small in this
patch (gcc should not compliain when the value is anded with 0xffff, and
a shfit of 2 will easily fit).

You are doing a lot of math here to calculate where to put
this one nodes worth of data in the array.   It really wants
to be multi-dimensional, and probably only contain one nodes
worth of data.

so samples[spu][entry] = spu_pc_lower;
    samples[spu + SPUS_PER_ENTRY][entry] = spu_pc_upper

hmm... upper and lower seem to be confused here ... the
data is naturarly big endian at the source, so upper should
be 0 and lower 1 .... or just make that spu_pc[2], use 0
and 1 ... yes, a for (j = 0; j < WORDS_PER_ENTRY] loop
assingnig to samples[spu + j * spus_per_word][entry] --- the
compiler should unrol a fixed count of 2.

> +	}
> +}
> +
> +static int cell_spu_pc_collection(int cpu)
> +{
> +	u32 trace_addr;
> +	int entry;
> +
> +	/* process the collected SPU PC for the node */
> +
> +	entry = 0;
> +
> +	trace_addr = cbe_read_pm(cpu, trace_address);
> +	while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
where did the 0x400 come from?
should this be (!(addr & EMPTY)) ?
> +	{
> +		/* there is data in the trace buffer to process */
> +		spu_pc_extract(cpu, entry);
> +
> +		entry++;
> +
> +		if (entry >= TRACE_ARRAY_SIZE)
> +			/* spu_samples is full */
> +			break;
> +
> +		trace_addr = cbe_read_pm(cpu, trace_address);
> +	}
> +	return(entry);

As I said the last time, I think this would be clearer:

for (entry = 0; entry < TRACE_ARRAY_SIZE; entry++) {
     trace_addr = cbe_read_pm(...)
     if (trace_addr & EMPTY)
		break;
     spu_pc_extract(cpu, entry, node_samples);
}

> +}
> +

Also, Documentation/CodingStyle
opening brace on for / if / while
no parenthis on return value -- return is not a function

> +
> +static int profile_spus(struct hrtimer * timer)
> +{
> +	ktime_t kt;
> +	int cpu, node, k, num_samples, spu_num;
> +	
> +	if (!spu_prof_running)
> +		goto stop;
> +
> +	for_each_online_cpu(cpu) {
> +		if (cbe_get_hw_thread_id(cpu))
> +			continue;
> +
> +		node = cbe_cpu_to_node(cpu);
> +
> +		num_samples = cell_spu_pc_collection(cpu);
> +
> +		if (num_samples == 0)
> +			continue;
> +
> +		for (k = 0; k < SPUS_PER_NODE; k++) {
> +			spu_num = k + (node * SPUS_PER_NODE);
> +			spu_sync_buffer(spu_num,
> +					samples + (spu_num * TRACE_ARRAY_SIZE),
> +					num_samples);
> +		}
> +	}

Looking here, you only process the entrys in a cpus's partiton
of the array at a time.   seems like an opertunity to cut
the size of the array by NR_CPUS.

Also, some comments about the locking rules.   I doubt
the hardware would like two threads reading the trace,
and the sample buffer is global in your code.
[I see there are some locks in the task swtich file]

> +	smp_wmb();
> +
> +	kt = ktime_set(0, profiling_interval);
> +	if (!spu_prof_running)
> +		goto stop;
> +	hrtimer_forward(timer, timer->base->get_time(), kt);
> +	return HRTIMER_RESTART;
> +
> + stop:
> +	printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
> +	return HRTIMER_NORESTART;
> +}
>

[ i don't know hr timing code to comment , omitted ]

> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_task_sync.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ 
> linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_task_sync.c	2007-02-06 
> 16:43:27.832908640 -0600
> @@ -0,0 +1,425 @@
> +/*
> + * Cell Broadband Engine OProfile Support
> + *
> + * (C) Copyright IBM Corporation 2006
> + *
> + * Author: Maynard Johnson <maynardj at us.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +/* The purpose of this file is to handle SPU event task switching
> + * and to record SPU context information into the OProfile
> + * event buffer.
> + *
> + * Additionally, the spu_sync_buffer function is provided as a helper
> + * for recoding actual SPU program counter samples to the event 
> buffer.
> + */
>

The split of this file seems a bit strained.   I think I would move
the context caching and refcounting to the vma_map file.

I would also mvoe the dcookie generation to that file.

I guess you have the actual glue to oprofile recording of task switch
here, so sending the actual sample stream isn't too bad of a split.
But I would not have looked for it in sync_task.

After the above split, rename this to profile_stream or something?
yes, task interface is still central.

[cut]
>
> +/*
> + * NOTE:  The caller is responsible for locking the
> + *	  cache_lock prior to calling this function.
> + */
> +static int release_cached_info(int spu_index)
> +{
> +	int index, end;
> +	if (spu_index == RELEASE_ALL) {
> +		end = num_spu_nodes;
> +		index = 0;
> +	} else {
> +	        if (spu_index >= num_spu_nodes) {
> +        	        printk(KERN_ERR "SPU_PROF: "
> +			       "%s, line %d: Invalid index %d into spu info cache\n",
> +               	               __FUNCTION__, __LINE__, spu_index);
> +	                goto out;
> +	        }

spaces vs tabs (use tabs)

> +		end = spu_index +1;
> +		index = spu_index;
> +	}
> +	for (; index < end; index++) {
> +		if (spu_info[index]) {
> +			kref_put(&spu_info[index]->cache_ref, destroy_cached_info);
> +			spu_info[index] = NULL;
> +		}
> +	}
> +
> +out:
> +	return 0;
> +}
>

[cut]

> +/* Look up the dcookie for the task's first VM_EXECUTABLE mapping,
> + * which corresponds loosely to "application name". Also, determine
> + * the offset for the SPU ELF object.  If computed offset is
> + * non-zero, it implies an embedded SPU object; otherwise, it's a
> + * separate SPU binary, in which case we retrieve it's dcookie.
> + */
> +static unsigned long
> +get_exec_dcookie_and_offset(
> +	struct spu * spu, unsigned int * offsetp,
> +	unsigned long * spu_bin_dcookie,
> +	unsigned int spu_ref)

closer to 80 chars / line please

> +{
> +	unsigned long cookie = 0;
> +	unsigned int my_offset = 0;
> +	struct vm_area_struct * vma;
> +	struct mm_struct * mm = spu->mm;
> +
> +	if (!mm)
> +		goto out;
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		if (!vma->vm_file)
> +			continue;
> +		if (!(vma->vm_flags & VM_EXECUTABLE))
> +			continue;
> +		cookie = fast_get_dcookie(vma->vm_file->f_dentry,
> +					  vma->vm_file->f_vfsmnt);
> +		pr_debug("got dcookie for %s\n",
> +			 vma->vm_file->f_dentry->d_name.name);
> +		break;
> +	}
> +
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		if (vma->vm_start > spu_ref || vma->vm_end < spu_ref)
> +			continue;
> +		my_offset = spu_ref - vma->vm_start;
> +		pr_debug("Found spu ELF at "
> +			 " %X for file %s\n", my_offset,
> +			 vma->vm_file->f_dentry->d_name.name);
> +		*offsetp = my_offset;

I admit my browser is whitespace impared, but does that
string really need to be split?

> +		if (my_offset == 0) {
> +			if (!vma->vm_file) {
> +				goto fail_no_spu_cookie;
> +			}
> +			*spu_bin_dcookie = fast_get_dcookie(
> +				vma->vm_file->f_dentry,
> +				vma->vm_file->f_vfsmnt);
> +			pr_debug("got dcookie for %s\n",
> +				 vma->vm_file->f_dentry->d_name.name);
> +		}
> +		break;			
> +	}
> +	
> + out:
> +	return cookie;
> +
> + fail_no_spu_cookie:
> +	printk(KERN_ERR "SPU_PROF: "
> +	       "%s, line %d: Cannot find dcookie for SPU binary\n",
> +	       __FUNCTION__, __LINE__);
> +	goto out;
>

spaces

> +}
> +
> +
> +
> +/* This function finds or creates cached context information for the
> + * passed SPU and records SPU context information into the OProfile
> + * event buffer.
> + */
> +static int process_context_switch(struct spu * spu, unsigned int 
> objectId)
> +{
> +	unsigned long flags;
> +	int retval = 0;
> +	unsigned int offset = 0;
> +	unsigned long spu_cookie = 0, app_dcookie = 0;
> +	retval = prepare_cached_spu_info(spu, objectId);
> +	if (retval == -1) {
> +		goto out;
> +	}
> +        /* Get dcookie first because a mutex_lock is taken in that

spaces

> +	 * code path, so interrupts must not be disabled.
> +	 */
> +	app_dcookie = get_exec_dcookie_and_offset(spu, &offset,
> +						  &spu_cookie, objectId);
> +
> +        /* Record context info in event buffer */

spaces

> +	spin_lock_irqsave(&buffer_lock, flags);
> +	add_event_entry(ESCAPE_CODE);
> +	add_event_entry(SPU_CTX_SWITCH_CODE);
> +	add_event_entry(spu->number);
> +	add_event_entry(spu->pid);
> +	add_event_entry(spu->tgid);
> +	add_event_entry(app_dcookie);
> +
> +	add_event_entry(ESCAPE_CODE);
> +	if (offset) {
> +	  /* When offset is non-zero,  this means the SPU ELF was embedded;
> +	   * otherwise, it was loaded from a separate binary file.  For the
> +	   * embedded case, we record the offset of the SPU ELF into the PPU
> +	   * executable; for the non-embedded case, we record a dcookie that
> +	   * points to the location of the SPU binary that was loaded.
> +	   */
> +		add_event_entry(SPU_OFFSET_CODE);
> +		add_event_entry(offset);
> +	} else {
> +		add_event_entry(SPU_COOKIE_CODE);
> +		add_event_entry(spu_cookie);
> +	

Why should the be inherently exclusive?

why not compare the spu dcookie with app_cookie to include it,
and offset whenever non-zero?

That way I can mmap an .ar of my spu programs for instance.

> }
> +	spin_unlock_irqrestore(&buffer_lock, flags);
> +	smp_wmb();
> +out:
> +	return retval;
> +}
> +

> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/vma_map.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/vma_map.c	2007-02-01 
> 17:21:46.944872872 -0600
> @@ -0,0 +1,229 @@
> + /*
> + * Cell Broadband Engine OProfile Support
> + *
> + * (C) Copyright IBM Corporation 2006
> + *
> + * Author: Maynard Johnson <maynardj at us.ibm.com>
>

As Christoph alluded to, vma is very commonly used to refer
to a vm_area struct.   This file appears to be related
to translating a spu address to the elf header offset
from whence it came.   How about map_saddr (for spu address)
or map_spu.c ?

> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +/* The code in this source file is responsible for generating
> + * vma-to-fileOffset maps for both overlay and non-overlay SPU
> + * applications.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/elf.h>
> +#include "pr_util.h"
> +
> +
> +void vma_map_free(struct vma_to_fileoffset_map *map)
> +{
> +	while (map) {
> +		struct vma_to_fileoffset_map *next = map->next;
> +		kfree(map);
> +		map = next;
> +	}
> +}
> +
> +unsigned int
> +vma_map_lookup(struct vma_to_fileoffset_map *map, unsigned int vma,
> +	       const struct spu * aSpu)
> +{
> +	u32 offset = -1;
> +	u32 ovly_grd;

((u32) (-1)) is the value for not found?

> +	for (; map; map = map->next) {
> +		if (vma < map->vma || vma >= map->vma + map->size)
> +			continue;
> +
> +		if (map->guard_ptr) {
> +			ovly_grd = *(u32 *)(aSpu->local_store + map->guard_ptr);
> +			if (ovly_grd != map->guard_val)
> +				continue;
> +		}
> +		break;
> +	}
> +
> +	if (likely(map != NULL)) {
> +		offset = vma - map->vma + map->offset;
> +	}
> +	return offset;
> +}
> +

rather thann break and recompare, just calulate offset and return from 
the for.
Its not like you have ref counts to worry about.

> +static struct vma_to_fileoffset_map *
> +vma_map_add(struct vma_to_fileoffset_map *map, unsigned int vma,
> +	    unsigned int size, unsigned int offset, unsigned int guard_ptr,
> +	    unsigned int guard_val)
> +{
> +	struct vma_to_fileoffset_map *new = kzalloc(sizeof(struct 
> vma_to_fileoffset_map), GFP_KERNEL);
> +	if (!new) {
> +		printk(KERN_ERR "SPU_PROF: %s, line %d: malloc failed\n",
> +		       __FUNCTION__, __LINE__);
> +		vma_map_free(map);
> +		return NULL;
> +	}
> +
> +	new->next = map;
> +	new->vma = vma;
> +	new->size = size;
> +	new->offset = offset;
> +	new->guard_ptr = guard_ptr;
> +	new->guard_val = guard_val;
> +
> +	return new;
> +}
> +
> +
> +/* Parse SPE ELF header and generate a list of vma_maps.
> + * A pointer to the first vma_map in the generated list
> + * of vma_maps is returned.  */
> +struct vma_to_fileoffset_map * create_vma_map(const struct spu * aSpu,
> +					      unsigned long spu_elf_start)
> +{
> +	static const unsigned char expected[EI_PAD] = {
> +		[EI_MAG0] = ELFMAG0,
> +		[EI_MAG1] = ELFMAG1,
> +		[EI_MAG2] = ELFMAG2,
> +		[EI_MAG3] = ELFMAG3,
> +		[EI_CLASS] = ELFCLASS32,
> +		[EI_DATA] = ELFDATA2MSB,
> +		[EI_VERSION] = EV_CURRENT,
> +		[EI_OSABI] = ELFOSABI_NONE
> +	};
> +
> +	struct vma_to_fileoffset_map *map = NULL;
> +	unsigned int overlay_tbl_offset = -1;
> +	unsigned long phdr_start, shdr_start;
> +	Elf32_Ehdr ehdr;
> +	Elf32_Phdr phdr;
> +	Elf32_Shdr shdr, shdr_str;
> +	Elf32_Sym sym;
> +	int i, j;
> +	char name[32];
> +
> +	unsigned int ovly_table_sym = 0;
> +	unsigned int ovly_buf_table_sym = 0;
> +	unsigned int ovly_table_end_sym = 0;
> +	unsigned int ovly_buf_table_end_sym = 0;
> +	unsigned long ovly_table;
> +	unsigned int n_ovlys;
> +
> +	struct {
> +		unsigned int vma;
> +		unsigned int size;
> +		unsigned int offset;
> +		unsigned int buf;
> +	} ovly;

This should struct be in asm/elf.h no?  or perhaps elf-spe.h

> +
> +	/* Get and validate ELF header.  */
> +
> +	copy_from_user(&ehdr, (void *) spu_elf_start, sizeof (ehdr));
> +	if (memcmp(ehdr.e_ident, expected, EI_PAD) != 0) {
> +		printk(KERN_ERR "SPU_PROF: "
> +		       "%s, line %d: Unexpected value parsing SPU ELF\n",
> +		       __FUNCTION__, __LINE__);
> +		return NULL;
> +	}

We have WARN_ON
giving line number with the common text is less useful than saying
which field is miscomparing.

> +	if (ehdr.e_machine != 23) {

elf header constant please

> +		printk(KERN_ERR "SPU_PROF: "
> +		       "%s, line %d: Unexpected value parsing SPU ELF\n",
> +		       __FUNCTION__,  __LINE__);
> +
> +		return NULL;
> +	}
> +	if (ehdr.e_type != ET_EXEC) {
> +		printk(KERN_ERR "SPU_PROF: "
> +		       "%s, line %d: Unexpected value parsing SPU ELF\n",
> +		       __FUNCTION__, __LINE__);
> +		return NULL;
> +	}
> +	phdr_start = spu_elf_start + ehdr.e_phoff;
> +	shdr_start = spu_elf_start + ehdr.e_shoff;
> +
> +	/* Traverse program headers.  */
> +	for (i = 0; i < ehdr.e_phnum; i++) {
> +		copy_from_user(&phdr, (void *) (phdr_start + i * sizeof(phdr)),
> +			       sizeof(phdr));
> +		if (phdr.p_type != PT_LOAD)
> +			continue;
> +		if (phdr.p_flags & (1 << 27))
> +			continue;
>

All the copy_from_user returns must be checked.

> +
> +		map = vma_map_add(map, phdr.p_vaddr, phdr.p_memsz,
> +				  phdr.p_offset, 0, 0);
> +		if (!map)
> +			return NULL;
> +	}
> +
> +	pr_debug("SPU_PROF: Created non-overlay maps\n");	
> +	/* Traverse section table and search for overlay-related symbols.  */
> +	for (i = 0; i < ehdr.e_shnum; i++) {
> +		copy_from_user(&shdr, (void *) (shdr_start + i * sizeof(shdr)),
> +			       sizeof(shdr));
> +		if (shdr.sh_type != SHT_SYMTAB)
> +			continue;
> +		if (shdr.sh_entsize != sizeof (sym))
> +			continue;
> +
> +		copy_from_user(&shdr_str,
> +			       (void *) (shdr_start + shdr.sh_link * sizeof(shdr)),
> +			       sizeof(shdr));
> +		if (shdr_str.sh_type != SHT_STRTAB)
> +			return NULL;
> +
> +		for (j = 0; j < shdr.sh_size / sizeof (sym); j++) {
> +			copy_from_user(&sym, (void *) (spu_elf_start +
> +						       shdr.sh_offset + j * sizeof (sym)),
> +				       sizeof (sym));
> +			copy_from_user(name, (void *) (spu_elf_start + shdr_str.sh_offset +
> +						       sym.st_name),
> +				       20);
> +			if (memcmp(name, "_ovly_table", 12) == 0)
> +				ovly_table_sym = sym.st_value;
> +			if (memcmp(name, "_ovly_buf_table", 16) == 0)
> +				ovly_buf_table_sym = sym.st_value;
> +			if (memcmp(name, "_ovly_table_end", 16) == 0)
> +				ovly_table_end_sym = sym.st_value;
> +			if (memcmp(name, "_ovly_buf_table_end", 20) == 0)
> +				ovly_buf_table_end_sym = sym.st_value;
> +		}
> +	}
> +
> +	/* If we don't have overlays, we're done.  */
> +	if (ovly_table_sym == 0 || ovly_buf_table_sym == 0
> +	    || ovly_table_end_sym == 0 || ovly_buf_table_end_sym == 0) {
> +		pr_debug("SPU_PROF: No overlay table found\n");
> +		return map;
> +	}
> +	else {
> +		pr_debug("SPU_PROF: Overlay table found\n");
> +	}
> +
> +	overlay_tbl_offset = vma_map_lookup(map, ovly_table_sym, aSpu);
> +	if (overlay_tbl_offset < 0) {
> +		printk(KERN_ERR "SPU_PROF: "
> +		       "%s, line %d: Error finding SPU overlay table\n",
> +		       __FUNCTION__, __LINE__);
> +		return NULL;
> +	}
> +	ovly_table = spu_elf_start + overlay_tbl_offset;
> +	n_ovlys = (ovly_table_end_sym - ovly_table_sym) / sizeof (ovly);
> +
> +	/* Traverse overlay table.  */
> +	for (i = 0; i < n_ovlys; i++) {
> +		copy_from_user(&ovly, (void *) (ovly_table + i * sizeof (ovly)),
> +			       sizeof (ovly));
> +		map = vma_map_add(map, ovly.vma, ovly.size, ovly.offset,
> +				   ovly_buf_table_sym + (ovly.buf - 1) * 4, i + 1);
>

not sure how those arguments for vma_map_add were arrived at.

> +		if (!map)
> +			return NULL;
> +	}
> +	
> +	return map;
> +}
>

[cut]
>  config OPROFILE
>  	tristate "OProfile system profiling (EXPERIMENTAL)"
> -	depends on PROFILING
> +	default m
> +	depends on SPU_FS && PROFILING
>  	help
>  	  OProfile is a profiling system capable of profiling the
>  	  whole system, include the kernel, kernel modules, libraries,

So you think my p6 needs spu_fs to do oprofile?

create OPROFILE_CELL that depends on both, default y.   If both are 
selected
then build the cell stuff on that.  Besides, this is for PPC_CELL_NATIVE
according to the next Makefile.

> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/Makefile
> ===================================================================
> --- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Makefile	2007-01-18 
> 16:43:14.429510072 -0600
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/Makefile	2007-02-01 
> 17:21:46.948872264 -0600
> @@ -11,7 +11,8 @@
>  		timer_int.o )
>
>  oprofile-y := $(DRIVER_OBJS) common.o backtrace.o
> -oprofile-$(CONFIG_PPC_CELL_NATIVE) += op_model_cell.o
> +oprofile-$(CONFIG_PPC_CELL_NATIVE) += op_model_cell.o \
> +					cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o
>  oprofile-$(CONFIG_PPC64) += op_model_rs64.o op_model_power4.o
>  oprofile-$(CONFIG_FSL_BOOKE) += op_model_fsl_booke.o
>  oprofile-$(CONFIG_6xx) += op_model_7450.o
> Index: linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c
> ===================================================================
> --- 
> linux-2.6.20-rc1.orig/arch/powerpc/oprofile/op_model_cell.c	2007-02-01 
> 17:21:38.388840624 -0600
> +++ linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c	2007-02-03 
> 15:59:38.555810464 -0600
> @@ -37,6 +37,16 @@
>  #include <asm/system.h>
>
>  #include "../platforms/cell/interrupt.h"
> +#include "cell/pr_util.h"
> +
> +/* spu_cycle_reset is the number of cycles between samples.
> + * This variable is used for SPU profiling and should ONLY be set
> + * at the beginning of cell_reg_setup; otherwise, it's read-only.
> + */
> +static unsigned int spu_cycle_reset = 0;
> +
> +#define NUM_SPUS_PER_NODE    8
> +#define SPU_CYCLES_EVENT_NUM 2        /*  event number for SPU_CYCLES 
> */
>
>  #define PPU_CYCLES_EVENT_NUM 1	/*  event number for CYCLES */
>  #define PPU_CYCLES_GRP_NUM   1  /* special group number for 
> identifying
> @@ -50,7 +60,6 @@
>  #define NUM_TRACE_BUS_WORDS 4
>  #define NUM_INPUT_BUS_WORDS 2
>
> -
>  struct pmc_cntrl_data {
>  	unsigned long vcntr;
>  	unsigned long evnts;
> @@ -140,12 +149,21 @@
>  /*
>   * Firmware interface functions
>   */
> +
>  static int
>  rtas_ibm_cbe_perftools(int subfunc, int passthru,
>  		       void *address, unsigned long length)
>  {
>  	u64 paddr = __pa(address);
>
> +	pm_rtas_token = rtas_token("ibm,cbe-perftools");
> +
> +	if (pm_rtas_token == RTAS_UNKNOWN_SERVICE) {
> +	  printk(KERN_ERR
> +		 "%s: rtas token ibm,cbe-perftools unknown\n",
> +		 __FUNCTION__);
> +	}
> +

You print the error but call it anyways?

>  	return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru,
>  			 paddr >> 32, paddr & 0xffffffff, length);
>  }
> @@ -486,7 +504,12 @@
>  	       struct op_system_config *sys, int num_ctrs)
>  {
>  	int i, j, cpu;
> +	spu_cycle_reset = 0;
>
> +	if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> +		spu_cycle_reset = ctr[0].count;
> +		return;
> +	}

I don't have the context here, but it seems an odd place
for such a check.

[ again, see other post for comments about lfsr counting ]

milton