[Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Sat Feb 10 04:42:50 EST 2007

On Thu, 2007-02-08 at 09:48 -0600, Milton Miller wrote:
> This is part 2, with specific comments to the existing patch.
> I'm not subscribed to the list, so please cc.
> 
> Thanks,
> milton
> 
> miltonm at bga.com  Milton Miller
> 

> > +	int node_factor;
> > +	
> > +	spu_mask = 0xFFFF;

> why not a #define now

Yes, should be a #define now.

> 
> > +	node_factor = cbe_cpu_to_node(cpu) * SPUS_PER_NODE;
> 
> hmm... this appears to be the base index for the portion of the array 
> used on this node.

Yes, it is part of the index calculation for accessing an element of an
array that has cpu, spu and entry dimentions.  It can be simplified
since the array should no longer be 3d.  It just needs to be 2D as noted
later in your comments.

> 
> > +	
> > +	/* Each SPU PC is 16 bits; hence, four spus in each of
> > +	 * the two 64-bit buffer entries that make up the
> > +	 * 128-bit trace_buffer entry.  Process the upper and
> > +	 * lower 64-bit values simultaneously.
> > +	 * trace[0] SPU PC contents are: 0 1 2 3
> > +	 * trace[1] SPU PC contents are: 4 5 6 7
> > +	 */
> > +
> > +	cbe_read_trace_buffer(cpu, trace_buffer);
> > +
> > +	for (spu = SPUS_PER_TB_ENTRY-1; spu >= 0; spu--) {
> > +		spu_pc_lower = spu_mask & trace_buffer[0];
> > +		trace_buffer[0] = trace_buffer[0] >> NUM_SPU_BITS_TRBUF;
> > +
> > +		spu_pc_upper = spu_mask & trace_buffer[1];
> > +		trace_buffer[1] = trace_buffer[1] >> NUM_SPU_BITS_TRBUF;
> 
> ok like the constant shift now.
> [optional] I would probably group the shfit right together, and place
> them either lower in the loop or as the third clause of the for().
> 
> > +		
> > +		/* spu PC trace entry is upper 16 bits of the
> > +		 * 18 bit SPU program counter
> > +		 */
> > +		spu_pc_lower = spu_pc_lower << 2;
> > +		spu_pc_upper = spu_pc_upper << 2;
> > +		
> > +		samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> > +			= (u32) spu_pc_lower;
> > +		samples[((node_factor + spu + SPUS_PER_TB_ENTRY)
> > +			 * TRACE_ARRAY_SIZE) + entry] = (u32) spu_pc_upper;
> 
> What is all that calcuation?
> 
> Don't cast the value to u32 when assigning it to the array.   For that
> matter, spu_pc* should just be u32, the value is already small in this
> patch (gcc should not compliain when the value is anded with 0xffff, and
> a shfit of 2 will easily fit).
> 
> You are doing a lot of math here to calculate where to put
> this one nodes worth of data in the array.   It really wants
> to be multi-dimensional, and probably only contain one nodes
> worth of data.

Yes, I didn't fix the array size after we restructured the code.  My
mistake. We pulled the cell_spu_pc_collection() into the profile_spus()
function which eliminated the outer loop.  You caught the fact that the
array is now much bigger then it needs to be.  We really just need the
two dimensions, [spu in node][sample entry], rather then the three
dimensions we had before.  We will fix this.

> 
> so samples[spu][entry] = spu_pc_lower;
>     samples[spu + SPUS_PER_ENTRY][entry] = spu_pc_upper
> 
> hmm... upper and lower seem to be confused here ... the
> data is naturarly big endian at the source, so upper should
> be 0 and lower 1 .... or just make that spu_pc[2], use 0
> and 1 ... yes, a for (j = 0; j < WORDS_PER_ENTRY] loop
> assingnig to samples[spu + j * spus_per_word][entry] --- the
> compiler should unrol a fixed count of 2.

I think you missed the layout here.  This has nothing to do with big or
little endian.  The trace buffer is a 128 bit wide, 1024 entry hardware
buffer.  It takes two 64 bit reads to get the entire 128 bits.  Hence,
one read gives you the lower 64 bits of the trace buffer.  The other
read gives you the upper 64 bits of the trace buffer.  The SPU PC are 18
bits wide but the lower 2 bits are always zero.  Only the upper 16 bits
of the 18 bit PC are stored in the trace buffer.  The hardware stores
the eight 16 bit PC values into the 128 bit entry.  The spu_pc_lower
variable holds the program counter for the lower 4 SPUs 0 to 3..  The
spu_pc_upper holds the program counters for the upper 4 SPUS, 4 to 7.

In IBM notation the MSB of a word is bit 0, the LSB is n-1 where n is
the size of the word.

The cbe_read_trace_buffer() function puts trace buffer bits 0:63 into
trace[0] and bits 64:127 into trace[1].  So the trace[0] is: bits 0:15
is SPU PC 0, 16:31 is SPU PC 1, 32:47 is SPU PC 2 and bits 48:63 is SPU
PC 3.  The layout of the trace[1] variable is the same except it has SPU
PC 4 through 7.  

The the trace[0] and trace[1] entries are masked and shifted to pull out
each of the SPU PCs.  We are processing the two trace entries at the
same time in the loop.  The spu_pc_lower holds the SPU PC for one of the
lower 4 SPUs (0-3) while spu_pc_upper holds the SPU PC for one of the
upper SPUS (4-7). The point here is that upper and lower has nothing to
do with the endian storage of a value. 

Note there is a trace buffer for each node.

> 
> 
> 
> > +	}
> > +}
> > +
> > +static int cell_spu_pc_collection(int cpu)
> > +{
> > +	u32 trace_addr;
> > +	int entry;
> > +
> > +	/* process the collected SPU PC for the node */
> > +
> > +	entry = 0;
> > +
> > +	trace_addr = cbe_read_pm(cpu, trace_address);
> > +	while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> where did the 0x400 come from?
> should this be (!(addr & EMPTY)) ?

The trace_address register bit definitions are:
0:9    - trace buffer address for next write 
10:19  - trace buffer address for next read
20     - flag, is 1 if the trace buffer is full
21     - flag, is 1 if the trace buffer is empty
22:31  - count of entries in the buffer

NOTE, MSB is 0, LSB is 31.  

If the buffer is empty, the empty bit is set.  So we need to mask out
just bit 21 using the value 0x400. So, if the result of masking
trace_addr with CBE_PM_TRACE_BUF_EMPTY is 0x400 then the buffer is
empty.  

So, yes the condition ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
could be simplified to just !(trace_addr & CBE_PM_TRACE_BUF_EMPTY). 

Note:  CBE_PM_TRACE_BUF_EMPTY is a #define for 0x400.

> > +	{
> > +		/* there is data in the trace buffer to process */
> > +		spu_pc_extract(cpu, entry);
> > +
> > +		entry++;
> > +
> > +		if (entry >= TRACE_ARRAY_SIZE)
> > +			/* spu_samples is full */
> > +			break;
> > +
> > +		trace_addr = cbe_read_pm(cpu, trace_address);
> > +	}
> > +	return(entry);
> 
> As I said the last time, I think this would be clearer:
> 
> for (entry = 0; entry < TRACE_ARRAY_SIZE; entry++) {
>      trace_addr = cbe_read_pm(...)
>      if (trace_addr & EMPTY)
> 		break;
>      spu_pc_extract(cpu, entry, node_samples);
> }
> 
> 
> 
> > +}
> > +

I really don't agree.  A for loop says that you expect to do the loop
TRACE_ARRAY_SIZE times.  Then you have an if statement just in case the
assumption is wrong.  

The reality is the trace buffer should never have TRACE_ARRAY_SIZE
entries.  If it does, then we have probably lost some samples.  The goal
was to set the kernel timer value such that the routine to empty the
trace buffer and process the samples should be called before the trace
buffer gets more then half full.  That allows us some room if the
routine gets called a little late.  The average number in the trace
buffer when it is emptied will depend on the frequency the user
specified for the hardware to take a sample.  Furthermore, the hardware
is putting values into the trace buffer independent of the kernel taking
them out.  It is entirely possible that the trace buffer will get an
additional entry after you enter the loop to pull them out.  

What the code is trying to say is while there are entries in the
hardware trace buffer, read the trace buffer, process them and put them
into the samples array.  We have to check to make sure we don't overflow
the samples array.  That is why we have to make sure we didn't put more
then TRACE_ARRAY_SIZE into the samples array.  This check should never
be true.   But we have to have it in case something really goes wrong we
don't want to walk off the end of the array.  

I really don't think the For loop fits well in this context.  Sorry.

> 
> Also, Documentation/CodingStyle
> opening brace on for / if / while
> no parenthis on return value -- return is not a function
> 
> > +
> > +static int profile_spus(struct hrtimer * timer)
> > +{
> > +	ktime_t kt;
> > +	int cpu, node, k, num_samples, spu_num;
> > +	
> > +	if (!spu_prof_running)
> > +		goto stop;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (cbe_get_hw_thread_id(cpu))
> > +			continue;
> > +
> > +		node = cbe_cpu_to_node(cpu);
> > +
> > +		num_samples = cell_spu_pc_collection(cpu);
> > +
> > +		if (num_samples == 0)
> > +			continue;
> > +
> > +		for (k = 0; k < SPUS_PER_NODE; k++) {
> > +			spu_num = k + (node * SPUS_PER_NODE);
> > +			spu_sync_buffer(spu_num,
> > +					samples + (spu_num * TRACE_ARRAY_SIZE),
> > +					num_samples);
> > +		}
> > +	}
> 
> Looking here, you only process the entrys in a cpus's partiton
> of the array at a time.   seems like an opertunity to cut
> the size of the array by NR_CPUS.
> 

Yes, as mentioned above, I should have changed the size of the array
when we restructured the code.

> Also, some comments about the locking rules.   I doubt
> the hardware would like two threads reading the trace,
> and the sample buffer is global in your code.
> [I see there are some locks in the task swtich file]
> 
> > +	smp_wmb();
> > +
> > +	kt = ktime_set(0, profiling_interval);
> > +	if (!spu_prof_running)
> > +		goto stop;
> > +	hrtimer_forward(timer, timer->base->get_time(), kt);
> > +	return HRTIMER_RESTART;
> > +
> > + stop:
> > +	printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
> > +	return HRTIMER_NORESTART;
> > +}
> >

True.  There is only one CPU or thread doing the processing for each
node.  There is a trace buffer per node so from that standpoint there is
no issue.  Generally speaking the processing of the buffer should be
completed long before the hrtimer starts the routine to empty the buffer
again.  However, we probably need a lock to protect us against the
corner case that the system is so busy that we didn't complete prcessing
the buffer within  the profiling_interval time.  

Good catch.

[cut]
> > */
> >
> >  #define PPU_CYCLES_EVENT_NUM 1	/*  event number for CYCLES */
> >  #define PPU_CYCLES_GRP_NUM   1  /* special group number for 
> > identifying
> > @@ -50,7 +60,6 @@
> >  #define NUM_TRACE_BUS_WORDS 4
> >  #define NUM_INPUT_BUS_WORDS 2
> >
> > -
> >  struct pmc_cntrl_data {
> >  	unsigned long vcntr;
> >  	unsigned long evnts;
> > @@ -140,12 +149,21 @@
> >  /*
> >   * Firmware interface functions
> >   */
> > +
> >  static int
> >  rtas_ibm_cbe_perftools(int subfunc, int passthru,
> >  		       void *address, unsigned long length)
> >  {
> >  	u64 paddr = __pa(address);
> >
> > +	pm_rtas_token = rtas_token("ibm,cbe-perftools");
> > +
> > +	if (pm_rtas_token == RTAS_UNKNOWN_SERVICE) {
> > +	  printk(KERN_ERR
> > +		 "%s: rtas token ibm,cbe-perftools unknown\n",
> > +		 __FUNCTION__);
> > +	}
> > +
> 
> You print the error but call it anyways?

Yes, this needs fixing.

> 
> >  	return rtas_call(pm_rtas_token, 5, 1, NULL, subfunc, passthru,
> >  			 paddr >> 32, paddr & 0xffffffff, length);
> >  }
> > @@ -486,7 +504,12 @@
> >  	       struct op_system_config *sys, int num_ctrs)
> >  {
> >  	int i, j, cpu;
> > +	spu_cycle_reset = 0;
> >
> > +	if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> > +		spu_cycle_reset = ctr[0].count;
> > +		return;
> > +	}
> 
> I don't have the context here, but it seems an odd place
> for such a check.
> 
> [ again, see other post for comments about lfsr counting ]
> 

We talked about this in the other email tread.  The event number get
passed in ctr[i].event.  For SPU profiling there can only be one
specified event.  Hence you look into ctr[0].event to see what the event
number is.  If it is SPU profiling, you get the number of events between
sample (in this case cycles) the user specified.  The number of events
is put into spu_cycle_reset.  That is later used to determine the
initial LFSR value for the hardware counter to restart at after it takes
a sample of the SPU program counters.  

> 
> milton
> 

I want to thank you for all of your input.  I know we don't agree on a
few points (moving the LFSR to user and the for loop) but I just wanted
to make sure you know I really do appreciate your input. You have
pointed out several things that do need fixing.  Thanks for your help.

                  Carl Love