[Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

Milton Miller miltonm at bga.com
Sat Feb 10 06:36:55 EST 2007


On Feb 9, 2007, at 11:42 AM, Carl Love wrote:

> On Thu, 2007-02-08 at 09:48 -0600, Milton Miller wrote:
>> This is part 2, with specific comments to the existing patch.
>> I'm not subscribed to the list, so please cc.
>>
>> Thanks,
>> milton
>>
>> miltonm at bga.com  Milton Miller

>>
>> so samples[spu][entry] = spu_pc_lower;
>>     samples[spu + SPUS_PER_ENTRY][entry] = spu_pc_upper
>>
>> hmm... upper and lower seem to be confused here ... the
>> data is naturarly big endian at the source, so upper should
>> be 0 and lower 1 .... or just make that spu_pc[2], use 0
>> and 1 ... yes, a for (j = 0; j < WORDS_PER_ENTRY] loop
>> assingnig to samples[spu + j * spus_per_word][entry] --- the
>> compiler should unrol a fixed count of 2.


>
> I think you missed the layout here.  This has nothing to do with big or
> little endian.

Actually, I think I have it exactly right, but are missing
my point.   If you consider 0 to be the first, and 7 to be
the last spu, then the order of the SPUs in the trace array
is big endian.   I was referring to the order of the elemnts
in the two words.


> The trace buffer is a 128 bit wide, 1024 entry hardware
> buffer.  It takes two 64 bit reads to get the entire 128 bits.  Hence,
> one read gives you the lower 64 bits of the trace buffer.  The other
> read gives you the upper 64 bits of the trace buffer.  The SPU PC are 
> 18
> bits wide but the lower 2 bits are always zero.  Only the upper 16 bits
> of the 18 bit PC are stored in the trace buffer.  The hardware stores
> the eight 16 bit PC values into the 128 bit entry.  The spu_pc_lower
> variable holds the program counter for the lower 4 SPUs 0 to 3..  The
> spu_pc_upper holds the program counters for the upper 4 SPUS, 4 to 7.
>
> In IBM notation the MSB of a word is bit 0, the LSB is n-1 where n is
> the size of the word.
>
> The cbe_read_trace_buffer() function puts trace buffer bits 0:63 into
> trace[0] and bits 64:127 into trace[1].  So the trace[0] is: bits 0:15
> is SPU PC 0, 16:31 is SPU PC 1, 32:47 is SPU PC 2 and bits 48:63 is SPU
> PC 3.  The layout of the trace[1] variable is the same except it has 
> SPU
> PC 4 through 7.

Yes, and trace[0] would be considered the most significant part of the
trace word, which would normally be called the upper word.   If think
of the spu number as big endian it makes sense.  If you think of it
as little endian, then you have this upper vs lower conflict.

>
> The the trace[0] and trace[1] entries are masked and shifted to pull 
> out
> each of the SPU PCs.  We are processing the two trace entries at the
> same time in the loop.  The spu_pc_lower holds the SPU PC for one of 
> the
> lower 4 SPUs (0-3) while spu_pc_upper holds the SPU PC for one of the
> upper SPUS (4-7). The point here is that upper and lower has nothing to
> do with the endian storage of a value.
>
> Note there is a trace buffer for each node.

Yes. I think my intuitution was correct.   Hopefully I explained
where I think we talked past each other.

>>> +static int cell_spu_pc_collection(int cpu)
>>> +{
>>> +	u32 trace_addr;
>>> +	int entry;
>>> +
>>> +	/* process the collected SPU PC for the node */
>>> +
>>> +	entry = 0;
>>> +
>>> +	trace_addr = cbe_read_pm(cpu, trace_address);
>>> +	while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
>> where did the 0x400 come from?
>> should this be (!(addr & EMPTY)) ?
>
>
> The trace_address register bit definitions are:
> 0:9    - trace buffer address for next write
> 10:19  - trace buffer address for next read
> 20     - flag, is 1 if the trace buffer is full
> 21     - flag, is 1 if the trace buffer is empty
> 22:31  - count of entries in the buffer
>
> NOTE, MSB is 0, LSB is 31.
>
> If the buffer is empty, the empty bit is set.  So we need to mask out
> just bit 21 using the value 0x400. So, if the result of masking
> trace_addr with CBE_PM_TRACE_BUF_EMPTY is 0x400 then the buffer is
> empty.
>
> So, yes the condition ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> could be simplified to just !(trace_addr & CBE_PM_TRACE_BUF_EMPTY).
>
> Note:  CBE_PM_TRACE_BUF_EMPTY is a #define for 0x400.

Ok we agree.   I was mostly pointing out that the number
was magic, not that I thought it was functional incorrect.

[ should we count the number of times full is set, since
it represents an unknown number of dropped samples? ]

[repeating for clarity]

>>>
>>> +	entry = 0;
>>> +
>>> +	trace_addr = cbe_read_pm(cpu, trace_address);
>>> +	while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
>
>>> +	{
>>> +		/* there is data in the trace buffer to process */
>>> +		spu_pc_extract(cpu, entry);
>>> +
>>> +		entry++;
>>> +
>>> +		if (entry >= TRACE_ARRAY_SIZE)
>>> +			/* spu_samples is full */
>>> +			break;
>>> +
>>> +		trace_addr = cbe_read_pm(cpu, trace_address);
>>> +	}
>>> +	return(entry);
>>
>> As I said the last time, I think this would be clearer:
>>
>> for (entry = 0; entry < TRACE_ARRAY_SIZE; entry++) {
>>      trace_addr = cbe_read_pm(...)
>>      if (trace_addr & EMPTY)
>> 		break;
>>      spu_pc_extract(cpu, entry, node_samples);
>> }



> I really don't agree.  A for loop says that you expect to do the loop
> TRACE_ARRAY_SIZE times.  Then you have an if statement just in case the
> assumption is wrong.

I choose to write the loop this way because it eliminated the
duplication of the line getting the trace entry, once before
and once after.   It also consolidates the initialization
of the output index.

>
> The reality is the trace buffer should never have TRACE_ARRAY_SIZE
> entries.  If it does, then we have probably lost some samples.  The 
> goal
> was to set the kernel timer value such that the routine to empty the
> trace buffer and process the samples should be called before the trace
> buffer gets more then half full.  That allows us some room if the
> routine gets called a little late.  The average number in the trace
> buffer when it is emptied will depend on the frequency the user
> specified for the hardware to take a sample.  Furthermore, the hardware
> is putting values into the trace buffer independent of the kernel 
> taking
> them out.  It is entirely possible that the trace buffer will get an
> additional entry after you enter the loop to pull them out.

I understand.   I think a brief comment to this effect could
convey that.   The fact that you are sizing the kernel read
timeout for half of the array should be explicitly commented
when  you calculate that timeout.

>
> What the code is trying to say is while there are entries in the
> hardware trace buffer, read the trace buffer, process them and put them
> into the samples array.  We have to check to make sure we don't 
> overflow
> the samples array.  That is why we have to make sure we didn't put more
> then TRACE_ARRAY_SIZE into the samples array.  This check should never
> be true.   But we have to have it in case something really goes wrong 
> we
> don't want to walk off the end of the array.
>
> I really don't think the For loop fits well in this context.  Sorry.

I was trying to eliminate the duplication of lines of code.   I think
the explicit break near the beginning of the loop does not hide
the expectation that the end is not reached, and would accept an
explicit comment to that effect.

Your code structure is emphasizing you expect the source to underrun,
mine emphasizes where the data is stored after it is obtained.

Yours hides the unusal exit (break) near the bottom but not all the way
there, and takes more lines of code.


[ in regards to buffer locking ]
>> Also, some comments about the locking rules.   I doubt
>> the hardware would like two threads reading the trace,
>> and the sample buffer is global in your code.
>> [I see there are some locks in the task swtich file]
>
> True.  There is only one CPU or thread doing the processing for each
> node.  There is a trace buffer per node so from that standpoint there 
> is
> no issue.  Generally speaking the processing of the buffer should be
> completed long before the hrtimer starts the routine to empty the 
> buffer
> again.  However, we probably need a lock to protect us against the
> corner case that the system is so busy that we didn't complete 
> prcessing
> the buffer within  the profiling_interval time.

I was thinking more of the spu context switch drain being interrupted
by the hrtimer.

>
> Good catch.
>
>
> [cut]

>>> @@ -486,7 +504,12 @@
>>>  	       struct op_system_config *sys, int num_ctrs)
>>>  {
>>>  	int i, j, cpu;
>>> +	spu_cycle_reset = 0;
>>>
>>> +	if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
>>> +		spu_cycle_reset = ctr[0].count;
>>> +		return;
>>> +	}
>>
>> I don't have the context here, but it seems an odd place
>> for such a check.
>>
>> [ again, see other post for comments about lfsr counting ]
>>
>
> We talked about this in the other email tread.  The event number get
> passed in ctr[i].event.  For SPU profiling there can only be one
> specified event.  Hence you look into ctr[0].event to see what the 
> event
> number is.  If it is SPU profiling, you get the number of events 
> between
> sample (in this case cycles) the user specified.  The number of events
> is put into spu_cycle_reset.  That is later used to determine the
> initial LFSR value for the hardware counter to restart at after it 
> takes
> a sample of the SPU program counters.

I understand the machanism and context number better now.  I'll continue
represntation to the kernel debate in the other thread.  [I think my 
reply
to your latest note in that thread will be delayed until the weekend.  ]

> I want to thank you for all of your input.  I know we don't agree on a
> few points (moving the LFSR to user and the for loop) but I just wanted
> to make sure you know I really do appreciate your input. You have
> pointed out several things that do need fixing.  Thanks for your help.
>
>                   Carl Love


Thanks for the consideration.
milton




More information about the cbe-oss-dev mailing list