[Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update

Wed Jan 31 10:31:09 EST 2007

On Tue, 2007-01-30 at 15:41 -0600, Maynard Johnson wrote:
[snip
> [snip]
> >>+       // process the collected SPU PC for each node
> >>+       for_each_online_cpu(cpu) {
> >>+               if (cbe_get_hw_thread_id(cpu))
> >>+                       continue;
> >>+
> >>+               node = cbe_cpu_to_node(cpu);
> >>+               node_factor = node * SPUS_PER_NODE;
> >>+                /* number of valid entries for this node */
> >>+               entry = 0;
> >>+
> >>+               trace_addr = cbe_read_pm(cpu, trace_address);
> >>+               while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> >>+               {
> >>+                       /* there is data in the trace buffer to process */
> >>+                       cbe_read_trace_buffer(cpu, trace_buffer);  
> >>+                       spu_mask = 0xFFFF000000000000;
> >>+
> >>+                       /* Each SPU PC is 16 bits; hence, four spus in each of 
> >>+                        * the two 64-bit buffer entries that make up the
> >>+                        * 128-bit trace_buffer entry.  Process the upper and
> >>+                        * lower 64-bit values simultaneously.
> >>+                        */
> >>+                       for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
> >>+                               spu_pc_lower = spu_mask & trace_buffer[0];
> >>+                               spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
> >>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+                               spu_pc_upper = spu_mask & trace_buffer[1];
> >>+                               spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
> >>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+                               spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
> >>+
> >>+                               /* spu PC trace entry is upper 16 bits of the
> >>+                                * 18 bit SPU program counter 
> >>+                                */
> >>+                               spu_pc_lower = spu_pc_lower << 2;
> >>+                               spu_pc_upper = spu_pc_upper << 2;
> >>+
> >>+                               samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> >>+                                       = (u32) spu_pc_lower;
> >>+                               samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
> >>+                                       = (u32) spu_pc_upper;
> >>+                       }
> >>+
> >>+                       entry++;
> >>+
> >>+                       if (entry >= TRACE_ARRAY_SIZE) 
> >>+                               /* spu_samples is full */
> >>+                               break;
> >>+
> >>+                       trace_addr = cbe_read_pm(cpu, trace_address);
> >>+               }
> >>+               samples_per_node[node] = entry;
> >>+       }
> >>+}
> > 
> > 
> > While I can't see anything technically wrong with this function, it would be
> > good to split it into smaller functions. Since you are nesting three
> > loops, it should be possible to make a separate function from one of the
> > inner loops without changing the actual logic behind it.
> Will do.
> > 
> > 
> >>+
> >>+static int profile_spus(struct hrtimer * timer)
> >>+{
> >>+       ktime_t kt;
> >>+        int cpu, node, k, num_samples, spu_num;
> > 
> > 
> > whitespace damage
> fixed
> > 
> > 
> >>+       
> >>+       if (!spu_prof_running)
> >>+               goto STOP;
> >>+
> >>+       cell_spu_pc_collection();
> >>+       for_each_online_cpu(cpu) {
> >>+               if (cbe_get_hw_thread_id(cpu))
> >>+                       continue;
> > 
> > 
> > Here, you enter the same top-level loop again, why not make it
> > 	for_each_online_cpu(cpu) {
> > 		if (cbe_get_hw_thread_id(cpu))
> >                          continue;
> > 		num_samples = cell_spu_pc_collection(cpu);
> > 		...
> Yes, good suggestion.
I believe what you are asking here is why can't the
cell_spu_pc_collection() function put the data in to the samples array
for a given node, in the loop that then processes the samples array for
that node.  Yes, I believe that this can be done.  The only restriction
is that cell_spu_pc_collection() will have to extract the SPU program
counter data for all SPUs on that node.  This is due to the fact the
data for the 8 SPUs are all stored in a single entry of the hardware
trace buffer.  Once the hardware trace buffer is read, the hardware
advances the read pointer so there is no way to go back and re-read the
entry.

[snip]

[snip]
> >>+static int calculate_lfsr(int n)
> >>+{
> >>+#define size 24
> >>+       int i;
> >>+       unsigned int newlfsr0;
> >>+       unsigned int lfsr = 0xFFFFFF;
> >>+       unsigned int howmany = lfsr - n;
> >>+
> >>+       for (i = 2; i < howmany + 2; i++) {
> >>+               newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
> >>+                           ((lfsr >> (size - 1 - 1)) & 1) ^
> >>+                           (((lfsr >> (size - 1 - 6)) & 1) ^
> >>+                            ((lfsr >> (size - 1 - 23)) & 1)));
> >>+
> >>+               lfsr >>= 1;
> >>+               lfsr = lfsr | (newlfsr0 << (size - 1));
> >>+       }
> >>+       return lfsr;
> >>+
> >>+}
> > 
> > 
> > I don't have the slightest idea what this code is about, but
> Me neither.  Carl, can you comment?
> > it certainly looks inefficient to loop 16 million times to
> > compute a constant. Could you use a faster algorithm instead,
> > or at least add a comment about why you do it this way?
> > 
> > 

An LFSR sequence is similar to a pseudo random number sequence. For a 24
bit LFSR sequence each number between 0 and 2^24 will occur once in the
sequence but not in a normal counting order.  The hardware uses the LFSR
sequence to count to since it is much simpler to implement in hardware
then a normal counter.  Unfortunately, the only way we know how to
figure out what the LFSR value that corresponds to the number in the
sequence that is N before the last value (0xFFFFFF) is to calculate the
previous value N times.  It is like trying to ask what is the pseudo
random number that is N before this pseudo random number?

I will add a short comment to the code that will summarize the above
paragraph.
[snip]