[Cbe-oss-dev] [PATCH 3/3] spufs context switch - fix interrupt routing

Thu Apr 24 11:42:10 EST 2008

On Thu, 2008-04-24 at 08:18 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2008-04-23 at 16:58 -0300, Luke Browning wrote:
> > > I also don't see how one would overwrite exception data as only hash
> > > misses can write there and there cannot be two pending at once. What do
> > > I miss here ?
> > 
> > >From 770bc074cc4ef45c450eb172f994e8a1425a3666 Mon Sep 17 00:00:00 2001
> > From: Jeremy Kerr <jk at ozlabs.org>
> > Date: Fri, 4 Apr 2008 17:55:28 +1100
> > Subject: [PATCH] [POWERPC] cell: Fix lost interrupts due to fasteoi handler
> 
>  .../...
> 
> Well, that's orthogonal :-) That's a bug I found with Jeremy and that
> affects affinity setting for any interrupt, and needs to be fixed (which
> it was), It wasn't per-se incorrect to re-route the interrupt, there was
> a bug and it has been fixed.
> 
> > ---
> > Maybe I misunderstood but I thought Jeremy was saying that two processors
> > was being interrupted at virtually the same time and the second interrupt 
> > was dropped, because the first one had not completed.  
> 
> Well, it's more like the second interrupt happens right after the first
> one, but on a different CPU. So close that the core hasn't yet cleared
> IRQ_INPROGRESS.
> 
> There can be 2 class 1 happening so close that they basically look like
> one interrupt to two processor, but that isn't a problem per-se, again.
> The register lock will make sure only one guy fetches things.
> 
> But due to the edge nature of the IIC messages, it's important that if
> the second one fires just after the first one's handler has read the
> pending bits, the handler gets called again as a new bit might be been
> set in between.
> 
> The bug was that we didn't do that, so if they were close enough -and-
> moved to a different CPU, we would "lose" the second one.
>
> > Jeremy's code changes are designed to make the interrupt handling
> > re-entrant by synchronizing the execution of the second level interrupt
> > handlers (slihs) spu_irq_class_0, spu_irq_class_1, and  spu_irq_class_2.
> 
> I don't think so. All the code change does is to properly take note that
> the interrupt re-occured while marked IN_PROGRESS and re-call the
> handler when that was the case. That's it. Just make sure we don't lose
> any.

Well maybe re-entrant is the wrong word :-).

> 
> > These callouts are made sequentially from the same cpu one after the
> > other almost immediately as each handler just records the current
> > exception data in the csa and performs a thread wakeup.  But the
> > exceptions are not really handled yet.  The controlling thread needs to
> > run, perform some virtual memory operations, and perform a dma restart
> > for the exception to be truly handled and this takes a lot of time as
> > the thread needs to be scheduled.  Therefore, if an spu is generating
> > multiple spus at virtually the same time, we have a problem as the
> > second call out will overwrite the exception data presented with the
> > first exception.
> 
> It shouldn't as it shouldn't be the same type of interrupt. Only the
> hash miss can write to the CSA, not the segment miss. The only case I
> know where two interrupts happen back to back is a segment miss followed
> by a hash miss for the same address.

I assume back to back interrupts happen all the time.  The difference in
this case is that we are re-routing interrupts within a few hundred
instructions of issuing the dma restart.  I assume we context switched
the spu context while it was segment faulting which caused the dma
restart to be deferred to context restore code within a few instructions
of the spu interrupt rerouting.  I assume that the SLB was not reloaded
by the context switch code, so the two faults were regenerated.  

> There should never be two hash misses unless there is a restart in
> between.
> 

Can you have a class 0 exception and a class 1 hash miss occur at the
same time?  If so, we still have the same issue.

Luke