[Cbe-oss-dev] [PATCH 2/3] spufs context switch - make SPU_CONTEXT_SWITCH_PENDING atomic

Luke Browning lukebr at linux.vnet.ibm.com
Wed Apr 23 08:49:06 EST 2008


On Tue, 2008-04-22 at 09:22 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2008-04-21 at 17:06 -0300, Luke Browning wrote:
> > On Thu, 2008-04-17 at 17:41 +1000, Jeremy Kerr wrote:
> > > Luke,
> > > 
> > > > The test_bit operation does not use atomic memory primitives
> > > 
> > > How so? It's a single load.
> > 
> > load is not enough in this case as the code needs to be MP safe.  Load
> > does not include the cache line reservation that occurs with the larx
> > instruction.  You could get a stale cache value.    
> 
> I fail to understand your logic here. A load and a larx are exactly
> equivalent -unless- you also attempt to modify with a stwcx. There is no
> ordering guarantee provided by either. A load is atomic, as a larx is
> and achitecturally at least both can return "old" data.
> 

I meant for the larx to be used with the stcx.  They always go together.

I was just pointing out that the existing code can miss an update if the
read is done with a normal load, but my main point is that this critical
section should not be protected with a bit lock.

Consider the old code:

if (!context_switch_pending)
	restart_dma

The context switch may take place between the if stmt and the
restart_dma instruction.  This should be avoided.

New code:

lock
if (!context_switch_pending)
	restart_dma
~lock


> However, that isn't necessarily an issue in this context.
> 
> > > so put this flag under the spu register_lock.  SPU exception handlers
> > > > must be tightly integrated into spu context switching, so that they
> > > > don't restart DMA while the MFC queue is being saved or restored.
> > > 
> > > Are you seeing a particular error case here?
> > 
> > Since the existing code is not atomic, mfc queue processing could be
> > restarted in the middle of a save sequence.  Not only would this cause
> > the wrong data to be saved, but it would put the hardware into a
> > different state from what was expected.    
> 
> I spent some time with Jeremy auditing that stuff (we found bugs and
> fixed them). I believe the current code is safe from that happening. The
> reason is that we disable all MFC interrupts after quiescing it and
> before saving the csa, and we use synchronize_irq() which guarantees
> that any previously received interrupt has completed execution of its
> handler.

synchronize_irq() is not atomic either! It suffers from the same problem
that I was discussing above.  It uses a simple load.  Here's an excerpt
from that routine.

	/*
         * Wait until we're out of the critical section.  This might
         * give the wrong answer due to the lack of memory barriers.
         */
        while (desc->status & IRQ_INPROGRESS)
        	cpu_relax();

Is synchronize_irq() appropriate?  It just tests that the interrupt is
not currently being processed.  The interrupt could be pending in the
hardware.  The processor might be disabled, right.  


> There is still a potentialy race if the previously issued interrupt
> hasn't reached the kernel yet but that is covered by setting the masks
> which are then tested by the interrupt handler.

I assume you are talking about the context switch pending flag here,
which I think can miss an update as noted above.  

> 
> Now there is one case we saw for which there was an issue, I don't know
> if Jeremy has addressed it yet, and your patch wouldn't fix it neither I
> believe.
> 
> The issue is that a pending fault condition is present when we quiesce.
> We quiesce the MFC, set the mask, save state etc... Later on, the
> context switch code re-enables interrupts for it's own use (the loading
> and saving of the local store requires interrupts to service kernel hash
> faults). At that point, at least in Mambo, we've seen the MFC "remember"
> it had a pending fault for the previous context and re-issue that
> interrupt.
> 
No, I didn't solve this problem.  There is a step in the algorithm where
it is proposed that the O/S handle pending interrupts, but we didn't
implement this step.

> We can't use your flag for that as we do need to handle class 1
> interrupts for the kernel.

The flag is just intended to protect the saving of the mfc queue.  When
this is completed, the flag is cleared so that normal dma processing can
be performed as required by the algorithm.  

> 
> I think the solution that worked for us back then we to make sure we
> clear the MFC's DSISR when quiescing, though it remains to be verified
> that this works on HW and not only in mambo. 

I didn't see clearing of the MFC DSISR in Jeremy's patch.

> 
> > 1) I think it makes the code cleaner and easier to debug.  This is
> > primarily a style issue but having the same protection apply to a field
> > regardless of where it is used in the algorithm makes it easier to
> > understand.  Eliminating this protection in the restore part of the
> > algorithm is an optimization.  I think there is an assumption in the
> > documented sequence that a stray dma woill not 
> 
> Bitops used as a synchronization facility are very error prone and in
> general don't clarify things much. Besides, we do need to handle kernel
> faults during context switches. I don't think your approach is buying us
> anything at this stage.

Hmm.  I am not proposing bit ops.  I am just extending the existing
serialization strategy to protect an additional critical section - the
restore part of the algorithm where we load the mfc queue.  More
importantly, I am proposing that we use a lock to protect the bit so
that it can be read reliably.  That is what the patch is all about.
  

> 
> > 2) a dma restart could occur in the restore sequence if the exception
> > was delayed long enough - probably around 80 microseconds.  The
> > exception would have to be related to pre-saved state.  This would
> > provide enough time for the context to be saved and unlocked, so that
> > the scheduler could find it and try to re-schedule it.
> 
> I'm not sure I understand the exact scenario you are talking about here.

Here is what I meant.  An interrupt is pending in the hardware and
spu_save is invoked and completes before the interrupt is delivered.
Then, spu_restore is invoked for the same context and the pending
interrupt is finally delivered. This could happen if the interrupt was
held pending in the hardware for ~100 usecs.  I am sure there are a lot
of disabled critical sections that last long.

Luke




More information about the cbe-oss-dev mailing list