Kernel bug in 2.6.23...was: RE: How to debug a hung multi-core system....

Fri May 29 04:46:29 EST 2009

Kumar,

To follow up on our postings from late last week...
(which I was expecting a response (but never got) from you)...

-----

We (well, mostly a very bright engineer who was very persistent) 
have(has) found the origin of how the kernel TLB got corrupted.

We tracked down the problem to a programming bug in the DataStorage
exception handler for our kernel (2.6.23). We have looked at newer
kernels, and have noticed that this piece of processing has changed, 
but let me explain to you what happened (and the conditions that 
caused the problem on our MPC8572E (running SMP)...

If you follow the logic of in this version of the kernel, it reads 
the SPRN_DEAR into register R10, and then does some operations 
(including a tlbsx operation (which uses R10)), and then attempts
to update the associated PTE entry.

Well, if you have REALLY bad luck, sometime between the time you 
took this exception and try to update the PTE for this page, the 
other core has decided to invalidate this page's PTE. The good 
part is the kernel recognizes this unlucky case.

Unfortunately, in this 'bad luck' case, a kernel bug was 
Introduced. The kernel uses R10 for some processing (puts
the physical address associated with this virtual page) and 
then branches up 'above' the tlbsx operation to try again 

...without restoring R10 to the SPRN_DEAR required by the tlbsx
operation...

This means, that even though the kernel recognized this exceptional
problem, it NEVER did the right thing, and instead, the kernel would 
(attempt) to modify the unlucky TLB virtual address that corresponds 
to the physical address of the original DataStorage exception.

The only way we caught this is that we also had a second piece of 
'bad luck' by having that physical address map to the virtual address
of the kernel (0xC0000000), and thus, when it loops back to try again,
it gets the kernel page(s) from the tlbsx operation, and modifies 
permissions on the kernel pages and thus causing an InstructionStore 
Exception (forever).

We fixed this in our kernel by just restoring R10 to SPRN_DEAR value
just before it loops back, something like this:

================================
              ....
	mtspr	SPRN_MAS1, r13
	tlbwe

	/* because we did NOT find in PTE */
	/* r10 was changed - so we need   */
	/* to re-load it here to work     */
	mfspr	r10, SPRN_DEAR   	  /* restore the faulting
address */
	b	5b		/* Try again */
             ....
================================

That's the short and long of it...and 4 weeks of very stressful
problems...

I am wondering why nobody has found this problem before - are we the
first to be this unlucky? I am not sure that is a good thing!

Comments? Suggestions? What else should I be doing with this
information?

Tom Morrison
Principal Software Engineer
EMPIRIX 
20 Crosby Drive - Bedford, MA  01730
p: 781.266.3567 f: 781.266.3670 
email: tmorrison at empirix.com 
www.empirix.com

>> -----Original Message-----
>> From: Morrison, Tom
>> Sent: Thursday, May 21, 2009 11:24 AM
>> To: Morrison, Tom; Kumar Gala
>> Cc: linuxppc-dev at ozlabs.org; Young, Andrew; Brown, Jeff; Geary Sean-
>> R60898
>> Subject: RE: How to debug a hung multi-core system....
>> 
>> Just had a little conference with several co-workers...to go over
results
>> 
>> We think that LT0 (the one that maps the kernel) has been corrupted:
>> 
>>        Entry  EPN          RPN    TID  TMASK   WIMGE  TSIZ U0:3  X0:1
>>
---------------------------------------------------------------
>>        LT0  C0000000     00000000 00     0FF     04     9     0     0
>> 
>>        PID  TS  PROT SHEN   UR   UW   UX   SR   SW   SX  TIDZ VAL
>>
---------------------------------------------------------------
>>        0    0    P    P    E    E    D    E    E    D    D    V
>> 
>> Is absolutely wrong - this is TLB for the kernel - and as you can see
>> ...it does NOT have execution privileges (and in fact the user space
>> HAS executive privileges for this area (complete opposite of what it
>> should be)...
>> 
>> This is why it is stuck AT that instruction (can't even single step
>> from that location)..
>> 
>> (one of) The first problem(s) is how can/when did this TLB get
corrupted!
>> 
>> Tom