[PATCH 2.6.14] mm: 8xx MM fix for

Tue Nov 8 02:51:47 EST 2005

On Mon, Nov 07, 2005 at 08:16:18AM -0200, Marcelo Tosatti wrote:
> Joakim!
> 
> On Mon, Nov 07, 2005 at 03:32:52PM +0100, Joakim Tjernlund wrote:
> > Hi Marcelo
> > 
> > [SNIP] 
> > > The root of the problem are the changes against the 8xx TLB 
> > > handlers introduced
> > > during v2.6. What happens is the TLBMiss handlers load the 
> > > zeroed pte into
> > > the TLB, causing the TLBError handler to be invoked (thats 
> > > two TLB faults per 
> > > pagefault), which then jumps to the generic MM code to setup the pte.
> > > 
> > > The bug is that the zeroed TLB is not invalidated (the same reason
> > > for the "dcbst" misbehaviour), resulting in infinite TLBError faults.
> > > 
> > > Dan, I wonder why we just don't go back to v2.4 behaviour.
> > 
> > This is one reason why it is the way it is:
> > http://ozlabs.org/pipermail/linuxppc-embedded/2005-January/016382.html
> > This details are little fuzzy ATM, but I think the reason for the
> > current
> > impl. was only that it was less intrusive to impl.
> 
> Ah, I see. I wonder if the bug is processor specific: we don't have such
> changes in our v2.4 tree and never experienced such problem.
> 
> It should be pretty easy to hit it right? (instruction pagefaults should
> fail).
> 
> Grigori, Tom, can you enlight us about the issue on the URL above. How
> can it be triggered?

So after looking at the code in 2.6.14 and current git, I think the
above URL isn't relevant, unless there was a change I missed (which
could totally be possible) that reverted the patch there and fixed that
issue in a different manner.  But since I didn't figure that out until I
had finished researching it again:

Switching hats for a minute, this came from a bug a customer of
MontaVista found, so I can't give out the testcase :(

To repeat what Joakim said back then:
"I think I have figured this out. The first TLB misses that happen at
app startup is Data TLB misses. These will then hit the NULL L1 entry
and end up in do_page_fault() which will populate the L1 entry. But when
you have a very large app that spans more than one L1 entry (16 MB I
think) it may happen that you will have I-TLB Miss first one of the L1
entrys which will make the I-TLB handler bail out to do_page_fault() and
the app craches(SEGV)."

Looking at the patch again, what I don't see is why I talk about fudging
I-TLB Miss at 0x400 when it's I-TLB Error we fudge at being there, but
then get hung up that there can be a slight diff between the two ("This
is because we check bit 4 of SRR1 in both cases, but in the case of an
I-TLB Miss, this bit is always set, and it only indicates a protection
fault on an I-TLB Error.") so instead of 0x1300 jumping to the handler
at 0x400, we treat it like a regular exception so we know where we came
from, and perhaps missed fixing a case somewhere?

-- 
Tom Rini
http://gate.crashing.org/~trini/