Apparent kernel bug with GDB on ppc405

Matt Mackall mpm at selenic.com
Thu Oct 25 08:32:50 EST 2007


On Wed, Oct 24, 2007 at 04:27:52PM -0600, Grant Likely wrote:
> On 10/24/07, Matt Mackall <mpm at selenic.com> wrote:
> > On Wed, Oct 24, 2007 at 03:42:16PM -0500, Matt Mackall wrote:
> > > On Wed, Oct 24, 2007 at 02:28:14PM -0600, Grant Likely wrote:
> > > > On 10/24/07, Matt Mackall <mpm at selenic.com> wrote:
> > > > > I'm trying to debug a trivial statically-linked hello world program on
> > > > > a Xilinx PPC 405 and I'm seeing the following behavior:
> > > > >
> > > > <snip>
> > > > >
> > > > > Any suggestions?
> > > >
> > > > http://thread.gmane.org/gmane.linux.ports.ppc.embedded/11202
> > > >
> > > > I was fighting with a similar problem almost 2 years ago.  Looks like
> > > > it might be related.  At some point the problem seemed to go away and
> > > > I determined what the root cause was.  :-(
> > > >
> > > > I haven't been using gdb lately, so I don't know if it's the same
> > > > problem.  Nobody I had talked to had seen the issue on other 405
> > > > platforms.  It could very well be something virtex-specific.
> > >
> > > Could be the same problem, but I'm seeing only your symptom 3 so far.
> > >
> > > I've tried throwing some larger hammers at the problem. Flushing all
> > > of the dcache and icache (flush_dcache_all and
> > > flush_instruction_cache) isn't helping. But printk(".") does!
> >
> > Well there was one remaining cache - the TLB. This patch seems to make
> > things work, but don't ask me why:
> >
> > --- include/asm-ppc/cacheflush.h        (revision 10439)
> > +++ include/asm-ppc/cacheflush.h        (working copy)
> > @@ -11,6 +11,7 @@
> >  #define _PPC_CACHEFLUSH_H
> >
> >  #include <linux/mm.h>
> > +#include <asm/tlbflush.h>
> >
> >  /*
> >   * No cache flushing is required when address mappings are
> > @@ -35,10 +36,23 @@
> >  extern void flush_icache_user_range(struct vm_area_struct *vma,
> >                 struct page *page, unsigned long addr, int len);
> >
> >  #define copy_to_user_page(vma, page, vaddr, dst, src, len) \
> >  do { memcpy(dst, src, len); \
> >       flush_icache_user_range(vma, page, vaddr, len); \
> > +     _tlbia(); \
> >  } while (0)
> 
> Hmmm; thinking out loud here...
> 
> - so tlbia invalidates all TLB entries
> - When gdb inserts a breakpoint the .text pages are marked as read
> only, so the kernel does a copy on write so that gdb can modify the
> instruction.  The kernel also updates the page tables so that the test
> process now uses the new page.
> - This means that there are now 2 pages for that one section of
> executable code; the original and the one with the breakpoint.
> - However, the program is still in memory, and there is probably
> already a TLB entry pointing to the original page for that range of
> addresses.
> 
> Could it be that the kernel page tables are getting updated to the new
> page; but active set of TLB entries is not getting updated?
> 
> If so, then printk(".") probably solves the problem simply because it
> touches enough pages in its execution path that the old TLB entry gets
> overwritten?  There are only 64 TLB entries afterall.
> 
> Thoughts?

Not completely implausible, but a) why isn't this seen on basically
every machine with software TLB? b) why does -local- GDB, which is
presumably doing much less work than gdbserver + network stack, not fail?

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Linuxppc-embedded mailing list