debug problems on ppc 83xx target due to changed struct task_struct

Sat Aug 13 01:47:37 AEST 2016

Thanks for the quick answer!

On 12/08/16 17:14, Dave Hansen wrote:
> On 08/12/2016 07:50 AM, Holger Brunck wrote:
>> When I try to debug our multithreaded userspace application with gdb I  get
>> stuck when trying to single step code.
> 
> Can you clarify "stuck"?  Like the instructions don't advance?  Have you
> been able to find a root cause for this?
> 

the behaviour is slightly different on the kernel versions. So my setup is a
remote debug session via gdbserver.

After connecting to the gdbserver I set a break point and start to run my
program. When hitting the breakpoint I try to single step. With stuck I mean
that the connection to the gdbserver is broken and I can't control my debug
session anymore while the application is not continuing.

On Kernel 4.2 I got additionally the following dump in my serial terminal:

------------[ cut here ]------------
WARNING: at
/opt/keymile/ws_root/git_repositories/prod/keyne/plat/kernel/gpl/kernel/sched/core.c:1975
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-00003-g0478a57 #10
task: c04213d0 ti: c0434000 task.ti: c0434000
NIP: c003c4ac LR: c005d7f0 CTR: c005d7c8
REGS: c0435ce0 TRAP: 0700   Not tainted  (4.2.0-00003-g0478a57)
MSR: 00021032 <ME,IR,DR,RI>  CR: 22044228  XER: 20000000

GPR00: c005dfd8 c0435d90 c04213d0 cfba7a70 c042624c 00000000 00000001 00000000
GPR08: 00000001 00000001 00000007 ffffffff 42044228 eec349c0 00000000 00000000
GPR16: 0fe75f34 c0434000 0000000a c005d7c8 00000001 0000000a c0430000 c042624c
GPR24: 7ffc66b5 7ffc66b5 00000001 c0434000 0000000a c0426240 cfb81e90 c04261e0
NIP [c003c4ac] wake_up_process+0x10/0x20
LR [c005d7f0] hrtimer_wakeup+0x28/0x44
Call Trace:
[c0435d90] [c0426240] 0xc0426240 (unreliable)
[c0435da0] [c005dfd8] __hrtimer_run_queues.constprop.7+0x114/0x214
[c0435df0] [c005e334] hrtimer_interrupt+0xb8/0x29c
[c0435e40] [c0009c80] __timer_interrupt+0xb8/0x1c4
[c0435e60] [c000a03c] timer_interrupt+0x8c/0xb8
[c0435e90] [c000ece4] ret_from_except+0x0/0x14
--- interrupt: 901 at arch_cpu_idle+0x24/0x6c
    LR = arch_cpu_idle+0x24/0x6c
[c0435f50] [c0434000] 0xc0434000 (unreliable)
[c0435f60] [c0044cc0] cpu_startup_entry+0x138/0x1cc
[c0435fb0] [c03fdde0] start_kernel+0x32c/0x340
[c0435ff0] [00003438] 0x3438

This trace is missing when I try the same with latest kernel 4.7. But the
behaviour is similar. The board is still reachable via telnet but I need to kill
the gdbserver session manually to get control over the initial serial terminal
again. When I move the mentioned line of code everything works fine.

>> Does anyone have an idea why the change in sched.h break my debug
>> usecase? Anyone out here who is debugging ppc83xx targets flawlessly
>> with a recent kernel?
> 
> Thanks for going to the trouble of bisecting this, btw!
> 
> I'd _suspect_ something very specific to your platform since this
> doesn't appear to affect even other ppc variants.
> 

yeah I also think  this. I did the same test on  an embedded ARM target and it
works fine, so it seems to be somehow related to ppc 83xx which is a 32-bit
target. And what we also need is multithreading and/or c++ code. I did check
with some simple code and single stepping works fine.

It might also be that your code change simply exposes an error in the gdb/g++
environment.

> I wonder if making it cross a page boundary from some other structure
> causes this, or moving it relative to something else.  Could you try
> moving it to a few more places, or padding it by, say PAGE_SIZE on
> either side makes a difference?
> 

yes I can do some more tests at the beginning of the next week. Moving this
definition within the structure is a good idea.

> Is there some assembly involved in your single-stepping, or some other
> code that assumes relative offsets between two pieces of 'task_struct'?
> 

no. At least not in the code we have written. Not sure what the related g++
libraries are doing.

Regards
Holger