debug problems on ppc 83xx target due to changed struct task_struct

Tue Aug 16 00:35:47 AEST 2016

On 12/08/16 18:09, Dave Hansen wrote:
> On 08/12/2016 08:47 AM, Holger Brunck wrote:
>> On 12/08/16 17:14, Dave Hansen wrote:
>>> On 08/12/2016 07:50 AM, Holger Brunck wrote:
>>>> When I try to debug our multithreaded userspace application with gdb I  get
>>>> stuck when trying to single step code.
>>>
>>> Can you clarify "stuck"?  Like the instructions don't advance?  Have you
>>> been able to find a root cause for this?
>>
>> the behaviour is slightly different on the kernel versions. So my setup is a
>> remote debug session via gdbserver.
>>
>> After connecting to the gdbserver I set a break point and start to run my
>> program. When hitting the breakpoint I try to single step. With stuck I mean
>> that the connection to the gdbserver is broken and I can't control my debug
>> session anymore while the application is not continuing.
> 
> Could you try debugging locally with gdb?  It would be nice to take all
> the stuff involved with remote debugging out of the picture.
> 

I tried this but unfortunately the error only occurs while remote debugging.
Locally with gdb everything works fine. BTW we double-checked with a 85xx ppc
target which is also 32-bit and it ends up with the same behaviour.

I was also investigating where I have to move the line in the struct task_struct
and it turns out to be like this (diff to 4.7 kernel):

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f..4868874 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1655,7 +1655,9 @@ struct task_struct {
        struct signal_struct *signal;
        struct sighand_struct *sighand;

+       // struct thread_struct thread;   // until here everything is fine
        sigset_t blocked, real_blocked;
+       struct thread_struct thread;      // from here it's broken
        sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */
        struct sigpending pending;

@@ -1919,7 +1921,6 @@ struct task_struct {
        struct task_struct *oom_reaper_list;
 #endif
 /* CPU-specific state of this task */
-       struct thread_struct thread;
 /*

So it's in the area where some signal information are stored, which makes sense
because this is highly used in case of gdb debugging.

> Have you tried turning on a bunch of kernel debugging (SLAB/SLUB
> debugging, pagealloc debug, lockdep, etc...)?  If something is getting
> corrupted, those tend to catch it.
>

I switched on some memory debugging features but didn't get suspicious output.
To make the situation even more weird after enabling FTRACE in the kernel to
trace some signal code the error disappeared.

> 
> Is the process still alive at the point that the remote debugger stops
> responding?  What is it doing at that point?
> 

the process is still alive. The state of the process, it's threads and the
gdbserver is like this:

Bad case after a single step:
 73    73 TS       -   0  19   0  0.3 S    sigsuspend     gdbserver
 74    74 TS       -   0  19   0  0.0 tl+  ptrace_stop    infra_pbec83xx_
 74    77 IDL      0   -  19   0  0.0 tl+  ptrace_stop    TR_Task
 74    78 IDL      0   -  19   0  0.0 tl+  ptrace_stop    TR_Timeout
 74    79 TS       -   0  19   0  0.0 tl+  poll_schedule_ timed_msg
 74    80 IDL      0   -  19   0  0.0 tl+  ptrace_stop    stimuli
 74    81 TS       -  -5  24   0  0.0 t<l+ ptrace_stop    timer0Dflt
 74    82 TS       - -19  38   0  0.0 t<l+ futex_wait_que timerUpd0
 74    83 TS       - -19  38   0  0.0 t<l+ timerfd_read   timerClk
 74    84 TS       - -19  38   0  0.0 t<l+ ptrace_stop    b/beatWDogRefr

Good case after a single step:
 76    76 TS       -   0  19   0  4.0 S    poll_schedule_ gdbserver
 77    77 TS       -   0  19   0  0.0 tl   ptrace_stop    infra_pbec83xx_
 77    84 IDL      0   -  19   0  0.0 tl   ptrace_stop    TR_Task
 77    85 IDL      0   -  19   0  0.0 tl   ptrace_stop    TR_Timeout
 77    86 TS       -   0  19   0  0.0 tl   ptrace_stop    timed_msg
 77    87 IDL      0   -  19   0  0.0 tl   ptrace_stop    stimuli
 77    88 TS       -  -5  24   0  0.0 t<l  ptrace_stop    timer0Dflt
 77    89 TS       - -19  38   0  0.0 t<l  ptrace_stop    timerUpd0
 77    90 TS       - -19  38   0  0.0 t<l  ptrace_stop    timerClk
 77    91 TS       - -19  38   0  0.0 t<l  ptrace_stop    b/beatWDogRefr

So in the error case only some threads are at ptrace_stop, while all of them
should be after a single step with the gdb. So it's somewhere in the signal
handling between kernel and gdbserver.

Best regards
Holger Brunck