[RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

Tue May 3 01:52:41 AEST 2016

On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe at redhat.com> wrote:
> On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe at redhat.com> wrote:
>> >
>> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe at redhat.com> wrote:
>> > > >> I suppose we could try to rejigger the code so that rbp points to
>> > > >> pt_regs or similar.
>> > > >
>> > > > I think we should avoid doing something like that because it would break
>> > > > gdb and all the other unwinders who don't know about it.
>> > >
>> > > How so?
>> > >
>> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> > > the pt_regs.  Currently it points to something stale (which the
>> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> > > is the next thing on the stack, so just doing the section thing would
>> > > work.
>> >
>> > Yes, rbp is meaningless on the entry from user space.  But if an
>> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> > past the nested entry frame and keep going until it gets to the original
>> > entry.
>>
>> Yes.
>>
>> It would be nice if we could do better, though, and actually notice
>> the pt_regs and identify the entry.  For example, I'd love to see
>> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> crash.
>>
>> Also, I think that just following rbp links will lose the
>> actual function that took the page fault (or whatever function
>> pt_regs->ip actually points to).
>
> Hm.  I think we could fix all that in a more standard way.  Whenever a
> new pt_regs frame gets saved on entry, we could also create a new stack
> frame which points to a fake kernel_entry() function.  That would tell
> the unwinder there's a pt_regs frame without otherwise breaking frame
> pointers across the frame.
>
> Then I guess we wouldn't need my other solution of putting the idt
> entries in a special section.
>
> How does that sound?

Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back.  Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it.  (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar.  That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry".  Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem.  We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding.  This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most
handlers.

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.

>
>> Have you looked at my vdso unwinding test at all?  If we could do
>> something similar for the kernel, IMO it would make testing much more
>> pleasant.
>
> I found it, but I'm not sure what it would mean to do something similar
> for the kernel.  Do you mean doing something like an NMI sampling-based
> approach where we periodically do a random stack sanity check?

I was imagining something a little more strict: single-step
interesting parts of the kernel and make sure that each step unwinds
correctly.  That could detect missing frames and similar.