[RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking
Josh Poimboeuf
jpoimboe at redhat.com
Tue May 3 05:44:44 AEST 2016
On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe at redhat.com> wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe at redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe at redhat.com> wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe at redhat.com> wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs. Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space. But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk
> >> >> > past the nested entry frame and keep going until it gets to the original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry. For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm. I think we could fix all that in a more standard way. Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function. That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back. Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it. (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar. That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry". Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem. We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding. This would be nice, and I think it could work.
> >
> > Yeah, that's about what I had in mind.
>
> FWIW, I just tried this:
>
> static bool is_entry_text(unsigned long addr)
> {
> return addr >= (unsigned long)__entry_text_start &&
> addr < (unsigned long)__entry_text_end;
> }
>
> it works. So the entry code is already annotated reasonably well :)
>
> I just hacked it up here:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
>
> and it seems to work, at least for page faults. A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.
I still think we would need more specific annotations to do that
reliably: a call from entry code doesn't necessarily correlate with a
pt_regs frame.
> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >
> > Hm, this I don't really follow. It's true that the unwinder can easily
> > find RIP from pt_regs, which will always be a known offset from the
> > kernel_entry pointer on the stack. But why would having the entry code
> > in a separate section make that faster?
>
> It doesn't make the unwinder faster -- it makes the entry code faster.
Oh, right. But I don't think a few extra frame pointer instructions are
much of an issue if you already have CONFIG_FRAME_POINTER enabled.
Anyway I'm not sure which way is better. I'll think about it.
> I hope your plans include rewriting the current stack unwinder
> completely. The thing in print_context_stack is (a)
> hard-to-understand and hard-to-modify crap and (b) is called in a loop
> from another file using totally ridiculous conventions.
I agree, that code is quite confusing. I haven't really thought about
how specifically it could be improved or replaced though.
Along those lines, I think it would be awesome if we could have an
arch-independent DWARF unwinder so that most of the stack dumping code
could be shared amongst all the arches.
--
Josh
More information about the Linuxppc-dev
mailing list