perf PPC: kernel panic with callchains and context switch events
dsahern at gmail.com
Tue Jul 26 01:38:09 EST 2011
On 07/24/2011 07:55 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2011-07-24 at 11:18 -0600, David Ahern wrote:
>> On 07/20/2011 03:57 PM, David Ahern wrote:
>>> I am hoping someone familiar with PPC can help understand a panic that
>>> is generated when capturing callchains with context switch events.
>>> Call trace is below. The short of it is that walking the callchain
>>> generates a page fault. To handle the page fault the mmap_sem is needed,
>>> but it is currently held by setup_arg_pages. setup_arg_pages calls
>>> shift_arg_pages with the mmap_sem held. shift_arg_pages then calls
>>> move_page_tables which has a cond_resched at the top of its for loop. If
>>> the cond_resched() is removed from move_page_tables everything works
>>> beautifully - no panics.
>>> So, the question: is it normal for walking the stack to trigger a page
>>> fault on PPC? The panic is not seen on x86 based systems.
>> Can anyone confirm whether page faults while walking the stack are
>> normal for PPC? We really want to use the context switch event with
>> callchains and need to understand whether this behavior is normal. Of
>> course if it is normal, a way to address the problem without a panic
>> will be needed.
> Now that leads to interesting discoveries :-) Becky, can you read all
> the way and let me know what you think ?
> So, trying to walk the user stack directly will potentially cause page
> faults if it's done by direct access. So if you're going to do it in a
> spot where you can't afford it, you need to pagefault_disable() I
> suppose. I think the problem with our existing code is that it's missing
> those around __get_user_inatomic().
> In fact, arguably, we don't want the hash code from modifying the hash
> either (or even hashing things in). Our 64-bit code handles it today in
> perf_callchain.c in a way that involves pretty much duplicating the
> functionality of __get_user_pages_fast() as used by x86 (see below), but
> as a fallback from a direct access which misses the pagefault_disable()
> as well.
> I think it comes from an old assumption that this would always be called
> from an nmi, and the explicit tracepoints broke that assumption.
> In fact we probably want to bump the NMI count, not just the IRQ count
> as pagefault_disable() does, to make sure we prevent hashing.
> x86 does things differently, using __get_user_pages_fast() (a variant of
> get_user_page_fast() that doesn't fallback to normal get_user_pages()).
> Now, we could do the same (use __gup_fast too), but I can see a
> potential issue with ppc 32-bit platforms that have 64-bit PTEs, since
> we could end up GUP'ing in the middle of the two accesses.
> Becky: I think gup_fast is generally broken on 32-bit with 64-bit PTE
> because of that, the problem isn't specific to perf backtraces, I'll
> propose a solution further down.
> Now, on x86, there is a similar problem with PAE, which is handled by
> - having gup disable IRQs
> - rely on the fact that to change from a valid value to another valid
> value, the PTE will first get invalidated, which requires an IPI
> and thus will be blocked by our interrupts being off
> We do the first part, but the second part will break if we use HW TLB
> invalidation broadcast (yet another reason why those are bad, I think I
> will write a blog entry about it one of these days).
> I think we can work around this while keeping our broadcast TLB
> invalidations by having the invalidation code also increment a global
> generation count (using the existing lock used by the invalidation code,
> all 32-bit platforms have such a lock).
> From there, gup_fast can be changed to, with proper ordering, check the
> generation count around the loading of the PTE and loop if it has
> changed, kind-of a seqlock.
> We also need the NMI count bump if we are going to try to keep the
> attempt at doing a direct access first for perfs.
> Becky, do you feel like giving that a shot or should I find another
> victim ? (Or even do it myself ... ) :-)
Did you have something in mind besides the patch Anton sent? We'll give
that one a try and see how it works. (Thanks, Anton!)
>> Linuxppc-dev mailing list
>> Linuxppc-dev at lists.ozlabs.org
More information about the Linuxppc-dev