[RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

Josh Poimboeuf jpoimboe at redhat.com
Thu Jun 23 02:30:11 AEST 2016


On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
> On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoimboe at redhat.com> wrote:
> > > Maybe I'm coming around to liking this idea.
> >
> > Ok, good :-)
> >
> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > > printer, etc), unwinding across a kernel exception would look like:
> > >
> > >  - some_func
> > >  - some_other_func
> > >  - do_page_fault
> > >  - page_fault
> > >
> > > After page_fault, the next unwind step takes us to the faulting RIP
> > > (faulting_func) and reports that all GPRs are known.  It should
> > > probably learn this fact from DWARF if DWARF is available, instead of
> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > > may be incomplete.  A nice pretty printer could now print all the
> > > regs.
> > >
> > >  - faulting_func
> > >  - etc.
> > >
> > > For this to work, we don't actually need the unwinder to explicitly
> > > know where pt_regs is.
> >
> > That's true (but only for DWARF).
> >
> > > Food for thought, though: if user code does SYSENTER with TF set,
> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > > about it.  DWARF will handle it without any fanfare, but I don't know
> > > if it's going to cause trouble for you pre-DWARF.
> >
> > In this case it should see the stack pointer is past the pt_regs offset,
> > so it would just report it as an empty stack.
> 
> OK
> 
> >
> > > I'm also not sure it makes sense to apply this before the unwinder
> > > that can consume it is ready.  Maybe, if it would be consistent with
> > > your plans, it would make sense to rewrite the unwinder first, then
> > > apply this and teach live patching to use the new unwinder, and *then*
> > > add DWARF support?
> >
> > For the purposes of livepatch, the reliable unwinder needs to detect
> > whether an interrupt/exception pt_regs frame exists on a sleeping task
> > (or current).  This patch would allow us to do that.
> >
> > So my preferred order of doing things would be:
> >
> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> > 2) this patch for annotating pt_regs stack frames
> > 3) reliable unwinder, similar to what I already posted, except it relies
> >    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
> >    the new inactive task frame format of #1
> > 4) livepatch consistency model which uses the reliable unwinder
> > 5) rewrite unwinder, and port all users to the new interface
> > 6) add DWARF unwinder
> >
> > 1-4 are pretty much already written, whereas 5 and 6 will take
> > considerably more work.
> 
> Fair enough.
> 
> >
> > > > +       /*
> > > > +        * Create a stack frame for the saved pt_regs.  This allows frame
> > > > +        * pointer based unwinders to find pt_regs on the stack.
> > > > +        */
> > > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +       pushq   \regs
> > > > +       pushq   $pt_regs+1
> > >
> > > Why the +1?
> >
> > Some unwinders like gdb are smart enough to report the function which
> > contains the instruction *before* the return address.  Without the +1,
> > they would show the wrong function.
> 
> Lovely.  Want to add a comment?
> 
> >
> > > > +       pushq   %rbp
> > > > +       movq    %rsp, %rbp
> > > > +#endif
> > > > +       .endm
> > >
> > > I keep wanting this to be only two pushes and to fudge rbp to make it
> > > work, but I don't see a good way.  But let's call it
> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > > nested_frame or similar.
> >
> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
> 
> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
> too.  CREATE_PT_REGS_FRAME is probably fine.
> 
> > > > +
> > > > +/* fake function which allows stack unwinders to detect pt_regs frames */
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +ENTRY(pt_regs)
> > > > +       nop
> > > > +       nop
> > > > +ENDPROC(pt_regs)
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > >
> > > Why is this two bytes long?  Is there some reason it has to be more
> > > than one byte?
> >
> > Similar to above, this is related to the need to support various
> > unwinders.  Whether the unwinder displays the ret_addr or the
> > instruction preceding it, either way the instruction needs to be inside
> > the function for the function to be reported.
> 
> OK.

Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even if starting from one of these other calls.

But then, things get ugly.  You have to either setup and tear down the
frame for every possible call, or do a higher-level setup/teardown
across multiple calls, which invalidates several assumptions in the
entry code about the location of pt_regs on the stack.

Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
make assumptions about the location of pt_regs.  And they're used by
both syscall and interrupt code.  So if we didn't create a frame pointer
header for syscalls, we'd basically need two versions of the macros: one
for irqs/exceptions and one for syscalls.

So I think the cleanest way to handle this is to always allocate two
extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
entry code can assume that pt_regs is at a constant location, and all
the above problems go away.  Another benefit is that we'd only need two
saves instead of three -- the pointer to pt_regs is no longer needed
since pt_regs is always immediately after the frame header.

I worked up a patch to implement this -- see below.  It writes the frame
pointer in all entry paths, including syscalls.  This helps keep the
code simple.

The downside is a small performance penalty: with getppid()-in-a-loop on
my laptop, the average syscall went from 52ns to 53ns, which is about a
2% slowdown.  But I doubt it would be measurable in a real-world
workload.

It looks like about half the slowdown is due to the extra stack
allocation (which presumably adds a little d-cache pressure on the stack
memory) and the other half is due to the stack writes.

I could remove the writes from the syscall path but it would only save
about half a ns, and it would make the code less robust.  Plus it's nice
to have the consistency of having *all* pt_regs frames annotated.

Thoughts?

----

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..0f6ccfc 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -88,30 +88,69 @@ For 32-bit we have the following conventions - kernel is built with
 #define RSP		19*8
 #define SS		20*8
 
-#define SIZEOF_PTREGS	21*8
 
-	.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
-	addq	$-(15*8+\addskip), %rsp
+#ifdef CONFIG_FRAME_POINTER
+
+#define PT_REGS_OFFSET 2*8
+
+	/*
+	 * Create an entry stack frame pointer header which corresponds to the
+	 * saved pt_regs.  This allows frame pointer based unwinders to find
+	 * pt_regs on the stack.  The frame pointer and the return address of a
+	 * fake function are stored immediately before the pt_regs.
+	 */
+	.macro SAVE_ENTRY_FRAME_POINTER
+	movq	%rbp,			0*8(%rsp)
+	movq	$entry_frame_ret,	1*8(%rsp)
+	movq	%rsp, %rbp
+	.endm
+
+	.macro RESTORE_ENTRY_FRAME_POINTER
+	movq	(%rsp), %rbp
+	.endm
+
+	.macro ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+	subq	$PT_REGS_OFFSET, %rsp
+	SAVE_ENTRY_FRAME_POINTER
+	.endm
+
+#else /* !CONFIG_FRAME_POINTER */
+
+#define PT_REGS_OFFSET 0
+#define SAVE_ENTRY_FRAME_POINTER
+#define RESTORE_ENTRY_FRAME_POINTER
+#define ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+
+#endif /* CONFIG_FRAME_POINTER */
+
+#define PT_REGS_SIZE		21*8
+#define ENTRY_FRAME_SIZE	(PT_REGS_OFFSET+PT_REGS_SIZE)
+
+#define TI_FLAGS(rsp) ASM_THREAD_INFO(TI_flags, rsp, ENTRY_FRAME_SIZE)
+#define PT_REGS(reg) reg+PT_REGS_OFFSET(%rsp)
+
+	.macro ALLOC_ENTRY_FRAME addskip=0
+	addq	$-(15*8+PT_REGS_OFFSET+\addskip), %rsp
 	.endm
 
 	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
 	.if \r11
-	movq %r11, 6*8+\offset(%rsp)
+	movq %r11, 6*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \r8910
-	movq %r10, 7*8+\offset(%rsp)
-	movq %r9,  8*8+\offset(%rsp)
-	movq %r8,  9*8+\offset(%rsp)
+	movq %r10, 7*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r9,  8*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r8,  9*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \rax
-	movq %rax, 10*8+\offset(%rsp)
+	movq %rax, 10*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
 	.if \rcx
-	movq %rcx, 11*8+\offset(%rsp)
+	movq %rcx, 11*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endif
-	movq %rdx, 12*8+\offset(%rsp)
-	movq %rsi, 13*8+\offset(%rsp)
-	movq %rdi, 14*8+\offset(%rsp)
+	movq %rdx, 12*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %rsi, 13*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %rdi, 14*8+PT_REGS_OFFSET+\offset(%rsp)
 	.endm
 	.macro SAVE_C_REGS offset=0
 	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
@@ -128,23 +167,39 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
 	.endm
+	.macro SAVE_C_REGS_EXCEPT_RAX
+	SAVE_C_REGS_HELPER rax=0
+	.endm
+
+	.macro SAVE_EXTRA_REGS offset=0 rbx=1
+	movq %r15, 0*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r14, 1*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r13, 2*8+PT_REGS_OFFSET+\offset(%rsp)
+	movq %r12, 3*8+PT_REGS_OFFSET+\offset(%rsp)
+#ifdef CONFIG_FRAME_POINTER
+	/* copy rbp value from entry frame header */
+	movq \offset(%rsp), %rbp
+	movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+	leaq \offset(%rsp), %rbp
+#else
+	movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+#endif
+	.if \rbx
+	movq %rbx, 5*8+PT_REGS_OFFSET+\offset(%rsp)
+	.endif
+	.endm
 
-	.macro SAVE_EXTRA_REGS offset=0
-	movq %r15, 0*8+\offset(%rsp)
-	movq %r14, 1*8+\offset(%rsp)
-	movq %r13, 2*8+\offset(%rsp)
-	movq %r12, 3*8+\offset(%rsp)
-	movq %rbp, 4*8+\offset(%rsp)
-	movq %rbx, 5*8+\offset(%rsp)
+	.macro SAVE_EXTRA_REGS_EXCEPT_RBX
+	SAVE_EXTRA_REGS rbx=0
 	.endm
 
 	.macro RESTORE_EXTRA_REGS offset=0
-	movq 0*8+\offset(%rsp), %r15
-	movq 1*8+\offset(%rsp), %r14
-	movq 2*8+\offset(%rsp), %r13
-	movq 3*8+\offset(%rsp), %r12
-	movq 4*8+\offset(%rsp), %rbp
-	movq 5*8+\offset(%rsp), %rbx
+	movq 0*8+PT_REGS_OFFSET+\offset(%rsp), %r15
+	movq 1*8+PT_REGS_OFFSET+\offset(%rsp), %r14
+	movq 2*8+PT_REGS_OFFSET+\offset(%rsp), %r13
+	movq 3*8+PT_REGS_OFFSET+\offset(%rsp), %r12
+	movq 4*8+PT_REGS_OFFSET+\offset(%rsp), %rbp
+	movq 5*8+PT_REGS_OFFSET+\offset(%rsp), %rbx
 	.endm
 
 	.macro ZERO_EXTRA_REGS
@@ -158,24 +213,24 @@ For 32-bit we have the following conventions - kernel is built with
 
 	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
 	.if \rstor_r11
-	movq 6*8(%rsp), %r11
+	movq 6*8+PT_REGS_OFFSET(%rsp), %r11
 	.endif
 	.if \rstor_r8910
-	movq 7*8(%rsp), %r10
-	movq 8*8(%rsp), %r9
-	movq 9*8(%rsp), %r8
+	movq 7*8+PT_REGS_OFFSET(%rsp), %r10
+	movq 8*8+PT_REGS_OFFSET(%rsp), %r9
+	movq 9*8+PT_REGS_OFFSET(%rsp), %r8
 	.endif
 	.if \rstor_rax
-	movq 10*8(%rsp), %rax
+	movq 10*8+PT_REGS_OFFSET(%rsp), %rax
 	.endif
 	.if \rstor_rcx
-	movq 11*8(%rsp), %rcx
+	movq 11*8+PT_REGS_OFFSET(%rsp), %rcx
 	.endif
 	.if \rstor_rdx
-	movq 12*8(%rsp), %rdx
+	movq 12*8+PT_REGS_OFFSET(%rsp), %rdx
 	.endif
-	movq 13*8(%rsp), %rsi
-	movq 14*8(%rsp), %rdi
+	movq 13*8+PT_REGS_OFFSET(%rsp), %rsi
+	movq 14*8+PT_REGS_OFFSET(%rsp), %rdi
 	.endm
 	.macro RESTORE_C_REGS
 	RESTORE_C_REGS_HELPER 1,1,1,1,1
@@ -193,8 +248,8 @@ For 32-bit we have the following conventions - kernel is built with
 	RESTORE_C_REGS_HELPER 1,0,0,1,1
 	.endm
 
-	.macro REMOVE_PT_GPREGS_FROM_STACK addskip=0
-	subq $-(15*8+\addskip), %rsp
+	.macro FREE_ENTRY_FRAME addskip=0
+	subq $-(15*8+PT_REGS_OFFSET+\addskip), %rsp
 	.endm
 
 	.macro icebp
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 23e764c..ff92759 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -55,7 +55,7 @@ ENDPROC(native_usergs_sysret64)
 
 .macro TRACE_IRQS_IRETQ
 #ifdef CONFIG_TRACE_IRQFLAGS
-	bt	$9, EFLAGS(%rsp)		/* interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* interrupts off? */
 	jnc	1f
 	TRACE_IRQS_ON
 1:
@@ -88,7 +88,7 @@ ENDPROC(native_usergs_sysret64)
 .endm
 
 .macro TRACE_IRQS_IRETQ_DEBUG
-	bt	$9, EFLAGS(%rsp)		/* interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* interrupts off? */
 	jnc	1f
 	TRACE_IRQS_ON_DEBUG
 1:
@@ -164,22 +164,17 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	pushq	$__USER_CS			/* pt_regs->cs */
 	pushq	%rcx				/* pt_regs->ip */
 	pushq	%rax				/* pt_regs->orig_ax */
-	pushq	%rdi				/* pt_regs->di */
-	pushq	%rsi				/* pt_regs->si */
-	pushq	%rdx				/* pt_regs->dx */
-	pushq	%rcx				/* pt_regs->cx */
-	pushq	$-ENOSYS			/* pt_regs->ax */
-	pushq	%r8				/* pt_regs->r8 */
-	pushq	%r9				/* pt_regs->r9 */
-	pushq	%r10				/* pt_regs->r10 */
-	pushq	%r11				/* pt_regs->r11 */
-	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
+
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	SAVE_C_REGS_EXCEPT_RAX
+	movq	$-ENOSYS, PT_REGS(RAX)
 
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -207,7 +202,7 @@ entry_SYSCALL_64_fastpath:
 	call	*sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
-	movq	%rax, RAX(%rsp)
+	movq	%rax, PT_REGS(RAX)
 1:
 
 	/*
@@ -217,15 +212,16 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	testl	$_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
 	TRACE_IRQS_ON		/* user mode is traced as IRQs on */
-	movq	RIP(%rsp), %rcx
-	movq	EFLAGS(%rsp), %r11
+	movq	PT_REGS(RIP), %rcx
+	movq	PT_REGS(EFLAGS), %r11
 	RESTORE_C_REGS_EXCEPT_RCX_R11
-	movq	RSP(%rsp), %rsp
+	RESTORE_ENTRY_FRAME_POINTER
+	movq	PT_REGS(RSP), %rsp
 	USERGS_SYSRET64
 
 1:
@@ -237,14 +233,14 @@ entry_SYSCALL_64_fastpath:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	jmp	return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
@@ -255,8 +251,8 @@ return_from_SYSCALL_64:
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.
 	 */
-	movq	RCX(%rsp), %rcx
-	movq	RIP(%rsp), %r11
+	movq	PT_REGS(RCX), %rcx
+	movq	PT_REGS(RIP), %r11
 	cmpq	%rcx, %r11			/* RCX == RIP */
 	jne	opportunistic_sysret_failed
 
@@ -283,8 +279,8 @@ return_from_SYSCALL_64:
 	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
 	jne	opportunistic_sysret_failed
 
-	movq	R11(%rsp), %r11
-	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
+	movq	PT_REGS(R11), %r11
+	cmpq	%r11, PT_REGS(EFLAGS)		/* R11 == RFLAGS */
 	jne	opportunistic_sysret_failed
 
 	/*
@@ -306,7 +302,7 @@ return_from_SYSCALL_64:
 
 	/* nothing to check for RSP */
 
-	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
+	cmpq	$__USER_DS, PT_REGS(SS)		/* SS must match SYSRET */
 	jne	opportunistic_sysret_failed
 
 	/*
@@ -316,7 +312,7 @@ return_from_SYSCALL_64:
 syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	RESTORE_C_REGS_EXCEPT_RCX_R11
-	movq	RSP(%rsp), %rsp
+	movq	PT_REGS(RSP), %rsp
 	USERGS_SYSRET64
 
 opportunistic_sysret_failed:
@@ -408,6 +404,7 @@ END(__switch_to_asm)
  * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
+	ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
@@ -415,7 +412,7 @@ ENTRY(ret_from_fork)
 	jnz	1f
 
 2:
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
@@ -430,7 +427,7 @@ ENTRY(ret_from_fork)
 	 * calling do_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
-	movq	$0, RAX(%rsp)
+	movq	$0, PT_REGS(RAX)
 	jmp	2b
 END(ret_from_fork)
 
@@ -460,11 +457,12 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 
-	testb	$3, CS(%rsp)
+	testb	$3, PT_REGS(CS)
 	jz	1f
 
 	/*
@@ -500,6 +498,7 @@ END(irq_entries_start)
 	/* We entered an interrupt context - irqs are off: */
 	TRACE_IRQS_OFF
 
+	addq	$PT_REGS_OFFSET, %rdi
 	call	\func	/* rdi points to pt_regs */
 	.endm
 
@@ -521,12 +520,12 @@ ret_from_intr:
 	/* Restore saved previous stack */
 	popq	%rsp
 
-	testb	$3, CS(%rsp)
+	testb	$3, PT_REGS(CS)
 	jz	retint_kernel
 
 	/* Interrupt came from user space */
 GLOBAL(retint_user)
-	mov	%rsp,%rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
 	SWAPGS
@@ -537,7 +536,7 @@ retint_kernel:
 #ifdef CONFIG_PREEMPT
 	/* Interrupts are off */
 	/* Check if we need preemption */
-	bt	$9, EFLAGS(%rsp)		/* were interrupts off? */
+	bt	$9, PT_REGS(EFLAGS)		/* were interrupts off? */
 	jnc	1f
 0:	cmpl	$0, PER_CPU_VAR(__preempt_count)
 	jnz	1f
@@ -558,7 +557,7 @@ GLOBAL(restore_regs_and_iret)
 	RESTORE_EXTRA_REGS
 restore_c_regs_and_iret:
 	RESTORE_C_REGS
-	REMOVE_PT_GPREGS_FROM_STACK 8
+	FREE_ENTRY_FRAME 8
 	INTERRUPT_RETURN
 
 ENTRY(native_iret)
@@ -699,11 +698,12 @@ ENTRY(\sym)
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
 	.endif
 
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 
 	.if \paranoid
 	.if \paranoid == 1
-	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
+	testb	$3, PT_REGS(CS)			/* If coming from userspace, switch stacks */
 	jnz	1f
 	.endif
 	call	paranoid_entry
@@ -720,11 +720,11 @@ ENTRY(\sym)
 	.endif
 	.endif
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	PT_REGS_OFFSET(%rsp), %rdi	/* pt_regs pointer */
 
 	.if \has_error_code
-	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
-	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
+	movq	PT_REGS(ORIG_RAX), %rsi		/* get error code */
+	movq	$-1, PT_REGS(ORIG_RAX)		/* no syscall to restart */
 	.else
 	xorl	%esi, %esi			/* no error code */
 	.endif
@@ -755,16 +755,15 @@ ENTRY(\sym)
 1:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
+	movq	%rsp, %rdi			/* stack frame + pt_regs */
+	call	sync_entry_frame
 	movq	%rax, %rsp			/* switch stack */
 
-	movq	%rsp, %rdi			/* pt_regs pointer */
+	leaq	PT_REGS_OFFSET(%rsp), %rdi	/* pt_regs pointer */
 
 	.if \has_error_code
-	movq	ORIG_RAX(%rsp), %rsi		/* get error code */
-	movq	$-1, ORIG_RAX(%rsp)		/* no syscall to restart */
+	movq	PT_REGS(ORIG_RAX), %rsi		/* get error code */
+	movq	$-1, PT_REGS(ORIG_RAX)		/* no syscall to restart */
 	.else
 	xorl	%esi, %esi			/* no error code */
 	.endif
@@ -922,7 +921,8 @@ ENTRY(xen_failsafe_callback)
 	movq	8(%rsp), %r11
 	addq	$0x30, %rsp
 	pushq	$-1 /* orig_ax = -1 => not a system call */
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 	jmp	error_exit
@@ -1003,7 +1003,7 @@ paranoid_exit_no_swapgs:
 paranoid_exit_restore:
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
-	REMOVE_PT_GPREGS_FROM_STACK 8
+	FREE_ENTRY_FRAME 8
 	INTERRUPT_RETURN
 END(paranoid_exit)
 
@@ -1016,7 +1016,7 @@ ENTRY(error_entry)
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
 	xorl	%ebx, %ebx
-	testb	$3, CS+8(%rsp)
+	testb	$3, CS+8+PT_REGS_OFFSET(%rsp)
 	jz	.Lerror_kernelspace
 
 .Lerror_entry_from_usermode_swapgs:
@@ -1049,12 +1049,12 @@ ENTRY(error_entry)
 .Lerror_kernelspace:
 	incl	%ebx
 	leaq	native_irq_return_iret(%rip), %rcx
-	cmpq	%rcx, RIP+8(%rsp)
+	cmpq	%rcx, RIP+8+PT_REGS_OFFSET(%rsp)
 	je	.Lerror_bad_iret
 	movl	%ecx, %eax			/* zero extend */
-	cmpq	%rax, RIP+8(%rsp)
+	cmpq	%rax, RIP+8+PT_REGS_OFFSET(%rsp)
 	je	.Lbstep_iret
-	cmpq	$.Lgs_change, RIP+8(%rsp)
+	cmpq	$.Lgs_change, RIP+8+PT_REGS_OFFSET(%rsp)
 	jne	.Lerror_entry_done
 
 	/*
@@ -1066,7 +1066,7 @@ ENTRY(error_entry)
 
 .Lbstep_iret:
 	/* Fix truncated RIP */
-	movq	%rcx, RIP+8(%rsp)
+	movq	%rcx, RIP+8+PT_REGS_OFFSET(%rsp)
 	/* fall through */
 
 .Lerror_bad_iret:
@@ -1182,29 +1182,20 @@ ENTRY(nmi)
 	pushq	2*8(%rdx)	/* pt_regs->cs */
 	pushq	1*8(%rdx)	/* pt_regs->rip */
 	pushq   $-1		/* pt_regs->orig_ax */
-	pushq   %rdi		/* pt_regs->di */
-	pushq   %rsi		/* pt_regs->si */
-	pushq   (%rdx)		/* pt_regs->dx */
-	pushq   %rcx		/* pt_regs->cx */
-	pushq   %rax		/* pt_regs->ax */
-	pushq   %r8		/* pt_regs->r8 */
-	pushq   %r9		/* pt_regs->r9 */
-	pushq   %r10		/* pt_regs->r10 */
-	pushq   %r11		/* pt_regs->r11 */
-	pushq	%rbx		/* pt_regs->rbx */
-	pushq	%rbp		/* pt_regs->rbp */
-	pushq	%r12		/* pt_regs->r12 */
-	pushq	%r13		/* pt_regs->r13 */
-	pushq	%r14		/* pt_regs->r14 */
-	pushq	%r15		/* pt_regs->r15 */
 
-	/*
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	movq	(%rdx), %rdx
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
+
+/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
 	 * done with the NMI stack.
 	 */
 
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1214,6 +1205,7 @@ ENTRY(nmi)
 	 * do_nmi doesn't modify pt_regs.
 	 */
 	SWAPGS
+	RESTORE_ENTRY_FRAME_POINTER
 	jmp	restore_c_regs_and_iret
 
 .Lnmi_from_kernel:
@@ -1405,7 +1397,8 @@ end_repeat_nmi:
 	 * frame to point back to repeat_nmi.
 	 */
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
-	ALLOC_PT_GPREGS_ON_STACK
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
 
 	/*
 	 * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
@@ -1417,7 +1410,7 @@ end_repeat_nmi:
 	call	paranoid_entry
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
@@ -1430,7 +1423,7 @@ nmi_restore:
 	RESTORE_C_REGS
 
 	/* Point RSP at the "iret" frame. */
-	REMOVE_PT_GPREGS_FROM_STACK 6*8
+	FREE_ENTRY_FRAME 6*8
 
 	/*
 	 * Clear "NMI executing".  Set DF first so that we can easily
@@ -1455,3 +1448,22 @@ ENTRY(ignore_sysret)
 	mov	$-ENOSYS, %eax
 	sysret
 END(ignore_sysret)
+
+#ifdef CONFIG_FRAME_POINTER
+/*
+ * This is a fake function which allows stack unwinders to detect entry stack
+ * frames.  The entry_frame_ret return address is stored on the stack after the
+ * frame pointer, immediately before pt_regs.
+ *
+ * Some unwinders like gdb are smart enough to report the function which
+ * contains the instruction *before* the return address on the stack.  More
+ * primitive unwinders like the kernel's will report the function containing
+ * the return address itself.  So the address needs to be in the middle of the
+ * function in order to satisfy them both.
+ */
+ENTRY(entry_frame)
+	nop
+GLOBAL(entry_frame_ret)
+	nop
+ENDPROC(entry_frame)
+#endif /* CONFIG_FRAME_POINTER */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e1721da..31b4f63c 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -309,21 +309,16 @@ ENTRY(entry_INT80_compat)
 
 	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
-	pushq	%rdi			/* pt_regs->di */
-	pushq	%rsi			/* pt_regs->si */
-	pushq	%rdx			/* pt_regs->dx */
-	pushq	%rcx			/* pt_regs->cx */
-	pushq	$-ENOSYS		/* pt_regs->ax */
-	pushq   $0			/* pt_regs->r8  = 0 */
-	pushq   $0			/* pt_regs->r9  = 0 */
-	pushq   $0			/* pt_regs->r10 = 0 */
-	pushq   $0			/* pt_regs->r11 = 0 */
-	pushq   %rbx                    /* pt_regs->rbx */
-	pushq   %rbp                    /* pt_regs->rbp */
-	pushq   %r12                    /* pt_regs->r12 */
-	pushq   %r13                    /* pt_regs->r13 */
-	pushq   %r14                    /* pt_regs->r14 */
-	pushq   %r15                    /* pt_regs->r15 */
+	ALLOC_ENTRY_FRAME
+	SAVE_ENTRY_FRAME_POINTER
+	movq	$-ENOSYS, %rax
+	xorq	%r8, %r8
+	xorq	%r9, %r9
+	xorq	%r10, %r10
+	xorq	%r11, %r11
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
+
 	cld
 
 	/*
@@ -332,7 +327,7 @@ ENTRY(entry_INT80_compat)
 	 */
 	TRACE_IRQS_OFF
 
-	movq	%rsp, %rdi
+	leaq	PT_REGS_OFFSET(%rsp), %rdi
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619..bba7ece 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -70,7 +70,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5df831e..f3c7922 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -362,24 +362,15 @@ early_idt_handler_common:
 	incl early_recursion_flag(%rip)
 
 	/* The vector number is currently in the pt_regs->di slot. */
-	pushq %rsi				/* pt_regs->si */
-	movq 8(%rsp), %rsi			/* RSI = vector number */
-	movq %rdi, 8(%rsp)			/* pt_regs->di = RDI */
-	pushq %rdx				/* pt_regs->dx */
-	pushq %rcx				/* pt_regs->cx */
-	pushq %rax				/* pt_regs->ax */
-	pushq %r8				/* pt_regs->r8 */
-	pushq %r9				/* pt_regs->r9 */
-	pushq %r10				/* pt_regs->r10 */
-	pushq %r11				/* pt_regs->r11 */
-	pushq %rbx				/* pt_regs->bx */
-	pushq %rbp				/* pt_regs->bp */
-	pushq %r12				/* pt_regs->r12 */
-	pushq %r13				/* pt_regs->r13 */
-	pushq %r14				/* pt_regs->r14 */
-	pushq %r15				/* pt_regs->r15 */
-
-	cmpq $14,%rsi		/* Page fault? */
+	ALLOC_ENTRY_FRAME addskip=-8
+	SAVE_ENTRY_FRAME_POINTER
+	movq %rbx, PT_REGS(RBX)	/* save rbx */
+	movq PT_REGS(RDI), %rbx	/* rbx = vector number */
+
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS_EXCEPT_RBX
+
+	cmpq $14, %rbx		/* page fault? */
 	jnz 10f
 	GET_CR2_INTO(%rdi)	/* Can clobber any volatile register if pv */
 	call early_make_pgtable
@@ -387,7 +378,9 @@ early_idt_handler_common:
 	jz 20f			/* All good */
 
 10:
-	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
+	movq %rsp, %rdi
+	addq $PT_REGS_OFFSET, %rdi /* rdi = pt_regs */
+	movq %rbx, %rsi		   /* rsi = vector number */
 	call early_fixup_exception
 
 20:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d8..6954e74 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -514,22 +514,34 @@ exit:
 NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
+
+struct entry_frame {
+#ifdef CONFIG_FRAME_POINTER
+	void *fp;
+	void *ret_addr;
+#endif
+	struct pt_regs regs;
+};
+
 /*
  * Help handler running on IST stack to switch off the IST stack if the
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct entry_frame *sync_entry_frame(struct entry_frame *old)
 {
-	struct pt_regs *regs = task_pt_regs(current);
-	*regs = *eregs;
-	return regs;
+	struct entry_frame *new = container_of(task_pt_regs(current),
+					       struct entry_frame, regs);
+
+	*new = *old;
+	return new;
 }
-NOKPROBE_SYMBOL(sync_regs);
+NOKPROBE_SYMBOL(sync_entry_frame);
 
 struct bad_iret_stack {
 	void *error_entry_ret;
-	struct pt_regs regs;
+	struct entry_frame frame;
 };
 
 asmlinkage __visible notrace
@@ -544,15 +556,15 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	 */
 	struct bad_iret_stack *new_stack =
 		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+			     struct bad_iret_stack, frame.regs);
 
 	/* Copy the IRET target to the new stack. */
-	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
+	memmove(&new_stack->frame.regs.ip, (void *)s->frame.regs.sp, 5*8);
 
 	/* Copy the remainder of the stack from the current stack. */
-	memmove(new_stack, s, offsetof(struct bad_iret_stack, regs.ip));
+	memmove(new_stack, s, offsetof(struct bad_iret_stack, frame.regs.ip));
 
-	BUG_ON(!user_mode(&new_stack->regs));
+	BUG_ON(!user_mode(&new_stack->frame.regs));
 	return new_stack;
 }
 NOKPROBE_SYMBOL(fixup_bad_iret);


More information about the Linuxppc-dev mailing list