[RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

Christophe Leroy christophe.leroy at csgroup.eu
Thu Dec 21 21:46:08 AEDT 2023



Le 08/12/2023 à 17:30, Naveen N Rao a écrit :
> Function profile sequence on powerpc includes two instructions at the
> beginning of each function:
> 
> 	mflr	r0
> 	bl	ftrace_caller
> 
> The call to ftrace_caller() gets nop'ed out during kernel boot and is
> patched in when ftrace is enabled.
> 
> There are two issues with this:
> 1. The 'mflr r0' instruction at the beginning of each function remains
>     even though ftrace is not being used.
> 2. When ftrace is activated, we return from ftrace_caller() with a bctr
>     instruction to preserve r0 and LR, resulting in the link stack
>     becoming unbalanced.
> 
> To address (1), we have tried to nop'out the 'mflr r0' instruction when
> nop'ing out the call to ftrace_caller() and restoring it when enabling
> ftrace. But, that required additional synchronization slowing down
> ftrace activation. It also left an additional nop instruction at the
> beginning of each function and that wasn't desirable on 32-bit powerpc.
> 
> Instead of that, move the function profile sequence out-of-line leaving
> a single nop at function entry. On ftrace activation, the nop is changed
> to an unconditional branch to the out-of-line sequence that in turn
> calls ftrace_caller(). This removes the need for complex synchronization
> during ftrace activation and simplifies the code. More importantly, this
> improves performance of the kernel when ftrace is not in use.
> 
> To address (2), change the ftrace trampoline to return with a 'blr'
> instruction with the original return address in r0 intact. Then, an
> additional 'mtlr r0' instruction in the function profile sequence can
> move the correct return address back to LR.
> 
> With the above two changes, the function profile sequence now looks like
> the following:
> 
>   [func:		# GEP -- 64-bit powerpc, optional
> 	addis	r2,r12,imm1
> 	addi	r2,r2,imm2]
>    tramp:
> 	mflr	r0
> 	bl	ftrace_caller
> 	mtlr	r0
> 	b	func
> 	nop
> 	[nop]	# 64-bit powerpc only
>    func:		# LEP
> 	nop
> 
> On 32-bit powerpc, the ftrace mcount trampoline is now completely
> outside the function. This is also the case on 64-bit powerpc for
> functions that do not need a GEP. However, for functions that need a
> GEP, the additional instructions are inserted between the GEP and the
> LEP. Since we can only have a fixed number of instructions between GEP
> and LEP, we choose to emit 6 instructions. Four of those instructions
> are used for the function profile sequence and two instruction slots are
> reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On
> 32-bit powerpc, we emit one additional nop for this purpose resulting in
> a total of 5 nops before function entry.
> 
> To enable ftrace, the nop at function entry is changed to an
> unconditional branch to 'tramp'. The call to ftrace_caller() may be
> updated to ftrace_regs_caller() depending on the registered ftrace ops.
> On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> 'mflr r0' from an unconditional branch back to func+4. This is so that
> functions entered through the GEP can skip the function profile sequence
> unless ftrace is enabled.
> 
> With the context_switch microbenchmark on a P9 machine, there is a
> performance improvement of ~6% with this patch applied, going from 650k
> context switches to 690k context switches without ftrace enabled. With
> ftrace enabled, the performance was similar at 86k context switches.

Wondering how significant that context_switch micorbenchmark is.

I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the 
results:
# ./context_switch --no-fp
Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no 
vdso:no

On 885, I get the following results before and after your patch.

CONFIG_FTRACE not selected : 44,9k
CONFIG_FTRACE selected, before : 32,8k
CONFIG_FTRACE selected, after : 33,6k

All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But 
when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected 
result is only 34,4.

On 8321:

CONFIG_FTRACE not selected : 100,3k
CONFIG_FTRACE selected, before : 72,5k
CONFIG_FTRACE selected, after : 116k

So the results look odd to me.

> 
> The downside of this approach is the increase in vmlinux size,
> especially on 32-bit powerpc. We now emit 3 additional instructions for
> each function (excluding the one or two instructions for supporting
> DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current
> implementation of -fpatchable-function-entry though, this is not
> avoidable since we are forced to emit 6 instructions between the GEP and
> the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS.

The increase is almost 5% on the few 32 bits defconfig I have tested. 
That's a lot.

On 32 bits powerpc, only the e500 has a link stack that could end up 
being unbalanced. Could we keep the bctr and avoid the mtlr and the jump 
in the trampoline ?

On 8xx a NOP is one cycle, a taken branch is 2 cycles, but the second 
cycle is a bubble that most of the time gets filled by following 
operations. On the 83xx, branches and NOPs are supposed to be seamless.

So is that out-of-line trampoline really worth it ? Maybe keeping the 
ftrace instructions at the begining and just replacing the mflr by an 
jump when ftrace is off would help reduce the ftrace insns by one more 
instruction.

> 
> Signed-off-by: Naveen N Rao <naveen at kernel.org>
> ---

> diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
> index 84f6ccd7de7a..9a54bb9e0dde 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -185,10 +185,21 @@ static inline unsigned long ppc_function_entry(void *func)
>   	 */
>   	if ((((*insn & OP_RT_RA_MASK) == ADDIS_R2_R12) ||
>   	     ((*insn & OP_RT_RA_MASK) == LIS_R2)) &&
> -	    ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2))
> +	    ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) {
> +#ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY

Can you replace by IS_ENABLED() ?

> +		/*
> +		 * Heuristic: look for the 'mtlr r0' instruction assuming ftrace init is done.
> +		 * If it is not found, look for two consecutive nops after the GEP.
> +		 * Longer term, we really should be parsing the symbol table to determine LEP.
> +		 */
> +		if ((*(insn+4) == PPC_RAW_MTLR(_R0)) ||
> +		    ((*(insn+2) == PPC_RAW_NOP() && *(insn+3) == PPC_RAW_NOP())))
> +			return (unsigned long)(insn + 8);
> +#endif
>   		return (unsigned long)(insn + 2);
> -	else
> +	} else {
>   		return (unsigned long)func;
> +	}
>   #elif defined(CONFIG_PPC64_ELF_ABI_V1)
>   	/*
>   	 * On PPC64 ABIv1 the function pointer actually points to the

> diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
> index 2956196c98ff..d3b4949142a8 100644
> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c

> @@ -217,15 +274,62 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
>   {
>   	unsigned long addr, ip = rec->ip;
>   	ppc_inst_t old, new;
> -	int ret = 0;
> +	int i, ret = 0;
> +	u32 ftrace_mcount_tramp_insns[] = {
> +#ifdef CONFIG_PPC64
> +		PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET + MCOUNT_INSN_SIZE),
> +#else
> +		PPC_RAW_MFLR(_R0),
> +#endif

Can it be based on IS_ENABLED(CONFIG_PPC64) instead ?

> +		PPC_RAW_BL(0), /* bl ftrace_caller */
> +		PPC_RAW_MTLR(_R0), /* also see update ppc_function_entry() */
> +		PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET - MCOUNT_INSN_SIZE * 2)
> +	};
> +


More information about the Linuxppc-dev mailing list