[PATCH 2/2] powerpc/64s: reimplement book3s idle code in C

Gautham R Shenoy ego at linux.vnet.ibm.com
Fri Aug 10 00:45:04 AEST 2018


Hello Nicholas,
On Fri, Aug 03, 2018 at 02:13:50PM +1000, Nicholas Piggin wrote:
> Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
> speific HV idle code to the powernv platform code. Generic book3s
> assembly stubs are kept in common code and used only to save and
> restore the stack frame and non-volatile GPRs just before going to
> idle and just after waking up, which can return into C code.
> 
> Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> way this stuff will cope with non-trivial new CPU implementation
> details, firmware changes, etc., without becoming unmaintainable.
> 
> This is not a strict translation to C code, there are some
> differences.
> 
> - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
>   but saves and restores them all explicitly.
> 
> - The optimisation where EC=ESL=0 idle modes did not have to save
>   GPRs or change MSR is restored, because it's now simple to do.
>   State loss modes that did not actually lose GPRs can use this
>   optimization too.
> 
> - KVM secondary entry code is now more of a call/return style
>   rather than spaghetti. nap_state_lost is not required beccause
>   KVM always returns via NVGPR restorig path.
> 
> This seems pretty solid, so needs more review and testing now. The
> KVM changes are pretty significant and complicated. POWER7 needs to
> be tested too.
> 
> Open question:
> - Why does P9 restore some of the PMU SPRs (but not others), and
>   P8 only zeroes them?

We are restoring MMCR0 (from the value saved in the stack) and MMCR1,
MMCR2, and MMCRA in the stop_sprs in PACA. We saw that MMCRH and MMCRC
are cleared on both POWER8 and POWER9. Hence we didn't restore
them. MMCRS is being initialized by the KVM code.

Is there anything apart from these that need to be restored ?

> 
> Since RFC v1:
> - Now tested and working with POWER9 hash and radix.
> - KVM support added. This took a bit of work to untangle and might
>   still have some issues, but POWER9 seems to work including hash on
>   radix with dependent threads mode.
> - This snowballed a bit because of KVM and other details making it
>   not feasible to leave POWER7/8 code alone. That's only half done
>   at the moment.
> - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
>   support done it might be another hundred or so lines of C.
> 
> Since RFC v2:
> - Fixed deep state SLB reloading
> - Now tested and working with POWER8.
> - Accounted for most feedback.
> 
> Since RFC v3:
> - Rebased to powerpc merge + idle state bugfix
> - Split SLB flush/restore code out and shared with MCE code (pseries
>   MCE patches can also use).
> - More testing on POWER8 including KVM with secondaries.
> - Performance testing looks good. EC=ESL=0 is about 5% faster, other
>   stop states look a little faster too.
> - Adjusted SPR saving to handler POWER7, haven't tested it.


This patch looks good to me.

A couple of comments below. 

> ---

[... snip ..]

> @@ -178,23 +177,30 @@ struct paca_struct {
>  #endif
> 
>  #ifdef CONFIG_PPC_POWERNV
> -	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> -	u32 *core_idle_state_ptr;
> -	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> -	/* Mask to indicate thread id in core */
> -	u8 thread_mask;
> -	/* Mask to denote subcore sibling threads */
> -	u8 subcore_sibling_mask;
> -	/* Flag to request this thread not to stop */
> -	atomic_t dont_stop;
> -	/* The PSSCR value that the kernel requested before going to stop */
> -	u64 requested_psscr;
> +	/* PowerNV idle fields */
> +	/* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
> +	unsigned long idle_state;
> +	union {
> +		/* P7/P8 specific fields */
> +		struct {
> +			/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> +			u8 thread_idle_state;
> +			/* Mask to indicate thread id in core */
> +			u8 thread_mask;

This is no longer needed. We can get this from cpu_thread_in_core()
from the C code.

The only place where we are currently using this is to DUMP the value
of the thread_mask from xmon but not anywhere else in the idle entry
code.



> +			/* Mask to denote subcore sibling threads */
> +			u8 subcore_sibling_mask;
> +		};
> 
> -	/*
> -	 * Save area for additional SPRs that need to be
> -	 * saved/restored during cpuidle stop.
> -	 */
> -	struct stop_sprs stop_sprs;
> +		/* P9 specific fields */
> +		struct {
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +			/* The PSSCR value that the kernel requested before going to stop */
> +			u64 requested_psscr;
> +			/* Flag to request this thread not to stop */
> +			atomic_t dont_stop;
> +#endif
> +		};
> +	};
>  #endif
[..snip..]

> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -136,8 +136,9 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
> 
>  #ifdef CONFIG_PPC_P7_NAP
>  EXC_COMMON_BEGIN(system_reset_idle_common)
> -	mfspr	r12,SPRN_SRR1
> -	b	pnv_powersave_wakeup
> +	mfspr	r3,SPRN_SRR1
> +	bltlr	cr3	/* no state loss, return to idle caller */

So, if we are in an ESL=EC=0 stop, and we get an xscom SRESET,
I guess the expected value in SPRN_SRR1[46:47] will be
SRR1_WS_NOLOSS.

In this case, though the LR would correspond to the caller of
isa3_idle_stop_noloss(), r3 would have SPRN_SRR1 as opposed to 0 which
is what we would have returned from isa3_idle_stop_noloss().

Do we use the value to service the NMI ?

> +	b	idle_return_gpr_loss
>  #endif
> 
>  /*
> @@ -416,17 +417,17 @@ EXC_COMMON_BEGIN(machine_check_idle_common)
>  	 * Then decrement MCE nesting after finishing with the stack.
>  	 */
>  	ld	r3,_MSR(r1)
> +	ld	r4,_LINK(r1)
> 
>  	lhz	r11,PACA_IN_MCE(r13)
>  	subi	r11,r11,1
>  	sth	r11,PACA_IN_MCE(r13)
> 
> -	/* Turn off the RI bit because SRR1 is used by idle wakeup code. */
> -	/* Recoverability could be improved by reducing the use of SRR1. */
> -	li	r11,0
> -	mtmsrd	r11,1
> -
> -	b	pnv_powersave_wakeup_mce
> +	mtlr	r4
> +	rlwinm	r10,r3,47-31,30,31
> +	cmpwi	cr3,r10,2

Can we use SRR1_WS_GPRLOSS here as well as in IDLETEST() ?

> +	bltlr	cr3	/* no state loss, return to idle caller */
> +	b	idle_return_gpr_loss

[..snip..]

> +_GLOBAL(isa206_idle_insn_mayloss)
> +	std	r1,PACAR1(r13)
> +	mflr	r4
> +	mfcr	r5
> +	/* use stack red zone rather than a new frame */
> +	addi	r6,r1,-INT_FRAME_SIZE
> +	SAVE_GPR(2, r6)
> +	SAVE_NVGPRS(r6)
> +	std	r4,_LINK(r6)
> +	std	r5,_CCR(r6)
> +	cmpwi	r3,PNV_THREAD_NAP
> +	bne	1f
> +	IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)

We can have the following here, since we don't expect to return from
nap. 

        b	. /* catch bugs */"

> +1:	cmpwi	r3,PNV_THREAD_SLEEP



[..snip..]

> +static unsigned long power7_idle_insn(unsigned long type)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +	unsigned long srr1;
> +	bool full_winkle;
> +	struct p7_sprs sprs;
> +	bool sprs_saved = false;
> +	int rc;
> +
> +	memset(&sprs, 0, sizeof(sprs));
> +
> +	if (unlikely(type != PNV_THREAD_NAP)) {
> +		atomic_lock_thread_idle();
> +
> +		BUG_ON(!(*state & thread));
> +		*state &= ~thread;
> +
> +		if (power7_fastsleep_workaround_entry) {
> +			if ((*state & ((1 << threads_per_core) - 1)) == 0) {
> +				rc = opal_config_cpu_idle_state(
> +						OPAL_CONFIG_IDLE_FASTSLEEP,
> +						OPAL_CONFIG_IDLE_APPLY);
> +				BUG_ON(rc);
> +			}
> +		}
> +
> +		if (type == PNV_THREAD_WINKLE) {
> +			sprs.tscr	= mfspr(SPRN_TSCR);
> +			sprs.worc	= mfspr(SPRN_WORC);
> +
> +			sprs.sdr1	= mfspr(SPRN_SDR1);
> +			sprs.rpr	= mfspr(SPRN_RPR);
> +			sprs.amor	= mfspr(SPRN_AMOR);
> +
> +			sprs.lpcr	= mfspr(SPRN_LPCR);
> +			if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
> +				sprs.hfscr	= mfspr(SPRN_HFSCR);
> +				sprs.fscr	= mfspr(SPRN_FSCR);
> +			}
> +			sprs.purr	= mfspr(SPRN_PURR);
> +			sprs.spurr	= mfspr(SPRN_SPURR);
> +			sprs.dscr	= mfspr(SPRN_DSCR);
> +			sprs.wort	= mfspr(SPRN_WORT);
> +
> +			sprs_saved = true;
> +
> +			/*
> +			 * Increment winkle counter and set all winkle bits if
> +			 * all threads are winkling. This allows wakeup side to
> +			 * distinguish between fast sleep and winkle state
> +			 * loss. Fast sleep still has to resync the timebase so
> +			 * this may not be a really big win.
> +			 */
> +			*state += 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
> +			if ((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) >> PNV_CORE_IDLE_WINKLE_COUNT_SHIFT == threads_per_core)
> +				*state |= PNV_CORE_IDLE_THREAD_WINKLE_BITS;
> +			WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
> +		}
> +
> +		atomic_unlock_thread_idle();
> +	}
> +

Could we add the following for debug purposes. We are already dumping
thread_idle_state in xmon.

      	paca->thread_idle_state = type; 

> +	srr1 = isa206_idle_insn_mayloss(type);

And the following: 
      paca->thread_idle_state = PNV_THREAD_RUNNING;

Apart from debugging purpose, IIRC, on some versions of POWER8,
SRR1[46:47] wasn't being updated to 0b00 when we get an xscom SRESET
while in running state. Thus, the only way to distinguish whether we
entered 0x100 is due to a nap/fastsleep/winkle wakeup or due to an NMI
caused when the processor wasn't in idle state was
paca->thread_idle_state.

> +
> +	WARN_ON_ONCE(!srr1);
> +	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
> +
> +	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
[..snip..]

> +static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +	unsigned long srr1;
> +	unsigned long mmcr0 = 0;
> +	struct p9_sprs sprs;
> +	bool sprs_saved = false;
> +
> +	memset(&sprs, 0, sizeof(sprs));
> +
> +	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
> +		BUG_ON(!mmu_on);
> +
> +		/*
> +		 * Wake synchronously. SRESET via xscom may still cause
> +		 * a 0x100 powersave wakeup with SRR1 reason!
> +		 */
> +		srr1 = isa3_idle_stop_noloss(psscr);
> +		if (likely(!srr1))
> +			return 0;
> +

We come here if if we were woken up from a ESL=0 stop by a xscom
SRESET. Where would the SRESET be handled ?

> +		/*
> +		 * Registers not saved, can't recover!
> +		 * This would be a hardware bug
> +		 */
> +		BUG_ON((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS);
> +
> +		goto out;
> +	}
> +


The patch looks good otherwise, especially the idle_book3s.S :-)

Reviewed-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>


--
Thanks and Regards
gautham.



More information about the Linuxppc-dev mailing list