[PATCH 2/2] powerpc/64s: reimplement book3s idle code in C

Nicholas Piggin npiggin at gmail.com
Fri Aug 10 09:30:11 AEST 2018


On Thu, 9 Aug 2018 20:15:04 +0530
Gautham R Shenoy <ego at linux.vnet.ibm.com> wrote:

> Hello Nicholas,
> On Fri, Aug 03, 2018 at 02:13:50PM +1000, Nicholas Piggin wrote:
> > Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
> > speific HV idle code to the powernv platform code. Generic book3s
> > assembly stubs are kept in common code and used only to save and
> > restore the stack frame and non-volatile GPRs just before going to
> > idle and just after waking up, which can return into C code.
> > 
> > Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> > way this stuff will cope with non-trivial new CPU implementation
> > details, firmware changes, etc., without becoming unmaintainable.
> > 
> > This is not a strict translation to C code, there are some
> > differences.
> > 
> > - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> >   but saves and restores them all explicitly.
> > 
> > - The optimisation where EC=ESL=0 idle modes did not have to save
> >   GPRs or change MSR is restored, because it's now simple to do.
> >   State loss modes that did not actually lose GPRs can use this
> >   optimization too.
> > 
> > - KVM secondary entry code is now more of a call/return style
> >   rather than spaghetti. nap_state_lost is not required beccause
> >   KVM always returns via NVGPR restorig path.
> > 
> > This seems pretty solid, so needs more review and testing now. The
> > KVM changes are pretty significant and complicated. POWER7 needs to
> > be tested too.
> > 
> > Open question:
> > - Why does P9 restore some of the PMU SPRs (but not others), and
> >   P8 only zeroes them?  
> 
> We are restoring MMCR0 (from the value saved in the stack) and MMCR1,
> MMCR2, and MMCRA in the stop_sprs in PACA. We saw that MMCRH and MMCRC
> are cleared on both POWER8 and POWER9. Hence we didn't restore
> them. MMCRS is being initialized by the KVM code.
> 
> Is there anything apart from these that need to be restored ?

No I'm wondering why it is we restore those on POWER9? POWER8 does not
restore them, only zeroes. What is the difference with POWER9?

I will leave that in for now so we don't change too much with one patch,
but it would be nice to document a bit better the reasons for saving or
clearing SPRs.

> 
> > 
> > Since RFC v1:
> > - Now tested and working with POWER9 hash and radix.
> > - KVM support added. This took a bit of work to untangle and might
> >   still have some issues, but POWER9 seems to work including hash on
> >   radix with dependent threads mode.
> > - This snowballed a bit because of KVM and other details making it
> >   not feasible to leave POWER7/8 code alone. That's only half done
> >   at the moment.
> > - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
> >   support done it might be another hundred or so lines of C.
> > 
> > Since RFC v2:
> > - Fixed deep state SLB reloading
> > - Now tested and working with POWER8.
> > - Accounted for most feedback.
> > 
> > Since RFC v3:
> > - Rebased to powerpc merge + idle state bugfix
> > - Split SLB flush/restore code out and shared with MCE code (pseries
> >   MCE patches can also use).
> > - More testing on POWER8 including KVM with secondaries.
> > - Performance testing looks good. EC=ESL=0 is about 5% faster, other
> >   stop states look a little faster too.
> > - Adjusted SPR saving to handler POWER7, haven't tested it.  
> 
> 
> This patch looks good to me.
> 
> A couple of comments below. 
> 
> > ---  
> 
> [... snip ..]
> 
> > @@ -178,23 +177,30 @@ struct paca_struct {
> >  #endif
> > 
> >  #ifdef CONFIG_PPC_POWERNV
> > -	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> > -	u32 *core_idle_state_ptr;
> > -	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> > -	/* Mask to indicate thread id in core */
> > -	u8 thread_mask;
> > -	/* Mask to denote subcore sibling threads */
> > -	u8 subcore_sibling_mask;
> > -	/* Flag to request this thread not to stop */
> > -	atomic_t dont_stop;
> > -	/* The PSSCR value that the kernel requested before going to stop */
> > -	u64 requested_psscr;
> > +	/* PowerNV idle fields */
> > +	/* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
> > +	unsigned long idle_state;
> > +	union {
> > +		/* P7/P8 specific fields */
> > +		struct {
> > +			/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> > +			u8 thread_idle_state;
> > +			/* Mask to indicate thread id in core */
> > +			u8 thread_mask;  
> 
> This is no longer needed. We can get this from cpu_thread_in_core()
> from the C code.
> 
> The only place where we are currently using this is to DUMP the value
> of the thread_mask from xmon but not anywhere else in the idle entry
> code.

Good catch, removed it.

> > +			/* Mask to denote subcore sibling threads */
> > +			u8 subcore_sibling_mask;
> > +		};
> > 
> > -	/*
> > -	 * Save area for additional SPRs that need to be
> > -	 * saved/restored during cpuidle stop.
> > -	 */
> > -	struct stop_sprs stop_sprs;
> > +		/* P9 specific fields */
> > +		struct {
> > +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +			/* The PSSCR value that the kernel requested before going to stop */
> > +			u64 requested_psscr;
> > +			/* Flag to request this thread not to stop */
> > +			atomic_t dont_stop;
> > +#endif
> > +		};
> > +	};
> >  #endif  
> [..snip..]
> 
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -136,8 +136,9 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
> > 
> >  #ifdef CONFIG_PPC_P7_NAP
> >  EXC_COMMON_BEGIN(system_reset_idle_common)
> > -	mfspr	r12,SPRN_SRR1
> > -	b	pnv_powersave_wakeup
> > +	mfspr	r3,SPRN_SRR1
> > +	bltlr	cr3	/* no state loss, return to idle caller */  
> 
> So, if we are in an ESL=EC=0 stop, and we get an xscom SRESET,
> I guess the expected value in SPRN_SRR1[46:47] will be
> SRR1_WS_NOLOSS.

Yes it should, because ESL=0.

> 
> In this case, though the LR would correspond to the caller of
> isa3_idle_stop_noloss(), r3 would have SPRN_SRR1 as opposed to 0 which
> is what we would have returned from isa3_idle_stop_noloss().
> 
> Do we use the value to service the NMI ?

Yes. System reset exception is cleared when the interrupt is delivered
so we can't ignore it.

> 
> > +	b	idle_return_gpr_loss
> >  #endif
> > 
> >  /*
> > @@ -416,17 +417,17 @@ EXC_COMMON_BEGIN(machine_check_idle_common)
> >  	 * Then decrement MCE nesting after finishing with the stack.
> >  	 */
> >  	ld	r3,_MSR(r1)
> > +	ld	r4,_LINK(r1)
> > 
> >  	lhz	r11,PACA_IN_MCE(r13)
> >  	subi	r11,r11,1
> >  	sth	r11,PACA_IN_MCE(r13)
> > 
> > -	/* Turn off the RI bit because SRR1 is used by idle wakeup code. */
> > -	/* Recoverability could be improved by reducing the use of SRR1. */
> > -	li	r11,0
> > -	mtmsrd	r11,1
> > -
> > -	b	pnv_powersave_wakeup_mce
> > +	mtlr	r4
> > +	rlwinm	r10,r3,47-31,30,31
> > +	cmpwi	cr3,r10,2  
> 
> Can we use SRR1_WS_GPRLOSS here as well as in IDLETEST() ?

You mean

  cmpwi    cr3,r10,SRR1_WS_GPRLOSS >> (63-47)

> 
> > +	bltlr	cr3	/* no state loss, return to idle caller */
> > +	b	idle_return_gpr_loss  
> 
> [..snip..]
> 
> > +_GLOBAL(isa206_idle_insn_mayloss)
> > +	std	r1,PACAR1(r13)
> > +	mflr	r4
> > +	mfcr	r5
> > +	/* use stack red zone rather than a new frame */
> > +	addi	r6,r1,-INT_FRAME_SIZE
> > +	SAVE_GPR(2, r6)
> > +	SAVE_NVGPRS(r6)
> > +	std	r4,_LINK(r6)
> > +	std	r5,_CCR(r6)
> > +	cmpwi	r3,PNV_THREAD_NAP
> > +	bne	1f
> > +	IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)  
> 
> We can have the following here, since we don't expect to return from
> nap. 
> 
>         b	. /* catch bugs */"
> 
> > +1:	cmpwi	r3,PNV_THREAD_SLEEP  

Sure, added.

> [..snip..]
> 
> > +static unsigned long power7_idle_insn(unsigned long type)
> > +{
> > +	int cpu = raw_smp_processor_id();
> > +	int first = cpu_first_thread_sibling(cpu);
> > +	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
> > +	unsigned long *state = &paca_ptrs[first]->idle_state;
> > +	unsigned long srr1;
> > +	bool full_winkle;
> > +	struct p7_sprs sprs;
> > +	bool sprs_saved = false;
> > +	int rc;
> > +
> > +	memset(&sprs, 0, sizeof(sprs));
> > +
> > +	if (unlikely(type != PNV_THREAD_NAP)) {
> > +		atomic_lock_thread_idle();
> > +
> > +		BUG_ON(!(*state & thread));
> > +		*state &= ~thread;
> > +
> > +		if (power7_fastsleep_workaround_entry) {
> > +			if ((*state & ((1 << threads_per_core) - 1)) == 0) {
> > +				rc = opal_config_cpu_idle_state(
> > +						OPAL_CONFIG_IDLE_FASTSLEEP,
> > +						OPAL_CONFIG_IDLE_APPLY);
> > +				BUG_ON(rc);
> > +			}
> > +		}
> > +
> > +		if (type == PNV_THREAD_WINKLE) {
> > +			sprs.tscr	= mfspr(SPRN_TSCR);
> > +			sprs.worc	= mfspr(SPRN_WORC);
> > +
> > +			sprs.sdr1	= mfspr(SPRN_SDR1);
> > +			sprs.rpr	= mfspr(SPRN_RPR);
> > +			sprs.amor	= mfspr(SPRN_AMOR);
> > +
> > +			sprs.lpcr	= mfspr(SPRN_LPCR);
> > +			if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
> > +				sprs.hfscr	= mfspr(SPRN_HFSCR);
> > +				sprs.fscr	= mfspr(SPRN_FSCR);
> > +			}
> > +			sprs.purr	= mfspr(SPRN_PURR);
> > +			sprs.spurr	= mfspr(SPRN_SPURR);
> > +			sprs.dscr	= mfspr(SPRN_DSCR);
> > +			sprs.wort	= mfspr(SPRN_WORT);
> > +
> > +			sprs_saved = true;
> > +
> > +			/*
> > +			 * Increment winkle counter and set all winkle bits if
> > +			 * all threads are winkling. This allows wakeup side to
> > +			 * distinguish between fast sleep and winkle state
> > +			 * loss. Fast sleep still has to resync the timebase so
> > +			 * this may not be a really big win.
> > +			 */
> > +			*state += 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
> > +			if ((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) >> PNV_CORE_IDLE_WINKLE_COUNT_SHIFT == threads_per_core)
> > +				*state |= PNV_CORE_IDLE_THREAD_WINKLE_BITS;
> > +			WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
> > +		}
> > +
> > +		atomic_unlock_thread_idle();
> > +	}
> > +  
> 
> Could we add the following for debug purposes. We are already dumping
> thread_idle_state in xmon.
> 
>       	paca->thread_idle_state = type; 
> 
> > +	srr1 = isa206_idle_insn_mayloss(type);  
> 
> And the following: 
>       paca->thread_idle_state = PNV_THREAD_RUNNING;

Yes I'll add that.

> 
> Apart from debugging purpose, IIRC, on some versions of POWER8,
> SRR1[46:47] wasn't being updated to 0b00 when we get an xscom SRESET
> while in running state. Thus, the only way to distinguish whether we
> entered 0x100 is due to a nap/fastsleep/winkle wakeup or due to an NMI
> caused when the processor wasn't in idle state was
> paca->thread_idle_state.

We don't actually do anything with that today, do we? Is that one
of the reasons we don't use xscom SRESET on POWER8?

> 
> > +
> > +	WARN_ON_ONCE(!srr1);
> > +	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
> > +
> > +	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))  
> [..snip..]
> 
> > +static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> > +{
> > +	int cpu = raw_smp_processor_id();
> > +	int first = cpu_first_thread_sibling(cpu);
> > +	unsigned long *state = &paca_ptrs[first]->idle_state;
> > +	unsigned long srr1;
> > +	unsigned long mmcr0 = 0;
> > +	struct p9_sprs sprs;
> > +	bool sprs_saved = false;
> > +
> > +	memset(&sprs, 0, sizeof(sprs));
> > +
> > +	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
> > +		BUG_ON(!mmu_on);
> > +
> > +		/*
> > +		 * Wake synchronously. SRESET via xscom may still cause
> > +		 * a 0x100 powersave wakeup with SRR1 reason!
> > +		 */
> > +		srr1 = isa3_idle_stop_noloss(psscr);
> > +		if (likely(!srr1))
> > +			return 0;
> > +  
> 
> We come here if if we were woken up from a ESL=0 stop by a xscom
> SRESET. Where would the SRESET be handled ?

In irq_set_pending_from_srr1(), the same as EC=1 states.

> 
> > +		/*
> > +		 * Registers not saved, can't recover!
> > +		 * This would be a hardware bug
> > +		 */
> > +		BUG_ON((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS);
> > +
> > +		goto out;
> > +	}
> > +  
> 
> 
> The patch looks good otherwise, especially the idle_book3s.S :-)
> 
> Reviewed-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>

Thank you for all the reviews

Thanks,
Nick


More information about the Linuxppc-dev mailing list