[PATCH v4 29/46] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

Nicholas Piggin npiggin at gmail.com
Fri Apr 2 18:58:46 AEDT 2021


Excerpts from Alexey Kardashevskiy's message of April 2, 2021 2:36 pm:
> 
> 
> On 01/04/2021 21:35, Nicholas Piggin wrote:
>> Excerpts from Alexey Kardashevskiy's message of April 1, 2021 3:30 pm:
>>>
>>>
>>> On 3/23/21 12:02 PM, Nicholas Piggin wrote:
>>>> Almost all logic is moved to C, by introducing a new in_guest mode that
>>>> selects and branches very early in the interrupt handler to the P9 exit
>>>> code.
>> 
>> [...]
>> 
>>>> +/*
>>>> + * kvmppc_p9_exit_hcall and kvmppc_p9_exit_interrupt are branched to from
>>>> + * above if the interrupt was taken for a guest that was entered via
>>>> + * kvmppc_p9_enter_guest().
>>>> + *
>>>> + * This code recovers the host stack and vcpu pointer, saves all GPRs and
>>>> + * CR, LR, CTR, XER as well as guest MSR and NIA into the VCPU, then re-
>>>> + * establishes the host stack and registers to return from  the
>>>> + * kvmppc_p9_enter_guest() function.
>>>
>>> What does "this code" refer to? If it is the asm below, then it does not
>>> save CTR, it is in the c code. Otherwise it is confusing (to me) :)
>> 
>> Yes you're right, CTR is saved in C.
>> 
>>>> + */
>>>> +.balign	IFETCH_ALIGN_BYTES
>>>> +kvmppc_p9_exit_hcall:
>>>> +	mfspr	r11,SPRN_SRR0
>>>> +	mfspr	r12,SPRN_SRR1
>>>> +	li	r10,0xc00
>>>> +	std	r10,HSTATE_SCRATCH0(r13)
>>>> +
>>>> +.balign	IFETCH_ALIGN_BYTES
>>>> +kvmppc_p9_exit_interrupt:
>> 
>> [...]
>> 
>>>> +static inline void slb_invalidate(unsigned int ih)
>>>> +{
>>>> +	asm volatile("slbia %0" :: "i"(ih));
>>>> +}
>>>
>>> This one is not used.
>> 
>> It gets used in a later patch, I guess I should move it there.
>> 
>> [...]
>> 
>>>> +int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u64 *exsave;
>>>> +	unsigned long msr = mfmsr();
>>>> +	int trap;
>>>> +
>>>> +	start_timing(vcpu, &vcpu->arch.rm_entry);
>>>> +
>>>> +	vcpu->arch.ceded = 0;
>>>> +
>>>> +	WARN_ON_ONCE(vcpu->arch.shregs.msr & MSR_HV);
>>>> +	WARN_ON_ONCE(!(vcpu->arch.shregs.msr & MSR_ME));
>>>> +
>>>> +	mtspr(SPRN_HSRR0, vcpu->arch.regs.nip);
>>>> +	mtspr(SPRN_HSRR1, (vcpu->arch.shregs.msr & ~MSR_HV) | MSR_ME);
>>>> +
>>>> +	/*
>>>> +	 * On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
>>>> +	 * Interrupt (HDSI) the HDSISR is not be updated at all.
>>>> +	 *
>>>> +	 * To work around this we put a canary value into the HDSISR before
>>>> +	 * returning to a guest and then check for this canary when we take a
>>>> +	 * HDSI. If we find the canary on a HDSI, we know the hardware didn't
>>>> +	 * update the HDSISR. In this case we return to the guest to retake the
>>>> +	 * HDSI which should correctly update the HDSISR the second time HDSI
>>>> +	 * entry.
>>>> +	 *
>>>> +	 * Just do this on all p9 processors for now.
>>>> +	 */
>>>> +	mtspr(SPRN_HDSISR, HDSISR_CANARY);
>>>> +
>>>> +	accumulate_time(vcpu, &vcpu->arch.guest_time);
>>>> +
>>>> +	local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_GUEST_HV_FAST;
>>>> +	kvmppc_p9_enter_guest(vcpu);
>>>> +	// Radix host and guest means host never runs with guest MMU state
>>>> +	local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_NONE;
>>>> +
>>>> +	accumulate_time(vcpu, &vcpu->arch.rm_intr);
>>>> +
>>>> +	/* Get these from r11/12 and paca exsave */
>>>> +	vcpu->arch.shregs.srr0 = mfspr(SPRN_SRR0);
>>>> +	vcpu->arch.shregs.srr1 = mfspr(SPRN_SRR1);
>>>> +	vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
>>>> +	vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
>>>> +
>>>> +	/* 0x2 bit for HSRR is only used by PR and P7/8 HV paths, clear it */
>>>> +	trap = local_paca->kvm_hstate.scratch0 & ~0x2;
>>>> +	if (likely(trap > BOOK3S_INTERRUPT_MACHINE_CHECK)) {
>>>> +		exsave = local_paca->exgen;
>>>> +	} else if (trap == BOOK3S_INTERRUPT_SYSTEM_RESET) {
>>>> +		exsave = local_paca->exnmi;
>>>> +	} else { /* trap == 0x200 */
>>>> +		exsave = local_paca->exmc;
>>>> +	}
>>>> +
>>>> +	vcpu->arch.regs.gpr[1] = local_paca->kvm_hstate.scratch1;
>>>> +	vcpu->arch.regs.gpr[3] = local_paca->kvm_hstate.scratch2;
>>>> +	vcpu->arch.regs.gpr[9] = exsave[EX_R9/sizeof(u64)];
>>>> +	vcpu->arch.regs.gpr[10] = exsave[EX_R10/sizeof(u64)];
>>>> +	vcpu->arch.regs.gpr[11] = exsave[EX_R11/sizeof(u64)];
>>>> +	vcpu->arch.regs.gpr[12] = exsave[EX_R12/sizeof(u64)];
>>>> +	vcpu->arch.regs.gpr[13] = exsave[EX_R13/sizeof(u64)];
>>>> +	vcpu->arch.ppr = exsave[EX_PPR/sizeof(u64)];
>>>> +	vcpu->arch.cfar = exsave[EX_CFAR/sizeof(u64)];
>>>> +	vcpu->arch.regs.ctr = exsave[EX_CTR/sizeof(u64)];
>>>> +
>>>> +	vcpu->arch.last_inst = KVM_INST_FETCH_FAILED;
>>>> +
>>>> +	if (unlikely(trap == BOOK3S_INTERRUPT_MACHINE_CHECK)) {
>>>> +		vcpu->arch.fault_dar = exsave[EX_DAR/sizeof(u64)];
>>>> +		vcpu->arch.fault_dsisr = exsave[EX_DSISR/sizeof(u64)];
>>>> +		kvmppc_realmode_machine_check(vcpu);
>>>> +
>>>> +	} else if (unlikely(trap == BOOK3S_INTERRUPT_HMI)) {
>>>> +		kvmppc_realmode_hmi_handler();
>>>> +
>>>> +	} else if (trap == BOOK3S_INTERRUPT_H_EMUL_ASSIST) {
>>>> +		vcpu->arch.emul_inst = mfspr(SPRN_HEIR);
>>>> +
>>>> +	} else if (trap == BOOK3S_INTERRUPT_H_DATA_STORAGE) {
>>>> +		vcpu->arch.fault_dar = exsave[EX_DAR/sizeof(u64)];
>>>> +		vcpu->arch.fault_dsisr = exsave[EX_DSISR/sizeof(u64)];
>>>> +		vcpu->arch.fault_gpa = mfspr(SPRN_ASDR);
>>>> +
>>>> +	} else if (trap == BOOK3S_INTERRUPT_H_INST_STORAGE) {
>>>> +		vcpu->arch.fault_gpa = mfspr(SPRN_ASDR);
>>>> +
>>>> +	} else if (trap == BOOK3S_INTERRUPT_H_FAC_UNAVAIL) {
>>>> +		vcpu->arch.hfscr = mfspr(SPRN_HFSCR);
>>>> +
>>>> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>>>> +	/*
>>>> +	 * Softpatch interrupt for transactional memory emulation cases
>>>> +	 * on POWER9 DD2.2.  This is early in the guest exit path - we
>>>> +	 * haven't saved registers or done a treclaim yet.
>>>> +	 */
>>>> +	} else if (trap == BOOK3S_INTERRUPT_HV_SOFTPATCH) {
>>>> +		vcpu->arch.emul_inst = mfspr(SPRN_HEIR);
>>>> +
>>>> +		/*
>>>> +		 * The cases we want to handle here are those where the guest
>>>> +		 * is in real suspend mode and is trying to transition to
>>>> +		 * transactional mode.
>>>> +		 */
>>>> +		if (local_paca->kvm_hstate.fake_suspend &&
>>>> +				(vcpu->arch.shregs.msr & MSR_TS_S)) {
>>>> +			if (kvmhv_p9_tm_emulation_early(vcpu)) {
>>>> +				/* Prevent it being handled again. */
>>>> +				trap = 0;
>>>> +			}
>>>> +		}
>>>> +#endif
>>>> +	}
>>>> +
>>>> +	radix_clear_slb();
>>>> +
>>>> +	__mtmsrd(msr, 0);
>>>
>>>
>>> The asm code only sets RI but this potentially sets more bits including
>>> MSR_EE, is it expected to be 0 when __kvmhv_vcpu_entry_p9() is called?
>> 
>> Yes.
>> 
>>>> +	mtspr(SPRN_CTRLT, 1);
>>>
>>> What is this for? ISA does not shed much light:
>>> ===
>>> 63 RUN This  bit  controls  an  external  I/O  pin.
>>> ===
>> 
>> I don't think it even does that these days. It interacts with the PMU.
>> I was looking whether it's feasible to move it into PMU code entirely,
>> but apparently some tool or something might sample it. I'm a bit
>> suspicious about that because an untrusted guest could be running and
>> claim not to so I don't know what said tool really achieves, but I'll
>> go through that fight another day.
>> 
>> But KVM has to set it to 1 at exit because Linux host has it set to 1
>> except in CPU idle.
> 
> 
> It this CTRLT setting a new thing or the asm does it too? I could not 
> spot it.

It's quite old actually. Earlier processors (maybe POWER6) you had to 
even read-modify-write but new ones you can just store 1:

Guest exit:
        /* Save guest CTRL register, set runlatch to 1 */
        mfspr   r6,SPRN_CTRLF
        stw     r6,VCPU_CTRL(r9)
        andi.   r0,r6,1
        bne     4f
        ori     r6,r6,1
        mtspr   SPRN_CTRLT,r6
4:

entry:
        /* Restore state of CTRL run bit; assume 1 on entry */
        lwz     r5,VCPU_CTRL(r4)
        andi.   r5,r5,1
        bne     4f
        mfspr   r6,SPRN_CTRLF
        clrrdi  r6,r6,1
        mtspr   SPRN_CTRLT,r6

It used to light an indicator on the front of the system once upon a 
time and I think on some processors (Cell maybe?) it actually controlled 
SMT threads in some way. But certainly in P9 it does almost nothing and
we'll probably try to phase it out.

>>> The asm does "For hash guest, read the guest SLB and save it away", this
>>> code does not. Is this new fast-path-in-c only for radix-on-radix or
>>> hash VMs are supported too?
>> 
>> That asm code does not run for "guest_exit_short_path" case (aka the
>> p9 path aka the fast path).
>> 
>> Upstream code only supports radix host and radix guest in this path.
>> The old path supports hash and radix. That's unchanged with this patch.
>> 
>> After the series, the new path supports all P9 modes (hash/hash,
>> radix/radix, and radix/hash), and the old path supports P7 and P8 only.
> 
> 
> Thanks for clarification. Besides that CTRLT, I checked if the new c 
> code matches the old asm code (which made diving into ISA incredible fun 
> :) ) so fwiw
> 
> Reviewed-by: Alexey Kardashevskiy <aik at ozlabs.ru>

Thanks for reviewing.

> I'd really like to see longer commit logs clarifying all intended 
> changes but it is probably just me.

I'm not sure what the best balance is, at some point code is a more 
precise description. For this particular patch I probably do need to go 
over the changelog again and try to make sure it makes sense and covers 
things. If you have specifics that are missing or changes you'd like
I would definitely consider them.

Thanks,
Nick


More information about the Linuxppc-dev mailing list