[RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

Tue Jan 10 14:11:50 EST 2012

On 10.01.2012, at 01:51, Scott Wood wrote:

> On 01/09/2012 11:46 AM, Alexander Graf wrote:
>> 
>> On 21.12.2011, at 02:34, Scott Wood wrote:
>> 
>>> Chips such as e500mc that implement category E.HV in Power ISA 2.06
>>> provide hardware virtualization features, including a new MSR mode for
>>> guest state.  The guest OS can perform many operations without trapping
>>> into the hypervisor, including transitions to and from guest userspace.
>>> 
>>> Since we can use SRR1[GS] to reliably tell whether an exception came from
>>> guest state, instead of messing around with IVPR, we use DO_KVM similarly
>>> to book3s.
>> 
>> Is there any benefit of using DO_KVM? I would assume that messing with IVPR is faster.
> 
> Using the GS bit to decide which handler to run means we won't get
> confused if a machine check or critical interrupt happens between
> entering/exiting the guest and updating IVPR (we could use the IS bit
> similarly in PR-mode).
> 
> This could be supplemented with IVPR (though that will add a few cycles
> to guest entry/exit) or some sort of runtime patching (would be more
> coarse-grained, active when any KVM guest exists) to avoid adding
> overhead to traps when KVM is not used, but I'd like to quantify that
> overhead first.  It should be much lower than what happens on book3s.

Hrm. Yeah, given that your DO_KVM handler is so much simpler, it might make sense to stick with that method. Benchmarks would be useful in the long run though.

> 
>>> Current issues include:
>>> - Machine checks from guest state are not routed to the host handler.
>>> - The guest can cause a host oops by executing an emulated instruction
>>>  in a page that lacks read permission.  Existing e500/4xx support has
>>>  the same problem.
>> 
>> We solve that in book3s pr by doing
>> 
>>  LAST_INST = <known bad value>;
>>  PACA->kvm_mode = <recover at next inst>;
>>  lwz(guest pc);
>>  do_more_stuff();
>> 
>> That way when an exception occurs at lwz() the DO_KVM handler checks that we're in kvm mode "recover" which does basically srr0+=4; rfi;.
> 
> I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
> treat it as a kernel fault (search exception table) -- but this works
> too and is a bit cleaner (could be other uses of external pid), at the
> expense of a couple extra instructions in the emulation path (but
> probably a slightly faster host TLB handler).
> 
> The check wouldn't go in DO_KVM, though, since on bookehv that only
> deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
> case here.

Yup, not sure where you'd put the check, as it'd slow down normal operation too. Hrm.

> 
>>> @@ -243,16 +324,20 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
>>> 	case BOOKE_IRQPRIO_AP_UNAVAIL:
>>> 	case BOOKE_IRQPRIO_ALIGNMENT:
>>> 		allowed = 1;
>>> -		msr_mask = MSR_CE|MSR_ME|MSR_DE;
>>> +		msr_mask = MSR_GS | MSR_CE | MSR_ME | MSR_DE;
>> 
>> No need to do this. You already force MSR_GS in set_msr();
> 
> OK.  This was here since before set_msr() started doing that. :-)
> 
>>> +	if (!current->thread.kvm_vcpu) {
>>> +		WARN(1, "no vcpu\n");
>>> +		return -EPERM;
>>> +	}
>> 
>> Huh?
> 
> Oops, leftover debugging.
> 
>>> +static int emulation_exit(struct kvm_run *run, struct kvm_vcpu *vcpu)
>>> +{
>>> +	enum emulation_result er;
>>> +
>>> +	er = kvmppc_emulate_instruction(run, vcpu);
>>> +	switch (er) {
>>> +	case EMULATE_DONE:
>>> +		/* don't overwrite subtypes, just account kvm_stats */
>>> +		kvmppc_account_exit_stat(vcpu, EMULATED_INST_EXITS);
>>> +		/* Future optimization: only reload non-volatiles if
>>> +		 * they were actually modified by emulation. */
>>> +		return RESUME_GUEST_NV;
>>> +
>>> +	case EMULATE_DO_DCR:
>>> +		run->exit_reason = KVM_EXIT_DCR;
>>> +		return RESUME_HOST;
>>> +
>>> +	case EMULATE_FAIL:
>>> +		/* XXX Deliver Program interrupt to guest. */
>>> +		printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
>>> +		       __func__, vcpu->arch.regs.nip, vcpu->arch.last_inst);
>> 
>> This should be throttled, otherwise the guest can spam our logs.
> 
> Yes it should, but I'm just moving the code here.

Yeah, only realized this later. Maybe next time (not for this patch set, next time you're sending something) just extract these mechanical parts, so it's easier to review the pieces where code actually changes :).

> 
>>> +		/* For debugging, encode the failing instruction and
>>> +		 * report it to userspace. */
>>> +		run->hw.hardware_exit_reason = ~0ULL << 32;
>>> +		run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
>> 
>> 
>> I'm fairly sure you want to fix this :)
> 
> Likewise, that's what booke.c already does.  What should it do instead?

This is what book3s does:

                case EMULATE_FAIL:
                        printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
                               __func__, kvmppc_get_pc(vcpu), kvmppc_get_last_inst(vcpu));
                        kvmppc_core_queue_program(vcpu, flags);
                        r = RESUME_GUEST;

which also doesn't throttle the printk, but I think injecting a program fault into the guest is the most sensible thing to do if we don't know what the instruction is supposed to do. Best case we get an oops inside the guest telling us what broke :).

> 
>> /**
>>> * kvmppc_handle_exit
>>> *
>>> @@ -374,12 +530,39 @@ out:
>>> int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>>                       unsigned int exit_nr)
>>> {
>>> -	enum emulation_result er;
>>> 	int r = RESUME_HOST;
>>> 
>>> 	/* update before a new last_exit_type is rewritten */
>>> 	kvmppc_update_timing_stats(vcpu);
>>> 
>>> +	/*
>>> +	 * If we actually care, we could copy MSR, DEAR, and ESR to regs,
>>> +	 * insert an appropriate trap number, etc.
>>> +	 *
>>> +	 * Seems like a waste of cycles for something that should only matter
>>> +	 * to someone using sysrq-t/p or similar host kernel debug facility.
>>> +	 * We have other debug facilities to get that information from a
>>> +	 * guest through userspace.
>>> +	 */
>>> +	switch (exit_nr) {
>>> +	case BOOKE_INTERRUPT_EXTERNAL:
>>> +		do_IRQ(&vcpu->arch.regs);
>> 
>> Ah, so that's what you want to use regs for. So is having a pt_regs
>> struct that only contains useful register values in half its fields
>> any useful here? Or could we keep control of the registers ourselves,
>> enabling us to maybe one day optimize things more.
> 
> I think it contains enough to be useful for debugging code such as sysrq
> and tracers, and as noted in the comment we could copy the rest if we
> care enough.  MSR might be worth copying.
> 
> It will eventually be used for machine checks as well, which I'd like to
> hand reasonable register state to, at least for GPRs, LR, and PC.
> 
> If there's a good enough performance reason, we could just copy
> everything over for machine checks and pass NULL to do_IRQ (I think it
> can take this -- a dummy regs struct if not), but it seems premature at
> the moment unless the switch already causes measured performance loss
> (cache utilization?).

I'm definitely not concerned about performance, but complexity and uniqueness. With the pt_regs struct, we have a bunch of fields in the vcpu that are there, but unused. I find that situation pretty confusing.

So yes, I would definitely prefer to copy registers during MC and keep the registers where they are today - unless there are SPRs for them of course.

Imagine we'd one day want to share GPRs with user space through the kvm_run structure (see the s390 patches on the ML for this). I really wouldn't want to make pt_regs part of our userspace ABI.

> 
>>> @@ -387,30 +570,56 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
>>> 
>>> 	switch (exit_nr) {
>>> 	case BOOKE_INTERRUPT_MACHINE_CHECK:
>>> -		printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR));
>>> -		kvmppc_dump_vcpu(vcpu);
>>> -		r = RESUME_HOST;
>>> +		kvm_resched(vcpu);
>>> +		r = RESUME_GUEST;
>> 
>> huh?
> 
> Patch shuffling accident -- this belongs with a later patch that invokes
> the host machine check handler similar to what is done with do_IRQ().
> The host machine check handler needs some work first, though.
> 
>>> 		break;
>>> 
>>> 	case BOOKE_INTERRUPT_EXTERNAL:
>>> 		kvmppc_account_exit(vcpu, EXT_INTR_EXITS);
>>> -		if (need_resched())
>>> -			cond_resched();
>>> +		kvm_resched(vcpu);
>> 
>> Why are we explicit about the resched? On book3s I just call kvm_resched(vcpu) before the switch().
> 
> There are a few exit types where we don't currently do the resched -- if
> they're all bugs or don't-cares, we could move it out of the switch.
> 
> We probably should defer the check until after we've disabled
> interrupts, similar to signals -- even if we didn't exit for an
> interrupt, we could have received one after enabling them.

Yup. I just don't think you can call resched() with interrupts disabled, so a bit cleverness is probably required here.

> 
>>> +		if (kvm_is_visible_gfn(vcpu->kvm, gfn)) {
>>> +			/* The guest TLB had a mapping, but the shadow TLB
>>> +			 * didn't. This could be because:
>>> +			 * a) the entry is mapping the host kernel, or
>>> +			 * b) the guest used a large mapping which we're faking
>>> +			 * Either way, we need to satisfy the fault without
>>> +			 * invoking the guest. */
>>> +			kvmppc_mmu_map(vcpu, eaddr, gpaddr, gtlb_index);
>>> +		} else {
>>> +			/* Guest mapped and leaped at non-RAM! */
>>> +			kvmppc_booke_queue_irqprio(vcpu,
>>> +						   BOOKE_IRQPRIO_MACHINE_CHECK);
>> 
>> Are you sure? Couldn't this also be MMIO? That doesn't really improve the situation as executing from MMIO is tricky with the KVM model, but it's not necessarily bad. Oh well, I guess we'll have to do something and throwing an #MC isn't all that ugly.
> 
> I think I asked you about executing from MMIO once, and you said it
> wasn't supported even in straight QEMU.  Have things changed?

Yeah, I talked to Anthony about that part and apparently the QEMU design does support execution from MMIO. But don't worry about it for now. I don't think we'll really have guest OSs doing this. And if they do, we can worry about it then.

> 
>>> diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
>>> index 05d1d99..d53bcf2 100644
>>> --- a/arch/powerpc/kvm/booke.h
>>> +++ b/arch/powerpc/kvm/booke.h
>>> @@ -48,7 +48,20 @@
>>> #define BOOKE_IRQPRIO_PERFORMANCE_MONITOR 19
>>> /* Internal pseudo-irqprio for level triggered externals */
>>> #define BOOKE_IRQPRIO_EXTERNAL_LEVEL 20
>>> -#define BOOKE_IRQPRIO_MAX 20
>>> +#define BOOKE_IRQPRIO_DBELL 21
>>> +#define BOOKE_IRQPRIO_DBELL_CRIT 22
>>> +#define BOOKE_IRQPRIO_MAX 23
>> 
>> So was MAX wrong before or is it too big now?
> 
> MAX is just a marker for how many IRQPRIOs we have, not any sort of
> external limit.  This patch adds new IRQPRIOs, so MAX goes up.
> 
> The actual limit is the number of bits in a long.

Yes, and before the highest value was 20 with MAX being 20, now the highest value is 22 with MAX being 23. Either MAX == highest number or MAX == highest number + 1, but you're changing the semantics of MAX here. Maybe it was wrong before, I don't know, hence I'm asking :).

> 
>>> +	.if	\flags & NEED_EMU
>>> +	lwz	r9, VCPU_KVM(r4)
>> 
>> writing r9
>> 
>>> +	.endif
>>> +
>>> +#ifdef CONFIG_KVM_EXIT_TIMING
>>> +	/* save exit time */
>>> +1:	mfspr	r7, SPRN_TBRU
>>> +	mfspr	r8, SPRN_TBRL
>>> +	mfspr	r9, SPRN_TBRU
>> 
>> overwriting r9 again?
> 
> Oops.  It's RFC for a reason. :-)
> 
>>> +#ifndef CONFIG_64BIT
>> 
>> Double negation is always hard to read. Please reverse the ifdef :)
> 
> OK.
> 
>>> +lightweight_exit:
>>> +	PPC_STL	r2, HOST_R2(r1)
>>> +
>>> +	mfspr	r3, SPRN_PID
>>> +	stw	r3, VCPU_HOST_PID(r4)
>>> +	lwz	r3, VCPU_GUEST_PID(r4)
>>> +	mtspr	SPRN_PID, r3
>>> +
>>> +	/* Save vcpu pointer for the exception handlers
>>> +	 * must be done before loading guest r2.
>>> +	 */
>>> +//	SET_VCPU(r4)
>> 
>> hm?
> 
> Can just be removed, it's handled in booke's vcpu load/put.
> 
>>> +	lwz	r6, (VCPU_SHARED_MAS2 + 4)(r11)
>>> +#else
>>> +	ld	r6, (VCPU_SHARED_MAS2)(r11)
>>> +#endif
>>> +	lwz	r7, VCPU_SHARED_MAS7_3+4(r11)
>>> +	lwz	r8, VCPU_SHARED_MAS4(r11)
>>> +	mtspr	SPRN_MAS0, r3
>>> +	mtspr	SPRN_MAS1, r5
>>> +	mtspr	SPRN_MAS2, r6
>>> +	mtspr	SPRN_MAS3, r7
>>> +	mtspr	SPRN_MAS4, r8
>>> +	lwz	r3, VCPU_SHARED_MAS6(r11)
>>> +	lwz	r5, VCPU_SHARED_MAS7_3+0(r11)
>>> +	mtspr	SPRN_MAS6, r3
>>> +	mtspr	SPRN_MAS7, r5
>>> +	/* Disable MAS register updates via exception */
>>> +	mfspr	r3, SPRN_EPCR
>>> +	oris	r3, r3, SPRN_EPCR_DMIUH at h
>>> +	mtspr	SPRN_EPCR, r3
>> 
>> Shouldn't this happen before you set the MAS registers? :)
> 
> Yes (though we really shouldn't be getting a TLB miss here, at least on
> e500mc).

Yeah, but the way it's now it gives you a false feeling of security :)

> 
>>> +	/* Load some guest volatiles. */
>>> +	PPC_LL	r3, VCPU_LR(r4)
>>> +	PPC_LL	r5, VCPU_XER(r4)
>>> +	PPC_LL	r6, VCPU_CTR(r4)
>>> +	PPC_LL	r7, VCPU_CR(r4)
>>> +	PPC_LL	r8, VCPU_PC(r4)
>>> +#ifndef CONFIG_64BIT
>>> +	lwz	r9, (VCPU_SHARED_MSR + 4)(r11)
>>> +#else
>>> +	ld	r9, (VCPU_SHARED_MSR)(r11)
>>> +#endif
>>> +	PPC_LL	r0, VCPU_GPR(r0)(r4)
>>> +	PPC_LL	r1, VCPU_GPR(r1)(r4)
>>> +	PPC_LL	r2, VCPU_GPR(r2)(r4)
>>> +	PPC_LL	r10, VCPU_GPR(r10)(r4)
>>> +	PPC_LL	r11, VCPU_GPR(r11)(r4)
>>> +	PPC_LL	r12, VCPU_GPR(r12)(r4)
>>> +	PPC_LL	r13, VCPU_GPR(r13)(r4)
>>> +	mtlr	r3
>>> +	mtxer	r5
>>> +	mtctr	r6
>>> +	mtcr	r7
>>> +	mtsrr0	r8
>>> +	mtsrr1	r9
>> 
>> Are you sure this should be shared->msr, not shadow_msr?
> 
> Yes, we don't use shadow_msr on bookehv.  I'll add a comment in the
> struct definition as discussed in the other thread, as well as other
> areas where there are subtle differences between PR-mode and GS-mode.

Thanks!

Alex