[PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode

Tue May 17 20:17:50 EST 2011

On 16.05.2011, at 07:58, Paul Mackerras wrote:

> On Sun, May 15, 2011 at 11:58:12PM +0200, Alexander Graf wrote:
>> 
>> On 11.05.2011, at 12:44, Paul Mackerras wrote:
> 
>>> +#ifdef CONFIG_KVM_BOOK3S_NONHV
>> 
>> I really liked how you called the .c file _pr - why call it NONHV now?
> 
> I agree, CONFIG_KVM_BOOK3S_PR would be better, I'll change it.
> 
>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>> index 7412676..8dba5f6 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -149,6 +149,16 @@ struct paca_struct {
>>> #ifdef CONFIG_KVM_BOOK3S_HANDLER
>>> 	/* We use this to store guest state in */
>>> 	struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +	struct kvm_vcpu *kvm_vcpu;
>>> +	u64 dabr;
>>> +	u64 host_mmcr[3];
>>> +	u32 host_pmc[6];
>>> +	u64 host_purr;
>>> +	u64 host_spurr;
>>> +	u64 host_dscr;
>>> +	u64 dec_expires;
>> 
>> Hrm. I'd say either push those into shadow_vcpu for HV mode or get
>> rid of the shadow_vcpu reference. I'd probably prefer the former.
> 
> These are fields that are pieces of host state that we need to save
> and restore across execution of a guest; they don't apply to any
> specific guest or vcpu.  That's why I didn't put them in shadow_vcpu,
> which is specifically for one vcpu in one guest.  
> 
> Given that book3s_pr copies the shadow_vcpu into and out of the paca,
> I thought it best not to add fields to it that are only live while we
> are in the guest.  True, these fields only exist for book3s_hv, but if
> we later on make it possible to select book3s_pr vs. book3s_hv at
> runtime, we won't want to be copying these fields into and out of the
> paca when book3s_pr is active.
> 
> Maybe we need another struct, kvm_host_state or something like that,
> to save this sort of state.

Yeah, just put them into a different struct then. I don't want to clutter the PACA struct with kvm fields all over :).

> 
>>> @@ -65,6 +98,7 @@ config KVM_440
>>> 	bool "KVM support for PowerPC 440 processors"
>>> 	depends on EXPERIMENTAL && 44x
>>> 	select KVM
>>> +	select KVM_MMIO
>> 
>> e500 should also select MMIO, no?
> 
> Good point, I'll fix that.
> 
>>> +long kvmppc_alloc_hpt(struct kvm *kvm)
>>> +{
>>> +	unsigned long hpt;
>>> +	unsigned long lpid;
>>> +
>>> +	hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
>>> +			       HPT_ORDER - PAGE_SHIFT);
>> 
>> This would end up failing quite often, no?
> 
> In practice it seems to be OK, possibly because the machines we're
> testing this on have plenty of memory.  Maybe we should get qemu to
> allocate the HPT using hugetlbfs so the memory will come from the
> reserved page pool.  It does need to be physically contiguous and
> aligned on a multiple of its size -- that's a hardware requirement.

Yes, I'd certainly prefer to see qemu allocate it. That'd also make things easier for migration later, as we still have access to the hpt. But then again - we can't really reuse the mappings there anyways, as they'll all be host mappings. Phew. Have you given that some thought yet? We can probably just ignore non-bolted entries - but the bolted ones need to be transferred.

Also, if we have qemu allocate the hpt memory, qemu can modify the mappings and thus break out. Bleks. It should definitely not be able to write to the hpt after it's actually being used by kvm.

> 
>>> +	kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
>>> +	kvm->arch.lpid = lpid;
>>> +	kvm->arch.host_sdr1 = mfspr(SPRN_SDR1);
>>> +	kvm->arch.host_lpid = mfspr(SPRN_LPID);
>>> +	kvm->arch.host_lpcr = mfspr(SPRN_LPCR);
>> 
>> How do these correlate with the guest's hv mmu? I'd like to keep the
>> code clean enough so we can potentially use it for PR mode as well :). 
> 
> The host SDR1 and LPID are different from the guest's.  That is, the
> guest has its own HPT which is quite separate from the host's.  The
> host values could be saved in global variables, though; there's no
> real need for each VM to have its own copy, except that doing it this
> way simplifies the low-level assembly code a little.

Well, my point is that I tried to separate the MMU implementation for the PR KVM stuff. What I'd like to see at the end of the day would be a "guest" hv implementation file that I could plug into PR kvm and have the MMU rolling by using the exact same code as the HV code. Only the backend would be different. Maybe there's some valid technical reason to not do it, but I haven't come across any yet :).

> 
>>> +	/* First see what page size we have */
>>> +	psize = user_page_size(mem->userspace_addr);
>>> +	/* For now, only allow 16MB pages */
>> 
>> The reason to go for 16MB pages is because of the host mmu code, not
>> the guest hv mmu. So please at least #ifdef the code to HV so we
>> document that correlation. 
> 
> I'm not sure what you mean by that.  The reason for going for 16MB
> pages initially is for performance (this way the guest can use 16MB
> pages for its linear mapping) and to avoid having to deal with the
> pages being paged or swapped on the host side.  Could you explain the
> "because of the host mmu code" part of your comment further?

If you consider a split as I described above, an HV guest running on PR KVM doesn't have to be mapped linearly. But then again - it could. In fact, it might even make sense. Hrm. Very good point! We could also just flip SDR1 in PR mode and reuse the exact same code as the HV mode.

However, my point here was that when we don't flip SDR1 but go through a shadow hpt, we don't have to map it to 16MB pages.

> 
> What would adding #ifdef CONFIG_KVM_BOOK3S_64_HV achieve in a file
> whose name ends in _hv.c and which only gets compiled when
> CONFIG_KVM_BOOK3S_64_HV=y?
> 
>>> +int kvmppc_mmu_hv_init(void)
>>> +{
>>> +	if (!cpu_has_feature(CPU_FTR_HVMODE_206))
>>> +		return 0;
>> 
>> Return 0 for "it doesn't work" might not be the right exit code ;).
> 
> Good point, I'll make it -ENXIO or something.

Anything < 0 sounds good to me :). -EINVAL is probably the most sensible one here.

> 
>>> +static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>>> +				struct kvmppc_pte *gpte, bool data)
>>> +{
>>> +	return -ENOENT;
>> 
>> Can't you just call the normal book3s_64 mmu code here? Without
>> xlate, doing ppc_ld doesn't work, which means that reading out the
>> faulting guest instruction breaks. We'd also need it for PR mode :). 
> 
> With book3s_hv we currently have no situations where we need to read
> out the faulting guest instruction.  
> 
> We could use the normal code here if we had the guest HPT mapped into
> the qemu address space, which it currently isn't -- but should be.  It
> hasn't been a priority to fix because with book3s_hv we currently have
> no situations where we need to read out the faulting guest
> instruction.

Makes sense. We might however need it in case we ever use this code in PR mode :). I don't fully remember if you shoved around that code, but in case fetching the last instruction fails (which IIRC it can for you too), a manual load through this function gets issued.

> 
>>> +void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
>>> +{
>>> +	vcpu->arch.pvr = pvr;
>>> +	kvmppc_mmu_book3s_hv_init(vcpu);
>> 
>> No need for you to do it depending on pvr. You can just do the mmu
>> initialization on normal init :).
> 
> OK, I'll do that.
> 
>>> +	case BOOK3S_INTERRUPT_PROGRAM:
>>> +	{
>>> +		ulong flags;
>>> +		/*
>>> +		 * Normally program interrupts are delivered directly
>>> +		 * to the guest by the hardware, but we can get here
>>> +		 * as a result of a hypervisor emulation interrupt
>>> +		 * (e40) getting turned into a 700 by BML RTAS.
>> 
>> Not sure I fully understand what's going on there. Mind to explain? :)
> 
> Recent versions of the architecture don't actually deliver a 0x700
> interrupt to the OS on an illegal instruction; instead they generate
> an 0xe40 interrupt to the hypervisor in case the hypervisor wants to
> emulate the instruction.  If the hypervisor doesn't want to do that
> it's supposed to synthesize a 0x700 interrupt to the OS.
> 
> When we're running this stuff under our BML (Bare Metal Linux)
> framework in the lab, we use a small RTAS implementation that gets
> loaded by the BML loader, and one of the things that this RTAS does is
> to patch the 0xe40 vector to make the 0xe40 interrupt come in via the
> 0x700 vector, in case the kernel you're running under BML hasn't been
> updated to have an 0xe40 handler.
> 
> I could just remove that case, in fact.

Yeah, the main issue I see here is that there's no hardware that could run this code. Whatever the first hardware that would be able to requires should be used here.

> 
>>> +	case BOOK3S_INTERRUPT_SYSCALL:
>>> +	{
>>> +		/* hcall - punt to userspace */
>>> +		int i;
>>> +
>> 
>> Do we really want to accept sc from pr=1? I'd just reject them straight away.
> 
> Good idea, I'll do that.
> 
>>> +	case BOOK3S_INTERRUPT_H_DATA_STORAGE:
>>> +		vcpu->arch.dsisr = vcpu->arch.fault_dsisr;
>>> +		vcpu->arch.dear = vcpu->arch.fault_dar;
>>> +		kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_DATA_STORAGE, 0);
>>> +		r = RESUME_GUEST;
>>> +		break;
>>> +	case BOOK3S_INTERRUPT_H_INST_STORAGE:
>>> +		kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_INST_STORAGE,
>>> +					0x08000000);
>> 
>> What do these do? I thought the guest gets DSI and ISI directly?
> 
> It does for accesses with relocation (IR/DR) on, but because we have
> enabled VRMA mode (Virtualized Real Mode Area) in the LPCR, we get
> these interrupts to the hypervisor if the guest does a bad real-mode
> access.  If we also turned on Virtualized Partition Memory (VPM) mode
> in the LPCR, then all ISI/DSI in the guest come through to the
> hypervisor using these vectors in case the hypervisor wants to do any
> paging/swapping of guest memory.  I plan to do that later when we
> support using 4k/64k pages for guest memory.

I see - please put that explanation in a comment there ;).

> 
>>> +	/* default to book3s_64 (power7) */
>>> +	vcpu->arch.pvr = 0x3f0200;
>> 
>> Wouldn't it make sense to simply default to the host pvr? Not sure -
>> maybe it's not :).
> 
> Sounds sensible, actually.  In fact the hypervisor can't spoof the PVR
> for the guest, that is, the guest can read the real PVR value and
> there's no way the hypervisor can stop it.

If it can't spoof it at all, then yes, use the host pvr.

In fact, thinking about this, how does userspace know which mode the kernel uses? It should be able to fail if we're trying to run a -M mac99 on this code ;).

> 
>>> +	flush_fp_to_thread(current);
>> 
>> Do those with fine with preemption enabled?
> 
> Yes.  If a preemption happens, it will flush the FP/VMX/VSX registers
> out to the thread_struct, and then any of these explicit calls that
> happen after the preemption will do nothing.

Ah, the flushes themselves disable preemption during the flush :). Then it makes sense.

> 
>>> +	/*
>>> +	 * Put whatever is in the decrementer into the
>>> +	 * hypervisor decrementer.
>>> +	 */
>>> +	mfspr	r8,SPRN_DEC
>>> +	mftb	r7
>>> +	mtspr	SPRN_HDEC,r8
>> 
>> Can't we just always use HDEC on the host? That's save us from all
>> the swapping here.
> 
> The problem is that there is only one HDEC per core, so using it would
> become complicated when the host is running in SMT4 or SMT2 mode.

I see.

> 
>>> +	extsw	r8,r8
>>> +	add	r8,r8,r7
>>> +	std	r8,PACA_KVM_DECEXP(r13)
>> 
>> Why is dec+hdec on vcpu_run decexp? What exactly does this store?
> 
> R7 here is the current (or very recent, anyway) timebase value, so
> this stores the timebase value at which the host decrementer would get
> to zero.
> 
>>> +	lwz	r3, PACA_HOST_PMC(r13)
>>> +	lwz	r4, PACA_HOST_PMC + 4(r13)
>>> +	lwz	r5, PACA_HOST_PMC + 4(r13)
>> 
>> copy&paste error?
> 
> Yes, thanks.
> 
>>> +.global kvmppc_handler_lowmem_trampoline
>>> +kvmppc_handler_lowmem_trampoline:
>>> +	cmpwi	r12,0x500
>>> +	beq	1f
>>> +	cmpwi	r12,0x980
>>> +	beq	1f
>> 
>> What?
>> 
>> 1) use the macros please
> 
> OK
> 
>> 2) why?
> 
> The external interrupt and hypervisor decrementer interrupt handlers
> expect the interrupted PC and MSR to be in HSRR0/1 rather than
> SRR0/1.  I could remove the case for 0x980 since we don't call the
> linux handler for HDEC interrupts any more.

Ah, makes sense. This also deserves a comment :)

> 
>>> +	/* Check if HDEC expires soon */
>>> +	mfspr	r3,SPRN_HDEC
>>> +	cmpwi	r3,10
>>> +	li	r12,0x980
>> 
>> define
> 
> OK.
> 
>>> +	mr	r9,r4
>>> +	blt	hdec_soon
>> 
>> Is it faster to do the check than to save the check and take the
>> odds? Also, maybe we should rather do the check in preemptible
>> code that could just directly pass the time slice on.
> 
> I do the check there because I was having problems where, if the HDEC
> goes negative before we do the partition switch, we would occasionally
> not get the HDEC interrupt at all until the next time HDEC went
> negative, ~ 8.4 seconds later.

Yikes - so HDEC is edge and doesn't even keep the interrupt line up? That sounds like a serious hardware limitation. What if you only use HDEC and it triggers while interrupts are off in a critical section? Is the hardware really that broken?

> 
>>> +	/* See if this is a leftover HDEC interrupt */
>>> +	cmpwi	r12,0x980
>> 
>> define
> 
> OK.
> 
>>> +	bne	2f
>>> +	mfspr	r3,SPRN_HDEC
>>> +	cmpwi	r3,0
>>> +	bge	ignore_hdec
>>> +2:
>>> +
>>> +	/* Check for mediated interrupts (could be done earlier really ...) */
>>> +	cmpwi	r12,0x500
>> 
>> define
> 
> OK.
> 
>>> +	bne+	1f
>>> +	ld	r5,VCPU_LPCR(r9)
>>> +	andi.	r0,r11,MSR_EE
>>> +	beq	1f
>>> +	andi.	r0,r5,LPCR_MER
>>> +	bne	bounce_ext_interrupt
>> 
>> So there's no need for the external check that directly goes into
>> the Linux handler code on full-fledged exits?
> 
> No, we still need that; ordinary external interrupts come to the
> hypervisor regardless of the guest's MSR.EE setting.
> 
> The MER bit (Mediated External Request) is a way for the hypervisor to
> know when the guest sets its MSR.EE bit.  If an event happens that
> means the host wants to give a guest vcpu an external interrupt, but
> the guest vcpu has MSR.EE = 0, then the host can't deliver the
> simulated external interrupt.  Instead it sets LPCR.MER, which has the
> effect that the hardware will deliver an external interrupt (to the
> hypervisor) when running in the guest and it has MSR.EE = 1.
> 
> So, when we get an external interrupt, LPCR.MER = 1 and MSR.EE = 1,
> we need to synthesize an external interrupt in the guest.  Doing it
> here means that we don't need to do the full partition switch out to
> the host and back again.

Ah, special functionality then. Please comment this in the code, so the unknowledgeable reader (me) knows what this is about.

> 
>>> +	/* Read the guest SLB and save it away */
>>> +	li	r6,0
>>> +	addi	r7,r9,VCPU_SLB
>>> +	li	r5,0
>>> +1:	slbmfee	r8,r6
>>> +	andis.	r0,r8,SLB_ESID_V at h
>>> +	beq	2f
>>> +	add	r8,r8,r6		/* put index in */
>>> +	slbmfev	r3,r6
>>> +	std	r8,VCPU_SLB_E(r7)
>>> +	std	r3,VCPU_SLB_V(r7)
>>> +	addi	r7,r7,VCPU_SLB_SIZE
>>> +	addi	r5,r5,1
>>> +2:	addi	r6,r6,1
>>> +	cmpwi	r6,32
>> 
>> I don't like how the 32 is hardcoded here. Better create a define
>> for it and use the same in the init code.
> 
> Sure.  In fact I probably should use vcpu->arch.slb_nr here.

and bctr? yup :)

> 
>>> +hdec_soon:
>>> +	/* Switch back to host partition */
>>> +	ld	r4,VCPU_KVM(r9)		/* pointer to struct kvm */
>>> +	ld	r6,KVM_HOST_SDR1(r4)
>>> +	lwz	r7,KVM_HOST_LPID(r4)
>>> +	li	r8,0x3ff		/* switch to reserved LPID */
>> 
>> is it reserved by ISA? Either way, hard-coding the constant without
>> a name is not nice :).
> 
> Actually, it just has to be an LPID value that isn't ever used for
> running a real guest (or the host).  I'll make a name for it.
> 
>>> +	lis	r8,0x7fff
>> 
>> INT_MAX at h might be more readable.
> 
> OK.
> 
>>> int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
>>> {
>>> -	return !(v->arch.shared->msr & MSR_WE) ||
>>> +#ifndef CONFIG_KVM_BOOK3S_64_HV
>>> +	return !(kvmppc_get_msr(v) & MSR_WE) ||
>>> 	       !!(v->arch.pending_exceptions);
>>> +#else
>>> +	return 1;
>> 
>> So what happens if the guest sets MSR_WE? It just stays in guest
>> context idling? That'd be pretty bad for scheduling on the host.
> 
> The MSR_WE bit doesn't exist on POWER7 (or any of POWER[4567], in
> fact).  If the guest wants to idle it calls the H_CEDE hcall.

Ah, interesting. Didn't know :).

> 
>>> @@ -184,12 +188,14 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> 	case KVM_CAP_PPC_IRQ_LEVEL:
>>> 	case KVM_CAP_ENABLE_CAP:
>>> 	case KVM_CAP_PPC_OSI:
>>> +#ifndef CONFIG_KVM_BOOK3S_64_HV
>> 
>> You also don't do OSI on HV :).
> 
> Good point, I'll fix that.
> 
>>> @@ -496,8 +510,11 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq)
>>> 	if (waitqueue_active(&vcpu->wq)) {
>>> 		wake_up_interruptible(&vcpu->wq);
>>> 		vcpu->stat.halt_wakeup++;
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +	} else if (vcpu->cpu != -1) {
>>> +		smp_send_reschedule(vcpu->cpu);
>> 
>> Shouldn't this be done for non-HV too? The only reason we don't do
>> it yet is because we don't do SMP, no?
> 
> I didn't know why you didn't do it for non-HV, so I didn't change it
> for that case.  If you say it's OK, I'll change it (we'll need to set
> vcpu->cpu in the vcpu_load/unload code for book3s_pr too, then).

Sure, go ahead and set it.

Alex