[PATCH 5/5 v4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Alexander Graf
agraf at suse.de
Thu Jul 3 23:53:02 EST 2014
On 28.06.14 00:49, Mihai Caraman wrote:
> On book3e, KVM uses load external pid (lwepx) dedicated instruction to read
> guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI
> and LRAT), generated by loading a guest address, needs to be handled by KVM.
> These exceptions are generated in a substituted guest translation context
> (EPLC[EGS] = 1) from host context (MSR[GS] = 0).
>
> Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1),
> doing minimal checks on the fast path to avoid host performance degradation.
> lwepx exceptions originate from host state (MSR[GS] = 0) which implies
> additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking
> at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context
> Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious
> too intrusive for the host.
>
> Read guest last instruction from kvmppc_load_last_inst() by searching for the
> physical address and kmap it. This address the TODO for TLB eviction and
> execute-but-not-read entries, and allow us to get rid of lwepx until we are
> able to handle failures.
>
> A simple stress benchmark shows a 1% sys performance degradation compared with
> previous approach (lwepx without failure handling):
>
> time for i in `seq 1 10000`; do /bin/echo > /dev/null; done
>
> real 0m 8.85s
> user 0m 4.34s
> sys 0m 4.48s
>
> vs
>
> real 0m 8.84s
> user 0m 4.36s
> sys 0m 4.44s
>
> An alternative solution, to handle lwepx exceptions in KVM, is to temporary
> highjack the interrupt vector from host. Some cores share host IVOR registers
> between hardware threads, which is the case of FSL e6500, which impose additional
> synchronization logic for this solution to work. The optimization can be addressed
> later on top of this patch.
>
> Signed-off-by: Mihai Caraman <mihai.caraman at freescale.com>
> ---
> v4:
> - add switch and new function when getting last inst earlier
> - use enum instead of prev semnatic
> - get rid of mas0, optimize mas7_mas3
> - give more context in visible messages
> - check storage attributes mismatch on MMUv2
> - get rid of pfn_valid check
>
> v3:
> - reworked patch description
> - use unaltered kmap addr for kunmap
> - get last instruction before beeing preempted
>
> v2:
> - reworked patch description
> - used pr_* functions
> - addressed cosmetic feedback
>
> arch/powerpc/kvm/booke.c | 44 +++++++++++++++++
> arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++----------
> arch/powerpc/kvm/e500_mmu_host.c | 91 +++++++++++++++++++++++++++++++++++
> 3 files changed, 144 insertions(+), 28 deletions(-)
>
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 34a42b9..843077b 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -869,6 +869,28 @@ static void kvmppc_restart_interrupt(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvmppc_resume_inst_load(struct kvm_run *run, struct kvm_vcpu *vcpu,
> + enum emulation_result emulated, u32 last_inst)
> +{
> + switch (emulated) {
> + case EMULATE_AGAIN:
> + return RESUME_GUEST;
> +
> + case EMULATE_FAIL:
> + pr_debug("%s: load instruction from guest address %lx failed\n",
> + __func__, vcpu->arch.pc);
> + /* For debugging, encode the failing instruction and
> + * report it to userspace. */
> + run->hw.hardware_exit_reason = ~0ULL << 32;
> + run->hw.hardware_exit_reason |= last_inst;
> + kvmppc_core_queue_program(vcpu, ESR_PIL);
> + return RESUME_HOST;
> +
> + default:
> + BUG();
> + }
> +}
> +
> /**
> * kvmppc_handle_exit
> *
> @@ -880,6 +902,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
> int r = RESUME_HOST;
> int s;
> int idx;
> + u32 last_inst = KVM_INST_FETCH_FAILED;
> + enum emulation_result emulated = EMULATE_DONE;
>
> /* update before a new last_exit_type is rewritten */
> kvmppc_update_timing_stats(vcpu);
> @@ -887,6 +911,20 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
> /* restart interrupts if they were meant for the host */
> kvmppc_restart_interrupt(vcpu, exit_nr);
>
> + /*
> + * get last instruction before beeing preempted
> + * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA
> + */
> + switch (exit_nr) {
> + case BOOKE_INTERRUPT_DATA_STORAGE:
> + case BOOKE_INTERRUPT_DTLB_MISS:
> + case BOOKE_INTERRUPT_HV_PRIV:
> + emulated = kvmppc_get_last_inst(vcpu, false, &last_inst);
> + break;
> + default:
> + break;
> + }
> +
> local_irq_enable();
>
> trace_kvm_exit(exit_nr, vcpu);
> @@ -895,6 +933,11 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
> run->exit_reason = KVM_EXIT_UNKNOWN;
> run->ready_for_interrupt_injection = 1;
>
> + if (emulated != EMULATE_DONE) {
> + r = kvmppc_resume_inst_load(run, vcpu, emulated, last_inst);
> + goto out;
> + }
> +
> switch (exit_nr) {
> case BOOKE_INTERRUPT_MACHINE_CHECK:
> printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR));
> @@ -1184,6 +1227,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
> BUG();
> }
>
> +out:
> /*
> * To avoid clobbering exit_reason, only check for signals if we
> * aren't already exiting to userspace for some other reason.
> diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S
> index 6ff4480..e000b39 100644
> --- a/arch/powerpc/kvm/bookehv_interrupts.S
> +++ b/arch/powerpc/kvm/bookehv_interrupts.S
> @@ -121,38 +121,14 @@
> 1:
>
> .if \flags & NEED_EMU
> - /*
> - * This assumes you have external PID support.
> - * To support a bookehv CPU without external PID, you'll
> - * need to look up the TLB entry and create a temporary mapping.
> - *
> - * FIXME: we don't currently handle if the lwepx faults. PR-mode
> - * booke doesn't handle it either. Since Linux doesn't use
> - * broadcast tlbivax anymore, the only way this should happen is
> - * if the guest maps its memory execute-but-not-read, or if we
> - * somehow take a TLB miss in the middle of this entry code and
> - * evict the relevant entry. On e500mc, all kernel lowmem is
> - * bolted into TLB1 large page mappings, and we don't use
> - * broadcast invalidates, so we should not take a TLB miss here.
> - *
> - * Later we'll need to deal with faults here. Disallowing guest
> - * mappings that are execute-but-not-read could be an option on
> - * e500mc, but not on chips with an LRAT if it is used.
> - */
> -
> - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */
> PPC_STL r15, VCPU_GPR(R15)(r4)
> PPC_STL r16, VCPU_GPR(R16)(r4)
> PPC_STL r17, VCPU_GPR(R17)(r4)
> PPC_STL r18, VCPU_GPR(R18)(r4)
> PPC_STL r19, VCPU_GPR(R19)(r4)
> - mr r8, r3
> PPC_STL r20, VCPU_GPR(R20)(r4)
> - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
> PPC_STL r21, VCPU_GPR(R21)(r4)
> - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
> PPC_STL r22, VCPU_GPR(R22)(r4)
> - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID
> PPC_STL r23, VCPU_GPR(R23)(r4)
> PPC_STL r24, VCPU_GPR(R24)(r4)
> PPC_STL r25, VCPU_GPR(R25)(r4)
> @@ -162,10 +138,15 @@
> PPC_STL r29, VCPU_GPR(R29)(r4)
> PPC_STL r30, VCPU_GPR(R30)(r4)
> PPC_STL r31, VCPU_GPR(R31)(r4)
> - mtspr SPRN_EPLC, r8
> - isync
> - lwepx r9, 0, r5
> - mtspr SPRN_EPLC, r3
> +
> + /*
> + * We don't use external PID support. lwepx faults would need to be
> + * handled by KVM and this implies aditional code in DO_KVM (for
> + * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
> + * is too intrusive for the host. Get last instuction in
> + * kvmppc_get_last_inst().
> + */
> + li r9, KVM_INST_FETCH_FAILED
> stw r9, VCPU_LAST_INST(r4)
> .endif
>
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 4b4e8d6..57463e5 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -606,11 +606,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
> }
> }
>
> +#ifdef CONFIG_KVM_BOOKE_HV
> int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, enum instruction_type type,
> u32 *instr)
> {
> + gva_t geaddr;
> + hpa_t addr;
> + hfn_t pfn;
> + hva_t eaddr;
> + u32 mas1, mas2, mas3;
> + u64 mas7_mas3;
> + struct page *page;
> + unsigned int addr_space, psize_shift;
> + bool pr;
> + unsigned long flags;
> +
> + /* Search TLB for guest pc to get the real address */
> + geaddr = kvmppc_get_pc(vcpu);
> +
> + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG;
> +
> + local_irq_save(flags);
> + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space);
> + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid);
> + asm volatile("tlbsx 0, %[geaddr]\n" : :
> + [geaddr] "r" (geaddr));
> + mtspr(SPRN_MAS5, 0);
> + mtspr(SPRN_MAS8, 0);
> + mas1 = mfspr(SPRN_MAS1);
> + mas2 = mfspr(SPRN_MAS2);
> + mas3 = mfspr(SPRN_MAS3);
> +#ifdef CONFIG_64BIT
> + mas7_mas3 = mfspr(SPRN_MAS7_MAS3);
> +#else
> + mas7_mas3 = ((u64)mfspr(SPRN_MAS7) << 32) | mas3;
> +#endif
> + local_irq_restore(flags);
> +
> + /*
> + * If the TLB entry for guest pc was evicted, return to the guest.
> + * There are high chances to find a valid TLB entry next time.
> + */
> + if (!(mas1 & MAS1_VALID))
> + return EMULATE_AGAIN;
> +
> + /*
> + * Another thread may rewrite the TLB entry in parallel, don't
> + * execute from the address if the execute permission is not set
> + */
> + pr = vcpu->arch.shared->msr & MSR_PR;
> + if (unlikely((pr && !(mas3 & MAS3_UX)) ||
> + (!pr && !(mas3 & MAS3_SX)))) {
> + pr_debug("%s: Instuction emulation from guest addres %08lx without execute permission\n",
> + __func__, geaddr);
> + return EMULATE_FAIL;
In this case how did we ever get here? Why can't we just evict the entry
and return EMULATE_AGAIN?
> + }
> +
> + /*
> + * The real address will be mapped by a cacheable, memory coherent,
> + * write-back page. Check for mismatches when LRAT is used.
> + */
> + if (has_feature(vcpu, VCPU_FTR_MMU_V2) &&
> + unlikely((mas2 & MAS2_I) || (mas2 & MAS2_W) || !(mas2 & MAS2_M))) {
> + pr_debug("%s: Instuction emulation from guest addres %08lx mismatches storage attributes\n",
> + __func__, geaddr);
> + return EMULATE_FAIL;
Hrm - do we really want to deal with injecting faults here? I'd say it's
ok to just end up in an endless EMULATE_AGAIN loop.
> + }
> +
> + /* Get page size */
> + psize_shift = MAS1_GET_TSIZE(mas1) + 10;
> +
> + /* Map a page and get guest's instruction */
> + addr = (mas7_mas3 & (~0ULL << psize_shift)) |
> + (geaddr & ((1ULL << psize_shift) - 1ULL));
> + pfn = addr >> PAGE_SHIFT;
> +
> + /* Guard us against emulation from devices area */
> + if (unlikely(!page_is_ram(pfn))) {
> + pr_debug("%s: Instruction emulation from non-RAM host addres %08llx is not supported\n",
> + __func__, addr);
> + return EMULATE_FAIL;
Same here :).
> + }
> +
> + page = pfn_to_page(pfn);
> + eaddr = (unsigned long)kmap_atomic(page);
> + *instr = *(u32 *)(eaddr | (addr & ~PAGE_MASK));
> + kunmap_atomic((u32 *)eaddr);
Doesn't kmap_atomic() have to be guarded somehow?
Alex
More information about the Linuxppc-dev
mailing list