[PATCH 27/27] KVM: PPC: Add Documentation about PV interface

Fri Jul 9 19:11:01 EST 2010

On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf <agraf at suse.de> wrote:
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
> Signed-off-by: Alexander Graf <agraf at suse.de>
>
> ---
>
> v1 -> v2:
>
>  - clarify guest implementation
>  - clarify that privileged instructions still work
>  - explain safe MSR bits
>  - Fix dsisr patch description
>  - change hypervisor calls to use new register values
> ---
>  Documentation/kvm/ppc-pv.txt |  185 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 185 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/ppc-pv.txt
>
> diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
> new file mode 100644
> index 0000000..82de6c6
> --- /dev/null
> +++ b/Documentation/kvm/ppc-pv.txt
> @@ -0,0 +1,185 @@
> +The PPC KVM paravirtual interface
> +=================================
> +
> +The basic execution principle by which KVM on PowerPC works is to run all kernel
> +space code in PR=1 which is user space. This way we trap all privileged
> +instructions and can emulate them accordingly.
> +
> +Unfortunately that is also the downfall. There are quite some privileged
> +instructions that needlessly return us to the hypervisor even though they
> +could be handled differently.
> +
> +This is what the PPC PV interface helps with. It takes privileged instructions
> +and transforms them into unprivileged ones with some help from the hypervisor.
> +This cuts down virtualization costs by about 50% on some of my benchmarks.
> +
> +The code for that interface can be found in arch/powerpc/kernel/kvm*
> +
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +       LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +       mfpvr   r5
> +       [r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> +       1) Call an invalid instruction
> +       2) Call the "sc" instruction with a parameter to "sc"
> +       3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
> +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
> +
> +The parameters are as follows:
> +
> +       r0              KVM_SC_MAGIC_R0
> +       r3              KVM_SC_MAGIC_R3         Return code
> +       r4              Hypercall number
> +       r5              First parameter
> +       r6              Second parameter
> +       r7              Third parameter
> +       r8              Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> +       ld      rX, -4096(0)
> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> +       __u64 scratch1;
> +       __u64 scratch2;
> +       __u64 scratch3;
> +       __u64 critical;         /* Guest may not get interrupts if == r1 */
> +       __u64 sprg0;
> +       __u64 sprg1;
> +       __u64 sprg2;
> +       __u64 sprg3;
> +       __u64 srr0;
> +       __u64 srr1;
> +       __u64 dar;
> +       __u64 msr;
> +       __u32 dsisr;
> +       __u32 int_pending;      /* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +MSR bits
> +========
> +
> +The MSR contains bits that require hypervisor intervention and bits that do
> +not require direct hypervisor intervention because they only get interpreted
> +when entering the guest or don't have any impact on the hypervisor's behavior.
> +
> +The following bits are safe to be set inside the guest:
> +
> +  MSR_EE
> +  MSR_RI
> +  MSR_CR
> +  MSR_ME
> +
> +If any other bit changes in the MSR, please still use mtmsr(d).
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
> +
> +The following is a list of mapping the Linux kernel performs when running as
> +guest. Implementing any of those mappings is optional, as the instruction traps
> +also act on the shared page. So calling privileged instructions still works as
> +before.
> +
> +From                   To
> +====                   ==
> +
> +mfmsr  rX              ld      rX, magic_page->msr
> +mfsprg rX, 0           ld      rX, magic_page->sprg0
> +mfsprg rX, 1           ld      rX, magic_page->sprg1
> +mfsprg rX, 2           ld      rX, magic_page->sprg2
> +mfsprg rX, 3           ld      rX, magic_page->sprg3
> +mfsrr0 rX              ld      rX, magic_page->srr0
> +mfsrr1 rX              ld      rX, magic_page->srr1
> +mfdar  rX              ld      rX, magic_page->dar
> +mfdsisr        rX              lwz     rX, magic_page->dsisr
> +
> +mtmsr  rX              std     rX, magic_page->msr
> +mtsprg 0, rX           std     rX, magic_page->sprg0
> +mtsprg 1, rX           std     rX, magic_page->sprg1
> +mtsprg 2, rX           std     rX, magic_page->sprg2
> +mtsprg 3, rX           std     rX, magic_page->sprg3
> +mtsrr0 rX              std     rX, magic_page->srr0
> +mtsrr1 rX              std     rX, magic_page->srr1
> +mtdar  rX              std     rX, magic_page->dar
> +mtdsisr        rX              stw     rX, magic_page->dsisr
> +
> +tlbsync                        nop
> +
> +mtmsrd rX, 0           b       <special mtmsr section>
> +mtmsr                  b       <special mtmsr section>
> +
> +mtmsrd rX, 1           b       <special mtmsrd section>
> +
> +[BookE only]
> +wrteei [0|1]           b       <special wrteei section>
> +
> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> +       1) copy emulation code to memory
> +       2) patch that code to fit the emulated instruction
> +       3) patch that code to return to the original pc + 4
> +       4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +

Which patch does this mapping ? Can you please point to that.

> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
-mj