[PATCH 27/27] KVM: PPC: Add Documentation about PV interface

Sat Jul 3 03:59:39 EST 2010

[Resending...]

Please reconcile this with
http://www.linux-kvm.org/page/PowerPC_Hypercall_ABI, which has been
discussed in the (admittedly closed) Power.org embedded hypervisor
working group. Bear in mind that other hypervisors are already
implementing the documented ABI, so if you have concerns, you should
probably raise them with that audience...

-Hollis

On Thu, Jul 1, 2010 at 3:43 AM, Alexander Graf <agraf at suse.de> wrote:
>
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
> Signed-off-by: Alexander Graf <agraf at suse.de>
>
> ---
>
> v1 -> v2:
>
>  - clarify guest implementation
>  - clarify that privileged instructions still work
>  - explain safe MSR bits
>  - Fix dsisr patch description
>  - change hypervisor calls to use new register values
> ---
>  Documentation/kvm/ppc-pv.txt |  185 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 185 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/ppc-pv.txt
>
> diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
> new file mode 100644
> index 0000000..82de6c6
> --- /dev/null
> +++ b/Documentation/kvm/ppc-pv.txt
> @@ -0,0 +1,185 @@
> +The PPC KVM paravirtual interface
> +=================================
> +
> +The basic execution principle by which KVM on PowerPC works is to run all kernel
> +space code in PR=1 which is user space. This way we trap all privileged
> +instructions and can emulate them accordingly.
> +
> +Unfortunately that is also the downfall. There are quite some privileged
> +instructions that needlessly return us to the hypervisor even though they
> +could be handled differently.
> +
> +This is what the PPC PV interface helps with. It takes privileged instructions
> +and transforms them into unprivileged ones with some help from the hypervisor.
> +This cuts down virtualization costs by about 50% on some of my benchmarks.
> +
> +The code for that interface can be found in arch/powerpc/kernel/kvm*
> +
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +       LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +       mfpvr   r5
> +       [r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> +       1) Call an invalid instruction
> +       2) Call the "sc" instruction with a parameter to "sc"
> +       3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
> +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
> +
> +The parameters are as follows:
> +
> +       r0              KVM_SC_MAGIC_R0
> +       r3              KVM_SC_MAGIC_R3         Return code
> +       r4              Hypercall number
> +       r5              First parameter
> +       r6              Second parameter
> +       r7              Third parameter
> +       r8              Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> +       ld      rX, -4096(0)
> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> +       __u64 scratch1;
> +       __u64 scratch2;
> +       __u64 scratch3;
> +       __u64 critical;         /* Guest may not get interrupts if == r1 */
> +       __u64 sprg0;
> +       __u64 sprg1;
> +       __u64 sprg2;
> +       __u64 sprg3;
> +       __u64 srr0;
> +       __u64 srr1;
> +       __u64 dar;
> +       __u64 msr;
> +       __u32 dsisr;
> +       __u32 int_pending;      /* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +MSR bits
> +========
> +
> +The MSR contains bits that require hypervisor intervention and bits that do
> +not require direct hypervisor intervention because they only get interpreted
> +when entering the guest or don't have any impact on the hypervisor's behavior.
> +
> +The following bits are safe to be set inside the guest:
> +
> +  MSR_EE
> +  MSR_RI
> +  MSR_CR
> +  MSR_ME
> +
> +If any other bit changes in the MSR, please still use mtmsr(d).
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
> +
> +The following is a list of mapping the Linux kernel performs when running as
> +guest. Implementing any of those mappings is optional, as the instruction traps
> +also act on the shared page. So calling privileged instructions still works as
> +before.
> +
> +From                   To
> +====                   ==
> +
> +mfmsr  rX              ld      rX, magic_page->msr
> +mfsprg rX, 0           ld      rX, magic_page->sprg0
> +mfsprg rX, 1           ld      rX, magic_page->sprg1
> +mfsprg rX, 2           ld      rX, magic_page->sprg2
> +mfsprg rX, 3           ld      rX, magic_page->sprg3
> +mfsrr0 rX              ld      rX, magic_page->srr0
> +mfsrr1 rX              ld      rX, magic_page->srr1
> +mfdar  rX              ld      rX, magic_page->dar
> +mfdsisr        rX              lwz     rX, magic_page->dsisr
> +
> +mtmsr  rX              std     rX, magic_page->msr
> +mtsprg 0, rX           std     rX, magic_page->sprg0
> +mtsprg 1, rX           std     rX, magic_page->sprg1
> +mtsprg 2, rX           std     rX, magic_page->sprg2
> +mtsprg 3, rX           std     rX, magic_page->sprg3
> +mtsrr0 rX              std     rX, magic_page->srr0
> +mtsrr1 rX              std     rX, magic_page->srr1
> +mtdar  rX              std     rX, magic_page->dar
> +mtdsisr        rX              stw     rX, magic_page->dsisr
> +
> +tlbsync                        nop
> +
> +mtmsrd rX, 0           b       <special mtmsr section>
> +mtmsr                  b       <special mtmsr section>
> +
> +mtmsrd rX, 1           b       <special mtmsrd section>
> +
> +[BookE only]
> +wrteei [0|1]           b       <special wrteei section>
> +
> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> +       1) copy emulation code to memory
> +       2) patch that code to fit the emulated instruction
> +       3) patch that code to return to the original pc + 4
> +       4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html