[PATCH 27/27] KVM: PPC: Add Documentation about PV interface
MJ embd
mj.embd at gmail.com
Fri Jul 9 19:11:01 EST 2010
On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf <agraf at suse.de> wrote:
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
> Signed-off-by: Alexander Graf <agraf at suse.de>
>
> ---
>
> v1 -> v2:
>
> - clarify guest implementation
> - clarify that privileged instructions still work
> - explain safe MSR bits
> - Fix dsisr patch description
> - change hypervisor calls to use new register values
> ---
> Documentation/kvm/ppc-pv.txt | 185 ++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 185 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/ppc-pv.txt
>
> diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
> new file mode 100644
> index 0000000..82de6c6
> --- /dev/null
> +++ b/Documentation/kvm/ppc-pv.txt
> @@ -0,0 +1,185 @@
> +The PPC KVM paravirtual interface
> +=================================
> +
> +The basic execution principle by which KVM on PowerPC works is to run all kernel
> +space code in PR=1 which is user space. This way we trap all privileged
> +instructions and can emulate them accordingly.
> +
> +Unfortunately that is also the downfall. There are quite some privileged
> +instructions that needlessly return us to the hypervisor even though they
> +could be handled differently.
> +
> +This is what the PPC PV interface helps with. It takes privileged instructions
> +and transforms them into unprivileged ones with some help from the hypervisor.
> +This cuts down virtualization costs by about 50% on some of my benchmarks.
> +
> +The code for that interface can be found in arch/powerpc/kernel/kvm*
> +
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> + LOAD_REG_IMM(r5, KVM_PVR_PARA)
> + mfpvr r5
> + [r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> + 1) Call an invalid instruction
> + 2) Call the "sc" instruction with a parameter to "sc"
> + 3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
> +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
> +
> +The parameters are as follows:
> +
> + r0 KVM_SC_MAGIC_R0
> + r3 KVM_SC_MAGIC_R3 Return code
> + r4 Hypercall number
> + r5 First parameter
> + r6 Second parameter
> + r7 Third parameter
> + r8 Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> + ld rX, -4096(0)
> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> + __u64 scratch1;
> + __u64 scratch2;
> + __u64 scratch3;
> + __u64 critical; /* Guest may not get interrupts if == r1 */
> + __u64 sprg0;
> + __u64 sprg1;
> + __u64 sprg2;
> + __u64 sprg3;
> + __u64 srr0;
> + __u64 srr1;
> + __u64 dar;
> + __u64 msr;
> + __u32 dsisr;
> + __u32 int_pending; /* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +MSR bits
> +========
> +
> +The MSR contains bits that require hypervisor intervention and bits that do
> +not require direct hypervisor intervention because they only get interpreted
> +when entering the guest or don't have any impact on the hypervisor's behavior.
> +
> +The following bits are safe to be set inside the guest:
> +
> + MSR_EE
> + MSR_RI
> + MSR_CR
> + MSR_ME
> +
> +If any other bit changes in the MSR, please still use mtmsr(d).
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
> +
> +The following is a list of mapping the Linux kernel performs when running as
> +guest. Implementing any of those mappings is optional, as the instruction traps
> +also act on the shared page. So calling privileged instructions still works as
> +before.
> +
> +From To
> +==== ==
> +
> +mfmsr rX ld rX, magic_page->msr
> +mfsprg rX, 0 ld rX, magic_page->sprg0
> +mfsprg rX, 1 ld rX, magic_page->sprg1
> +mfsprg rX, 2 ld rX, magic_page->sprg2
> +mfsprg rX, 3 ld rX, magic_page->sprg3
> +mfsrr0 rX ld rX, magic_page->srr0
> +mfsrr1 rX ld rX, magic_page->srr1
> +mfdar rX ld rX, magic_page->dar
> +mfdsisr rX lwz rX, magic_page->dsisr
> +
> +mtmsr rX std rX, magic_page->msr
> +mtsprg 0, rX std rX, magic_page->sprg0
> +mtsprg 1, rX std rX, magic_page->sprg1
> +mtsprg 2, rX std rX, magic_page->sprg2
> +mtsprg 3, rX std rX, magic_page->sprg3
> +mtsrr0 rX std rX, magic_page->srr0
> +mtsrr1 rX std rX, magic_page->srr1
> +mtdar rX std rX, magic_page->dar
> +mtdsisr rX stw rX, magic_page->dsisr
> +
> +tlbsync nop
> +
> +mtmsrd rX, 0 b <special mtmsr section>
> +mtmsr b <special mtmsr section>
> +
> +mtmsrd rX, 1 b <special mtmsrd section>
> +
> +[BookE only]
> +wrteei [0|1] b <special wrteei section>
> +
> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> + 1) copy emulation code to memory
> + 2) patch that code to fit the emulated instruction
> + 3) patch that code to return to the original pc + 4
> + 4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +
Which patch does this mapping ? Can you please point to that.
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
-mj
More information about the Linuxppc-dev
mailing list