[RFC PATCH v2 5/6] KVM: PPC: Add support for nested PAPR guests
Nicholas Piggin
npiggin at gmail.com
Wed Jun 7 19:08:40 AEST 2023
On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> A series of hcalls have been added to the PAPR which allow a regular
> guest partition to create and manage guest partitions of its own. Add
> support to KVM to utilize these hcalls to enable running nested guests.
>
> Overview of the new hcall usage:
>
> - L1 and L0 negotiate capabilities with
> H_GUEST_{G,S}ET_CAPABILITIES()
>
> - L1 requests the L0 create a L2 with
> H_GUEST_CREATE() and receives a handle to use in future hcalls
>
> - L1 requests the L0 create a L2 vCPU with
> H_GUEST_CREATE_VCPU()
>
> - L1 sets up the L2 using H_GUEST_SET and the
> H_GUEST_VCPU_RUN input buffer
>
> - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()
>
> - L2 returns to L1 with an exit reason and L1 reads the
> H_GUEST_VCPU_RUN output buffer populated by the L0
>
> - L1 handles the exit using H_GET_STATE if necessary
>
> - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN
>
> - L1 frees the L2 in the L0 with H_GUEST_DELETE()
>
> Support for the new API is determined by trying
> H_GUEST_GET_CAPABILITIES. On a successful return, the new API will then
> be used.
>
> Use the vcpu register state setters for tracking modified guest state
> elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
> buffer immediately before running a L2. The guest wide
> elements can not be added to the input buffer so send them with a
> separate H_GUEST_SET call if necessary.
>
> Make the vcpu register getter load the corresponding value from the real
> host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
> which values have already been loaded between H_GUEST_VCPU_RUN calls. If
> an element is present in the H_GUEST_VCPU_RUN output buffer it also does
> not need to be loaded again.
>
> There is existing support for running nested guests on KVM
> with powernv. However the interface used for this is not supported by
> other PAPR hosts. This existing API is still supported.
>
> Signed-off-by: Jordan Niethe <jpn at linux.vnet.ibm.com>
> ---
> v2:
> - Declare op structs as static
> - Use expressions in switch case with local variables
> - Do not use the PVR for the LOGICAL PVR ID
> - Handle emul_inst as now a double word
> - Use new GPR(), etc macros
> - Determine PAPR nested capabilities from cpu features
> ---
> arch/powerpc/include/asm/guest-state-buffer.h | 105 +-
> arch/powerpc/include/asm/hvcall.h | 30 +
> arch/powerpc/include/asm/kvm_book3s.h | 122 ++-
> arch/powerpc/include/asm/kvm_book3s_64.h | 6 +
> arch/powerpc/include/asm/kvm_host.h | 21 +
> arch/powerpc/include/asm/kvm_ppc.h | 64 +-
> arch/powerpc/include/asm/plpar_wrappers.h | 198 ++++
> arch/powerpc/kvm/Makefile | 1 +
> arch/powerpc/kvm/book3s_hv.c | 126 ++-
> arch/powerpc/kvm/book3s_hv.h | 74 +-
> arch/powerpc/kvm/book3s_hv_nested.c | 38 +-
> arch/powerpc/kvm/book3s_hv_papr.c | 940 ++++++++++++++++++
> arch/powerpc/kvm/emulate_loadstore.c | 4 +-
> arch/powerpc/kvm/guest-state-buffer.c | 49 +
> 14 files changed, 1684 insertions(+), 94 deletions(-)
> create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c
>
> diff --git a/arch/powerpc/include/asm/guest-state-buffer.h b/arch/powerpc/include/asm/guest-state-buffer.h
> index 65a840abf1bb..116126edd8e2 100644
> --- a/arch/powerpc/include/asm/guest-state-buffer.h
> +++ b/arch/powerpc/include/asm/guest-state-buffer.h
> @@ -5,6 +5,7 @@
> #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
> #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
>
> +#include "asm/hvcall.h"
> #include <linux/gfp.h>
> #include <linux/bitmap.h>
> #include <asm/plpar_wrappers.h>
> @@ -14,16 +15,16 @@
> **************************************************************************/
> #define GSID_BLANK 0x0000
>
> -#define GSID_HOST_STATE_SIZE 0x0001 /* Size of Hypervisor Internal Format VCPU state */
> -#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002 /* Minimum size of the Run VCPU output buffer */
> -#define GSID_LOGICAL_PVR 0x0003 /* Logical PVR */
> -#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
> -#define GSID_PARTITION_TABLE 0x0005 /* Partition Scoped Page Table */
> -#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
> +#define GSID_HOST_STATE_SIZE 0x0001
> +#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002
> +#define GSID_LOGICAL_PVR 0x0003
> +#define GSID_TB_OFFSET 0x0004
> +#define GSID_PARTITION_TABLE 0x0005
> +#define GSID_PROCESS_TABLE 0x0006
You lost your comments.
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 0ca2d8b37b42..c5c57552b447 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -12,6 +12,7 @@
> #include <linux/types.h>
> #include <linux/kvm_host.h>
> #include <asm/kvm_book3s_asm.h>
> +#include <asm/guest-state-buffer.h>
>
> struct kvmppc_bat {
> u64 raw;
> @@ -316,6 +317,57 @@ long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
>
> void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
>
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +
> +extern bool __kvmhv_on_papr;
> +
> +static inline bool kvmhv_on_papr(void)
> +{
> + return __kvmhv_on_papr;
> +}
It's a nitpick, but kvmhv_on_pseries() is because we're runnning KVM-HV
on a pseries guest kernel. Which is a papr guest kernel. So this kind of
doesn't make sense if you read it the same way.
kvmhv_nested_using_papr() or something like that might read a bit
better.
This could be a static key too.
> @@ -575,6 +593,7 @@ struct kvm_vcpu_arch {
> ulong dscr;
> ulong amr;
> ulong uamor;
> + ulong amor;
> ulong iamr;
> u32 ctrl;
> u32 dabrx;
This belongs somewhere else.
> @@ -829,6 +848,8 @@ struct kvm_vcpu_arch {
> u64 nested_hfscr; /* HFSCR that the L1 requested for the nested guest */
> u32 nested_vcpu_id;
> gpa_t nested_io_gpr;
> + /* For nested APIv2 guests*/
> + struct kvmhv_papr_host papr_host;
> #endif
This is not exactly a papr host. Might have to come up with a better
name especially if we implement a L0 things could get confusing.
> @@ -342,6 +343,203 @@ static inline long plpar_get_cpu_characteristics(struct h_cpu_char_result *p)
> return rc;
> }
>
> +static inline long plpar_guest_create(unsigned long flags, unsigned long *guest_id)
> +{
> + unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> + unsigned long token;
> + long rc;
> +
> + token = -1UL;
> + while (true) {
> + rc = plpar_hcall(H_GUEST_CREATE, retbuf, flags, token);
> + if (rc == H_SUCCESS) {
> + *guest_id = retbuf[0];
> + break;
> + }
> +
> + if (rc == H_BUSY) {
> + token = retbuf[0];
> + cpu_relax();
> + continue;
> + }
> +
> + if (H_IS_LONG_BUSY(rc)) {
> + token = retbuf[0];
> + mdelay(get_longbusy_msecs(rc));
All of these things need a non-sleeping delay? Can we sleep instead?
Or if not, might have to think about going back to the caller and it
can retry.
get/set state might be a bit inconvenient, although I don't expect
that should potentially take so long as guest and vcpu create/delete,
so at least those ones would be good if they're called while
preemptable.
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 521d84621422..f22ee582e209 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -383,6 +383,11 @@ static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu *vcpu)
> spin_unlock_irqrestore(&vcpu->arch.tbacct_lock, flags);
> }
>
> +static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 pvr)
> +{
> + vcpu->arch.pvr = pvr;
> +}
Didn't you lose this in a previous patch? I thought it must have moved
to a header but it reappears.
> +
> /* Dummy value used in computing PCR value below */
> #define PCR_ARCH_31 (PCR_ARCH_300 << 1)
>
> @@ -1262,13 +1267,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> return RESUME_HOST;
> break;
> #endif
> - case H_RANDOM:
> + case H_RANDOM: {
> unsigned long rand;
>
> if (!arch_get_random_seed_longs(&rand, 1))
> ret = H_HARDWARE;
> kvmppc_set_gpr(vcpu, 4, rand);
> break;
> + }
> case H_RPT_INVALIDATE:
> ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
> kvmppc_get_gpr(vcpu, 5),
Compile fix for a previous patch.
> @@ -2921,14 +2927,21 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu *vcpu)
> vcpu->arch.shared_big_endian = false;
> #endif
> #endif
> - kvmppc_set_mmcr_hv(vcpu, 0, MMCR0_FC);
>
> + if (kvmhv_on_papr()) {
> + err = kvmhv_papr_vcpu_create(vcpu, &vcpu->arch.papr_host);
> + if (err < 0)
> + return err;
> + }
> +
> + kvmppc_set_mmcr_hv(vcpu, 0, MMCR0_FC);
> if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> kvmppc_set_mmcr_hv(vcpu, 0, kvmppc_get_mmcr_hv(vcpu, 0) | MMCR0_PMCCEXT);
> kvmppc_set_mmcra_hv(vcpu, MMCRA_BHRB_DISABLE);
> }
>
> kvmppc_set_ctrl_hv(vcpu, CTRL_RUNLATCH);
> + kvmppc_set_amor_hv(vcpu, ~0);
This AMOR thing should go somewhere else. Not actually sure why it needs
to be added to the vcpu since it's always ~0. Maybe just put that in a
#define somewhere and use that.
> @@ -4042,6 +4059,50 @@ static void vcpu_vpa_increment_dispatch(struct kvm_vcpu *vcpu)
> }
> }
>
> +static int kvmhv_vcpu_entry_papr(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long lpcr, u64 *tb)
> +{
> + struct kvmhv_papr_host *ph;
> + unsigned long msr, i;
> + int trap;
> + long rc;
> +
> + ph = &vcpu->arch.papr_host;
> +
> + msr = mfmsr();
> + kvmppc_msr_hard_disable_set_facilities(vcpu, msr);
> + if (lazy_irq_pending())
> + return 0;
> +
> + kvmhv_papr_flush_vcpu(vcpu, time_limit);
> +
> + accumulate_time(vcpu, &vcpu->arch.in_guest);
> + rc = plpar_guest_run_vcpu(0, vcpu->kvm->arch.lpid, vcpu->vcpu_id,
> + &trap, &i);
> +
> + if (rc != H_SUCCESS) {
> + pr_err("KVM Guest Run VCPU hcall failed\n");
> + if (rc == H_INVALID_ELEMENT_ID)
> + pr_err("KVM: Guest Run VCPU invalid element id at %ld\n", i);
> + else if (rc == H_INVALID_ELEMENT_SIZE)
> + pr_err("KVM: Guest Run VCPU invalid element size at %ld\n", i);
> + else if (rc == H_INVALID_ELEMENT_VALUE)
> + pr_err("KVM: Guest Run VCPU invalid element value at %ld\n", i);
> + return 0;
> + }
This needs the proper error handling. Were you going to wait until I
sent that up for existing code?
> @@ -5119,6 +5183,7 @@ static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
> */
> void kvmppc_update_lpcr(struct kvm *kvm, unsigned long lpcr, unsigned long mask)
> {
> + struct kvm_vcpu *vcpu;
> long int i;
> u32 cores_done = 0;
>
> @@ -5139,6 +5204,12 @@ void kvmppc_update_lpcr(struct kvm *kvm, unsigned long lpcr, unsigned long mask)
> if (++cores_done >= kvm->arch.online_vcores)
> break;
> }
> +
> + if (kvmhv_on_papr()) {
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + kvmppc_set_lpcr_hv(vcpu, vcpu->arch.vcore->lpcr);
> + }
> + }
> }
vcpu define could go in that scope I guess.
> @@ -5405,15 +5476,43 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>
> /* Allocate the guest's logical partition ID */
>
> - lpid = kvmppc_alloc_lpid();
> - if ((long)lpid < 0)
> - return -ENOMEM;
> - kvm->arch.lpid = lpid;
> + if (!kvmhv_on_papr()) {
> + lpid = kvmppc_alloc_lpid();
> + if ((long)lpid < 0)
> + return -ENOMEM;
> + kvm->arch.lpid = lpid;
> + }
>
> kvmppc_alloc_host_rm_ops();
>
> kvmhv_vm_nested_init(kvm);
>
> + if (kvmhv_on_papr()) {
> + long rc;
> + unsigned long guest_id;
> +
> + rc = plpar_guest_create(0, &guest_id);
> +
> + if (rc != H_SUCCESS)
> + pr_err("KVM: Create Guest hcall failed, rc=%ld\n", rc);
> +
> + switch (rc) {
> + case H_PARAMETER:
> + case H_FUNCTION:
> + case H_STATE:
> + return -EINVAL;
> + case H_NOT_ENOUGH_RESOURCES:
> + case H_ABORTED:
> + return -ENOMEM;
> + case H_AUTHORITY:
> + return -EPERM;
> + case H_NOT_AVAILABLE:
> + return -EBUSY;
> + }
> + kvm->arch.lpid = guest_id;
> + }
I wouldn't mind putting lpid/guest_id in different variables. guest_id
is 64-bit isn't it? LPIDR is 32. If nothing else that could cause
issues if the hypervisor does something clever with the token.
> @@ -5573,10 +5675,14 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
> kvm->arch.process_table = 0;
> if (kvm->arch.secure_guest)
> uv_svm_terminate(kvm->arch.lpid);
> - kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
> + if (!kvmhv_on_papr())
> + kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
> }
Would be nice to have a +ve test for the "existing" API. All we have to
do is think of a name for it.
Thanks,
Nick
More information about the Linuxppc-dev
mailing list