[PATCH v2 RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
Cédric Le Goater
clg at kaod.org
Wed May 2 22:30:51 AEST 2018
On 05/01/2018 07:04 AM, Sam Bobroff wrote:
> From: Sam Bobroff <sam.bobroff at au1.ibm.com>
>
> It is not currently possible to create the full number of possible
> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> threads per core than it's core stride (or "VSMT mode"). This is
> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>
> To address this, "pack" the VCORE ID and XIVE offsets by using
> knowledge of the way the VCPU IDs will be used when there are less
> guest threads per core than the core stride. The primary thread of
> each core will always be used first. Then, if the guest uses more than
> one thread per core, these secondary threads will sequentially follow
> the primary in each core.
>
> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> VCPUs are being spaced apart, so at least half of each core is empty
> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> into the second half of each core (4..7, in an 8-thread core).
>
> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> each core is being left empty, and we can map down into the second and
> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>
> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> threads are being used and 7/8 of the core is empty, allowing use of
> the 1, 3, 5 and 7 thread slots.
>
> (Strides less than 8 are handled similarly.)
>
> This allows the VCORE ID or offset to be calculated quickly from the
> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>
> Signed-off-by: Sam Bobroff <sam.bobroff at au1.ibm.com>
> ---
> Hello everyone,
>
> I've tested this on P8 and P9, in lots of combinations of host and guest
> threading modes and it has been fine but it does feel like a "tricky"
> approach, so I still feel somewhat wary about it.
>
> I've posted it as an RFC because I have not tested it with guest native-XIVE,
> and I suspect that it will take some work to support it.
> ====== v1 -> v2: ======
>
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
> * Corrected places in kvm/book3s_xive.c where IDs weren't packed.
> * Because kvmppc_pack_vcpu_id() is only called on P9, there is no need to test "emul_smt_mode > 1", so remove it.
> * Re-ordered block_offsets[] to be more ascending.
> * Added more detailed description of the packing algorithm.
>
> ====== v1: ======
>
> Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
>
> arch/powerpc/include/asm/kvm_book3s.h | 44 +++++++++++++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv.c | 14 +++++++----
> arch/powerpc/kvm/book3s_xive.c | 19 +++++++++------
> 3 files changed, 66 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index 376ae803b69c..a8d9d625e873 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -368,4 +368,48 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
> #define SPLIT_HACK_MASK 0xff000000
> #define SPLIT_HACK_OFFS 0xfb000000
>
> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
> + * (but not it's actual threading mode, which is not available) to avoid
> + * collisions.
> + *
> + * The implementation leaves VCPU IDs from the range [0..KVM_MAX_VCPUS) (block
> + * 0) unchanged: if the guest is filling each VCORE completely then it will be
> + * using consecutive IDs and it will fill the space without any packing.
> + *
> + * For higher VCPU IDs, the packed ID is based on the VCPU ID modulo
> + * KVM_MAX_VCPUS (effectively masking off the top bits) and then an offset is
> + * added to avoid collisions.
> + *
> + * VCPU IDs in the range [KVM_MAX_VCPUS..(KVM_MAX_VCPUS*2)) (block 1) are only
> + * possible if the guest is leaving at least 1/2 of each VCORE empty, so IDs
> + * can be safely packed into the second half of each VCORE by adding an offset
> + * of (stride / 2).
> + *
> + * Similarly, if VCPU IDs in the range [(KVM_MAX_VCPUS*2)..(KVM_MAX_VCPUS*4))
> + * (blocks 2 and 3) are seen, the guest must be leaving at least 3/4 of each
> + * VCORE empty so packed IDs can be offset by (stride / 4) and (stride * 3 / 4).
> + *
> + * Finally, VCPU IDs from blocks 5..7 will only be seen if the guest is using a
> + * stride of 8 and 1 thread per core so the remaining offsets of 1, 3, 5 and 7
> + * must be free to use.
> + *
> + * (The offsets for each block are stored in block_offsets[], indexed by the
> + * block number if the stride is 8. For cases where the guest's stride is less
> + * than 8, we can re-use the block_offsets array by multiplying the block
> + * number by (MAX_SMT_THREADS / stride) to reach the correct entry.)
> + */
> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> +{
> + const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 3, 5, 7};
> + int stride = kvm->arch.emul_smt_mode;
> + int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> + u32 packed_id;
> +
> + BUG_ON(block >= MAX_SMT_THREADS);
> + packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> + BUG_ON(packed_id >= KVM_MAX_VCPUS);
> + return packed_id;
> +}
> +
> #endif /* __ASM_KVM_BOOK3S_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9cb9448163c4..49165cc90051 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> return threads_per_subcore;
> }
>
> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> {
> struct kvmppc_vcore *vcore;
>
> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> init_swait_queue_head(&vcore->wq);
> vcore->preempt_tb = TB_NIL;
> vcore->lpcr = kvm->arch.lpcr;
> - vcore->first_vcpuid = core * kvm->arch.smt_mode;
> + vcore->first_vcpuid = id;
> vcore->kvm = kvm;
> INIT_LIST_HEAD(&vcore->preempt_list);
>
> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> mutex_lock(&kvm->lock);
> vcore = NULL;
> err = -EINVAL;
> - core = id / kvm->arch.smt_mode;
> + if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> + BUG_ON(kvm->arch.smt_mode != 1);
> + core = kvmppc_pack_vcpu_id(kvm, id);
> + } else {
> + core = id / kvm->arch.smt_mode;
> + }
> if (core < KVM_MAX_VCORES) {
> vcore = kvm->arch.vcores[core];
> + BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> if (!vcore) {
> err = -ENOMEM;
> - vcore = kvmppc_vcore_create(kvm, core);
> + vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> kvm->arch.vcores[core] = vcore;
> kvm->arch.online_vcores++;
> }
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f9818d7d3381..dbd5887daf4a 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> return -EBUSY;
> }
Quick check :
I suppose that the numbers used in the QEMU/KVM ioctl calls are
the same than the ones used by the guest OS in the RTAS calls,
set-xive and get-xive, and in the hcalls, H_IPI and H_IPOLL.
cpu->vcpu_id under QEMU ?
The xive part looks sane but I think it would look even better
if we could get rid of the 'xive->vp_base + NUMBER' beforehand.
These calculations are shortcuts between CPU number spaces.
To fix that, we could add a 'act_vp_id' field under
kvmppc_xive_irq_state which would be assigned to the 'vp_id'
of the selected target. And so, xive_select_target() needs some
changes.
C.
> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> +{
> + return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> +}
> +
> static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> struct kvmppc_xive_src_block *sb,
> struct kvmppc_xive_irq_state *state)
> @@ -362,7 +367,7 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> */
> if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
> xive_native_configure_irq(hw_num,
> - xive->vp_base + state->act_server,
> + xive_vp(xive, state->act_server),
> MASKED, state->number);
> /* set old_p so we can track if an H_EOI was done */
> state->old_p = true;
> @@ -418,7 +423,7 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
> */
> if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
> xive_native_configure_irq(hw_num,
> - xive->vp_base + state->act_server,
> + xive_vp(xive, state->act_server),
> state->act_priority, state->number);
> /* If an EOI is needed, do it here */
> if (!state->old_p)
> @@ -495,7 +500,7 @@ static int xive_target_interrupt(struct kvm *kvm,
> kvmppc_xive_select_irq(state, &hw_num, NULL);
>
> return xive_native_configure_irq(hw_num,
> - xive->vp_base + server,
> + xive_vp(xive, server),
> prio, state->number);
> }
>
> @@ -883,7 +888,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
> * which is fine for a never started interrupt.
> */
> xive_native_configure_irq(hw_irq,
> - xive->vp_base + state->act_server,
> + xive_vp(xive, state->act_server),
> state->act_priority, state->number);
>
> /*
> @@ -959,7 +964,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
>
> /* Reconfigure the IPI */
> xive_native_configure_irq(state->ipi_number,
> - xive->vp_base + state->act_server,
> + xive_vp(xive, state->act_server),
> state->act_priority, state->number);
>
> /*
> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> pr_devel("Duplicate !\n");
> return -EEXIST;
> }
> - if (cpu >= KVM_MAX_VCPUS) {
> + if (cpu >= KVM_MAX_VCPU_ID) {
> pr_devel("Out of bounds !\n");
> return -EINVAL;
> }
> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> xc->xive = xive;
> xc->vcpu = vcpu;
> xc->server_num = cpu;
> - xc->vp_id = xive->vp_base + cpu;
> + xc->vp_id = xive_vp(xive, cpu);
> xc->mfrr = 0xff;
> xc->valid = true;
>
>
More information about the Linuxppc-dev
mailing list