[PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes

Tue May 17 20:44:22 EST 2011

On Tue, May 17, 2011 at 10:21:56AM +0200, Alexander Graf wrote:
> 
> On 11.05.2011, at 12:46, Paul Mackerras wrote:
> 
> > -#define KVM_MAX_VCPUS 1
> > +#define KVM_MAX_VCPUS		NR_CPUS
> > +#define KVM_THREADS_PER_CORE	4
> 
> So what if POWER8 (or whatever it will be called) comes along with 8
> threads per core? Would that change the userspace interface?

The idea is that userspace queries the KVM_CAP_PPC_SMT capability and
the value it gets back is the number of vcpus per vcore.  It then
allocates vcpu numbers based on that.

If a CPU came along with more than 4 threads per core then we'd have
to change that define in the kernel, but that won't affect the
userspace API.

> > +	/* wait for secondary threads to get back to nap mode */
> > +	spin_lock(&vc->lock);
> > +	if (vc->nap_count < vc->n_woken)
> > +		kvmppc_wait_for_nap(vc);
> 
> So you're taking the vcore wide lock and wait for other CPUs to set
> themselves to nap? Not sure I fully understand this. Why would
> another thread want to go to nap mode when it's 100% busy?

It's more about waiting for the other hardware threads to have
finished writing their vcpu state to memory.  Currently those threads
then go to nap mode, but they could in fact poll instead for a bit,
so that name is possible a bit misleading, I agree.

> > +	cmpwi	r12,0x980
> > +	beq	40f
> > +	cmpwi	r3,0x100
> 
> good old use define comment :)

Yep, OK. :)

> Maybe I also missed the point here, but how does this correlate with
> Linux threads? Is each vcpu running in its own Linux thread? How
> does the scheduling happen? IIUC the host only sees a single thread
> per core and then distributes the vcpus to the respective host
> threads.

Each vcpu has its own Linux thread, but while the vcore is running,
all but one of them are sleeping.  The thing is that since the host is
running with each core single-threaded, one Linux thread is enough to
run 4 vcpus.  So when we decide we can run the vcore, the vcpu thread
that discovered that we can now run the vcore takes the responsibility
to run it.  That involves sending an IPI to the other hardware threads
to wake them up and get them to each run a vcpu.  Then the vcpu thread
that is running the vcore dives into the guest switch code itself.  It
synchronizes with the other threads and does the partition switch, and
then they all enter the guest.

We thought about various schemes to cope with the hardware restriction
that all hardware threads in a core have to be in the same partition
(at least whenever the MMU is on).  This is the least messy scheme we
could come up with.  I'd be happy to discuss the alternatives if you
like.

Paul.