[PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes

Tue May 17 21:36:26 EST 2011

Am 17.05.2011 um 12:44 schrieb Paul Mackerras <paulus at samba.org>:

> On Tue, May 17, 2011 at 10:21:56AM +0200, Alexander Graf wrote:
>> 
>> On 11.05.2011, at 12:46, Paul Mackerras wrote:
>> 
>>> -#define KVM_MAX_VCPUS 1
>>> +#define KVM_MAX_VCPUS        NR_CPUS
>>> +#define KVM_THREADS_PER_CORE    4
>> 
>> So what if POWER8 (or whatever it will be called) comes along with 8
>> threads per core? Would that change the userspace interface?
> 
> The idea is that userspace queries the KVM_CAP_PPC_SMT capability and
> the value it gets back is the number of vcpus per vcore.  It then
> allocates vcpu numbers based on that.
> 
> If a CPU came along with more than 4 threads per core then we'd have
> to change that define in the kernel, but that won't affect the
> userspace API.

Ah, I see :). Tht's exactly why documentation is so important - with proper documentation you wouldn't need to explain those things to me :)

> 
>>> +    /* wait for secondary threads to get back to nap mode */
>>> +    spin_lock(&vc->lock);
>>> +    if (vc->nap_count < vc->n_woken)
>>> +        kvmppc_wait_for_nap(vc);
>> 
>> So you're taking the vcore wide lock and wait for other CPUs to set
>> themselves to nap? Not sure I fully understand this. Why would
>> another thread want to go to nap mode when it's 100% busy?
> 
> It's more about waiting for the other hardware threads to have
> finished writing their vcpu state to memory.  Currently those threads
> then go to nap mode, but they could in fact poll instead for a bit,
> so that name is possible a bit misleading, I agree.

Just so I understand the scheme: One vcpu needs to go to MMU mode in KVM, it then sends IPIs to stop the other threads and finally we return from this wait here?

> 
>>> +    cmpwi    r12,0x980
>>> +    beq    40f
>>> +    cmpwi    r3,0x100
>> 
>> good old use define comment :)
> 
> Yep, OK. :)
> 
>> Maybe I also missed the point here, but how does this correlate with
>> Linux threads? Is each vcpu running in its own Linux thread? How
>> does the scheduling happen? IIUC the host only sees a single thread
>> per core and then distributes the vcpus to the respective host
>> threads.
> 
> Each vcpu has its own Linux thread, but while the vcore is running,
> all but one of them are sleeping.  The thing is that since the host is
> running with each core single-threaded, one Linux thread is enough to
> run 4 vcpus.  So when we decide we can run the vcore, the vcpu thread
> that discovered that we can now run the vcore takes the responsibility
> to run it.  That involves sending an IPI to the other hardware threads
> to wake them up and get them to each run a vcpu.  Then the vcpu thread
> that is running the vcore dives into the guest switch code itself.  It
> synchronizes with the other threads and does the partition switch, and
> then they all enter the guest.
> 
> We thought about various schemes to cope with the hardware restriction
> that all hardware threads in a core have to be in the same partition
> (at least whenever the MMU is on).  This is the least messy scheme we
> could come up with.  I'd be happy to discuss the alternatives if you
> like.

Oh, I'm certainly fine with the scheme :). I would just like to understand it and see it documented somewhere, as it's slightly unintuitive.

Also, this scheme might confuse the host scheduler for a bit, as it might migrate threads to other host CPUs while it would prove beneficial for cache usage to keep them local. But since the scheduler doesn't know about the correlation between the threads, it can't be clever about it.

Alex

>