[RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption

Fri Oct 31 04:43:12 AEDT 2025

Hi Sean.

On 10/23/25 12:16 AM, Sean Christopherson wrote:
> On Tue, Oct 21, 2025, Shrikanth Hegde wrote:
>>
>> Hi Sean.
>> Thanks for taking time and going through the series.
>>
>> On 10/20/25 8:02 PM, Sean Christopherson wrote:
>>> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>>>> tl;dr
>>>>
>>>> This is follow up of [1] with few fixes and addressing review comments.
>>>> Upgraded it to RFC PATCH from RFC.
>>>> Please review.
>>>>
>>>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>>>
>>>> v2 -> v3:
>>>> - Renamed to paravirt CPUs
>>>
>>> There are myriad uses of "paravirt" throughout Linux and related environments,
>>> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
>>> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
>>> accurate; "paravirt" is wildly misleading.
>>
>> Name has been tricky. We want to have a positive sounding name while
>> conveying that these CPUs are not be used for now due to contention,
>> they may be used again when the contention has gone.
> 
> I suspect part of the problem with naming is the all-or-nothing approach itself.
> There's a _lot_ of policy baked into that seemingly simple decision, and thus
> it's hard to describe with a human-friendly name.
> 

open for suggestions :)

>>>> Open issues:
>>>>
>>>> - Derivation of hint from steal time is still a challenge. Some work is
>>>>     underway to address it.
>>>>
>>>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>>>     Need inputs from community.
>>>
>>> Bluntly, this series is never going to land, at least not in a form that's remotely
>>> close to what is proposed here.  This is an incredibly simplistic way of handling
>>> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
>>>
>>
>> Could you describe these complex scenarios?
> 
> Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores
> could be overcommitted at any given time, or is far, far too coarse-grained.  Very
> few use cases can distill vCPU scheduling needs and policies into single flag.
> 

Okay. Let me explain whats the current thought process is.

On S390 and pseries are the current main use cases.

On S390, Z hypervisor provides distinction among vCPUs. vCPU are marked as Vertical High,
Vertical Medium and Vertical Low. When there is a steal time it is recommended
to use Vertical Highs and avoid using Vertical Lows. In such cases, using this infra, one
can avoid scheduling anything on these vertical low vCPUs. Performance benefit is
observed since there is less contention and CPU cycles are mainly from Vertical Highs.

On PowerVM hypervisor, hypervisor dispatches full core at a time. all SMT=8 siblings are dispatched
to the same core always. That means it beneficial to schedule on vCPU siblings together at core level.
When there is contention for pCPU full core is preempted. i.e all vCPU belonging to that core would be
preempted. In such cases, depending on the configuration of overcommit, and depending on the steal time
one could limit the number of cores usage by using limited vCPUs. When done in that way, we see better
latency numbers and increase in throughput compared to out-box. The cover letter has those numbers.

Now, lets come to KVM with Linux running as Hypervisor. Correct me if i am wrong.
each vCPU in KVM will be a process in the host. when vCPU is running, that process will be in
running state as well. When there is overcommit and all vCPU are running, there will be more
process than number of physical CPUs and host has to context switch and will preempt one vCPU
to run another. It can also preempt vCPU to run some host process.
If we restrict the number of vCPU where workload is currently running, then
number of runnable process in the host also will reduce and less chance of host context switches.
Since this avoid any overhead of kvm context save/restore the workload is likely to benefit.

I guess it is possible to distinguish between host process and vCPU running as process.
If so, host can decide how many threads it can optimally run and give signal to each guest
depending on the configuration.

Currently keeping it arch dependent, since IMHO it is each Hypervisor is in right place to
make decision. Not sure if one fit for all approach works here.

Another tricky point is how this signal is going to be. It could be hcall, or vpa area
or some shared memory region or using bpf method similar to vCPU boosting patch series.
There too, i think it is best to leave to arch to specify how. the reason being bpf method
will not work for powerVM hypervisors.

> E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively
> running, and the host has a non-vCPU task that _must_ run, then the host will need
> to preempt a vCPU task.  Ideally, a paravirtualized scheduling system would allow

Host/Hypervsior need not make the vCPU as "Not use" every single time it preempts.
It needs to do so, only when there are more vCPU processes than number of physical CPUS and
preemption is happening between vCPU process.

There would be corner cases such as only one physical process is there, and two
KVM each with one vCPU, then nothing much can be done.

> the host to make an informed decision when choosing which vCPU to preempt, e.g. to
> minimize disruption to the guest(s).