[RFC 0/2] Define a new apporach to determine if an idle vCPU will be scheduled instantly or not

Thu Apr 1 22:59:20 AEDT 2021

Abstract:
=========
The Linux task scheduler tries to find an idle cpu for a wakee task
thereby lowering the wakeup latency as much as possible. The process
of determining if a cpu is idle or not has evolved over time.
Currently, a cpu is considered idle if
- there are no task running or enqueued to the runqueue of the cpu and
- in while running inside a guest, a cpu is not yielded (determined
via available_idle_cpu())

While inside the guest, there is no way to deterministically predict
if a vCPU that has been yielded/ceded to the hypervisor can be gotten
back. Hence currently the scheduler considers such CEDEd vCPU as not
"available" idle and would instead pick other busy CPUs for waking up
the wakee task.

In this patch-set we try to further classify idle cpus as instantly
available or not. This is achieved by taking hint from the hypervisor
by quering if the vCPU will be scheduled instantly or not.  In most
cases, scheduler prefers prev_cpu of a waking task unless it is busy.
In this patchset, the hypervisor uses this information to figure out
if the prev_cpu used by the task (of the corresponding vCPU) is idle
or not, and passes this information to the guest.

Interface:
===========
This patchset introduces a new HCALL named H_IDLE_HINT for the guest
to query if a vCPU can be dispatched quickly or not. This is
currently a crude interface to demonstrate the efficacy of this
method. We are looking for feedback on any other mechanisms of
obtainining a hint from the hypervisor where the hint is still relevant.

But this patch series tries to emphasis the possible optimization for
task wakeup-latency and is open to accept any interface/architecture.

Internal working:
========
The code-flow of the current implementation is as follow:
- GuestOS scheduler searches for idle cpu using `avaialable_idle_cpu()`
  which also looks if a vcpu_is_preempted() to see if vCPU is yielded or
  not.
- If vCPU is yielded, then the GuestOS will additionally make hcall
  H_IDLE_HINT to find if a vCPU can be scheduled instantly or not.
- The hypervisor services hcall by first finding the corresponding
  task-p of the vCPU and returns 1 if the task_cpu(p) is
  avaialable_idle_cpu or running SCHED_IDLE task. Else returns 0.
- GuestOS takes up this hint and considers the vCPU as idle if the hint
  from hypervisor has value == 1.

The patch-set is based on v5.11 kernel.

Results:
========
- Baseline kernel = v5.11
- Patched kernel = v5.11 + this patch-set

Setup:
All the results are taken on IBM POWER9 baremetal system running patched
kernel. This system consists of 2 NUMA nodes, 22 cores per socket with
SMT-4 mode.

2 KVM guests are created sharing same set of physical CPUs, and each KVM
has identical CPU topology of 40 CPUs, 10 cores with SMT-4 support.

Scenarios:
----------
1. Under-commit case: Only one KVM is active at a time.

- Baseline (v5.11):
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 67
        75.0000th: 83
        90.0000th: 115
        95.0000th: 352
        *99.0000th: 2260 <-----
        99.5000th: 3580
        99.9000th: 7128
        min=0, max=9927
- With patch (v5.11 + patch):
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 100
        75.0000th: 113
        90.0000th: 328
        95.0000th: 360
        *99.0000th: 434 (-80%) <----
        99.5000th: 489
        99.9000th: 2324
        min=0, max=6054

We see a significant reduction in the tail latencies due to being able
to schedule on an yielded/ceded idle CPU with the patchset instead of
waking up the task on a busy CPU.

2. Over-commit case: Both KVMs sharing same set of CPUs. One KVM is
creating noise using `schbench -m 10 -t 2 -r 3000` while only the other
KVM is benchmarked.

- Baseline:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 73
        75.0000th: 89
        90.0000th: 115
        95.0000th: 166
        *99.0000th: 3084
        99.5000th: 4044
        99.9000th: 7656
        min=0, max=18448
- With patch:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 114
        75.0000th: 137
        90.0000th: 170
        95.0000th: 237
        *99.0000th: 2828
        99.5000th: 4168
        99.9000th: 7528
        min=0, max=15387

The results demonstrates that the proposed method of getting idle-hint
from the hypervisor to better find an idle cpu in the guestOS is very
helpful in under-commmit cases due to higher chance of finding the
previously used physical cpu as idle.
The results also confirms that there is no regression in the over-commit
case where the proposed methodlogy does not affect much.

Additionally, more tests were carried out with different combinations of
schbench threads and different numbers of KVM guest. The results for
these tests further confirmed that there is no major regression on the
workload performance.

Parth Shah (2):
  KVM:PPC: Add new hcall to provide hint if a vcpu task will be
    scheduled instantly.
  sched: Use H_IDLE_HINT hcall to find if a vCPU can be wakeup target

 arch/powerpc/include/asm/hvcall.h   |  3 ++-
 arch/powerpc/include/asm/paravirt.h | 21 +++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c        | 13 +++++++++++++
 arch/powerpc/kvm/trace_hv.h         |  1 +
 include/linux/kvm_host.h            |  1 +
 include/linux/sched.h               |  1 +
 kernel/sched/core.c                 | 13 +++++++++++++
 kernel/sched/fair.c                 | 12 ++++++++++++
 kernel/sched/sched.h                |  1 +
 virt/kvm/kvm_main.c                 | 17 +++++++++++++++++
 10 files changed, 80 insertions(+), 3 deletions(-)

-- 
2.26.2