[RFC 0/2] Paravirtualize idle CPU wakeup optimization

Tue Jul 13 15:24:31 AEST 2021

This patch-set is a revision over HCALL based implementation which can
be found at:
https://lore.kernel.org/linuxppc-dev/20210401115922.1524705-1-parth@linux.ibm.com/
But since the overhead of HCALL is huge, this patch-set uses lppaca
region to update idle-hint, where hypervisor keeps changing the newly
added idle_hint attribute in the VPA region of each vCPUs of all KVMs,
and guest have to just look at this attribute.

This implementation is not aimed for full fledged solution, but is
rather a demonstration of para-virtualizing task scheduler. Hypervisor
can provided better idle-hints about vCPU scheduling and guest can use
it for better scheduling decisions.

Abstract:
=========
The Linux task scheduler searches for an idle cpu for task wakeup
in-order to lower the wakeup latency as much as possible. The process of
determining if a cpu is idle or not has evolved over time.
Currently, in available_idle_cpu(), a cpu is considered idle if
- there are no task running or enqueued to the runqueue of the cpu and
- the cpu is not-preempted, i.e. while running inside a guest, a cpu is
  not yielded (determined via vcpu_is_preempted())

While inside the guest, there is no way to deterministically predict
if a vCPU that has been yielded/ceded to the hypervisor can be gotten
back. Hence currently the scheduler considers such CEDEd vCPU as not
"available" idle and would instead pick other busy CPUs for waking up
the wakee task.

In this patch-set we try to further classify idle cpus as instantly
available or not. This is achieved by taking hint from the hypervisor
by quering if the vCPU will be scheduled instantly or not.  In most
cases, scheduler prefers prev_cpu of a waking task unless it is busy.
In this patchset, the hypervisor uses this information to figure out
if the prev_cpu used by the task (of the corresponding vCPU) is idle
or not, and passes this information to the guest.

Interface:
===========
This patchset introduces a new attribute in lppaca structure which is
shared by both the hypervisor and the guest. The new attribute, i.e.
idle_hint is updated regularly by the hypervisor. When a particular cpu
goes into idle-state, it updates the idle_hint of all the vCPUs of all
existing KVMs whose prev_cpu == smp_processor_id(). It similarly revert
backs the update when coming out of the idle-state.

Internal working:
========
The code-flow of the current implementation is as follow:
- In do_idle(), when entering an idle-state, walk through all vCPUs of
  all KVM guests and find whose prev_cpu of vCPU task was same as the
  caller's cpu, and mark the idle_hint=1 in the lppaca region of such
  vCPUs.
- Similarly, mark idle_hint=0 for such vCPUs when exiting idle state.
- Guest OS scheduler searches for idle cpu using `avaialable_idle_cpu()`
  which also looks if a vcpu_is_preempted() to see if vCPU is yielded or
  not.
- If vCPU is yielded, then the GuestOS will additionally see if
  idle_hint is marked as 1 or not. If idle_hint==1 then consider the
  vCPU as non-preempted and use it for scheduling a task.

The patch-set is based on v5.13 kernel.

Results:
========
- Baseline kernel = v5.13
- Patched kernel = v5.13 + this patch-set

Setup:
All the results are taken on IBM POWER9 baremetal system running patched
kernel. This system consists of 2 NUMA nodes, 22 cores per socket with
SMT-4 mode.

Each KVM guest have identical cpu topology with total 40 CPUs, which are
10 cores with SMT-4 support.

Scenarios:
----------
1. Under-commit case: Only one KVM is active at a time.

- Baseline kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 86
        75.0000th: 406
        90.0000th: 497
        95.0000th: 541
        *99.0000th: 2572 <-----
        99.5000th: 3724
        99.9000th: 6904
        min=0, max=10007

- With Patched kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 386
        75.0000th: 470
        90.0000th: 529
        95.0000th: 589
        *99.0000th: 741 (-71%) <-----
        99.5000th: 841
        99.9000th: 1522
        min=0, max=6488

We see a significant reduction in the tail latencies due to being able
to schedule on an yielded/ceded idle CPU with the patchset instead of
waking up the task on a busy CPU.

2. Over-commit case: Multiple KVM guests sharing same set of CPUs.

Two KVMs running baseline kernel is used for creating noise using `schbench
-m 10 -t 2 -r 3000` while only the other KVM is benchmarked.

- Baseline kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 289
        75.0000th: 1074
        90.0000th: 7288
        95.0000th: 11248
        *99.0000th: 17760
        99.5000th: 21088
        99.9000th: 28896
        min=0, max=36640

- With Patched kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
        50.0000th: 281
        75.0000th: 445
        90.0000th: 4344
        95.0000th: 9168
        *99.0000th: 15824
        99.5000th: 19296
        99.9000th: 26656
        min=0, max=36455

The results demonstrates that the proposed method of getting idle-hint
from the hypervisor to better find an idle cpu in the guestOS is very
helpful in under-commmit cases due to higher chance of finding the
previously used physical cpu as idle.
The results also confirms that there is no regression in the over-commit
case where the proposed methodlogy does not affect much.

Parth Shah (2):
  powerpc/book3s_hv: Add new idle-hint attribute in VPA region
  kernel/idle: Update and use idle-hint in VPA region

 arch/powerpc/include/asm/idle_hint.h | 28 +++++++++++++++++++++++
 arch/powerpc/include/asm/lppaca.h    |  3 ++-
 arch/powerpc/include/asm/paravirt.h  | 12 ++++++++--
 arch/powerpc/kvm/book3s.h            |  2 ++
 arch/powerpc/kvm/book3s_hv.c         | 34 ++++++++++++++++++++++++++++
 kernel/sched/idle.c                  |  3 +++
 6 files changed, 79 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/idle_hint.h

-- 
2.26.3