[RFC 0/2] Paravirtualize idle CPU wakeup optimization
Parth Shah
parth at linux.ibm.com
Tue Jul 13 15:24:31 AEST 2021
This patch-set is a revision over HCALL based implementation which can
be found at:
https://lore.kernel.org/linuxppc-dev/20210401115922.1524705-1-parth@linux.ibm.com/
But since the overhead of HCALL is huge, this patch-set uses lppaca
region to update idle-hint, where hypervisor keeps changing the newly
added idle_hint attribute in the VPA region of each vCPUs of all KVMs,
and guest have to just look at this attribute.
This implementation is not aimed for full fledged solution, but is
rather a demonstration of para-virtualizing task scheduler. Hypervisor
can provided better idle-hints about vCPU scheduling and guest can use
it for better scheduling decisions.
Abstract:
=========
The Linux task scheduler searches for an idle cpu for task wakeup
in-order to lower the wakeup latency as much as possible. The process of
determining if a cpu is idle or not has evolved over time.
Currently, in available_idle_cpu(), a cpu is considered idle if
- there are no task running or enqueued to the runqueue of the cpu and
- the cpu is not-preempted, i.e. while running inside a guest, a cpu is
not yielded (determined via vcpu_is_preempted())
While inside the guest, there is no way to deterministically predict
if a vCPU that has been yielded/ceded to the hypervisor can be gotten
back. Hence currently the scheduler considers such CEDEd vCPU as not
"available" idle and would instead pick other busy CPUs for waking up
the wakee task.
In this patch-set we try to further classify idle cpus as instantly
available or not. This is achieved by taking hint from the hypervisor
by quering if the vCPU will be scheduled instantly or not. In most
cases, scheduler prefers prev_cpu of a waking task unless it is busy.
In this patchset, the hypervisor uses this information to figure out
if the prev_cpu used by the task (of the corresponding vCPU) is idle
or not, and passes this information to the guest.
Interface:
===========
This patchset introduces a new attribute in lppaca structure which is
shared by both the hypervisor and the guest. The new attribute, i.e.
idle_hint is updated regularly by the hypervisor. When a particular cpu
goes into idle-state, it updates the idle_hint of all the vCPUs of all
existing KVMs whose prev_cpu == smp_processor_id(). It similarly revert
backs the update when coming out of the idle-state.
Internal working:
========
The code-flow of the current implementation is as follow:
- In do_idle(), when entering an idle-state, walk through all vCPUs of
all KVM guests and find whose prev_cpu of vCPU task was same as the
caller's cpu, and mark the idle_hint=1 in the lppaca region of such
vCPUs.
- Similarly, mark idle_hint=0 for such vCPUs when exiting idle state.
- Guest OS scheduler searches for idle cpu using `avaialable_idle_cpu()`
which also looks if a vcpu_is_preempted() to see if vCPU is yielded or
not.
- If vCPU is yielded, then the GuestOS will additionally see if
idle_hint is marked as 1 or not. If idle_hint==1 then consider the
vCPU as non-preempted and use it for scheduling a task.
The patch-set is based on v5.13 kernel.
Results:
========
- Baseline kernel = v5.13
- Patched kernel = v5.13 + this patch-set
Setup:
All the results are taken on IBM POWER9 baremetal system running patched
kernel. This system consists of 2 NUMA nodes, 22 cores per socket with
SMT-4 mode.
Each KVM guest have identical cpu topology with total 40 CPUs, which are
10 cores with SMT-4 support.
Scenarios:
----------
1. Under-commit case: Only one KVM is active at a time.
- Baseline kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
50.0000th: 86
75.0000th: 406
90.0000th: 497
95.0000th: 541
*99.0000th: 2572 <-----
99.5000th: 3724
99.9000th: 6904
min=0, max=10007
- With Patched kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
50.0000th: 386
75.0000th: 470
90.0000th: 529
95.0000th: 589
*99.0000th: 741 (-71%) <-----
99.5000th: 841
99.9000th: 1522
min=0, max=6488
We see a significant reduction in the tail latencies due to being able
to schedule on an yielded/ceded idle CPU with the patchset instead of
waking up the task on a busy CPU.
2. Over-commit case: Multiple KVM guests sharing same set of CPUs.
Two KVMs running baseline kernel is used for creating noise using `schbench
-m 10 -t 2 -r 3000` while only the other KVM is benchmarked.
- Baseline kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
50.0000th: 289
75.0000th: 1074
90.0000th: 7288
95.0000th: 11248
*99.0000th: 17760
99.5000th: 21088
99.9000th: 28896
min=0, max=36640
- With Patched kernel:
$> schbench -m 20 -t 2 -r 30
Latency percentiles (usec)
50.0000th: 281
75.0000th: 445
90.0000th: 4344
95.0000th: 9168
*99.0000th: 15824
99.5000th: 19296
99.9000th: 26656
min=0, max=36455
The results demonstrates that the proposed method of getting idle-hint
from the hypervisor to better find an idle cpu in the guestOS is very
helpful in under-commmit cases due to higher chance of finding the
previously used physical cpu as idle.
The results also confirms that there is no regression in the over-commit
case where the proposed methodlogy does not affect much.
Parth Shah (2):
powerpc/book3s_hv: Add new idle-hint attribute in VPA region
kernel/idle: Update and use idle-hint in VPA region
arch/powerpc/include/asm/idle_hint.h | 28 +++++++++++++++++++++++
arch/powerpc/include/asm/lppaca.h | 3 ++-
arch/powerpc/include/asm/paravirt.h | 12 ++++++++--
arch/powerpc/kvm/book3s.h | 2 ++
arch/powerpc/kvm/book3s_hv.c | 34 ++++++++++++++++++++++++++++
kernel/sched/idle.c | 3 +++
6 files changed, 79 insertions(+), 3 deletions(-)
create mode 100644 arch/powerpc/include/asm/idle_hint.h
--
2.26.3
More information about the Linuxppc-dev
mailing list