[PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt
Parth Shah
parth at linux.ibm.com
Mon Dec 9 19:26:25 AEDT 2019
Hi,
On 12/5/19 2:02 PM, Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
>
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
>
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
>
> On a Power9 System with 32 cores
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 8
> Core(s) per socket: 1
> Socket(s): 16
> NUMA node(s): 2
> Model: 2.2 (pvr 004e 0202)
> Model name: POWER9 (architected), altivec supported
> Hypervisor vendor: pHyp
> Virtualization type: para
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 512K
> L3 cache: 10240K
> NUMA node0 CPU(s): 0-63
> NUMA node1 CPU(s): 64-127
>
> # perf stat -a -r 5 ./schbench
> v5.4 v5.4 + patch
> Latency percentiles (usec) Latency percentiles (usec)
> 50.0000th: 45 50.0000th: 39
> 75.0000th: 62 75.0000th: 53
> 90.0000th: 71 90.0000th: 67
> 95.0000th: 77 95.0000th: 76
> *99.0000th: 91 *99.0000th: 89
> 99.5000th: 707 99.5000th: 93
> 99.9000th: 6920 99.9000th: 118
> min=0, max=10048 min=0, max=211
> Latency percentiles (usec) Latency percentiles (usec)
> 50.0000th: 45 50.0000th: 34
> 75.0000th: 61 75.0000th: 45
> 90.0000th: 72 90.0000th: 53
> 95.0000th: 79 95.0000th: 56
> *99.0000th: 691 *99.0000th: 61
> 99.5000th: 3972 99.5000th: 63
> 99.9000th: 8368 99.9000th: 78
> min=0, max=16606 min=0, max=228
> Latency percentiles (usec) Latency percentiles (usec)
> 50.0000th: 45 50.0000th: 34
> 75.0000th: 61 75.0000th: 45
> 90.0000th: 71 90.0000th: 53
> 95.0000th: 77 95.0000th: 57
> *99.0000th: 106 *99.0000th: 63
> 99.5000th: 2364 99.5000th: 68
> 99.9000th: 7480 99.9000th: 100
> min=0, max=10001 min=0, max=134
> Latency percentiles (usec) Latency percentiles (usec)
> 50.0000th: 45 50.0000th: 34
> 75.0000th: 62 75.0000th: 46
> 90.0000th: 72 90.0000th: 53
> 95.0000th: 78 95.0000th: 56
> *99.0000th: 93 *99.0000th: 61
> 99.5000th: 108 99.5000th: 64
> 99.9000th: 6792 99.9000th: 85
> min=0, max=17681 min=0, max=121
> Latency percentiles (usec) Latency percentiles (usec)
> 50.0000th: 46 50.0000th: 33
> 75.0000th: 62 75.0000th: 44
> 90.0000th: 73 90.0000th: 51
> 95.0000th: 79 95.0000th: 54
> *99.0000th: 113 *99.0000th: 61
> 99.5000th: 2724 99.5000th: 64
> 99.9000th: 6184 99.9000th: 82
> min=0, max=9887 min=0, max=121
>
> Performance counter stats for 'system wide' (5 runs):
>
> context-switches 43,373 ( +- 0.40% ) 44,597 ( +- 0.55% )
> cpu-migrations 1,211 ( +- 5.04% ) 220 ( +- 6.23% )
> page-faults 15,983 ( +- 5.21% ) 15,360 ( +- 3.38% )
>
> Waiman Long suggested using static_keys.
>
> Reported-by: Parth Shah <parth at linux.ibm.com>
> Reported-by: Ihor Pasichnyk <Ihor.Pasichnyk at ibm.com>
> Cc: Parth Shah <parth at linux.ibm.com>
> Cc: Ihor Pasichnyk <Ihor.Pasichnyk at ibm.com>
> Cc: Juri Lelli <juri.lelli at redhat.com>
> Cc: Phil Auld <pauld at redhat.com>
> Cc: Waiman Long <longman at redhat.com>
> Cc: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
> Tested-by: Juri Lelli <juri.lelli at redhat.com>
> Ack-by: Waiman Long <longman at redhat.com>
> Reviewed-by: Gautham R. Shenoy <ego at linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar at linux.vnet.ibm.com>
> ---
> Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
> Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
> This was suggested by Waiman Long.
>
> arch/powerpc/include/asm/spinlock.h | 5 +++--
> arch/powerpc/mm/numa.c | 4 ++++
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index e9a960e28f3c..de817c25deff 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -35,11 +35,12 @@
> #define LOCK_TOKEN 1
> #endif
>
> -#ifdef CONFIG_PPC_PSERIES
> +#ifdef CONFIG_PPC_SPLPAR
> +DECLARE_STATIC_KEY_FALSE(shared_processor);
> #define vcpu_is_preempted vcpu_is_preempted
> static inline bool vcpu_is_preempted(int cpu)
> {
> - if (!firmware_has_feature(FW_FEATURE_SPLPAR))
> + if (!static_branch_unlikely(&shared_processor))
> return false;
> return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> }
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 50d68d21ddcc..ffb971f3a63c 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1568,9 +1568,13 @@ int prrn_is_enabled(void)
> return prrn_enabled;
> }
>
> +DEFINE_STATIC_KEY_FALSE(shared_processor);
> +EXPORT_SYMBOL_GPL(shared_processor);
> +
> void __init shared_proc_topology_init(void)
> {
> if (lppaca_shared_proc(get_lppaca())) {
> + static_branch_enable(&shared_processor);
> bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
> nr_cpumask_bits);
> numa_update_cpu_topology(false);
>
I have tested this patch on Guest KVM with following configuration:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 88
NUMA node(s): 1
Model: 2.2 (pvr 004e 1202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: KVM
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 0-87
- Baseline kernel: v5.4
Setup:
=======
- 2 KVM guests: Both sharing same set of CPUs.
- First guest is idle
- Second guest executes 'schbench -r 30'
- Hypervisor details:
- Architecture: POWER9
- CPU(s): 88
- Socket(s): 1
- kernel: v5.4
Results:
===========
- schbench -r 30
+----------------+--------------+-----------------+
| Latency %ile | v5.4 | v5.4 + patch |
+================+==============+=================+
| 50 | 28 (+- 1) | 22.625 (+- 0.7) |
+----------------+--------------+-----------------+
| 75 | 330 (+- 107) | 67 (+- 36) |
+----------------+--------------+-----------------+
| 90 | 463 (+- 14) | 447 (+- 5) |
+----------------+--------------+-----------------+
| 95 | 486 (+- 27) | 472 (+- 7.3) |
+----------------+--------------+-----------------+
| 99 | 851 (+- 83) | 709 (+- 77) |
+----------------+--------------+-----------------+
| 99.5 | 884 (+- 51) | 865 (+- 57) |
+----------------+--------------+-----------------+
| 99.99 | 1038 (+- 72) | 961 (+- 36) |
+----------------+--------------+-----------------+
Tested-by: Parth Shah <parth at linux.ibm.com>
Best,
Parth
More information about the Linuxppc-dev
mailing list