CPU Hotplug optimization: offcputime analysis
Aboorva Devarajan
aboorvad at linux.vnet.ibm.com
Mon Mar 20 17:19:51 AEDT 2023
CPU Hotplug smt=off operation on a maximum configuration ppc64le system
with 1920 logical CPUs takes more than 59 minutes to complete.
Several attempts made to reduce the time consumption of CPU hotplug
operation is discussed in this thread below:
https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/
By applying the solution discussed in the above thread, time taken for
CPU hotplug smt=off operation is brought down from 59m to 32m resulting
in a performance improvement of around 45%.
Though a significant performance improvement is achieved, still 32m for
CPU hotplug (smt=off) operation is a large number. To bring it down further,
we analysed the blocking time overhead in CPU hotplug using the offcputime
bcc script. The script outputs the stack-traces of the tasks that were
blocked and the total duration for which the tasks were blocked, to
identify the areas of improvement.
offcputime bcc script:
https://github.com/iovisor/bcc/blob/master/tools/offcputime.py
Below is one of the call-stacks that accounted for most of the blocking
time overhead as reported by offcputime bcc script for CPU offline
operation,
finish_task_switch
__schedule
schedule
schedule_timeout
wait_for_completion
__wait_rcu_gp
synchronize_rcu
cpuidle_uninstall_idle_handler
powernv_cpuidle_cpu_dead
cpuhp_invoke_callback
__cpuhp_invoke_callback_range
_cpu_down
cpu_device_down
cpu_subsys_offline
device_offline
online_store
dev_attr_store
sysfs_kf_write
kernfs_fop_write_iter
vfs_write
ksys_write
system_call_exception
system_call_common
- bash (29705)
5771569 ------------------------> Duration (us)
>From the above call-stack, it is observed that in
cpuidle_uninstall_idle_handler, synchronize_rcu is accounting for major
chunk of the overhead seen in CPU online and offline operations. This
stack-trace is observed in pseries and powernv systems but not in ACPI
based systems where we don't invoke cpuidle_disable_device during CPU
hotplug offline operation.
Patch that introduces synchronize_rcu in cpuidle_uninstall_idle_handler
442bf3aaf55a ("sched: Let the scheduler see CPU idle states")
is reverted to check for the accounted overhead.
On a machine having 128 logical CPUs with the below configuration,
root at ltc:~# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Model: 2.3 (pvr 004e 1203)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node250 CPU(s):
NUMA node251 CPU(s):
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
The tabulation below lists the total time taken for the CPU hotplug
offline and online operation in 4 different scenarios:
|-------------------------------------------------------------------------|
| Time take to offline 127 CPUs (niters : 10) |
|--------------------------------------------------|---------|------------|
| kernel version | avg (s) | % decrease |
|--------------------------------------------------|---------|------------|
| (1) v6.2.0-rc5 | 17.945 | baseline |
| (2) revert 442bf3aaf55a (remove synchronize_rcu) | 10.259 | 42.831 |
| (3) replace synchronize_rcu with | | |
| synchronize_rcu_expedited | 10.129 | 43.554 |
| in cpuidle_uninstall_idle_handler | | |
| (4) enable system-wide rcu_expedited | 0.842 | 95.304 |
|--------------------------------------------------|---------|------------|
|-------------------------------------------------------------------------|
| Time take to online 127 CPUs (niters : 10) |
|-------------------------------------------------------------------------|
| kernel version | avg (s) | % decrease |
|--------------------------------------------------|---------|------------|
| (1) v6.2.0-rc5 | 16.474 | baseline |
| (2) revert 442bf3aaf55a (remove synchronize_rcu) | 12.503 | 24.104 |
| (3) replace synchronize_rcu with | | |
| synchronize_rcu_expedited | 12.817 | 22.197 |
| in cpuidle_uninstall_idle_handler | | |
| (4) enable system-wide rcu_expedited | 0.4983 | 96.975 |
|--------------------------------------------------|---------|------------|
Note: A performance improvement of around 16% for CPU offline operation is
observed on large configuration systems with nCPUs = 1600 as well by
avoiding `synchronize_rcu` in `cpuidle_uninstall_idle_handler`.
It is observed from the above tabulations that synchronize_rcu introduced
in 442bf3aaf55a ("sched: Let the scheduler see CPU idle states") accounts
for around 40% and 24% of the total time taken by the CPU hotplug offline
and online operation respectively, it will be really helpful to get any
guidance from the community on suggestions for optimization here.
Thanks,
Aboorva
More information about the Linuxppc-dev
mailing list