[PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
David Hildenbrand
david at redhat.com
Wed Sep 16 17:39:53 AEST 2020
On 15.09.20 21:46, Scott Cheloha wrote:
> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> to determine which node id (nid) to use when later calling __add_memory().
>
> This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an
> appropriate nid for a given address by looking up the LMB containing the
> address and then passing that LMB to of_drconf_to_nid_single() to get the
> nid. In dlpar_add_lmb() we get this address from the LMB itself.
>
> In short, we have a pointer to an LMB and then we are searching for
> that LMB *again* in order to find its nid.
>
> If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
> can skip the redundant lookup. The only error handling we need to
> duplicate from memory_add_physaddr_to_nid() is the fallback to the
> default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
> an invalid nid.
>
> Skipping the extra lookup makes hot-add operations faster, especially
> on machines with many LMBs.
>
> Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000
> LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> completed the same operation in ~2 hours:
>
> Unpatched (12450 seconds):
> Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
> Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
>
> Patched (7065 seconds):
> Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
> Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> [...]
> Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
>
> It should be noted that the speedup grows more substantial when
> hot-adding LMBs at the end of the drconf range. This is because we
> are skipping a linear LMB search.
>
> To see the distinction, consider smaller hot-add test on the same
> LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096
> LMBs completed less than 1 second faster on a patched kernel:
>
> Unpatched:
> Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
>
> 104,753.42 msec task-clock # 0.992 CPUs utilized ( +- 0.55% )
> 4,708 context-switches # 0.045 K/sec ( +- 0.69% )
> 2,444 cpu-migrations # 0.023 K/sec ( +- 1.25% )
> 394 page-faults # 0.004 K/sec ( +- 0.22% )
> 445,902,503,057 cycles # 4.257 GHz ( +- 0.55% ) (66.67%)
> 8,558,376,740 stalled-cycles-frontend # 1.92% frontend cycles idle ( +- 0.88% ) (49.99%)
> 300,346,181,651 stalled-cycles-backend # 67.36% backend cycles idle ( +- 0.76% ) (50.01%)
> 258,091,488,691 instructions # 0.58 insn per cycle
> # 1.16 stalled cycles per insn ( +- 0.22% ) (66.67%)
> 70,568,169,256 branches # 673.660 M/sec ( +- 0.17% ) (50.01%)
> 3,100,725,426 branch-misses # 4.39% of all branches ( +- 0.20% ) (49.99%)
>
> 105.583 +- 0.589 seconds time elapsed ( +- 0.56% )
>
> Patched:
> Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
>
> 104,055.69 msec task-clock # 0.993 CPUs utilized ( +- 0.32% )
> 4,606 context-switches # 0.044 K/sec ( +- 0.20% )
> 2,463 cpu-migrations # 0.024 K/sec ( +- 0.93% )
> 394 page-faults # 0.004 K/sec ( +- 0.25% )
> 442,951,129,921 cycles # 4.257 GHz ( +- 0.32% ) (66.66%)
> 8,710,413,329 stalled-cycles-frontend # 1.97% frontend cycles idle ( +- 0.47% ) (50.06%)
> 299,656,905,836 stalled-cycles-backend # 67.65% backend cycles idle ( +- 0.39% ) (50.02%)
> 252,731,168,193 instructions # 0.57 insn per cycle
> # 1.19 stalled cycles per insn ( +- 0.20% ) (66.66%)
> 68,902,851,121 branches # 662.173 M/sec ( +- 0.13% ) (49.94%)
> 3,100,242,882 branch-misses # 4.50% of all branches ( +- 0.15% ) (49.98%)
>
> 104.829 +- 0.325 seconds time elapsed ( +- 0.31% )
>
> This is consistent. An add-by-count hot-add operation adds LMBs
> greedily, so LMBs near the start of the drconf range are considered
> first. On an otherwise idle LPAR with so many LMBs we would expect to
> find the LMBs we need near the start of the drconf range, hence the
> smaller speedup.
>
> Signed-off-by: Scott Cheloha <cheloha at linux.ibm.com>
Hi Scott,
IIRC, ppc DLPAR does a single add_memory() for each LMB (16 MB). With
tons of LMBs, this will also make /proc/iomem explode in size (using a a
list-based tree), making traversal significantly slower e.g., on
insertions and system ram walks.
I was wondering if you would get another performance boost under ppc
when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are
not of interest. No guarantees, might be worth a try.
Did you investigate what else makes memory hotplug that slow? (126000
LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...) Memory
block devices might still be a slowdown (although we have an xarray in
place now that takes care of most pain).
[1]
https://lore.kernel.org/linux-mm/20200911103459.10306-1-david@redhat.com/
--
Thanks,
David / dhildenb
More information about the Linuxppc-dev
mailing list