[PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup

Thu Sep 17 00:39:13 AEST 2020

On Wed, Sep 16, 2020 at 09:39:53AM +0200, David Hildenbrand wrote:
> On 15.09.20 21:46, Scott Cheloha wrote:
> > During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> > to determine which node id (nid) to use when later calling __add_memory().
> > 
> > This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
> > appropriate nid for a given address by looking up the LMB containing the
> > address and then passing that LMB to of_drconf_to_nid_single() to get the
> > nid.  In dlpar_add_lmb() we get this address from the LMB itself.
> > 
> > In short, we have a pointer to an LMB and then we are searching for
> > that LMB *again* in order to find its nid.
> > 
> > If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
> > can skip the redundant lookup.  The only error handling we need to
> > duplicate from memory_add_physaddr_to_nid() is the fallback to the
> > default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
> > an invalid nid.
> > 
> > Skipping the extra lookup makes hot-add operations faster, especially
> > on machines with many LMBs.
> > 
> > Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
> > LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> > completed the same operation in ~2 hours:
> > 
> > Unpatched (12450 seconds):
> > Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
> > Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> > [...]
> > Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> > 
> > Patched (7065 seconds):
> > Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
> > Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
> > [...]
> > Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
> > 
> > It should be noted that the speedup grows more substantial when
> > hot-adding LMBs at the end of the drconf range.  This is because we
> > are skipping a linear LMB search.
> > 
> > To see the distinction, consider smaller hot-add test on the same
> > LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
> > LMBs completed less than 1 second faster on a patched kernel:
> > 
> > Unpatched:
> >  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> > 
> >         104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
> >              4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
> >              2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
> >                394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
> >    445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
> >      8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
> >    300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
> >    258,091,488,691      instructions              #    0.58  insn per cycle
> >                                                   #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
> >     70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
> >      3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
> > 
> >            105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
> > 
> > Patched:
> >  Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
> > 
> >         104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
> >              4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
> >              2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
> >                394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
> >    442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
> >      8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
> >    299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
> >    252,731,168,193      instructions              #    0.57  insn per cycle
> >                                                   #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
> >     68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
> >      3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
> > 
> >            104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
> > 
> > This is consistent.  An add-by-count hot-add operation adds LMBs
> > greedily, so LMBs near the start of the drconf range are considered
> > first.  On an otherwise idle LPAR with so many LMBs we would expect to
> > find the LMBs we need near the start of the drconf range, hence the
> > smaller speedup.
> > 
> > Signed-off-by: Scott Cheloha <cheloha at linux.ibm.com>
> 
> 
> Hi Scott,
> 
> IIRC, ppc DLPAR does a single add_memory() [...]

Yes.

> [...] for each LMB (16 MB).

The block size is set by the hypervisor.  The default is 256MB.  In
this test I had a block size of 256MB.

On multi-terabyte machines I would effectively always expect a block
size of 256MB.  16MB blocks are supported, but it is not the default
setting so it is increasingly rare.

> With tons of LMBs, this will also make /proc/iomem explode in size (using a
> list-based tree), making traversal significantly slower e.g., on
> insertions and system ram walks.
> 
> I was wondering if you would get another performance boost under ppc
> when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are
> not of interest. No guarantees, might be worth a try.

I'll give it a shot.

> Did you investigate what else makes memory hotplug that slow? (126000
> LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...)

It was about ~31TB in 256MB blocks.  It's a worst-case test (add all
the memory), but I'm pretty happy with a 1.5 hour improvement :)

> Memory block devices might still be a slowdown (although we have an
> xarray in place now that takes care of most pain).

Memory block devices are no longer a hotspot.

Some of the slowdown is in the printk overhead.  We print a log for
every LMB.  It is very silly.  I intend to move those to a debug
priority, which should trivially speed things up.

Otherwise I need to do more profiling.