Slub: Increased mem consumption on cpu,mem-less node powerpc guest

Tue Mar 17 20:26:24 AEDT 2020

Hi,

We are seeing an increased slab memory consumption on PowerPC guest
LPAR (on PowerVM) having an uncommon topology where one NUMA node has no
CPUs or any memory and the other node has all the CPUs and memory. Though
QEMU prevents such topologies for KVM guest, I hacked QEMU to allow such
topology to get some slab numbers. Here is the comparision of such
a KVM guest with a single node KVM guest with equal amount of CPUs and memory.

Case 1: 2 node NUMA, node0 empty
================================
# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16294 MB
node 1 free: 15453 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

Case 2: Single node
===================
# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16294 MB
node 0 free: 15675 MB
node distances:
node   0 
  0:  10 

Here is how the total slab memory consumptions compare right after boot:
# grep -i slab /proc/meminfo

Case 1: 442560 kB
Case 2: 195904 kB

Closer look at the individual slabs suggests that most of the increased
slab consumption in Case 1 can be attributed to kmalloc-N slabs. In
particular the following two caches account for most of the increase.

Case 1:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k                2869    1024          101.5M    1549/1540/0   32 0  99   2 U
kmalloc-512               3302     512          100.2M    1530/1522/0   64 0  99   1 U

Case 2:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k                2811    1024            6.1M        94/29/0   32 0  30  46 U
kmalloc-512               3207     512            3.5M        54/13/0   64 0  24  46 U

Here is the list of slub stats that significantly differ between two cases:

Case 1:
------
alloc_from_partial 6333 C0=1506 C1=525 C2=774 C3=478 C4=413 C5=1036 C6=698 C7=903
alloc_slab 3350 C0=757 C1=336 C2=120 C3=72 C4=120 C5=912 C6=600 C7=433
alloc_slowpath 9792 C0=2329 C1=861 C2=916 C3=571 C4=533 C5=1948 C6=1298 C7=1336
cmpxchg_double_fail 31 C1=3 C2=2 C3=7 C4=3 C5=4 C6=2 C7=10
deactivate_full 38 C0=14 C1=2 C2=13 C5=3 C6=2 C7=4
deactivate_remote_frees 1 C7=1
deactivate_to_head 10092 C0=2654 C1=859 C2=903 C3=571 C4=533 C5=1945 C6=1296 C7=1331
deactivate_to_tail 1 C7=1
free_add_partial 29 C0=7 C2=1 C3=5 C4=3 C5=6 C6=2 C7=5
free_frozen 32 C0=4 C1=3 C2=4 C3=3 C4=7 C5=3 C6=7 C7=1
free_remove_partial 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slab 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slowpath 7415 C0=2014 C1=486 C2=433 C3=525 C4=814 C5=1707 C6=586 C7=850
objects 2875 N1=2875
objects_partial 2587 N1=2587
partial 1542 N1=1542
slabs 1551 N1=1551
total_objects 49632 N1=49632

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=147813/152837/153714 pid=1-1074 cpus=0-2,5-7 nodes=1

# cat free_calls (truncated) 
   2671 <not-available> age=4295094831 pid=0 cpus=0 nodes=1
      2 free_fair_sched_group+0xa0/0x120 age=156576/156850/157125 pid=0 cpus=0,5 nodes=1

Case 1:
------
alloc_from_partial 9231 C0=435 C1=2349 C2=2386 C3=1807 C4=882 C5=367 C6=559 C7=446
alloc_slab 114 C0=12 C1=41 C2=28 C3=15 C4=9 C5=1 C6=1 C7=7
alloc_slowpath 9415 C0=448 C1=2390 C2=2414 C3=1891 C4=891 C5=368 C6=560 C7=453
cmpxchg_double_fail 22 C0=1 C1=1 C3=3 C4=8 C5=1 C6=5 C7=3
deactivate_full 512 C0=13 C1=143 C2=147 C3=147 C4=22 C5=10 C6=6 C7=24
deactivate_remote_frees 1 C4=1
deactivate_to_head 9099 C0=437 C1=2247 C2=2267 C3=1937 C4=870 C5=358 C6=554 C7=429
deactivate_to_tail 1 C4=1
free_add_partial 447 C0=21 C1=140 C2=164 C3=60 C4=22 C5=16 C6=14 C7=10
free_frozen 22 C0=3 C2=3 C3=2 C4=1 C5=6 C6=6 C7=1
free_remove_partial 20 C1=5 C2=5 C4=3 C6=7
free_slab 20 C1=5 C2=5 C4=3 C6=7
free_slowpath 6953 C0=194 C1=2123 C2=1729 C3=850 C4=466 C5=725 C6=520 C7=346
objects 2812 N0=2812
objects_partial 733 N0=733
partial 29 N0=29
slabs 94 N0=94
total_objects 3008 N0=3008

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=43957/46225/46802 pid=1-1059 cpus=1-5,7

# cat free_calls (truncated) 
   1516 <not-available> age=4294987281 pid=0 cpus=0
    647 free_fair_sched_group+0xa0/0x120 age=48798/49142/49628 pid=0-954 cpus=1-2

We see a significant difference in the number of partial slabs and
the resulting total_objects between the two cases. I was trying to
see if this has got to do anything with the way the node value is
arrived at in difference slub routines. Haven't yet understood slub
code to say anything conclusively, but the following hack in the slub
code completely reduces the increased slab consumption for Case1 and
makes it very similar to Case2

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..888e4d245444 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
 	int searchnode = node;
 
-	if (node == NUMA_NO_NODE)
+	if (node == NUMA_NO_NODE || !node_present_pages(node))
 		searchnode = numa_mem_id();
-	else if (!node_present_pages(node))
-		searchnode = node_to_mem_node(node);
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)

Regards,
Bharata.