[RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map

Nishanth Aravamudan nacc at linux.vnet.ibm.com
Fri Mar 6 05:05:49 AEDT 2015


Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc ->
for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
iteration of for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

One could alternatively nodemask_and(node_possible_map,
node_online_map), but I think the cost of anding the two will always be
higher than zero and set a few bits in practice.

Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com>

---
While looking at this, I noticed that nr_node_ids is actually a
misnomer, it seems. It's not the number, but the maximum_node_id, as
with sparse NUMA nodes, you might only have two NUMA nodes possible, but
to make certain loops work, nr_node_ids will be, e.g., 17. Should it be
changed?

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..24de29b3651b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,9 +958,17 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
+	/*
+	 * zero out the possible nodes after we parse the device-tree,
+	 * so that we lower the maximum NUMA node ID to what is actually
+	 * present.
+	 */
+	nodes_clear(node_possible_map);
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
+		node_set(nid, node_possible_map);
 		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 		setup_node_data(nid, start_pfn, end_pfn);
 		sparse_memory_present_with_active_regions(nid);



More information about the Linuxppc-dev mailing list