Slub: Increased mem consumption on cpu,mem-less node powerpc guest
Vlastimil Babka
vbabka at suse.cz
Wed Mar 18 21:18:11 AEDT 2020
On 3/18/20 4:20 AM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka at suse.cz> [2020-03-17 17:45:15]:
>>
>> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
>> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
>> that there has to be something else that calls kmalloc_node(node) where node is
>> one that doesn't have present pages.
>>
>> He mentions alloc_fair_sched_group() which has:
>>
>> for_each_possible_cpu(i) {
>> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>> GFP_KERNEL, cpu_to_node(i));
>> ...
>> se = kzalloc_node(sizeof(struct sched_entity),
>> GFP_KERNEL, cpu_to_node(i));
>>
>
>
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.
So you're saying the crash was actually for allocation on e.g. node 2, not node 0?
But I believe it was on node 0, because init_kmem_cache_nodes() will only
allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include
0), and slab_mem_going_online_callback() was probably not called for node 0 (it
was not dynamically onlined).
Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly)
would have returned 0 and thus not crash as well.
> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
>
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.
Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of
them properly initialized or is there a similar issue as with
node_to_mem_node(), that some were not initialized and thus cpu_to_node() will
return 0?
Because AFAICS, if kzalloc_node() was always called 1, then
node_present_pages(1) is true, and the "hack" that Bharata reports to work in
his original mail would make no functional difference.
>
>> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
>> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
>> Bharata pasted, node_to_mem_node(0) = 0
>> So this looks like the same scenario, but it doesn't crash? Is the node 0
>> actually online here, and/or does it have N_NORMAL_MEMORY state?
>
> I still dont have any clue on the leak though.
Let's assume that kzalloc_node() was called with 0 for some of the possible
CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does
exist for node 0 here.
So the execution AFAICS goes like this:
slab_alloc_node(0)
c = raw_cpu_ptr(s->cpu_slab);
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
// whatever we have in the per-cpu cache must be from node 1
// because node 0 has no memory, so there's no node_match and thus
__slab_alloc(node == 0)
___slab_alloc(node == 0)
page = c->page;
redo:
if (unlikely(!node_match(page, node))) { // still no match
int searchnode = node;
if (node != NUMA_NO_NODE && !node_present_pages(node))
// true && true for node 0
searchnode = node_to_mem_node(node);
// searchnode is 0, not 1
if (unlikely(!node_match(page, searchnode))) {
// page still from node 1, searchnode is 0, no match
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
// we removed the slab from cpu's cache
goto new_slab;
}
new_slab:
if (slub_percpu_partial(c)) {
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
// huh, so with CONFIG_SLUB_CPU_PARTIAL
// this can become an infinite loop actually?
}
// Bharata's slub stats don't include cpu_partial_alloc so I assume
// CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop
freelist = new_slab_objects(s, gfpflags, node, &c);
freelist = new_slab_objects(s, gfpflags, node, &c);
if (node == NUMA_NO_NODE) // false, it's 0
else if (!node_present_pages(node)) // true for 0
searchnode = node_to_mem_node(node); // still 0
object = get_partial_node(s, get_node(s, searchnode),...);
// object is NULL as node 0 has nothing
// but we have node == 0 so we return the NULL
if (object || node != NUMA_NO_NODE)
return object;
// and we don't fallback to get_any_partial which would
// have found e.g. the slab we deactivated earlier
return get_any_partial(s, flags, c);
page = new_slab(s, flags, node);
// we attempt to allocate new slab on node 0, but it will come
// from node 1
So that explains the leak I think. We keep throwing away slabs from node 1 only
to allocate new ones on node 1. Effectively each cfs_rq object and each
sched_entity object will get a new (high-order?) page
for a possible cpu where cpu_to_node() is 0.
More information about the Linuxppc-dev
mailing list