Bug in reclaim logic with exhausted nodes?

Wed Mar 12 08:06:14 EST 2014

We have seen the following situation on a test system:

2-node system, each node has 32GB of memory.

2 gigantic (16GB) pages reserved at boot-time, both of which are
allocated from node 1.

SLUB notices this:

[    0.000000] SLUB: Unable to allocate memory from node 1
[    0.000000] SLUB: Allocating a useless per node structure in order to
be able to continue

After boot, user then did:

echo 24 > /proc/sys/vm/nr_hugepages

And tasks are stuck:

[<c0000000010980b8>] kexec_stack+0xb8/0x8000
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f9334b0>] 0xc00000004f9334b0
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f91f440>] 0xc00000004f91f440
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
[<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
[<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

kswapd1 is also pegged at this point at 100% cpu.

If we go in and manually:

echo 24 >
/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages

rather than relying on the interleaving allocator from the sysctl, the
allocation succeeds (and the echo returns immediately).

I think we are hitting the following:

mm/hugetlb.c::alloc_fresh_huge_page_node():

        page = alloc_pages_exact_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));

include/linux/gfp.h:

#define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

and mm/page_alloc.c::__alloc_pages_slowpath():

        /*
         * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
         * __GFP_NOWARN set) should not cause reclaim since the subsystem
         * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
         * using a larger set of nodes after it has established that the
         * allowed per node queues are empty and that nodes are
         * over allocated.
         */
        if (IS_ENABLED(CONFIG_NUMA) &&
                        (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
                goto nopage;

so we *do* reclaim in this callpath. Under my reading, since node1 is
exhausted, no matter how much work kswapd1 does, it will never reclaim
memory from node1 to satisfy a 16M page allocation request (or any
other, for that matter).

I see the following possible changes/fixes, but am unsure if
a) my analysis is right
b) which is best.

1) Since we did notice early in boot that (in this case) node 1 was
exhausted, perhaps we should mark it as such there somehow, and if a
__GFP_THISNODE allocation request comes through on such a node, we
immediately fallthrough to nopage?

2) There is the following check
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
                return 1;

I wonder if we should add a check to also be sure that the pages we are
reclaiming, if __GFP_THISNODE is set, are from the right node?

       if (gfp_mask & __GFP_THISNODE && the progress we have made is on
       		the node requested?)

3) did_some_progress could be updated to track where the progress is
occuring, and if we are in __GFP_THISNODE allocation request and we
didn't make any progress on the correct node, we fail the allocation?

I think this situation could be reproduced (and am working on it) by
exhausting a NUMA node with 16M hugepages and then using the generic
RR allocator to ask for more. Other node exhaustion cases probably
exist, but since we can't swap the hugepages, it seems like the most
straightforward way to try and reproduce it.

Any thoughts on this? Am I way off base?

Thanks,
Nish