Bug in reclaim logic with exhausted nodes?

Tue Apr 1 12:33:46 EST 2014

On 29.03.2014 [00:40:41 -0500], Christoph Lameter wrote:
> On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:
> 
> > > That looks to be the correct way to handle things. Maybe mark the node as
> > > offline or somehow not present so that the kernel ignores it.
> >
> > This is a SLUB condition:
> >
> > mm/slub.c::early_kmem_cache_node_alloc():
> > ...
> >         page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
> > ...
> 
> So the page allocation from the node failed. We have a strange boot
> condition where the OS is aware of anode but allocations on that node
> fail.

Yep. The node exists, it's just fully exhausted at boot (due to the
presence of 16GB pages reserved at boot-time).

>  >         if (page_to_nid(page) != node) {
> >                 printk(KERN_ERR "SLUB: Unable to allocate memory from "
> >                                 "node %d\n", node);
> >                 printk(KERN_ERR "SLUB: Allocating a useless per node structure "
> >                                 "in order to be able to continue\n");
> >         }
> > ...
> >
> > Since this is quite early, and we have not set up the nodemasks yet,
> > does it make sense to perhaps have a temporary init-time nodemask that
> > we set bits in here, and "fix-up" those nodes when we setup the
> > nodemasks?
> 
> Please take care of this earlier than this. The page allocator in
> general should allow allocations from all nodes with memory during
> boot,

I'd appreciate a bit more guidance? I'm suggesting that in this case the
node functionally has no memory. So the page allocator should not allow
allocations from it -- except (I need to investigate this still)
userspace accessing the 16GB pages on that node, but that, I believe,
doesn't go through the page allocator at all, it's all from hugetlb
interfaces. It seems to me there is a bug in SLUB that we are noting
that we have a useless per-node structure for a given nid, but not
actually preventing requests to that node or reclaim because of those
allocations.

The page allocator is actually fine here, afaict. We've pulled out
memory from this node, even though it's present, so none is free. All of
that is working as expected, based upon the issue we've seen. The
problems start when we "force" (by way of a round-robin page allocation
request from /proc/sys/vm/nr_hugepages) a THISNODE allocation to come
from the exhausted node, which has no memory free, causing reclaim,
which progresses on other nodes, and thus never alleviates the
allocation failure (and can't).

I think there is a logical bug (even if it only occurs in this
particular corner case) where if reclaim progresses for a THISNODE
allocation, we don't check *where* the reclaim is progressing, and thus
may falsely be indicating that we have done some progress when in fact
the allocation that is causing reclaim will not possibly make any more
progress.

Thanks,
Nish