Bug in reclaim logic with exhausted nodes?
Nishanth Aravamudan
nacc at linux.vnet.ibm.com
Fri Mar 28 07:33:54 EST 2014
Hi Christoph,
On 25.03.2014 [13:25:30 -0500], Christoph Lameter wrote:
> On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:
>
> > On power, very early, we find the 16G pages (gpages in the powerpc arch
> > code) in the device-tree:
> >
> > early_setup ->
> > early_init_mmu ->
> > htab_initialize ->
> > htab_init_page_sizes ->
> > htab_dt_scan_hugepage_blocks ->
> > memblock_reserve
> > which marks the memory
> > as reserved
> > add_gpage
> > which saves the address
> > off so future calls for
> > alloc_bootmem_huge_page()
> >
> > hugetlb_init ->
> > hugetlb_init_hstates ->
> > hugetlb_hstate_alloc_pages ->
> > alloc_bootmem_huge_page
> >
> > > Not sure if I understand that correctly.
> >
> > Basically this is present memory that is "reserved" for the 16GB usage
> > per the LPAR configuration. We honor that configuration in Linux based
> > upon the contents of the device-tree. It just so happens in the
> > configuration from my original e-mail that a consequence of this is that
> > a NUMA node has memory (topologically), but none of that memory is free,
> > nor will it ever be free.
>
> Well dont do that
>
> > Perhaps, in this case, we could just remove that node from the N_MEMORY
> > mask? Memory allocations will never succeed from the node, and we can
> > never free these 16GB pages. It is really not any different than a
> > memoryless node *except* when you are using the 16GB pages.
>
> That looks to be the correct way to handle things. Maybe mark the node as
> offline or somehow not present so that the kernel ignores it.
This is a SLUB condition:
mm/slub.c::early_kmem_cache_node_alloc():
...
page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
...
if (page_to_nid(page) != node) {
printk(KERN_ERR "SLUB: Unable to allocate memory from "
"node %d\n", node);
printk(KERN_ERR "SLUB: Allocating a useless per node structure "
"in order to be able to continue\n");
}
...
Since this is quite early, and we have not set up the nodemasks yet,
does it make sense to perhaps have a temporary init-time nodemask that
we set bits in here, and "fix-up" those nodes when we setup the
nodemasks?
Thanks,
Nish
More information about the Linuxppc-dev
mailing list