Nodes with no memory

Sat Nov 22 19:58:51 EST 2008

On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote:
> On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
>> Dave Hansen wrote:
>>> I was handed off a bug report about a blade not booting with a, um
>>> "newer" kernel.
>>
>> If you're unable to provide basic information such as the kernel
>> version then perhaps this isn't the best forum for discussing this.  
>> :)
>
> Let's just say a derivative of 2.6.27.5.  I will, of course be trying 
> to reproduce on mainline.  I'm just going with the kernel closest to 
> the bug report as I can get for now.

This reminds me.  I was asked to look at a system that had all cpus and 
memory on node 1.  I recently switched to 2.6.27.0, and had a similar 
failure when I tried my latest development kernel.  However, I realized 
that the user was wanting to run my previously supported 2.6.24 kernel, 
and that did not have this issue, so I never got back to debugging this 
problem.  (Both kernels had similar patches applied, but very little to 
mm or numa selection).  I was able to fix the problem they were having 
and returned the machine to them without debugging the issue, but I 
suspect the problem was introduced to mainline between 2.6.24 and 
2.6.27.

>>> I'm thinking that we need to at least fix careful_allocation() to 
>>> oops
>>> and not return NULL, or check to make sure all it callers check its
>>> return code.
>>
>> Well, careful_allocation() in current mainline tries pretty hard to
>> panic if it can't satisfy the request.  Why isn't that happening?
>
> I added some random debugging to careful_alloc() to find out.
>
> careful_allocation(1, 7680, 80, 0)
> careful_allocation() ret1: 00000001dffe4100
> careful_allocation() ret2: 00000001dffe4100
> careful_allocation() ret3: 00000001dffe4100
> careful_allocation() ret4: c000000000000000
> careful_allocation() ret5: 0000000000000000
>
> It looks to me like it is hitting 'the memory came from a previously
>  allocated node' check.  So, the __lmb_alloc_base() appears to get
> something worthwhile, but that gets overwritten later.
>
> I'm still not quite sure what this comment means.  Are we just trying 
> to
> get node locality from the allocation?

My memory (and a quick look) is that careful alloc is used while we are 
in the process of creating the memory maps for the node.  We want them 
to be allocated from memory on the node, but will accept memory from 
any node to handle the case that memory is not available in the desired 
node.  Linux requires the maps exist for every online node.

Because we are in the process transferring the memory between 
allocators, the check for new_nid < nid is meant to say "if the memory 
did not come from the preferred node, but instead came from one we 
already transfered, then we need to obtain that memory from the new 
allocator".  If it came from the preferred node or a later node, the 
allocation we did is valid, and will be marked in-use when we transfer 
that node's memory.

> I also need to go look at how __alloc_bootmem_node() ends up returning
> c000000000000000.  It should be returning NULL, and panic'ing, in
> careful_alloc().  This probably has to do with the fact that 
> NODE_DATA()
> isn't set up, yet, but I'll double check.

We setup NODE_DATA with the result of this alloc in nid order.  If 
early_pfs_to_nid returns the wrong value then we would obviously be in 
trouble here.

>         /*
>          * If the memory came from a previously allocated node, we must
>          * retry with the bootmem allocator.
>          */
>         new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT);
>         if (new_nid < nid) {
>                 ret = (unsigned 
> long)__alloc_bootmem_node(NODE_DATA(new_nid),
>                                 size, align, 0);
>                 dbg("careful_allocation() ret4: %016lx\n", ret);
>
>                 if (!ret)
>                         panic("numa.c: cannot allocate %lu bytes on 
> node %d",
>                               size, new_nid);
>
>                 ret = __pa(ret);
>                 dbg("careful_allocation() ret5: %016lx\n", ret);
>
>                 dbg("alloc_bootmem %lx %lx\n", ret, size);
>         }

Perhaps someone can recreate this with the fake numa stuff that was 
added since 2.6.24?   Or edit a device tree to fake the numa 
assignments for memory and kexec using the modified tree.

milton