Nodes with no memory
Milton Miller
miltonm at bga.com
Sat Nov 22 19:58:51 EST 2008
On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote:
> On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
>> Dave Hansen wrote:
>>> I was handed off a bug report about a blade not booting with a, um
>>> "newer" kernel.
>>
>> If you're unable to provide basic information such as the kernel
>> version then perhaps this isn't the best forum for discussing this.
>> :)
>
> Let's just say a derivative of 2.6.27.5. I will, of course be trying
> to reproduce on mainline. I'm just going with the kernel closest to
> the bug report as I can get for now.
This reminds me. I was asked to look at a system that had all cpus and
memory on node 1. I recently switched to 2.6.27.0, and had a similar
failure when I tried my latest development kernel. However, I realized
that the user was wanting to run my previously supported 2.6.24 kernel,
and that did not have this issue, so I never got back to debugging this
problem. (Both kernels had similar patches applied, but very little to
mm or numa selection). I was able to fix the problem they were having
and returned the machine to them without debugging the issue, but I
suspect the problem was introduced to mainline between 2.6.24 and
2.6.27.
>>> I'm thinking that we need to at least fix careful_allocation() to
>>> oops
>>> and not return NULL, or check to make sure all it callers check its
>>> return code.
>>
>> Well, careful_allocation() in current mainline tries pretty hard to
>> panic if it can't satisfy the request. Why isn't that happening?
>
> I added some random debugging to careful_alloc() to find out.
>
> careful_allocation(1, 7680, 80, 0)
> careful_allocation() ret1: 00000001dffe4100
> careful_allocation() ret2: 00000001dffe4100
> careful_allocation() ret3: 00000001dffe4100
> careful_allocation() ret4: c000000000000000
> careful_allocation() ret5: 0000000000000000
>
> It looks to me like it is hitting 'the memory came from a previously
> allocated node' check. So, the __lmb_alloc_base() appears to get
> something worthwhile, but that gets overwritten later.
>
> I'm still not quite sure what this comment means. Are we just trying
> to
> get node locality from the allocation?
My memory (and a quick look) is that careful alloc is used while we are
in the process of creating the memory maps for the node. We want them
to be allocated from memory on the node, but will accept memory from
any node to handle the case that memory is not available in the desired
node. Linux requires the maps exist for every online node.
Because we are in the process transferring the memory between
allocators, the check for new_nid < nid is meant to say "if the memory
did not come from the preferred node, but instead came from one we
already transfered, then we need to obtain that memory from the new
allocator". If it came from the preferred node or a later node, the
allocation we did is valid, and will be marked in-use when we transfer
that node's memory.
> I also need to go look at how __alloc_bootmem_node() ends up returning
> c000000000000000. It should be returning NULL, and panic'ing, in
> careful_alloc(). This probably has to do with the fact that
> NODE_DATA()
> isn't set up, yet, but I'll double check.
We setup NODE_DATA with the result of this alloc in nid order. If
early_pfs_to_nid returns the wrong value then we would obviously be in
trouble here.
> /*
> * If the memory came from a previously allocated node, we must
> * retry with the bootmem allocator.
> */
> new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT);
> if (new_nid < nid) {
> ret = (unsigned
> long)__alloc_bootmem_node(NODE_DATA(new_nid),
> size, align, 0);
> dbg("careful_allocation() ret4: %016lx\n", ret);
>
> if (!ret)
> panic("numa.c: cannot allocate %lu bytes on
> node %d",
> size, new_nid);
>
> ret = __pa(ret);
> dbg("careful_allocation() ret5: %016lx\n", ret);
>
> dbg("alloc_bootmem %lx %lx\n", ret, size);
> }
Perhaps someone can recreate this with the fake numa stuff that was
added since 2.6.24? Or edit a device tree to fake the numa
assignments for memory and kexec using the modified tree.
milton
More information about the Linuxppc-dev
mailing list