libnuma interleaving oddness

Nishanth Aravamudan nacc at
Thu Aug 31 16:00:36 EST 2006

On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > I took out the mlock() call, and I get the same results, FWIW.
> What zones are available on your box? Any with HIGHMEM?

How do I tell the available zones from userspace? This is ppc64 with
about 64GB of memory total, it looks like. So, none of the nodes
(according to /sys/devices/system/node/*/meminfo) have highmem.

> Also what kernel version are we talking about? Before 2.6.18?

The SuSE default, -- I thought I mentioned that in one of my
replies, sorry.

Tim and I spent most of this afternoon debugging the huge_zonelist()
callpath with kprobes and jprobes. We found the following via a jprobe
to offset_li_node():

jprobe: vma=0xc000000006dc2d78, pol->policy=0x3, pol->v.nodes=0xff, off=0x0
jprobe: vma=0xc00000000f247e30, pol->policy=0x3, pol->v.nodes=0xff, off=0x1000
jprobe: vma=0xc000000006dbf648, pol->policy=0x3, pol->v.nodes=0xff, off=0x2000
jprobe: vma=0xc00000000f298870, pol->policy=0x3, pol->v.nodes=0xff, off=0x17000
jprobe: vma=0xc00000000f298368, pol->policy=0x3, pol->v.nodes=0xff, off=0x18000

So, it's quite clear that the nodemask is set appropriately and so is
the policy. The problem, in fact, is the "offset" being passed into

The problem, I think, is from interleave_nid():

	off = vma->vm_pgoff;
	off += (addr - vma->vm_start) >> shift;
	return offset_il_node(pol, vma, off);

For hugetlbfs vma's, since vm_pgoff is in units of small pages, the lower
(HPAGE_SHIFT - PAGE_SHIFT) bits of vma->vm_pgoff and off will always be zero
(12 in this case).  Thus, when we get into offset_li_node():

        unsigned nnodes = nodes_weight(pol->v.nodes);
        unsigned target = (unsigned)off % nnodes;
        int c;
        int nid = -1;

        c = 0;
        do {
                nid = next_node(nid, pol->v.nodes);
        } while (c <= target);
        return nid;

nnodes is 8 (the number of nodes). Our offset (some multiple of 4096) is
always going to be evenly divided by 8. So, our target node is always
node 0! Note, that when we took out a bit in our nodemask, nnodes
changed accordingly and 7 did not evenly divide the offset, and we got
interleaving as expected.

To test my hypothesis (my analysis may be a bit hand-wavy, sorry), I
changed interleave_nid() to shift off right by (HPAGE_SHIFT -
PAGE_SHIFT) only #if CONFIG_HUGETLB_PAGE. This fixes the behavior for
the page-by-page case. But I'm not sure this is an acceptable mainline
change, but I've included my signed-off-but-not-for-inclusion patch.

Note, that when I try this with my testcase that makes each allocation
be 4 hugepages large, I get 4 hugepages on node 0, then 4 on node 4,
then 4 on node 0, and so on. I believe this is because the offset ends
up being the same for all of the 4 hugepages in each set, so they go to
the same node

Many thanks to Tim for his help debugging.


Once again, not for inclusion!

Signed-off-by: Nishanth Aravamudan <nacc at>

diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c	2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c	2006-08-30 22:56:43.000000000 -0700
@@ -1169,6 +1169,7 @@ static unsigned offset_il_node(struct me
 	return nid;
 /* Determine a node number for interleave */
 static inline unsigned interleave_nid(struct mempolicy *pol,
 		 struct vm_area_struct *vma, unsigned long addr, int shift)
@@ -1182,8 +1183,22 @@ static inline unsigned interleave_nid(st
 	} else
 		return interleave_nodes(pol);
+/* Determine a node number for interleave */
+static inline unsigned interleave_nid(struct mempolicy *pol,
+		 struct vm_area_struct *vma, unsigned long addr, int shift)
+	if (vma) {
+		unsigned long off;
+		off = vma->vm_pgoff;
+		off += (addr - vma->vm_start) >> shift;
+		off >>= (HPAGE_SHIFT - PAGE_SHIFT);
+		return offset_il_node(pol, vma, off);
+	} else
+		return interleave_nodes(pol);
 /* Return a zonelist suitable for a huge page allocation. */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)

Nishanth Aravamudan <nacc at>
IBM Linux Technology Center

More information about the Linuxppc-dev mailing list