[PATCH] Fix fake numa on ppc

Wed Sep 2 15:58:41 EST 2009

On Wed, 2 Sep 2009, Ankita Garg wrote:

> > > With the patch,
> > > 
> > > # cat /proc/cmdline
> > > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > > # cat /sys/devices/system/node/node0/cpulist
> > > 0-3
> > > # cat /sys/devices/system/node/node1/cpulist
> > > 
> > 
> > Oh! interesting.. cpuless nodes :) I think we need to fix this in the
> > longer run and distribute cpus between fake numa nodes of a real node
> > using some acceptable heuristic.
> >
> 
> True. Presently this is broken on both x86 and ppc systems. It would be
> interesting to find a way to map, for example, 4 cpus to >4 number of
> fake nodes created from a single real numa node!
>  

We've done it for years on x86_64.  It's quite trivial to map all fake 
nodes within a physical node to the cpus to which they have affinity both 
via node_to_cpumask_map() and cpu_to_node_map().  There should be no 
kernel space dependencies on a cpu appearing in only a single node's 
cpumask and if you map each fake node to its physical node's pxm, you can 
index into the slit and generate local NUMA distances amongst fake nodes.

So if you map the apicids and pxms appropriately depending on the 
physical topology of the machine, that is the only emulation necessary on 
x86_64 for the page allocator zonelist ordering, task migration, etc.  (If 
you use CONFIG_SLAB, you'll need to avoid the exponential growth of alien 
caches, but that's an implementation detail and isn't really within the 
scope of numa=fake's purpose to modify.)