[RFC/PATCH] numa: distinguish associativity domain from node id

Thu Apr 7 11:37:05 EST 2005

On Thu, Apr 07, 2005 at 10:15:19AM +1000, Anton Blanchard wrote:
> 
> > The ppc64 numa code makes some possibly invalid assumptions about the
> > numbering of "associativity domains" (which may be considered NUMA
> > nodes).  As far as I've been able to determine from the architecture
> > docs, there is no guarantee about the numbering of associativity
> > domains, i.e. the values that are contained in ibm,associativity
> > device node properties.  Yet we seem to assume that the numbering of
> > the domains begins at zero and that the range is contiguous, and we
> > use the domain number for a given resource as its logical node id.
> > This strikes me as a problem waiting to happen, and in fact I've been
> > seeing some problems in the lab with larger machines violating or at
> > least straining these assumptions.
> 
> Im reluctant to have a mapping between the Linux concept of a node and
> the firmware concept if possible. Its nice to be able to jump on a
> machine and determine if it is set up correctly by looking at sysfs and
> /proc/device-tree.

Ok...  just throwing out an idea here.  What if we could add an
attribute to the node sysdevs which would give us the firmware domain
number?

> > Consider one such case: the associativity domain for all memory in a
> > partition is 0x1, but the processors are in shared mode (so no
> > associativity info for them) -- all the memory is placed in node 1
> > while all cpus are mapped to node 0.  But in this case, we should
> > really have only one logical node, with all memory and cpus mapped to
> > it.
> 
> Even in shared processor mode it makes sense to have separate memory
> nodes so we can still do striping across memory controllers. For the
> shared processor case where all our memory is in one node that isnt
> zero, perhaps we could just stuff all the cpus in that node at boot.
> When we support memory hotplug we then add new nodes as normal. New cpus
> go into the node we chose at boot.

OK.

> 
> > Another case I've seen is that of a partition with all processors and
> > memory having an associativity domain of 0x1.  We end up with
> > everything in node 1 and an empty (yet online) node 0.
> 
> I saw some core changes go in recently that may allow us to have
> discontiguous node numbers. I agree onlining all nodes from 0...max node
> is pretty ugly, but perhaps thats fixable. Also, with hot memory unplug
> we are going to end up with holes.

The nodemap stuff, I assume.  I'll look into whether we can get away
with discontiguous online node numbers.

> The main problem with not doing a mapping is if firmware decides to
> exceed the maximum node number (we have it set to 16 at the moment).

We may need to bump that up.  If I'm interpreting the
ibm,max-associativity-domains property correctly, it should be 32.
This is from a box which (I think) can have only two domains worth of
cpus and memory.  I guess all the extra is to account for I/O which
could be added to the system?  Unlike ibm,lrdr-capacity, this property
doesn't seem to be affected by the partition profile settings.

# od -x /proc/device-tree/rtas/ibm,max-associativity-domains 
0000000 0000 0005 0000 0001 0000 0001 0000 0020
0000020 0000 0020 0000 0040                ^^^^

Thanks,
Nathan