[PATCH v3] topology: add support for node_to_mem_node() to determine the fallback node

Wed Sep 10 10:47:23 EST 2014

On 09.09.2014 [17:11:15 -0700], Andrew Morton wrote:
> On Tue, 9 Sep 2014 12:03:27 -0700 Nishanth Aravamudan <nacc at linux.vnet.ibm.com> wrote:
> 
> > From: Joonsoo Kim <iamjoonsoo.kim at lge.com>
> > 
> > We need to determine the fallback node in slub allocator if the
> > allocation target node is memoryless node. Without it, the SLUB wrongly
> > select the node which has no memory and can't use a partial slab,
> > because of node mismatch. Introduced function, node_to_mem_node(X), will
> > return a node Y with memory that has the nearest distance. If X is
> > memoryless node, it will return nearest distance node, but, if X is
> > normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> > ...
> >
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -119,11 +119,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> 
> This comment could be updated.

Will do, do you prefer a follow-on patch or one that replaces this one?

> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +extern int _node_numa_mem_[MAX_NUMNODES];
> >  
> >  #ifndef set_numa_mem
> >  static inline void set_numa_mem(int node)
> >  {
> >  	this_cpu_write(_numa_mem_, node);
> > +	_node_numa_mem_[numa_node_id()] = node;
> > +}
> > +#endif
> > +
> > +#ifndef node_to_mem_node
> > +static inline int node_to_mem_node(int node)
> > +{
> > +	return _node_numa_mem_[node];
> >  }
> 
> A wee bit of documentation wouldn't hurt.
> 
> How does node_to_mem_node(numa_node_id()) differ from numa_mem_id()? 
> If I'm reading things correctly, they should both always return the
> same thing.  If so, do we need both?

That seems correct to me. The nearest memory node of this cpu's NUMA
node (node_to_mem_node(numa_node_id()) is always equal to the nearest
memory node (numa_mem_id()).

> Will node_to_mem_node() ever actually be called with a node !=
> numa_node_id()?

Well, it's a layering problem. The eventual callers of
node_to_mem_node() only have the requested NUMA node (if any) available.
I think because get_partial() __slab_alloc() allow for allocations for
any node, and that's where we see the slab deactivation issues, we need
to support this in the API.

In practice, it's probably that the node parameter is often
numa_node_id(), but we can't be sure of that in these call-paths,
afaict.

> >  #endif
> >  
> > @@ -146,6 +155,7 @@ static inline int cpu_to_mem(int cpu)
> >  static inline void set_cpu_numa_mem(int cpu, int node)
> >  {
> >  	per_cpu(_numa_mem_, cpu) = node;
> > +	_node_numa_mem_[cpu_to_node(cpu)] = node;
> >  }
> >  #endif
> >  
> > @@ -159,6 +169,13 @@ static inline int numa_mem_id(void)
> >  }
> >  #endif
> >  
> > +#ifndef node_to_mem_node
> > +static inline int node_to_mem_node(int node)
> > +{
> > +	return node;
> > +}
> > +#endif
> > +
> >  #ifndef cpu_to_mem
> >  static inline int cpu_to_mem(int cpu)
> >  {
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 18cee0d4c8a2..0883c42936d4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -85,6 +85,7 @@ EXPORT_PER_CPU_SYMBOL(numa_node);
> >   */
> >  DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
> >  EXPORT_PER_CPU_SYMBOL(_numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> How does this get updated as CPUs, memory and nodes are hot-added and
> removed?

As CPUs are added, the architecture code in the CPU bringup will update
the NUMA topology. Memory and node hotplug are still open issues, I
mentioned the former in the cover letter. I should have mentioned it in
this commit message as well.

I do notice that Lee's commit message from 7aac78988551 ("numa:
introduce numa_mem_id()- effective local memory node id"):

"Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change
of numa_zonelist_order, or when node or memory hot-plug requires
zonelist rebuild.  Archs that support memoryless nodes will need to
initialize 'numa_mem' for secondary cpus as they're brought on-line."

And since we update the _node_numa_mem_ value on set_cpu_numa_mem()
calls, which were already needed for numa_mem_id(), we might be covered.
Testing these cases (hotplug) is next in my plans.

Thanks,
Nish