[RFC 1/2] workqueue: use the nearest NUMA node, not the local one

Sat Jul 19 03:33:41 EST 2014

[ Apologies for replying from a different address, we have a service outage
at work. ]

On Fri, Jul 18, 2014 at 1:11 AM, Lai Jiangshan <laijs at cn.fujitsu.com> wrote:
>
> Hi,

Thank you for your response!

> I'm curious about what will it happen when
alloc_pages_node(memoryless_node).

alloc_pages_node() is only involved in one of the possible paths (maybe
this occurs on x86 with THREAD_INFO > PAGE_SIZE?) On powerpc, though,
that's not the case.

Details:

1. pool->node is used in the invocation of kthread_create_on_node() in
create_worker().
2. kthread_create_on_node sets up a struct kthread_create_info with
create->node = node and wakes up kthreadd.
3. kthreadd calls create_kthread, which sets current->pref_node_fork =
create->node.
4. dup_task_struct() calls node = tsk_fork_get_node() before invoking
alloc_task_struct_node(node) and alloc_thread_info_node(node).
5. tsk_fork_get_node() returns current->pref_node_fork for kthreadd.
6. alloc_task_struct_node() calls kmem_cache_alloc_node(,GFP_KERNEL, node)
7. alloc_thread_info_node() either calls kmem_cache_alloc_node(,GFP_KERNEL,
node) or alloc_kmem_pages_node(node,GFP_KERNEL,), depending on the size of
THREAD_INFO relative to PAGE_SIZE.
8a. alloc_kmem_pages_node() -> alloc_pages_node -> __alloc_pages with a
zonelist built from node_zonelist. This should lead to proper fallback.
8b. kmem_cache_alloc_node() calls slab_alloc_node()
9. For a memoryless node, we will trigger the following:

        if (unlikely(!object || !node_match(page, node))) {
                object = __slab_alloc(s, gfpflags, node, addr, c);
                stat(s, ALLOC_SLOWPATH);
        }

10. __slab_alloc() in turn will:

        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
                deactivate_slab(s, page, c->freelist);
                c->page = NULL;
                c->freelist = NULL;
                goto new_slab;
        }

deactivating the slab. Thus, every kthread created with a node
specification leads to a single object on a slab. We see an explosion in
the slab consumption, all of which is unreclaimable.

Anton originally proposed not deactivating slabs when we *know* the
allocation will be remote (i.e., from a memoryless node). Joonsoo and
Christoph disagreed with this and proposed alternative solutions, which
weren't agreed upon at the time.

> If the memory is allocated from the most preferable node for the
@memoryless_node,
> why we need to bother and use cpu_to_mem() in the caller site?

The reason is that the node passed is a hint into the MM subsystem of what
node we want memory to come from. Well, I take that back, I think
semantically there are two ways to interpret the node parameter:

1) The NUMA node we want memory from
2) The NUMA node we expect memory from

The path through the MM above sort of conflates the two, the caller
specified an impossible request (which the MM subsystem technically knows
but which knowledge it isn't using at this point) of memory from a node
that has none.

We could change the core MM to do better in the presence of memoryless
nodes, and should, but this seems far less invasive and does the right
thing. Semantically, I think the workqueue's pool->node is meant to be the
node from which we want memory allocated, which is the node with memory
closest to the CPU.

Thanks,
Nish

> If not, why the memory allocation subsystem refuses to find a preferable
node
> for @memoryless_node in this case? Does it intend on some purpose or
> it can't find in some cases?
>
> Thanks,
> Lai
>
> Added CC to Tejun (workqueue maintainer).
>
> On 07/18/2014 07:09 AM, Nishanth Aravamudan wrote:
> > In the presence of memoryless nodes, the workqueue code incorrectly uses
> > cpu_to_node() to determine what node to prefer memory allocations come
> > from. cpu_to_mem() should be used instead, which will use the nearest
> > NUMA node with memory.
> >
> > Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com>
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 35974ac..0bba022 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -3547,7 +3547,12 @@ static struct worker_pool
*get_unbound_pool(const struct workqueue_attrs *attrs)
> >               for_each_node(node) {
> >                       if (cpumask_subset(pool->attrs->cpumask,
> >
 wq_numa_possible_cpumask[node])) {
> > -                             pool->node = node;
> > +                             /*
> > +                              * We could use local_memory_node(node)
here,
> > +                              * but it is expensive and the following
caches
> > +                              * the same value.
> > +                              */
> > +                             pool->node =
cpu_to_mem(cpumask_first(pool->attrs->cpumask));
> >                               break;
> >                       }
> >               }
> > @@ -4921,7 +4926,7 @@ static int __init init_workqueues(void)
> >                       pool->cpu = cpu;
> >                       cpumask_copy(pool->attrs->cpumask,
cpumask_of(cpu));
> >                       pool->attrs->nice = std_nice[i++];
> > -                     pool->node = cpu_to_node(cpu);
> > +                     pool->node = cpu_to_mem(cpu);
> >
> >                       /* alloc pool ID */
> >                       mutex_lock(&wq_pool_mutex);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20140718/1c5edc54/attachment-0001.html>