[RFC 0/2] Memoryless nodes and kworker

Sat Jul 19 03:42:29 EST 2014

Hi Tejun,

On Fri, Jul 18, 2014 at 4:20 AM, Tejun Heo <tj at kernel.org> wrote:
>
> On Thu, Jul 17, 2014 at 04:09:23PM -0700, Nishanth Aravamudan wrote:
> > [Apologies for the large Cc list, but I believe we have the following
> > interested parties:
> >
> > x86 (recently posted memoryless node support)
> > ia64 (existing memoryless node support)
> > ppc (existing memoryless node support)
> > previous discussion of how to solve Anton's issue with slab usage
> > workqueue contributors/maintainers]
>
> Well, you forgot to cc me.

Ah I'm very sorry! That's what I get for editing e-mails... Thank you for
your reply!

> ...
> > It turns out we see this large slab usage due to using the wrong NUMA
> > information when creating kthreads.
> >
> > Two changes are required, one of which is in the workqueue code and one
> > of which is in the powerpc initialization. Note that ia64 may want to
> > consider something similar.
>
> Wasn't there a thread on this exact subject a few weeks ago?  Was that
> someone else?  Memory-less node detail leaking out of allocator proper
> isn't a good idea.  Please allow allocator users to specify the nodes
> they're on and let the allocator layer deal with mapping that to
> whatever is appropriate.  Please don't push that to everybody.

I didn't send anything for the workqueue logic anytime recently. Jiang sent
out a patchset for x86 memoryless node support, which may have touched
kernel/workqueue.c.

So, to be clear, this is not *necessarily* about memoryless nodes. It's
about the semantics intended. The workqueue code currently calls
cpu_to_node() in a few places, and passes that node into the core MM as a
hint about where the memory should come from. However, when memoryless
nodes are present, that hint is guaranteed to be wrong, as it's the nearest
NUMA node to the CPU (which happens to be the one its on), not the nearest
NUMA node with memory. The hint is correctly specified as cpu_to_mem(),
which does the right thing in the presence or absence of memoryless nodes.
And I think encapsulates the hint's semantics correctly -- please give me
memory from where I expect it, which is the closest NUMA node.

I guess we could also change tsk_fork_get_node to return
local_memory_node(tsk->pref_node_fork), but that can be a bit expensive, as
it generates a new zonelist each time to determine the first fallback node.
We get the exact same semantics (because cpu_to_mem() caches the result of
local_memory_node) by using cpu_to_mem directly.

Again, apologies for not Cc'ing you originally.

-Nish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20140718/1f01773d/attachment.html>