[RFC 0/2] Memoryless nodes and kworker

Sat Jul 19 04:47:08 EST 2014

On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <tj at kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:
> > why aren't these callers using kthread_create_on_cpu()? That API was
>
> It is using that.  There just are other data structures too.

Sorry, I might not have been clear.

Why are any callers of the format kthread_create_on_node(...,
cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?

In total in Linus' tree, there are only two APIs that use
kthread_create_on_cpu() -- smpboot_create_threads() and
smpboot_register_percpu_thread(). Neither of those seem to be used by the
workqueue code that I can see as of yet.

> > already change to use cpu_to_mem() [so one change, rather than of all
over
> > the kernel source]. We could change it back to cpu_to_node and push down
> > the knowledge about the fallback.
>
> And once it's properly solved, please convert back kthread to use
> cpu_to_node() too.  We really shouldn't be sprinkling the new subtly
> different variant across the kernel.  It's wrong and confusing.

I understand what you mean, but it's equally wrong for the kernel to be
wasting GBs of slab. Different kinds of wrongness :)

> > Yes, this is a good point. But honestly, we're not really even to the
point
> > of talking about fallback here, at least in my testing, going off-node
at
> > all causes SLUB-configured slabs to deactivate, which then leads to an
> > explosion in the unreclaimable slab.
>
> I don't think moving the logic inside allocator proper is a huge
> amount of work and this isn't the first spillage of this subtlety out
> of allocator proper.  Fortunately, it hasn't spread too much yet.
> Let's please stop it here.  I'm not saying you shouldn't or can't fix
> the off-node allocation.

It seems like an additional reasonable approach would be to provide a
suitable _cpu() API for the allocators. I'm not sure why saying that
callers should know about NUMA (in order to call cpu_to_node() in every
caller) is any better than saying that callers should know about memoryless
nodes (in order to call cpu_to_mem() in every caller instead) -- when at
least in several cases that I've seen the relevant data is what CPU we're
expecting to run or are running on. Seems like the _cpu API would specify
-- please allocate memory local to this CPU, wherever it is?

Thanks,
Nish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20140718/502f2858/attachment.html>