<div dir="ltr"><div><div>On Fri, Jul 18, 2014 at 11:19 AM, Tejun Heo <<a href="mailto:tj@kernel.org">tj@kernel.org</a>> wrote:<br>><br>> Hello,<br>><br>> On Fri, Jul 18, 2014 at 11:12:01AM -0700, Nish Aravamudan wrote:<br>

> > why aren't these callers using kthread_create_on_cpu()? That API was<br>><br>> It is using that.  There just are other data structures too.<br><br></div>Sorry, I might not have been clear.<br><br>Why are any callers of the format kthread_create_on_node(..., cpu_to_node(cpu), ...) not using kthread_create_on_cpu(..., cpu, ...)?<br>

<br></div>In total in Linus' tree, there are only two APIs that use kthread_create_on_cpu() -- smpboot_create_threads() and smpboot_register_percpu_thread(). Neither of those seem to be used by the workqueue code that I can see as of yet.<br>

<div><div><div><br>> > already change to use cpu_to_mem() [so one change, rather than of all over<br>> > the kernel source]. We could change it back to cpu_to_node and push down<br>> > the knowledge about the fallback.<br>

><br>> And once it's properly solved, please convert back kthread to use<br>> cpu_to_node() too.  We really shouldn't be sprinkling the new subtly<br>> different variant across the kernel.  It's wrong and confusing.<br>

<br></div><div>I understand what you mean, but it's equally wrong for the kernel to be wasting GBs of slab. Different kinds of wrongness :)<br></div><div><br>> > Yes, this is a good point. But honestly, we're not really even to the point<br>

> > of talking about fallback here, at least in my testing, going off-node at<br>> > all causes SLUB-configured slabs to deactivate, which then leads to an<br>> > explosion in the unreclaimable slab.<br>

><br>> I don't think moving the logic inside allocator proper is a huge<br>> amount of work and this isn't the first spillage of this subtlety out<br>> of allocator proper.  Fortunately, it hasn't spread too much yet.<br>

> Let's please stop it here.  I'm not saying you shouldn't or can't fix<br>> the off-node allocation.<br><br></div><div>It seems like an additional reasonable approach would be to provide a suitable _cpu() API for the allocators. I'm not sure why saying that callers should know about NUMA (in order to call cpu_to_node() in every caller) is any better than saying that callers should know about memoryless nodes (in order to call cpu_to_mem() in every caller instead) -- when at least in several cases that I've seen the relevant data is what CPU we're expecting to run or are running on. Seems like the _cpu API would specify -- please allocate memory local to this CPU, wherever it is?<br>

<br></div><div>Thanks,<br></div><div>Nish<br></div></div></div></div>