<div dir="ltr"><div>Hi Tejun,<br><br></div>[I found the other thread where you made these points, thanks you for expressing them so clearly again!]<br><div><div><div><br>On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo <<a href="mailto:tj@kernel.org">tj@kernel.org</a>> wrote:<br>
><br>> Hello,<br>><br>> On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote:<br>> > So, to be clear, this is not *necessarily* about memoryless nodes. It's<br>> > about the semantics intended. The workqueue code currently calls<br>
> > cpu_to_node() in a few places, and passes that node into the core MM as a<br>> > hint about where the memory should come from. However, when memoryless<br>> > nodes are present, that hint is guaranteed to be wrong, as it's the nearest<br>
> > NUMA node to the CPU (which happens to be the one its on), not the nearest<br>> > NUMA node with memory. The hint is correctly specified as cpu_to_mem(),<br>><br>> It's telling the allocator the node the CPU is on. Choosing and<br>
> falling back the actual allocation is the allocator's job.<br><br></div><div>Ok, I agree with you then, if that's all the semantic is supposed to be.<br></div><div><br>But looking at the comment for kthread_create_on_node:<br>
<br> * If thread is going to be bound on a particular cpu, give its node<br> * in @node, to get NUMA affinity for kthread stack, or else give -1.<br><br></div><div>so the API interprets it as a suggestion for the affinity itself, *not* the node the kthread should be on. Piddly, yes, but actually I have another thought altogether, and in reviewing Jiang's patches this seems like the right approach:<br>
<br></div><div>why aren't these callers using kthread_create_on_cpu()? That API was already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback.<br>
</div><div><br>> > which does the right thing in the presence or absence of memoryless nodes.<br>> > And I think encapsulates the hint's semantics correctly -- please give me<br>> > memory from where I expect it, which is the closest NUMA node.<br>
><br>> I don't think it does. It loses information at too high a layer.<br>> Workqueue here doesn't care how memory subsystem is structured, it's<br>> just telling the allocator where it's at and expecting it to do the<br>
> right thing. Please consider the following scenario.<br>><br>> A - B - C - D - E<br>><br>> Let's say C is a memory-less node. If we map from C to either B or D<br>> from individual users and that node can't serve that memory request,<br>
> the allocator would fall back to A or E respectively when the right<br>> thing to do would be falling back to D or B respectively, right?<br><br></div><div>Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab.<br>
</div><div><br>> This isn't a huge issue but it shows that this is the wrong layer to<br>> deal with this issue. Let the allocators express where they are.<br>> Choosing and falling back belong to the memory allocator. That's the<br>
> only place which has all the information that's necessary and those<br>> details must be contained there. Please don't leak it to memory<br>> allocator users.<br><br></div><div>Ok, I will continue to work at that level of abstraction.<br>
<br>Thanks,<br>Nish<br></div></div></div></div>