Topology updates and NUMA-level sched domains

Wed Apr 8 03:14:10 AEST 2015

On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote:
> > Hi Peter,
> > 
> > As you are very aware, I think, power has some odd NUMA topologies (and
> > changes to the those topologies) at run-time. In particular, we can see
> > a topology at boot:
> > 
> > Node 0: all Cpus
> > Node 7: no cpus
> > 
> > Then we get a notification from the hypervisor that a core (or two) have
> > moved from node 0 to node 7. This results in the:
> 
> > or a re-init API (which won't try to reallocate various bits), because
> > the topology could be completely different now (e.g.,
> > sched_domains_numa_distance will also be inaccurate now).  Really, a
> > topology update on power (not sure on s390x, but those are the only two
> > archs that return a positive value from arch_update_cpu_topology() right
> > now, afaics) is a lot like a hotplug event and we need to re-initialize
> > any dependent structures.
> > 
> > I'm just sending out feelers, as we can limp by with the above warning,
> > it seems, but is less than ideal. Any help or insight you could provide
> > would be greatly appreciated!
> 
> So I think (and ISTR having stated this before) that dynamic cpu<->node
> maps are absolutely insane.

Sorry if I wasn't involved at the time. I agree that it's a bit of a
mess!

> There is a ton of stuff that assumes the cpu<->node relation is a boot
> time fixed one. Userspace being one of them. Per-cpu memory another.

Well, userspace already deals with CPU hotplug, right? And the topology
updates are, in a lot of ways, just like you've hotplugged a CPU from
one node and re-hotplugged it into another node.

I'll look into the per-cpu memory case.

For what it's worth, our test teams are stressing the kernel with these
topology updates and hopefully we'll be able to resolve any issues that
result.

> You simply cannot do this without causing massive borkage.
> 
> So please come up with a coherent plan to deal with the entire problem
> of dynamic cpu to memory relation and I might consider the scheduler
> impact. But we're not going to hack around and maybe make it not crash
> in a few corner cases while the entire thing is shite.

Well, it doesn't crash now. In fact, it stays up reasonable well and
seems to dtrt (from the kernel perspective) other than the sched domain
messages.

I will look into per-cpu memory, and also another case I have been
thinking about where if a process is bound to a CPU/node combination via
numactl and then the topology changes, what exactly will happen. In
theory, via these topology updates, a node could go from memoryless ->
not and v.v., which seems like it might not be well supported (but
again, should not be much different from hotplugging all the memory out
from a node).

And, in fact, I think topologically speaking, I think I should be able
to repeat the same sched domain warnings if I start off with a 2-node
system with all CPUs on one node, and then hotplug a CPU onto the second
node, right? That has nothing to do with power, that I can tell. I'll
see if I can demonstrate it via a KVM guest.

Thanks for your quick response!

-Nish