[PATCH] powerpc/numa: Restrict possible nodes based on platform

Sat Jul 11 03:41:22 AEST 2020

Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:

> * Michael Ellerman <mpe at ellerman.id.au> [2020-07-07 15:02:17]:
>
>> Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
>> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
>> > /proc/device-tree/rtas/ibm,current-associativity-domains
>> > 		 00000005 00000001 00000002 00000002 00000002 00000010
>> > /proc/device-tree/rtas/ibm,max-associativity-domains
>> > 		 00000005 00000001 00000008 00000020 00000020 00000100
>> >
>> > $ cat /sys/devices/system/node/possible ##Before patch
>> > 0-31
>> >
>> > $ cat /sys/devices/system/node/possible ##After patch
>> > 0-1
>> >
>> > Note the maximum nodes this platform can support is only 2 but the
>> > possible nodes is set to 32.
>> 
>> But what about LPM to a system with more nodes?
>> 
>
> I have very less info on LPM, so I checked with Nathan Lynch before posting
> and as per Nathan in the current design of LPM, Linux wouldn't use the new
> node numbers.

(I see a v2 has been posted already but I needed a little time to check
with our hypervisor people on this point.)

It's less of a design and more of a least-bad option in the absence of a
more flexible NUMA architecture in Linux.

For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA
information from the device tree that it was booted with, and it must
disregard ibm,associativity and similar information after LPM or certain
other platform events. Historically there has been code that tried to
honor changes in NUMA information but it caused much worse problems than
degraded performance. That code has been disabled by default since last
year and is now subject to removal:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897

Most NUMA-aware code happens to follow that rule because the device tree
associativity information tends to get cached on first access in Linux's
own data structures. It all feels a little fragile to me though,
especially since we can DLPAR add processors and memory after LPM with
"new" associativity properties which don't relate to the logical
topology Linux has already built. However, on currently available hw, as
long as we're using ibm,max-associativity-domains to limit the set of
possible nodes, I believe such resources will always receive valid (but
possibly suboptimal) NUMA assignments. That's because as of this
writing ibm,max-associativity-domains has the same contents on all
currently available PowerVM systems.

Now if we change to using ibm,current-associativity-domains, which we
*can* expect to differ between differently configured systems, post-LPM
DLPAR additions can yield resources with node assignments that
fall outside of the possible range, especially when we've migrated from
a smaller system to a larger one.

Is the current code robust against that possibility? I don't think so:
it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and
possibly more code need to guard against this in order to prevent
NODE_DATA() null dereferences and the like. Probably those functions
should be made to clamp the nid assignment at num_possible_nodes()
instead of MAX_NUMNODES, which strikes me as more correct regardless of
your patch.