[PATCH v2 1/3] powerpc/numa: Introduce logical numa id

Tue Aug 18 18:21:16 AEST 2020

Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:

> * Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com> [2020-08-17 17:04:24]:
>
>> On 8/17/20 4:29 PM, Srikar Dronamraju wrote:
>> > * Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com> [2020-08-17 16:02:36]:
>> > 
>> > > We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
>> > > node numbers. These device tree properties are firmware indicated grouping of
>> > > resources based on their hierarchy in the platform. These numbers (group id) are
>> > > not sequential and hypervisor/firmware can follow different numbering schemes.
>> > > For ex: on powernv platforms, we group them in the below order.
>> > > 
>> > >   *     - CCM node ID
>> > >   *     - HW card ID
>> > >   *     - HW module ID
>> > >   *     - Chip ID
>> > >   *     - Core ID
>> > > 
>> > > Based on ibm,associativity-reference-points we use one of the above group ids as
>> > > Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
>> > > in Linux reporting non-linear NUMA node id and which also results in Linux
>> > > reporting empty node 0 NUMA nodes.
>> > > 
>> > > This can  be resolved by mapping the firmware provided group id to a logical Linux
>> > > NUMA id. In this patch, we do this only for pseries platforms considering the
>> > > firmware group id is a virtualized entity and users would not have drawn any
>> > > conclusion based on the Linux Numa Node id.
>> > > 
>> > > On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
>> > > id, we keep the existing Linux NUMA node id numbering.
>> > 
>> > I still dont understand how you are going to handle numa distances.
>> > With your patch, have you tried dlpar add/remove on a sparsely noded machine?
>> > 
>> 
>> We follow the same steps when fetching distance information. Instead of
>> using affinity domain id, we now use the mapped node id. The relevant hunk
>> in the patch is
>> 
>> +	nid = affinity_domain_to_nid(&domain);
>> 
>>  	if (nid > 0 &&
>> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
>> +	    of_read_number(associativity, 1) >= distance_ref_points_depth) {
>>  		/*
>>  		 * Skip the length field and send start of associativity array
>>  		 */
>> 
>> I haven't tried dlpar add/remove. I don't have a setup to try that. Do you
>> see a problem there?
>> 
>
> Yes, I think there can be 2 problems.
>
> 1. distance table may be filled with incorrect data.
> 2. numactl -H distance table shows symmetric data, the symmetric nature may
> be lost.
>

After discussing with srikar to understand these concern better, below
are the conclusions.

1) There is no corruption of node distance. We do handle node distance
correctly. But the numactl -H output can be such that we won't find the
numa nodes with a higher number to be further away from node 0. ie. We can
find output like below.

node  0  1   2  3
  0:  10  40  40 20
  1:  40  10  40 40
  2:  40  40  10 40
  3:  20  40  40 10

Here node 3 is closer to node 0  Than node 1 and 2. I am not sure this
is going to break any userspace.

2) We can find node number changing if we do a DLPAR add of memory/cpu
and reboot. ie, we boot with resource domain id 4 and 6 and then later
add resources from domain 5. In this above case, we will have node 0,1
and 2 mapping domain id 4, 6, 5. On reboot, we can map them such that

node 0 -> 4
node 1 -> 5
node 2 -> 6

I guess this is still ok because we are running in a virtualized
environment and node numbers to domain id are never guaranteed to be he
same across reboot.

-aneesh