[PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

Daniel Henrique Barboza danielhb413 at gmail.com
Wed Jun 23 02:04:14 AEST 2021



On 6/22/21 9:07 AM, Aneesh Kumar K.V wrote:
> Daniel Henrique Barboza <danielhb413 at gmail.com> writes:
> 
>> On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote:
>>> PAPR interface currently supports two different ways of communicating resource
>>> grouping details to the OS. These are referred to as Form 0 and Form 1
>>> associativity grouping. Form 0 is the older format and is now considered
>>> deprecated. This patch adds another resource grouping named FORM2.
>>>
>>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com>
>>> ---
>>>    Documentation/powerpc/associativity.rst   | 135 ++++++++++++++++++++
>>>    arch/powerpc/include/asm/firmware.h       |   3 +-
>>>    arch/powerpc/include/asm/prom.h           |   1 +
>>>    arch/powerpc/kernel/prom_init.c           |   3 +-
>>>    arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>>>    arch/powerpc/platforms/pseries/firmware.c |   1 +
>>>    6 files changed, 286 insertions(+), 6 deletions(-)
>>>    create mode 100644 Documentation/powerpc/associativity.rst
>>>
>>> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>>> new file mode 100644
>>> index 000000000000..93be604ac54d
>>> --- /dev/null
>>> +++ b/Documentation/powerpc/associativity.rst
>>> @@ -0,0 +1,135 @@
>>> +============================
>>> +NUMA resource associativity
>>> +=============================
>>> +
>>> +Associativity represents the groupings of the various platform resources into
>>> +domains of substantially similar mean performance relative to resources outside
>>> +of that domain. Resources subsets of a given domain that exhibit better
>>> +performance relative to each other than relative to other resources subsets
>>> +are represented as being members of a sub-grouping domain. This performance
>>> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>>> +From the platform view, these groups are also referred to as domains.
>>> +
>>> +PAPR interface currently supports different ways of communicating these resource
>>> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>>> +associativity grouping. Form 0 is the older format and is now considered deprecated.
>>> +
>>> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>>> +
>>> +Form 0
>>> +-----
>>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>>> +
>>> +Form 1
>>> +-----
>>> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>>> +device tree properties are used to determine the NUMA distance between resource groups/domains.
>>> +
>>> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>>> +representing the resource’s platform grouping domains.
>>> +
>>> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>>> +(domainID index) that represents the 1 based ordinal in the associativity lists.
>>> +The list of domainID index represnets increasing hierachy of resource grouping.
>>> +
>>> +ex:
>>> +{ primary domainID index, secondary domainID index, tertiary domainID index.. }
>>> +
>>> +Linux kernel uses the domainID at the primary domainID index as the NUMA node id.
>>> +Linux kernel computes NUMA distance between two domains by recursively comparing
>>> +if they belong to the same higher-level domains. For mismatch at every higher
>>> +level of the resource group, the kernel doubles the NUMA distance between the
>>> +comparing domains.
>>> +
>>> +Form 2
>>> +-------
>>> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>>> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>>> +domain numbering. With numa distance computation now detached from the index value of
>>> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>>> +same domainID index representing resource groups of different performance/latency characteristics.
>>> +
>>> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>>> +"ibm,architecture-vec-5" property.
>>> +
>>> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>>> +the domainIDs present in the system. The offset of the domainID in this property is considered
>>> +the domainID index.
>>> +
>>> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>>> +N domainID encoded as with encode-int
>>> +
>>> +For ex:
>>> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
>>> +
>>> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>>> +distance between resource groups/domains present in the system.
>>> +
>>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>>> +
>>> +For ex:
>>> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>>> +ibm,numa-distance-table     =  {9, 10, 20, 80, 20, 10, 160, 80, 160, 10}
>>> +
>>> +  | 0    8   40
>>> +--|------------
>>> +  |
>>> +0 | 10   20  80
>>> +  |
>>> +8 | 20   10  160
>>> +  |
>>> +40| 80   160  10
>>> +
>>> +
>>> +"ibm,associativity" property for resources in node 0, 8 and 40
>>> +
>>> +{ 3, 6, 7, 0 }
>>> +{ 3, 6, 9, 8 }
>>> +{ 3, 6, 7, 40}
>>> +
>>> +With "ibm,associativity-reference-points"  { 0x3 }
>>
>> With this configuration, would the following ibm,associativity arrays
>> also be valid?
>>
>>
>> { 3, 0, 0, 0 }
>> { 3, 0, 0, 8 }
>> { 3, 0, 0, 40}
>>
> 
> Yes
> 
>> If yes, then we need a way to tell that the associativity domains assignment
>> are optional, and FORM2 relies solely on finding out the domainID of the
>> resource (0, 8 and 40) to retrieve the domainID index, and with this
>> index all performance metrics can be retrieved from the numa-* properties
>> (numa-distance-table, numa-bandwidth-table ...).
>>
> 
> Where do you suggest we clarify that? I agree that it is not explicitly
> mentioned. But we describe the details of how we find the numa distance
> with example in the document.


Perhaps something like this, right in the middle of the example:


----------------

(...)

+  | 0    8   40
+--|------------
+  |
+0 | 10   20  80
+  |
+8 | 20   10  160
+  |
+40| 80   160  10
+
+

With "ibm,associativity-reference-points" equal to { 0x3 }, the domainID of
each resource is located at index 3 of each ibm,associativity property:

+{ 3, 6, 7, 0 }
+{ 3, 6, 9, 8 }
+{ 3, 6, 7, 40 }


FORM2 requires the ibm,associativity array to contain the domainID of the
resource, which is defined by the ibm,associativity-reference-points.
Calculating the associativity domains of the remaining ibm,associativity
elements is not obligatory. In this example, the following ibm,associativity
arrays are also valid:

{ 3, 0, 0, 0 }
{ 3, 0, 0, 8 }
{ 3, 0, 0, 40 }

(...)

-------------


> 
>> Retrieving the resource domainID is done by using ibm,associativity-reference-points.
>>
>> This will allow the platform to implement FORM2 such as:
>>
>> { 1, 0 }
>> { 1, 8 }
>> { 1, 40 }
>>    
>> - ref-points: { 0x1 }
>>
>> If the platform chooses to do so.
>>
> 
> That is correct.
> 
>>
>>> +
>>> +Each resource (drcIndex) now also supports additional optional device tree properties.
>>> +These properties are marked optional because the platform can choose not to export
>>> +them and provide the system topology details using the earlier defined device tree
>>> +properties alone. The optional device tree properties are used when adding new resources
>>> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>>> +contains the newly added resource during boot.
>>> +
>>> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>>> +when building the NUMA distance of the numa node to which this resource belongs. This can
>>> +be looked at as the index at which this new domainID would have appeared in
>>> +"ibm,numa-lookup-index-table" if the domain was present during boot. The domainID
>>> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>>> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>>> +The value is 1 based array index value.
>>> +
>>> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>>> +
>>> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>>> +from this resource domain to other resources.
>>> +
>>> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>>> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>>> +
>>> +For ex:
>>> +ibm,associativity     = { 4, 5, 10, 50}
>>> +ibm,numa-lookup-index = { 4 }
>>> +ibm,numa-distance   =  {8, 160, 255, 80, 10, 160, 255, 80, 10}
>>> +
>>> +resulting in a new toplogy as below.
>>> +  | 0    8   40   50
>>> +--|------------------
>>> +  |
>>> +0 | 10   20  80   160
>>> +  |
>>> +8 | 20   10  160  255
>>> +  |
>>> +40| 80   160  10  80
>>> +  |
>>> +50| 160  255  80  10
>>> +
>>
>> I see there is no mention of the special PAPR SCM handling. I saw in
>> one of the your replies of v1:
>>
>> "Another option is to make sure that numa-distance-value is populated
>> such that PMEMB distance indicates it is closer to node0 when compared
>> to node1. ie, node_distance[40][0] < node_distance[40][1]. One could
>> possibly infer the grouping based on the distance value and not deepend
>> on ibm,associativity for that purpose."
>>
>>
>> Is that was we're supposed to do with PAPR SCM? I'm not sure how that
>> affects NVDIMM support in QEMU with FORM2.
>>
>>
> 
> yes that is what we are doing with this version of the patchset (v4)
> version. We can drop the nvdimm specific changes from Qemu.


I see. I'll drop the NVDIMM changes in the QEMU POC of FORM2 then.



Thanks,


Daniel

> 
> -aneesh
> 


More information about the Linuxppc-dev mailing list