[RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
Aneesh Kumar K.V
aneesh.kumar at linux.ibm.com
Thu Aug 6 20:44:21 AEST 2020
Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
> * Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com> [2020-08-02 19:51:41]:
>> Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
>> > * Aneesh Kumar K.V <aneesh.kumar at linux.ibm.com> [2020-07-31 16:49:14]:
>> >
>> >
>> > If its just to eliminate node 0, then we have 2 other probably better
>> > solutions.
>> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in
>> > linux-next)
>> > 2. powerpc specific: explicitly clear node 0 during numa bringup.
>> >
>>
>>
>> I am not sure I consider them better. But yes, those patches are good
>> and also resolves the node 0 initialization when the firmware didn't
>> indicate the presence of such a node.
>>
>> This patch in addition make sure that we get the same topolgy report
>> across reboot on a virtualized partitions as longs as the cpu/memory
>> ratio per powervm domains remain the same. This should also help to
>> avoid confusion after an LPM migration once we start applying topology
>> updates.
>>
>
> What do we mean by cpu/memory ratio. The topology across reboot would have
> changed only if PowerVM would have allocated resources differently by
> scrambling/unscrambling. We are no more processing topology updates at
> runtime. As far as I know, after LPM, the source topology is maintained.
A LPAR running with one numa node and 10GB of memory on PowerVM domain
10 will report node 10 and 10GB memory in the current scheme. After LPM
migration or a CEC shutdown/reboot if the domain from which the resource
allocated becomes 11, then the LPAR will report node 11 and 10GB memory.
Having a logical node number means in the both the above cases we report
node 0, 10GB memory.
>
>> >> This can be resolved by mapping the firmware provided group id to a logical Linux
>> >> NUMA id. In this patch, we do this only for pseries platforms considering the
>> >
>> > On PowerVM, as you would know the nid is already a logical or a flattened
>> > chip-id and not the actual hardware chip-id.
>>
>> Yes. But then they are derived based on PowerVM resources AKA domains.
>> Now based on the available resource on a system, we could end up with
>> different node numbers with same toplogy across reboots. Making it
>> logical at OS level prevent that.
>
> The above statement kind of gives an impression, that topology changes
> across every reboot. We only end up with different node numbers if and only
> if the underlying topology has changed and that case is very rare. Or am I
> missing something?
IIUC it also depends on availability of resources within the
domain at the time of LPAR start.
>
>>
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index e437a9ac4956..6c659aada55b 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>> >> }
>> >> }
>> >>
>> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
>> >> +
>> >> +int firmware_group_id_to_nid(int firmware_gid)
>> >> +{
>> >> + static int last_nid = 0;
>> >> +
>> >> + /*
>> >> + * For PowerNV we don't change the node id. This helps to avoid
>> >> + * confusion w.r.t the expected node ids. On pseries, node numbers
>> >> + * are virtualized. Hence do logical node id for pseries.
>> >> + */
>> >> + if (!firmware_has_feature(FW_FEATURE_LPAR))
>> >> + return firmware_gid;
>> >> +
>> >> + if (firmware_gid == -1)
>> >> + return NUMA_NO_NODE;
>> >> +
>> >> + if (nid_map[firmware_gid] == NUMA_NO_NODE)
>> >> + nid_map[firmware_gid] = last_nid++;
>> >
>> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
>> > at this place in parallel?
>>
>> Do we have a code path where we do that? All the node id init should
>> happen early and there should not be two cpus doing node init at the
>> same time. I might be mistaken. Can you point to the code path where you
>> expect this to be called in parallel?
>>
>
> associativity_to_nid gets called the first time a cpu is being made present
> from offline. So it need not be in boot path. We may to verify if cpu
> hotplug, dlpar, operations are synchronized. For example a memory hotadd and
> cpu hotplug are they synchronized? I am not sure if they are synchronized at
> this time.
But you don't online cpu or memory to a non existent node post boot
right?. If the node is existent we have already initialized the nid_map.
However i am not sure whether we do a parallel initialization of devices. ie,
of_device_add getting called in parallel. if it can then we need the
below?
@@ -226,6 +226,7 @@ static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
int firmware_group_id_to_nid(int firmware_gid)
{
static int last_nid = 0;
+ static DEFINE_SPINLOCK(node_id_lock);
/*
* For PowerNV we don't change the node id. This helps to avoid
@@ -238,8 +239,13 @@ int firmware_group_id_to_nid(int firmware_gid)
if (firmware_gid == -1)
return NUMA_NO_NODE;
- if (nid_map[firmware_gid] == NUMA_NO_NODE)
- nid_map[firmware_gid] = last_nid++;
+ if (nid_map[firmware_gid] == NUMA_NO_NODE) {
+ spin_lock(&node_id_lock);
+ /* recheck with lock held */
+ if (nid_map[firmware_gid] == NUMA_NO_NODE)
+ nid_map[firmware_gid] = last_nid++;
+ spin_unlock(&node_id_lock);
+ }
return nid_map[firmware_gid];
}
>
>> >
>> >> +
>> >> + return nid_map[firmware_gid];
>> >> +}
>> >> +
>> >> /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>> >> * info is found.
>> >> */
>> >> static int associativity_to_nid(const __be32 *associativity)
>> >> {
>> >> int nid = NUMA_NO_NODE;
>> >> + int firmware_gid = -1;
>> >>
>> >> if (!numa_enabled)
>> >> goto out;
>> >>
>> >> if (of_read_number(associativity, 1) >= min_common_depth)
>> >> - nid = of_read_number(&associativity[min_common_depth], 1);
>> >> + firmware_gid = of_read_number(&associativity[min_common_depth], 1);
>> >>
>> >> /* POWER4 LPAR uses 0xffff as invalid node */
>> >> - if (nid == 0xffff || nid >= MAX_NUMNODES)
>> >> - nid = NUMA_NO_NODE;
>> >> + if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
>> >> + firmware_gid = -1;
>> >
>> > Lets assume two or more invocations of associativity_to_nid for the same
>> > associativity, end up with -1, In each case aren't giving different
>> > nids?
>>
>>
>> I didn't quiet get the comment here. But I assume you are indicating the
>> same one you mentioned above?
>>
>
> No its not related to the above comment.
> We are incrementing the nid_map table for every unique firmware_gid or for
> every -1 (aka invalid associativities). If there are sufficiently large
> number of associativities that end up returning invalid associativities,
> then don't we quickly overflow the nid_map table? Not only about the
> overflow but a 8 node machine may soon look like a 80 node machine.
Not sure I follow. What does a large number of associativies imply? Are
you looking at ibm,associativity-lookup-arrays that got entries which
are invalid? Even there we are not parsing the full array, we lookup
only a specific firmware_gid (in case of lookup-arrays we use aa_index
value from drmem_lmb).
I will also add a las_nid > MAX_NUMNODES check in
firmware_group_id_to_nid() to handle the case where we find more numa
nodes than MAX_NUMANODES in device tree.
-aneesh
More information about the Linuxppc-dev
mailing list