[PATCH v4] pseries: prevent free CPU ids to be reused on another node

Wed Apr 21 02:34:34 AEST 2021

Le 07/04/2021 à 17:38, Laurent Dufour a écrit :
> When a CPU is hot added, the CPU ids are taken from the available mask from
> the lower possible set. If that set of values was previously used for CPU
> attached to a different node, this seems to application like if these CPUs
> have migrated from a node to another one which is not expected in real
> life.
> 
> To prevent this, it is needed to record the CPU ids used for each node and
> to not reuse them on another node. However, to prevent CPU hot plug to
> fail, in the case the CPU ids is starved on a node, the capability to reuse
> other nodes’ free CPU ids is kept. A warning is displayed in such a case
> to warn the user.
> 
> A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
> node. It is populated with the CPU onlined at boot time, and then when a
> CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
> unplugged, to remind this CPU ids have been used for this node.
> 
> The effect of this patch can be seen by removing and adding CPUs using the
> Qemu monitor. In the following case, the first CPU from the node 2 is
> removed, then the first one from the node 1 is removed too. Later, the
> first CPU of the node 2 is added back. Without that patch, the kernel will
> numbered these CPUs using the first CPU ids available which are the ones
> freed when removing the second CPU of the node 0. This leads to the CPU ids
> 16-23 to move from the node 1 to the node 2. With the patch applied, the
> CPU ids 32-39 are used since they are the lowest free ones which have not
> been used on another node.
> 
> At boot time:
> [root at vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> 
> Vanilla kernel, after the CPU hot unplug/plug operations:
> [root at vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 24 25 26 27 28 29 30 31
> node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
> 
> Patched kernel, after the CPU hot unplug/plug operations:
> [root at vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 24 25 26 27 28 29 30 31
> node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> 
> Changes since V3, addressing Nathan's comment:
>   - Rename the local variable named 'nid' into 'assigned_node'
> Changes since V2, addressing Nathan's comments:
>   - Remove the retry feature
>   - Reduce the number of local variables (removing 'i')
>   - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
> Changes since V1 (no functional changes):
>   - update the test's output in the commit's description
>   - node_recorded_ids_map should be static
> 
> Signed-off-by: Laurent Dufour <ldufour at linux.ibm.com>

I did further LPM tests with this patch applied and not allowing fall back 
reusing free ids of another node is too strong.

This is easy to hit that limitation when a LPAR is running at the maximum number 
of CPU it is configured for and when a LPAR migration leads to new node activation.

For instance, consider a dedicated LPAR configured with a max of 32 CPUs (4 
cores SMT 8). At boot time, cpu_possible_mask is filled with CPU ids 0-31 in 
smp_setup_cpu_maps() by reading the DT property "/rtas/ibm,lrdr-capacity", so 
the higher CPU id for this LPAR is 31.

Departure box:
	node 0 : CPU 0-31
Arrival box:
	node 0 : CPU 0-15
	node 1 : CPU 16-31 << need to reuse ids from node 0

Visualizing the CPU ids would have a big impact as it is used in several places 
in the kernel as to index linear table.

But in the case the LPAR is migratable (DT property "ibm,migratable-partition" 
is present), we may set the higher CPU ids to NR_CPUS (usually 2048), to limit 
the case where CPU id has to be reused on a different node. Doing this will have 
impact on some data allocation done in the kernel when the size is based on 
num_possible_cpus.

Any better idea?

Thanks,
Laurent.