Advice needed on SMP regression after cpu_core_mask change
Cédric Le Goater
clg at kaod.org
Thu Mar 18 02:30:02 AEDT 2021
On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:
> Hello,
>
> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
> in the commit:
>
> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
> equal on Power"
>
> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
> node SMP topologies such as:
>
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
>
> lscpu will give the following output in this case:
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 8
> On-line CPU(s) list: 0-7
> Thread(s) per core: 2
> Core(s) per socket: 4
> Socket(s): 1
> NUMA node(s): 1
> Model: 2.2 (pvr 004e 1202)
> Model name: POWER9 (architected), altivec supported
> Hypervisor vendor: KVM
> Virtualization type: para
> L1d cache: 32K
> L1i cache: 32K
> NUMA node0 CPU(s): 0-7
>
>
> This is happening because the macro cpu_cpu_mask(cpu) expands to
> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
>
> If I add a second NUMA node then I can get the intended smp topology:
>
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 8
> On-line CPU(s) list: 0-7
> Thread(s) per core: 2
> Core(s) per socket: 2
> Socket(s): 2
> NUMA node(s): 2
> Model: 2.2 (pvr 004e 1202)
> Model name: POWER9 (architected), altivec supported
> Hypervisor vendor: KVM
> Virtualization type: para
> L1d cache: 32K
> L1i cache: 32K
> NUMA node0 CPU(s): 0-3
> NUMA node1 CPU(s): 4-7
>
>
> However, if I try a single socket with multiple NUMA nodes topology, which is the case
> of Power10, e.g.:
>
>
> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>
>
> This is the result:
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 8
> On-line CPU(s) list: 0-7
> Thread(s) per core: 2
> Core(s) per socket: 2
> Socket(s): 2
> NUMA node(s): 2
> Model: 2.2 (pvr 004e 1202)
> Model name: POWER9 (architected), altivec supported
> Hypervisor vendor: KVM
> Virtualization type: para
> L1d cache: 32K
> L1i cache: 32K
> NUMA node0 CPU(s): 0-3
> NUMA node1 CPU(s): 4-7
>
>
> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.
Yes. I don't think we can do better on PAPR and the above examples
seem to confirm that the "sockets" definition is simply ignored.
> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
> using physical_package_id/'ibm,chip-id' for that.
ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
is hiding a lot of the underlying HW and topology. May be we are trying
to reconcile two orthogonal views of machine virtualization ...
> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
> that cares about cores per socket information. The kernel is now ignoring that, starting
> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
>
> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
> not declare it, we need another way of letting the guest know how much cores per socket
> we want.
The RTAS call "ibm,get-system-parameter" with token "Processor Module
Information" returns that kind of information :
2 byte binary number (N) of module types followed by N module specifiers of the form:
2 byte binary number (M) of sockets of this module type
2 byte binary number (L) of chips per this module type
2 byte binary number (K) of cores per chip in this module type.
See the values in these sysfs files :
cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip}
But I am afraid these are host level information and not guest/LPAR.
I didn't find any LPAR level properties or hcalls in the PAPR document.
They need to be specified.
or
We can add extra properties like ibm,chip-id but making sure it's only
used under the KVM hypervisor. My understanding is that's something we
are trying to avoid.
C.
More information about the Linuxppc-dev
mailing list