Advice needed on SMP regression after cpu_core_mask change

Thu Mar 18 02:30:02 AEDT 2021

On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:
> Hello,
> 
> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
> in the commit:
> 
> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
> equal on Power"
> 
> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
> node SMP topologies such as:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> 
> lscpu will give the following output in this case:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  4
> Socket(s):           1
> NUMA node(s):        1
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-7
> 
> 
> This is happening because the macro cpu_cpu_mask(cpu) expands to
> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
> 
> If I add a second NUMA node then I can get the intended smp topology:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> However, if I try a single socket with multiple NUMA nodes topology, which is the case
> of Power10, e.g.:
> 
> 
> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> 
> This is the result:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.

Yes. I don't think we can do better on PAPR and the above examples 
seem to confirm that the "sockets" definition is simply ignored.

> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
> using physical_package_id/'ibm,chip-id' for that.

ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
is hiding a lot of the underlying HW and topology. May be we are trying 
to reconcile two orthogonal views of machine virtualization ...

> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
> that cares about cores per socket information. The kernel is now ignoring that, starting
> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
> 
> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
> not declare it, we need another way of letting the guest know how much cores per socket
> we want.
The RTAS call "ibm,get-system-parameter" with token "Processor Module 
Information" returns that kind of information :

  2 byte binary number (N) of module types followed by N module specifiers of the form:
  2 byte binary number (M) of sockets of this module type
  2 byte binary number (L) of chips per this module type
  2 byte binary number (K) of cores per chip in this module type.

See the values in these sysfs files :

  cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip} 

But I am afraid these are host level information and not guest/LPAR.

I didn't find any LPAR level properties or hcalls in the PAPR document. 
They need to be specified.

or

We can add extra properties like ibm,chip-id but making sure it's only 
used under the KVM hypervisor. My understanding is that's something we 
are trying to avoid. 

C.