[PATCH v5 3/4] arm64: topology: Support SMT control on ACPI based system

Yicong Yang yangyicong at huawei.com
Tue Sep 3 22:44:20 AEST 2024


On 2024/9/2 15:43, Pierre Gondois wrote:
> Hello Yicong
> 
> On 8/30/24 11:35, Yicong Yang wrote:
>> On 2024/8/29 20:46, Pierre Gondois wrote:
>>> Hello Yicong,
>>>
>>> On 8/29/24 09:40, Yicong Yang wrote:
>>>> Hi Pierre,
>>>>
>>>> On 2024/8/27 23:40, Pierre Gondois wrote:
>>>>> Hello Yicong,
>>>>> IIUC we are looking for the maximum number of threads a CPU can have
>>>>> in the platform. But is is actually possible to have a platform with
>>>>> CPUs not having the same number of threads ?
>>>>>
>>>>
>>>> I was thinking of the heterogenous platforms, for example part of the
>>>> cores have SMT and others don't (I don't have any though, but there
>>>> should be some such platforms for arm64).
>>>>
>>>>> For instance kernel/cpu.c::cpu_smt_num_threads_valid() will check
>>>>> that the number of threads is either 1 or cpu_smt_max_threads (as
>>>>> CONFIG_SMT_NUM_THREADS_DYNAMIC is not enabled for arm64). However
>>>>> a (hypothetical) platform with:
>>>>> - CPU0: 2 threads
>>>>> - CPU1: 4 threads
>>>>> should not be able to set the number of threads to 2, as
>>>>> 1 < 2 < cpu_smt_max_threads (cf. cpu_smt_num_threads_valid()).
>>>>>
>>>>
>>>> It's a bit more complex. If we enable/disable the SMT using on/off control
>>>> then we won't have this problem. We'll online all the CPUs on enabling and
>>>> offline CPUs which is not primary thread and don't care about the thread
>>>> number of each core.
>>>>
>>>> Control using thread number is introduced by CONFIG_SMT_NUM_THREADS_DYNAMIC
>>>> and only enabled on powerpc. I think this interface is not enough to handle
>>>> the hypothetical we assumed, since it assumes the homogenous platform that
>>>> all the CPUs have the same thread number. But this should be fine for
>>>> the platforms with SMT(with same thread number) and non-SMT cores, since it do
>>>> indicates the real thread number of the SMT cores.
>>>>
>>>>> So if there is an assumption that all the CPUs have the same number of
>>>>> threads, it should be sufficient to count the number of threads for one
>>>>> CPU right ?
>>>>>
>>>>
>>>> Naturally and conveniently I may think use of the threads number of CPU0 (boot
>>>> cpu) in such a solution. But this will be wrong if CPU0 is non-SMT on a heterogenous
>>>> platform :(
>>>>
>>>>> In the other (unlikely) case (i.e. CPUs can have various number of threads),
>>>>> I think we should either:
>>>>> - use your method and check that all the CPUs have the same number of threads
>>>>> - get the number of threads for one CPU and check that all the CPUs are
>>>>>      identical using the ACPI_PPTT_ACPI_IDENTICAL tag
>>>>
>>>> I think this won't be simpler since we still need to iterate all the CPUs to see
>>>> if they have the same hetero_id (checked how we're using this ACPI tag in
>>>> arm_acpi_register_pmu_device()). We could already know if they're identical by
>>>> comparing the thread number and do the update if necessary.
>>>>
>>>>> - have a per-cpu cpu_smt_max_threads value ?
>>>>
>>>> This should be unnecessary in currently stage if there's no platforms
>>>> with several kind cores have different thread number like in your assumption
>>>> and enable CONFIG_SMT_NUM_THREADS_DYNAMIC on such platforms. Otherwise using
>>>> a global cpu_smt_max_threads to enable the SMT control should be enough.
>>>> Does it make sense?
>>>
>>> Yes, I agree with all the things you said:
>>> - the current smt/control interface cannot handle asymmetric SMT platforms
>>> - I am not aware if such platform exist so far
>>>
>>> I think it would still be good to check all the CPUs have the same number of
>>> threads. I tried to enable the SMT_NUM_THREADS_DYNAMIC Kconfig with the
>>> patch attached (and to check CPUs have the same number of threads). Feel free
>>> to take what you like/ignore what is inside the attached patch, or let me know
>>> if I should submit a part in a separate patchset,
>>>
>>
>> Checked the changes, we can make this series as the basic support and a separate
>> series of yours as a further support of SMT control on arm64:
>> - support thread_id on ACPI based arm64 platform
>> - support partial SMT control by enable CONFIG_SMT_NUM_THREADS_DYNAMIC
> 
> I'm not sure I fully understand what you mean. I can send patches to
> enable SMT_NUM_THREADS_DYNAMIC on top of a v6 of yours IIUC. I let you pick
> the changes that you estimate to make more sense in your serie (if that makes
> sense) ? Please let met know if that works for you (or not).
> 

yes it works for me :)

>>
>> some minor comments below.
>>
>>> ----------------------------
>>>
>>>      [RFC] arm64: topology: Enable CONFIG_SMT_NUM_THREADS_DYNAMIC
>>>          - On arm64 ACPI systems, change the thread_id assignment to have
>>>        increasing values starting from 0. This is already the case for DT
>>>        based systems. Doing so allows to uniformly identify the n-th
>>>        thread of a given CPU.
>>>      - Check that all CPUs have the same number of threads (for DT/ACPI)
>>>      - Enable CONFIG_SMT_NUM_THREADS_DYNAMIC
>>>          On a Tx2, with 256 CPUs, threads siblings being 0,32,64,96
>>>      for socket0 and 128 + (0,32,64,96) for socket1:
>>>      $ cd /sys/devices/system/cpu/smt/
>>>      $ cat ../online
>>>      0-255
>>>      $ echo 2 > control
>>>      $ cat ../online
>>>      0-63,128-191
>>>      $ echo 3 > control
>>>      $ cat ../online
>>>      0-95,128-223
>>>      $ echo on > control
>>>      $ cat ../online
>>>      0-255
>>>
>>
>> So it's a real SMT4 system, it does make sense to have this partial SMT control
>> support.
> 
> Yes right
> 
>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index bd3bc2f5e0ec..1d8521483065 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -239,6 +239,7 @@ config ARM64
>>>          select HAVE_GENERIC_VDSO
>>>          select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
>>>          select HOTPLUG_SMT if (SMP && HOTPLUG_CPU)
>>> +       select SMT_NUM_THREADS_DYNAMIC if HOTPLUG_SMT
>>>          select IRQ_DOMAIN
>>>          select IRQ_FORCED_THREADING
>>>          select KASAN_VMALLOC if KASAN
>>> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
>>> index 0f6ef432fb84..7dd211f81687 100644
>>> --- a/arch/arm64/include/asm/topology.h
>>> +++ b/arch/arm64/include/asm/topology.h
>>> @@ -39,6 +39,14 @@ void update_freq_counters_refs(void);
>>>   #define arch_scale_hw_pressure topology_get_hw_pressure
>>>   #define arch_update_hw_pressure        topology_update_hw_pressure
>>>   +#ifdef CONFIG_SMT_NUM_THREADS_DYNAMIC
>>> +#include <linux/cpu_smt.h>
>>> +static inline bool topology_smt_thread_allowed(unsigned int cpu)
>>> +{
>>> +       return topology_thread_id(cpu) < cpu_smt_num_threads;
>>> +}
>>> +#endif
>>> +
>>>   #include <asm-generic/topology.h>
>>>     #endif /* _ASM_ARM_TOPOLOGY_H */
>>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>>> index f72e1e55b05e..a83babe19972 100644
>>> --- a/arch/arm64/kernel/topology.c
>>> +++ b/arch/arm64/kernel/topology.c
>>> @@ -47,7 +47,9 @@ int __init parse_acpi_topology(void)
>>>   {
>>>          int thread_num, max_smt_thread_num = 1;
>>>          struct xarray core_threads;
>>> +       bool have_thread = false;
>>>          int cpu, topology_id;
>>> +       unsigned long i;
>>>          void *entry;
>>>            if (acpi_disabled)
>>> @@ -61,6 +63,8 @@ int __init parse_acpi_topology(void)
>>>                          return topology_id;
>>>                    if (acpi_cpu_is_threaded(cpu)) {
>>> +                       have_thread = true;
>>> +
>>>                          cpu_topology[cpu].thread_id = topology_id;
>>>                          topology_id = find_acpi_cpu_topology(cpu, 1);
>>>                          cpu_topology[cpu].core_id   = topology_id;
>>> @@ -69,9 +73,10 @@ int __init parse_acpi_topology(void)
>>>                          if (!entry) {
>>>                                  xa_store(&core_threads, topology_id,
>>>                                           xa_mk_value(1), GFP_KERNEL);
>>> +                               cpu_topology[cpu].thread_id = 0;
>>>                          } else {
>>>                                  thread_num = xa_to_value(entry);
>>> -                               thread_num++;
>>> +                               cpu_topology[cpu].thread_id = thread_num++;
>>>                                  xa_store(&core_threads, topology_id,
>>>                                           xa_mk_value(thread_num), GFP_KERNEL);
>>>   @@ -86,8 +91,17 @@ int __init parse_acpi_topology(void)
>>>                  cpu_topology[cpu].cluster_id = topology_id;
>>>                  topology_id = find_acpi_cpu_topology_package(cpu);
>>>                  cpu_topology[cpu].package_id = topology_id;
>>> +
>>> +               pr_debug("CPU%u: package=0x%x cluster=0x%x core=0x%x thread=0x%x\n",
>>> +                       cpu, cpu_topology[cpu].package_id, cpu_topology[cpu].cluster_id,
>>> +                       cpu_topology[cpu].core_id, cpu_topology[cpu].thread_id);
>>>          }
>>>   +       if (have_thread)
>>
>> we could know this from max_smt_thread_num so have_thread maybe unnecessary.
> 
> Yes right, I will change that.
> 
>>
>>> +               xa_for_each(&core_threads, i, entry)
>>> +                       if (xa_to_value(entry) != max_smt_thread_num)
>>> +                               pr_warn("Heterogeneous SMT topology not handled");\
>>
>> Wondering if we can avoid this 2nd loop. Greg express the worries of looping twice on large scale
>> system in v1. Maybe we could use the hetero_id and get the necessary information in one loop, I need
>> further think.
> 
> I found this comments (not sure this is what you are refering to):
> - https://lore.kernel.org/linux-arm-kernel/20231011103303.00002d8f@Huawei.com/
> - https://lore.kernel.org/all/20230921150333.c2zqigs3xxwcg4ln@bogus/T/#m406c4c16871ca7ae431beb20feccfb5e14498452
> 
> I don't see another way to do it right now. Also, I thing the complexity is in
> O(2n), which should be better than the original O(n**2),
> 

yes it's less complex. I'm wondering build up the xarray in another way then we can avoid the
long loops. What about below:



More information about the Linuxppc-dev mailing list