[PATCH] powerpc/numa: Handle partially initialized numa nodes

Michael Ellerman mpe at ellerman.id.au
Sun Apr 10 21:28:38 AEST 2022


Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
> * Oscar Salvador <osalvador at suse.de> [2022-04-06 18:19:00]:
>
>> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>> >  arch/powerpc/mm/numa.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> > index b9b7fefbb64b..13022d734951 100644
>> > --- a/arch/powerpc/mm/numa.c
>> > +++ b/arch/powerpc/mm/numa.c
>> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>> >  	if (new_nid < 0 || !node_possible(new_nid))
>> >  		new_nid = first_online_node;
>> >  
>> > -	if (NODE_DATA(new_nid) == NULL) {
>> > +	if (!node_online(new_nid)) {
>> >  #ifdef CONFIG_MEMORY_HOTPLUG
>> >  		/*
>> >  		 * Need to ensure that NODE_DATA is initialized for a node from
>> 
>> Because of this fix, I wanted to check whether we might have any more fallouts due
>> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
>> as to why powerpc is the only platform that special cases try_online_node(),
>> while all others rely on cpu_up()->try_online_node() to do the right thing.
>> 
>> So, I had a look.
>> Let us rewind a bit:
>> 
>> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
>> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>> 
>> In there, it says that we have the following picture:
>> 
>> partition_sched_domains
>>  arch_update_cpu_topology
>>   numa_update_cpu_topology
>>    find_and_online_cpu_nid
>> 
>> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
>> too late as we might have referenced some NODE_DATA() fields.
>> Note that this happens at a much later stage in cpuhp.
>> 
>> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
>> which should allocate-and-online the node and prevent accessing garbage.
>> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
>> (see arch/powerpc/mm/numa.c:mem_topology_setup),
>> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
>> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
>> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>> 
>> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
>>  tries to online node0, or whatever old mapping numa<->cpu is there).
>> 
>> So, we have something like:
>> 
>> dlpar_online_cpu
>>  device_online
>>   dev->bus->online
>>    cpu_subsys_online
>>     cpu_device_up
>>      cpu_up
>>       try_online_node (old mapping nid <-> cpu )
>>       ...
>>       ...
>>       cphp_callbacks
>>        sched_cpu_activate
>>         cpuset_update_active_cpus
>>          schedule_work(&cpuset_hotplug_work)
>>           cpuset_hotplug_work
>>            partition_sched_domains
>>             arch_update_cpu_topology
>>              numa_update_cpu_topology
>>               find_and_online_cpu_nid (online new_nid)
>> 
>> 
>> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
>> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
>> fixed the issue, but we should not need this special casing at all.
>> 
>> We do already know the numa<->cpu associativity in
>> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
>> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>> 
>> With this in mind, I came up with the following patch, where I carried out a battery
>> of tests for CPU hotplug stuff and it worked as expected.
>> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
>> I would like to hear from more familiar people.
>> 
>> The patch is:
>> 
>> From: Oscar Salvador <osalvador at suse.de>
>> Date: Wed, 6 Apr 2022 14:39:15 +0200
>> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>> 
>> powerpc is the only platform that do not rely on
>> cpu_up()->try_online_node() to bring up a numa node,
>> and special cases it, instead, deep in its own machinery:
>> 
>> dlpar_online_cpu
>>  find_and_online_cpu_nid
>>   try_online_node
>> 
>> This should not be needed, but the thing is that the try_online_node()
>> from cpu_up() will not apply on the right node, because cpu_to_node()
>> will return the old mapping numa<->cpu that gets set on boot stage
>> for all possible cpus.
>> 
>> That can be seen easily if we try to print out the numa node passed
>> to try_online_node() in cpu_up().
>> 
>> The thing is that the numa<->cpu mapping does not get updated till a much
>> later stage in start_secondary:
>> 
>> start_secondary:
>>  set_numa_node(numa_cpu_lookup_table[cpu])
>> 
>> But we do not really care, as we already now the
>> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
>> so let us make use of that and set the proper numa<->cpu mapping,
>> so cpu_to_node() in cpu_up() returns the right node and
>> try_online_node() can do its work.
>> 
>> Signed-off-by: Oscar Salvador <osalvador at suse.de>
>
> Looks good to me.
>
> Reviewed-by: Srikar Dronamraju <srikar at linux.vnet.ibm.com>

Yeah agreed, thanks for getting to the root of the problem.

Can you resend as a standalone patch. Because you sent it as a reply it
won't be recognised by patchwork[1] which means it risks getting lost.

cheers

1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/


More information about the Linuxppc-dev mailing list