[PATCH] powerpc/numa: Handle partially initialized numa nodes
Michael Ellerman
mpe at ellerman.id.au
Sun Apr 10 21:28:38 AEST 2022
Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
> * Oscar Salvador <osalvador at suse.de> [2022-04-06 18:19:00]:
>
>> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>> > arch/powerpc/mm/numa.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> > index b9b7fefbb64b..13022d734951 100644
>> > --- a/arch/powerpc/mm/numa.c
>> > +++ b/arch/powerpc/mm/numa.c
>> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>> > if (new_nid < 0 || !node_possible(new_nid))
>> > new_nid = first_online_node;
>> >
>> > - if (NODE_DATA(new_nid) == NULL) {
>> > + if (!node_online(new_nid)) {
>> > #ifdef CONFIG_MEMORY_HOTPLUG
>> > /*
>> > * Need to ensure that NODE_DATA is initialized for a node from
>>
>> Because of this fix, I wanted to check whether we might have any more fallouts due
>> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
>> as to why powerpc is the only platform that special cases try_online_node(),
>> while all others rely on cpu_up()->try_online_node() to do the right thing.
>>
>> So, I had a look.
>> Let us rewind a bit:
>>
>> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
>> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>>
>> In there, it says that we have the following picture:
>>
>> partition_sched_domains
>> arch_update_cpu_topology
>> numa_update_cpu_topology
>> find_and_online_cpu_nid
>>
>> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
>> too late as we might have referenced some NODE_DATA() fields.
>> Note that this happens at a much later stage in cpuhp.
>>
>> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
>> which should allocate-and-online the node and prevent accessing garbage.
>> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
>> (see arch/powerpc/mm/numa.c:mem_topology_setup),
>> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
>> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
>> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>>
>> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
>> tries to online node0, or whatever old mapping numa<->cpu is there).
>>
>> So, we have something like:
>>
>> dlpar_online_cpu
>> device_online
>> dev->bus->online
>> cpu_subsys_online
>> cpu_device_up
>> cpu_up
>> try_online_node (old mapping nid <-> cpu )
>> ...
>> ...
>> cphp_callbacks
>> sched_cpu_activate
>> cpuset_update_active_cpus
>> schedule_work(&cpuset_hotplug_work)
>> cpuset_hotplug_work
>> partition_sched_domains
>> arch_update_cpu_topology
>> numa_update_cpu_topology
>> find_and_online_cpu_nid (online new_nid)
>>
>>
>> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
>> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
>> fixed the issue, but we should not need this special casing at all.
>>
>> We do already know the numa<->cpu associativity in
>> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
>> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>>
>> With this in mind, I came up with the following patch, where I carried out a battery
>> of tests for CPU hotplug stuff and it worked as expected.
>> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
>> I would like to hear from more familiar people.
>>
>> The patch is:
>>
>> From: Oscar Salvador <osalvador at suse.de>
>> Date: Wed, 6 Apr 2022 14:39:15 +0200
>> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>>
>> powerpc is the only platform that do not rely on
>> cpu_up()->try_online_node() to bring up a numa node,
>> and special cases it, instead, deep in its own machinery:
>>
>> dlpar_online_cpu
>> find_and_online_cpu_nid
>> try_online_node
>>
>> This should not be needed, but the thing is that the try_online_node()
>> from cpu_up() will not apply on the right node, because cpu_to_node()
>> will return the old mapping numa<->cpu that gets set on boot stage
>> for all possible cpus.
>>
>> That can be seen easily if we try to print out the numa node passed
>> to try_online_node() in cpu_up().
>>
>> The thing is that the numa<->cpu mapping does not get updated till a much
>> later stage in start_secondary:
>>
>> start_secondary:
>> set_numa_node(numa_cpu_lookup_table[cpu])
>>
>> But we do not really care, as we already now the
>> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
>> so let us make use of that and set the proper numa<->cpu mapping,
>> so cpu_to_node() in cpu_up() returns the right node and
>> try_online_node() can do its work.
>>
>> Signed-off-by: Oscar Salvador <osalvador at suse.de>
>
> Looks good to me.
>
> Reviewed-by: Srikar Dronamraju <srikar at linux.vnet.ibm.com>
Yeah agreed, thanks for getting to the root of the problem.
Can you resend as a standalone patch. Because you sent it as a reply it
won't be recognised by patchwork[1] which means it risks getting lost.
cheers
1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/
More information about the Linuxppc-dev
mailing list