[PATCH] powerpc/numa: Handle partially initialized numa nodes
Srikar Dronamraju
srikar at linux.vnet.ibm.com
Fri Apr 8 22:25:37 AEST 2022
* Oscar Salvador <osalvador at suse.de> [2022-04-06 18:19:00]:
> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> > arch/powerpc/mm/numa.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index b9b7fefbb64b..13022d734951 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
> > if (new_nid < 0 || !node_possible(new_nid))
> > new_nid = first_online_node;
> >
> > - if (NODE_DATA(new_nid) == NULL) {
> > + if (!node_online(new_nid)) {
> > #ifdef CONFIG_MEMORY_HOTPLUG
> > /*
> > * Need to ensure that NODE_DATA is initialized for a node from
>
> Because of this fix, I wanted to check whether we might have any more fallouts due
> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
> as to why powerpc is the only platform that special cases try_online_node(),
> while all others rely on cpu_up()->try_online_node() to do the right thing.
>
> So, I had a look.
> Let us rewind a bit:
>
> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>
> In there, it says that we have the following picture:
>
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid
>
> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
> too late as we might have referenced some NODE_DATA() fields.
> Note that this happens at a much later stage in cpuhp.
>
> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
> which should allocate-and-online the node and prevent accessing garbage.
> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
> (see arch/powerpc/mm/numa.c:mem_topology_setup),
> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>
> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
> tries to online node0, or whatever old mapping numa<->cpu is there).
>
> So, we have something like:
>
> dlpar_online_cpu
> device_online
> dev->bus->online
> cpu_subsys_online
> cpu_device_up
> cpu_up
> try_online_node (old mapping nid <-> cpu )
> ...
> ...
> cphp_callbacks
> sched_cpu_activate
> cpuset_update_active_cpus
> schedule_work(&cpuset_hotplug_work)
> cpuset_hotplug_work
> partition_sched_domains
> arch_update_cpu_topology
> numa_update_cpu_topology
> find_and_online_cpu_nid (online new_nid)
>
>
> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
> fixed the issue, but we should not need this special casing at all.
>
> We do already know the numa<->cpu associativity in
> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>
> With this in mind, I came up with the following patch, where I carried out a battery
> of tests for CPU hotplug stuff and it worked as expected.
> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
> I would like to hear from more familiar people.
>
> The patch is:
>
> From: Oscar Salvador <osalvador at suse.de>
> Date: Wed, 6 Apr 2022 14:39:15 +0200
> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>
> powerpc is the only platform that do not rely on
> cpu_up()->try_online_node() to bring up a numa node,
> and special cases it, instead, deep in its own machinery:
>
> dlpar_online_cpu
> find_and_online_cpu_nid
> try_online_node
>
> This should not be needed, but the thing is that the try_online_node()
> from cpu_up() will not apply on the right node, because cpu_to_node()
> will return the old mapping numa<->cpu that gets set on boot stage
> for all possible cpus.
>
> That can be seen easily if we try to print out the numa node passed
> to try_online_node() in cpu_up().
>
> The thing is that the numa<->cpu mapping does not get updated till a much
> later stage in start_secondary:
>
> start_secondary:
> set_numa_node(numa_cpu_lookup_table[cpu])
>
> But we do not really care, as we already now the
> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
> so let us make use of that and set the proper numa<->cpu mapping,
> so cpu_to_node() in cpu_up() returns the right node and
> try_online_node() can do its work.
>
> Signed-off-by: Oscar Salvador <osalvador at suse.de>
Looks good to me.
Reviewed-by: Srikar Dronamraju <srikar at linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/topology.h | 8 ++-----
> arch/powerpc/mm/numa.c | 31 +++++++---------------------
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
> 3 files changed, 11 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 36fcafb1fd6d..6ae1b2dce83e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {}
> #endif /* CONFIG_NUMA */
>
> #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
> -extern int find_and_online_cpu_nid(int cpu);
> +extern void find_and_update_cpu_nid(int cpu);
> extern int cpu_to_coregroup_id(int cpu);
> #else
> -static inline int find_and_online_cpu_nid(int cpu)
> -{
> - return 0;
> -}
> -
> +static inline void find_and_update_cpu_nid(int cpu) {}
> static inline int cpu_to_coregroup_id(int cpu)
> {
> #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..b5bc8b1a833d 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
> return rc;
> }
>
> -int find_and_online_cpu_nid(int cpu)
> +void find_and_update_cpu_nid(int cpu)
> {
> __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> int new_nid;
>
> /* Use associativity from first thread for all siblings */
> if (vphn_get_associativity(cpu, associativity))
> - return cpu_to_node(cpu);
> + return;
>
> + /* Do not have previous associativity, so find it now. */
> new_nid = associativity_to_nid(associativity);
> +
> if (new_nid < 0 || !node_possible(new_nid))
> new_nid = first_online_node;
> -
> - if (NODE_DATA(new_nid) == NULL) {
> -#ifdef CONFIG_MEMORY_HOTPLUG
> - /*
> - * Need to ensure that NODE_DATA is initialized for a node from
> - * available memory (see memblock_alloc_try_nid). If unable to
> - * init the node, then default to nearest node that has memory
> - * installed. Skip onlining a node if the subsystems are not
> - * yet initialized.
> - */
> - if (!topology_inited || try_online_node(new_nid))
> - new_nid = first_online_node;
> -#else
> - /*
> - * Default to using the nearest node that has memory installed.
> - * Otherwise, it would be necessary to patch the kernel MM code
> - * to deal with more memoryless-node error conditions.
> + else
> + /* Associate node <-> cpu, so cpu_up() calls
> + * try_online_node() on the right node.
> */
> - new_nid = first_online_node;
> -#endif
> - }
> + set_cpu_numa_node(cpu, new_nid);
>
> pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
> cpu, new_nid);
> - return new_nid;
> }
>
> int cpu_to_coregroup_id(int cpu)
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index b81fc846d99c..0f8cd8b06432 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
> if (get_hard_smp_processor_id(cpu) != thread)
> continue;
> cpu_maps_update_done();
> - find_and_online_cpu_nid(cpu);
> + find_and_update_cpu_nid(cpu);
> rc = device_online(get_cpu_device(cpu));
> if (rc) {
> dlpar_offline_cpu(dn);
> --
> 2.16.4
>
> --
> Oscar Salvador
> SUSE Labs
More information about the Linuxppc-dev
mailing list