[PATCH v2 3/4] powerpc/numa: Early request for home node associativity

Fri Sep 6 06:04:00 AEST 2019

Hi Srikar,

Srikar Dronamraju <srikar at linux.vnet.ibm.com> writes:
> Currently the kernel detects if its running on a shared lpar platform
> and requests home node associativity before the scheduler sched_domains
> are setup. However between the time NUMA setup is initialized and the
> request for home node associativity, workqueue initializes its per node
> cpumask. The per node workqueue possible cpumask may turn invalid
> after home node associativity resulting in weird situations like
> workqueue possible cpumask being a subset of workqueue online cpumask.
>
> This can be fixed by requesting home node associativity earlier just
> before NUMA setup. However at the NUMA setup time, kernel may not be in
> a position to detect if its running on a shared lpar platform. So
> request for home node associativity and if the request fails, fallback
> on the device tree property.
>
> While here, fix a problem where of_node_put could be called even when
> of_get_cpu_node was not successful.

of_node_put() handles NULL arguments, so this should not be necessary.

> +static int vphn_get_nid(unsigned long cpu, bool get_hwid)

[...]

> +static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)

[...]

> @@ -528,7 +561,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
>  {
>  	int nid;
>  
> -	nid = numa_setup_cpu(cpu);
> +	nid = numa_setup_cpu(cpu, true);
>  	verify_cpu_node_mapping(cpu, nid);
>  	return 0;
>  }
> @@ -875,7 +908,7 @@ void __init mem_topology_setup(void)
>  	reset_numa_cpu_lookup_table();
>  
>  	for_each_present_cpu(cpu)
> -		numa_setup_cpu(cpu);
> +		numa_setup_cpu(cpu, false);
>  }

I'm open to other points of view here, but I would prefer two separate
functions, something like vphn_get_nid() for runtime and
vphn_get_nid_early() (which could be __init) for boot-time
initialization. Propagating a somewhat unexpressive boolean flag through
two levels of function calls in this code is unappealing...

Regardless, I have an annoying question :-) Isn't it possible that,
while Linux is calling vphn_get_nid() for each logical cpu in sequence,
the platform could change a virtual processor's node assignment,
potentially causing sibling threads to get different node assignments
and producing an incoherent topology (which then leads to sched domain
assertions etc)?

If so, I think more care is needed. The algorithm should make the vphn
call only once per cpu node, I think?