[PATCH] powerpc/numa: Fix percpu allocations to be NUMA aware

Fri Jun 2 19:54:32 AEST 2017

Nicholas Piggin <npiggin at gmail.com> writes:

> On Fri,  2 Jun 2017 15:14:47 +1000
> Michael Ellerman <mpe at ellerman.id.au> wrote:
>
>> In commit 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), we
>> switched to the generic implementation of cpu_to_node(), which uses a percpu
>> variable to hold the NUMA node for each CPU.
>> 
>> Unfortunately we neglected to notice that we use cpu_to_node() in the allocation
>> of our percpu areas, leading to a chicken and egg problem. In practice what
>> happens is when we are setting up the percpu areas, cpu_to_node() reports that
>> all CPUs are on node 0, so we allocate all percpu areas on node 0.
>> 
>> This is visible in the dmesg output, as all pcpu allocs being in group 0:
>> 
>>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>>   pcpu-alloc: [0] 24 25 26 27 [0] 28 29 30 31
>>   pcpu-alloc: [0] 32 33 34 35 [0] 36 37 38 39
>>   pcpu-alloc: [0] 40 41 42 43 [0] 44 45 46 47
>> 
>> To fix it we need an early_cpu_to_node() which can run prior to percpu being
>> setup. We already have the numa_cpu_lookup_table we can use, so just plumb it
>> in. With the patch dmesg output shows two groups, 0 and 1:
>> 
>>   pcpu-alloc: [0] 00 01 02 03 [0] 04 05 06 07
>>   pcpu-alloc: [0] 08 09 10 11 [0] 12 13 14 15
>>   pcpu-alloc: [0] 16 17 18 19 [0] 20 21 22 23
>>   pcpu-alloc: [1] 24 25 26 27 [1] 28 29 30 31
>>   pcpu-alloc: [1] 32 33 34 35 [1] 36 37 38 39
>>   pcpu-alloc: [1] 40 41 42 43 [1] 44 45 46 47
>> 
>> We can also check the data_offset in the paca of various CPUs, with the fix we
>> see:
>> 
>>   CPU 0:  data_offset = 0x0ffe8b0000
>>   CPU 24: data_offset = 0x1ffe5b0000
>> 
>> And we can see from dmesg that CPU 24 has an allocation on node 1:
>> 
>>   node   0: [mem 0x0000000000000000-0x0000000fffffffff]
>>   node   1: [mem 0x0000001000000000-0x0000001fffffffff]
>
> Nice bug :)

Yeah what a shocker.

> I wonder what happens if you put a WARN if cpu_to_node is used
> before it is set up?

Yeah we definitely need some way to debug these things.

>> @@ -672,10 +672,19 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>>  
>>  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
>>  {
>> -	if (cpu_to_node(from) == cpu_to_node(to))
>> -		return LOCAL_DISTANCE;
>> -	else
>> -		return REMOTE_DISTANCE;
>> +#ifndef CONFIG_NUMA
>> +	return LOCAL_DISTANCE;
>> +#else
>> +	int from_nid, to_nid;
>> +
>> +	from_nid = early_cpu_to_node(from);
>> +	to_nid   = early_cpu_to_node(to);
>> +
>> +	if (from_nid == -1 || to_nid == -1)
>> +		return LOCAL_DISTANCE;	/* Or assume remote? */
>> +
>> +	return node_distance(from_nid, to_nid);
>
> If you made node_distance() return LOCAL_NODE for !NUMA, this
> should fall out and not require the ifdef?

Maybe yeah. This is designed to be minimal for backporting though.

cheers