[PATCH] ppc64: per_cpu data optimisations

Thu Dec 29 22:54:34 EST 2005

> 5 loads for something that is supposed to be fast, pretty awful. One
> reason for the large number of loads is that we have to synthesize 2
> 64bit constants (per_cpu__variable_name and __per_cpu_offset).

It will probably not help you very much because most code seems
to use int cpu = get_cpu(); per_cpu(...., cpu);  put_cpu();
instead of the faster get_cpu();  __get_per_cpu(...); put_cpu();

With the cpu argument there is no fast way to go the local CPU shortcut :/

It would be probably a good idea to go through the fast
paths and change them over to the second pattern.

> Longer term we can should be able to do even better than 3 loads.
> If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset
> was in a register we could cut it down to one load. A suggestion from
> Rusty is to use gcc's __thread extension here. In order to do this we
> would need to free up r13 (the __thread register and where the paca
> currently is). So far Ive had a few unsuccessful attempts at doing that :)

I tried it at some point on x86-64, but gave up because the ELF
relocations for this are hopelessly user space specific hacks and it was
just impossible to use them for anything else.

Also you become very glibc/binutils specific and I think it would
be a bad thing to reach glibc state in the kernel where you need always the 
latest toolkit to build it.

> 
> At this stage it might be worth making the NUMA and possible cpu
> optimisations generic, but per cpu init is done so early we have to be
> careful that all architectures have their possible map setup correctly.

It's quite complicated to do it anyways - i'm just going through 
it with Kiran. 

One problem is that sched_init() access per cpu variables really early,
so you have ugly ordering problems. That is why Kiran's patch has 
to bootstrap it with a "boot time" per cpu area and then later relocating.
Quite ugly.

-Andi