Appropriate liburcu cache line size for Power

Wed Mar 27 01:37:57 AEDT 2024

On 2024-03-26 03:19, Michael Ellerman wrote:
> Mathieu Desnoyers <mathieu.desnoyers at efficios.com> writes:
>> Hi,
> 
> Hi Mathieu,
> 
>> In the powerpc architecture support within the liburcu project [1]
>> we have a cache line size defined as 256 bytes with the following
>> comment:
>>
>> /* Include size of POWER5+ L3 cache lines: 256 bytes */
>> #define CAA_CACHE_LINE_SIZE     256
>>
>> I recently received a pull request on github [2] asking to
>> change this to 128 bytes. All the material provided supports
>> that the cache line sizes on powerpc are 128 bytes or less (even
>> L3 on POWER7, POWER8, and POWER9) [3].
>>
>> I wonder where the 256 bytes L3 cache line size for POWER5+
>> we have in liburcu comes from, and I wonder if it's the right choice
>> for a cache line size on all powerpc, considering that the Linux
>> kernel cache line size appear to use 128 bytes on recent Power
>> architectures. I recall some benchmark experiments Paul and I did
>> on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
>> cache line size, and I suppose this is why we came up with this
>> value, but I don't have the detailed specs of that machine.
>>
>> Any feedback on this matter would be appreciated.
> 
> The ISA doesn't specify the cache line size, other than it is smaller
> than a page.
> 
> In practice all the 64-bit IBM server CPUs I'm aware of have used 128
> bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
> Freescale e6500.
> 
> It is possible to discover at runtime via AUXV headers. But that's no
> use if you want a compile-time constant.

Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.

> 
> I'm happy to run some benchmarks if you can point me at what to run. I
> had a poke around the repository and found short_bench, but it seemed to
> run for a very long time.

I've created a dedicated test program for this, see:

https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

example use:

On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1

Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.

The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.

On a POWER9 (architected), altivec supported:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1

This points at false-sharing below 128 bytes stride.

On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1

Which points at false-sharing below 64 bytes.

I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.

Feedback is welcome.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com