[PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

Tue Apr 29 20:35:25 EST 2014

On Tuesday 29 April 2014 12:36 PM, Ingo Molnar wrote:
> 
> * Madhavan Srinivasan <maddy at linux.vnet.ibm.com> wrote:
> 
>> Performance data for different FAULT_AROUND_ORDER values from 4 socket
>> Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
>> is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
>> v3.15-rc1 for different fault around order values.
>>
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>>
>> Linux build (make -j64)
>> minor-faults            47,437,359      35,279,286      25,425,347      23,461,275      22,002,189      21,435,836
>> times in seconds        347.302528420   344.061588460   340.974022391   348.193508116   348.673900158   350.986543618
>>  stddev for time        ( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
>>  %chg time to baseline                  -0.9%           -1.8%           0.2%            0.39%           1.06%
> 
> Probably too noisy.

Ok. I should have added the formula used for %change to clarify the data
presented. My bad.

Just to clarify, %change here is calculated based on this formula.

((new value - baseline)/baseline)

And in this case, negative %change says it a drop in time and
positive value has increase the time when compared to baseline.

With regards
Maddy

> 
>> Linux rebuild (make -j64)
>> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
>> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> 
> Here it looks like a speedup. Optimal value: 5+.
> 
>> Binutils build (make all -j64 )
>> minor-faults            474,821         371,380         269,463         247,715         235,255         228,337
>> times in seconds        53.882492432    53.584289348    53.882773216    53.755816431    53.607824348    53.423759642
>>  stddev for time        ( +-  0.08% )   ( +-  0.56% )   ( +-  0.17% )   ( +-  0.11% )   ( +-  0.60% )   ( +-  0.69% )
>>  %chg time to baseline                  -0.55%          0.0%            -0.23%          -0.51%          -0.85%
> 
> Probably too noisy, but looks like a potential slowdown?
> 
>> Two synthetic tests: access every word in file in sequential/random order.
>>
>> Sequential access 16GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     263,148         131,166         32,908          16,514          8,260           1,093
>>        times in seconds 53.091138345    53.113191672    53.188776177    53.233017218    53.206841347    53.429979442
>>        stddev for time  ( +-  0.06% )   ( +-  0.07% )   ( +-  0.08% )   ( +-  0.09% )   ( +-  0.03% )   ( +-  0.03% )
>>        %chg time to baseline            0.04%           0.18%           0.26%           0.21%           0.63%
> 
> Speedup, optimal value: 8+.
> 
>> 8 threads
>>        minor-faults     2,097,267       1,048,753       262,237         131,397         65,621          8,274
>>        times in seconds 55.173790028    54.591880790    54.824623287    54.802162211    54.969680503    54.790387715
>>        stddev for time  ( +-  0.78% )   ( +-  0.09% )   ( +-  0.08% )   ( +-  0.07% )   ( +-  0.28% )   ( +-  0.05% )
>>        %chg time to baseline            -1.05%          -0.63%          -0.67%          -0.36%          -0.69%
> 
> Looks like a regression?
> 
>> 32 threads
>>        minor-faults     8,388,751       4,195,621       1,049,664       525,461         262,535         32,924
>>        times in seconds 60.431573046    60.669110744    60.485336388    60.697789706    60.077959564    60.588855032
>>        stddev for time  ( +-  0.44% )   ( +-  0.27% )   ( +-  0.46% )   ( +-  0.67% )   ( +-  0.31% )   ( +-  0.49% )
>>        %chg time to baseline            0.39%           0.08%           0.44%           -0.58%          0.25%
> 
> Probably too noisy.
> 
>> 64 threads
>>        minor-faults     16,777,409      8,607,527       2,289,766       1,202,264       598,405         67,587
>>        times in seconds 96.932617720    100.675418760   102.109880836   103.881733383   102.580199555   105.751194041
>>        stddev for time  ( +-  1.39% )   ( +-  1.06% )   ( +-  0.99% )   ( +-  0.76% )   ( +-  1.65% )   ( +-  1.60% )
>>        %chg time to baseline            3.86%           5.34%           7.16%           5.82%           9.09%
> 
> Speedup, optimal value: 4+
> 
>> 128 threads
>>        minor-faults     33,554,705      17,375,375      4,682,462       2,337,245       1,179,007       134,819
>>        times in seconds 128.766704495   115.659225437   120.353046307   115.291871270   115.450886036   113.991902150
>>        stddev for time  ( +-  2.93% )   ( +-  0.30% )   ( +-  2.93% )   ( +-  1.24% )   ( +-  1.03% )   ( +-  0.70% )
>>        %chg time to baseline            -10.17%         -6.53%          -10.46%         -10.34%         -11.47%
> 
> Rather significant regression at order 1 already.
> 
>> Random access 1GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     17,155          8,678           2,126           1,097           581             134
>>        times in seconds 51.904430523    51.658017987    51.919270792    51.560531738    52.354431597    51.976469502
>>        stddev for time  ( +-  3.19% )   ( +-  1.35% )   ( +-  1.56% )   ( +-  0.91% )   ( +-  1.70% )   ( +-  2.02% )
>>        %chg time to baseline            -0.47%          0.02%           -0.66%          0.86%           0.13%
> 
> Probably too noisy.
> 
>> 8 threads
>>        minor-faults     131,844         70,705          17,457          8,505           4,251           598
>>        times in seconds 58.162813956    54.991706305    54.952675791    55.323057492    54.755587379    53.376722828
>>        stddev for time  ( +-  1.44% )   ( +-  0.69% )   ( +-  1.23% )   ( +-  2.78% )   ( +-  1.90% )   ( +-  2.91% )
>>        %chg time to baseline            -5.45%          -5.52%          -4.88%          -5.86%          -8.22%
> 
> Regression.
> 
>> 32 threads
>>        minor-faults     524,437         270,760         67,069          33,414          16,641          2,204
>>        times in seconds 69.981777072    76.539570015    79.753578505    76.245943618    77.254258344    79.072596831
>>        stddev for time  ( +-  2.81% )   ( +-  1.95% )   ( +-  2.66% )   ( +-  0.99% )   ( +-  2.35% )   ( +-  3.22% )
>>        %chg time to baseline            9.37%           13.96%          8.95%           10.39%          12.98%
> 
> Speedup, optimal value hard to tell due to noise - 3+ or 8+.
> 
>> 64 threads
>>        minor-faults     1,049,117       527,451         134,016         66,638          33,391          4,559
>>        times in seconds 108.024517536   117.575067996   115.322659914   111.943998437   115.049450815   119.218450840
>>        stddev for time  ( +-  2.40% )   ( +-  1.77% )   ( +-  1.19% )   ( +-  3.29% )   ( +-  2.32% )   ( +-  1.42% )
>>        %chg time to baseline            8.84%           6.75%           3.62%           6.5%            10.3%
> 
> Speedup, optimal value again hard to tell due to noise.
> 
>> 128 threads
>>        minor-faults     2,097,440       1,054,360       267,042         133,328         66,532          8,652
>>        times in seconds 155.055861167   153.059625968   152.449492156   151.024005282   150.844647770   155.954366718
>>        stddev for time  ( +-  1.32% )   ( +-  1.14% )   ( +-  1.32% )   ( +-  0.81% )   ( +-  0.75% )   ( +-  0.72% )
>>        %chg time to baseline            -1.28%          -1.68%          -2.59%          -2.71%          0.57%
> 
> Slowdown for most orders.
> 
>> Incase of Kernel compilation, fault around order (fao) of 1 and 3 
>> provides fast compilation time when compared to a value of 4. On 
>> closer look, fao of 3 has higher agains. Incase of Sequential access 
>> synthetic tests fao of 1 has higher gains and in Random access test, 
>> fao of 3 has marginal gains. Going by compilation time, fao value of 
>> 3 is suggested in this patch for pseries platform.
> 
> So I'm really at loss to understand where you get the optimal value of 
> '3' from. The data does not seem to match your claim that '1 and 3 
> provides fast compilation time when compared to a value of 4':
> 

>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>>
>> Linux rebuild (make -j64)
>> minor-faults            941,552         718,319         486,625         440,124         410,510         397,416
>> times in seconds        30.569834718    31.219637539    31.319370649    31.434285472    31.972367174    31.443043580
>>  stddev for time        ( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
>>  %chg time to baseline                  2.1%            2.4%            2.8%            4.58%           2.85%
> 
> 5 and 8, and probably 6, 7 are better than 4.
> 
> 3 is probably _slower_ than the current default - but it's hard to 
> tell due to inherent noise.
> 
> But the other two build tests were too noisy, and if then they showed 
> a slowdown.
> 
>> Worst case scenario: we touch one page every 16M to demonstrate overhead.
>>
>> Touch only one page in page table in 16GiB file
>> FAULT_AROUND_ORDER      Baseline        1               3               4               5               8
>> 1 thread
>>        minor-faults     1,104           1,090           1,071           1,068           1,065           1,063
>>        times in seconds 0.006583298     0.008531502     0.019733795     0.036033763     0.062300553     0.406857086
>>        stddev for time  ( +-  2.79% )   ( +-  2.42% )   ( +-  3.47% )   ( +-  2.81% )   ( +-  2.01% )   ( +-  1.33% )
>> 8 threads
>>        minor-faults     8,279           8,264           8,245           8,243           8,239           8,240
>>        times in seconds 0.044572398     0.057211811     0.107606306     0.205626815     0.381679120     2.647979955
>>        stddev for time  ( +-  1.95% )   ( +-  2.98% )   ( +-  1.74% )   ( +-  2.80% )   ( +-  2.01% )   ( +-  1.86% )
>> 32 threads
>>        minor-faults     32,879          32,864          32,849          32,845          32,839          32,843
>>        times in seconds 0.197659343     0.218486087     0.445116407     0.694235883     1.296894038     9.127517045
>>        stddev for time  ( +-  3.05% )   ( +-  3.05% )   ( +-  4.33% )   ( +-  3.08% )   ( +-  3.75% )   ( +-  0.56% )
>> 64 threads
>>        minor-faults     65,680          65,664          65,646          65,645          65,640          65,647
>>        times in seconds 0.455537304     0.489688780     0.866490093     1.427393118     2.379628982     17.059295051
>>        stddev for time  ( +-  4.01% )   ( +-  4.13% )   ( +-  2.92% )   ( +-  1.68% )   ( +-  1.79% )   ( +-  0.48% )
>> 128 threads
>>        minor-faults     131,279         131,265         131,250         131,245         131,241         131,254
>>        times in seconds 1.026880651     1.095327536     1.721728274     2.808233068     4.662729948     31.732848290
>>        stddev for time  ( +-  6.85% )   ( +-  4.09% )   ( +-  1.71% )   ( +-  3.45% )   ( +-  2.40% )   ( +-  0.68% )
> 
> There's no '%change' values shown, but the slowdown looks significant, 
> it's the worst case: for example with 1 thread order 3 looks about 
> 300% slower (3x slowdown) compared to order 0.
> 
> All in one, looking at your latest data I don't think the conclusion 
> from your first version of this optimization patch from a month ago is 
> true anymore:
> 
>> +	/* Measured on a 4 socket Power7 system (128 Threads and 128GB memory) */
>> +	fault_around_order = 3;
> 
> As the data is rather conflicting and inconclusive, and if it shows a 
> sweet spot it's not at order 3. New data should in general trigger 
> reanalysis of your first optimization value.
>
> I'm starting to suspect that maybe workloads ought to be given a 
> choice in this matter, via madvise() or such.
> 
> Thanks,
> 
> 	Ingo
>