[PATCH v11 00/26] Speculative page faults

Tue Jul 17 19:36:45 AEST 2018

On 13/07/2018 05:56, Song, HaiyanX wrote:
> Hi Laurent,

Hi Haiyan,

Thanks a lot for sharing this perf reports.

I looked at them closely, and I've to admit that I was not able to found a
major difference between the base and the head report, except that
handle_pte_fault() is no more in-lined in the head one.

As expected, __handle_speculative_fault() is never traced since these tests are
dealing with file mapping, not handled in the speculative way.

When running these test did you seen a major differences in the test's result
between base and head ?

>From the number of cycles counted, the biggest difference is page_fault3 when
run with the THP enabled:
				BASE		HEAD		Delta
page_fault2_base_thp_never	1142252426747	1065866197589	-6.69%
page_fault2_base_THP-Alwasys	1124844374523	1076312228927	-4.31%
page_fault3_base_thp_never	1099387298152	1134118402345	3.16%
page_fault3_base_THP-Always	1059370178101	853985561949	-19.39%

The very weird thing is the difference of the delta cycles reported between
thp never and thp always, because the speculative way is aborted when checking
for the vma->ops field, which is the same in both case, and the thp is never
checked. So there is no code covering differnce, on the speculative path,
between these 2 cases. This leads me to think that there are other interactions
interfering in the measure.

Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at
the head of the perf report is the 92% testcase which is weirdly not reported
on the head side :
    92.02%    22.33%  page_fault3_processes  [.] testcase
92.02% testcase

Then the base reported 37.67% for __do_page_fault() where the head reported
48.41%, but the only difference in this function, between base and head, is the
call to handle_speculative_fault(). But this is a macro checking for the fault
flags, and mm->users and then calling __handle_speculative_fault() if needed.
So this can't explain this difference, except if __handle_speculative_fault()
is inlined in __do_page_fault().
Is this the case on your build ?

Haiyan, do you still have the output of the test to check those numbers too ?

Cheers,
Laurent

> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case. 
> Please help to check on these data if it can help you to find the higher change. Thanks.
> 
> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2 
>     tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration.
> 
> Best regards,
> Haiyan Song
> 
> ________________________________________
> From: owner-linux-mm at kvack.org [owner-linux-mm at kvack.org] on behalf of Laurent Dufour [ldufour at linux.vnet.ibm.com]
> Sent: Thursday, July 12, 2018 1:05 AM
> To: Song, HaiyanX
> Cc: akpm at linux-foundation.org; mhocko at kernel.org; peterz at infradead.org; kirill at shutemov.name; ak at linux.intel.com; dave at stgolabs.net; jack at suse.cz; Matthew Wilcox; khandual at linux.vnet.ibm.com; aneesh.kumar at linux.vnet.ibm.com; benh at kernel.crashing.org; mpe at ellerman.id.au; paulus at samba.org; Thomas Gleixner; Ingo Molnar; hpa at zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work at gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel at vger.kernel.org; linux-mm at kvack.org; haren at linux.vnet.ibm.com; npiggin at gmail.com; bsingharora at gmail.com; paulmck at linux.vnet.ibm.com; Tim Chen; linuxppc-dev at lists.ozlabs.org; x86 at kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
> 
> Hi Haiyan,
> 
> Do you get a chance to capture some performance cycles on your system ?
> I still can't get these numbers on my hardware.
> 
> Thanks,
> Laurent.
> 
> On 04/07/2018 09:51, Laurent Dufour wrote:
>> On 04/07/2018 05:23, Song, HaiyanX wrote:
>>> Hi Laurent,
>>>
>>>
>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times.
>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev.
>>
>> Repeating the test only 3 times seems a bit too low to me.
>>
>> I'll focus on the higher change for the moment, but I don't have access to such
>> a hardware.
>>
>> Is possible to provide a diff between base and SPF of the performance cycles
>> measured when running page_fault3 and page_fault2 when the 20% change is detected.
>>
>> Please stay focus on the test case process to see exactly where the series is
>> impacting.
>>
>> Thanks,
>> Laurent.
>>
>>>
>>> And I did not find other high variation on test case result.
>>>
>>> a). Enable THP
>>> testcase                          base     stddev       change      head     stddev         metric
>>> page_fault3/enable THP           10519      ± 3%        -20.5%      8368      ±6%          will-it-scale.per_thread_ops
>>> page_fault2/enalbe THP            8281      ± 2%        -18.8%      6728                   will-it-scale.per_thread_ops
>>> brk1/eanble THP                 998475                   -2.2%    976893                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      223910                   -1.3%    220930                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      233722                   -1.0%    231288                   will-it-scale.per_thread_ops
>>>
>>> b). Disable THP
>>> page_fault3/disable THP          10856                  -23.1%      8344                   will-it-scale.per_thread_ops
>>> page_fault2/disable THP           8147                  -18.8%      6613                   will-it-scale.per_thread_ops
>>> brk1/disable THP                   957                    -7.9%      881                   will-it-scale.per_thread_ops
>>> context_switch1/disable THP     237006                    -2.2%    231907                  will-it-scale.per_thread_ops
>>> brk1/disable THP                997317                    -2.0%    977778                  will-it-scale.per_process_ops
>>> page_fault3/disable THP         467454                    -1.8%    459251                  will-it-scale.per_process_ops
>>> context_switch1/disable THP     224431                    -1.3%    221567                  will-it-scale.per_process_ops
>>>
>>>
>>> Best regards,
>>> Haiyan Song
>>> ________________________________________
>>> From: Laurent Dufour [ldufour at linux.vnet.ibm.com]
>>> Sent: Monday, July 02, 2018 4:59 PM
>>> To: Song, HaiyanX
>>> Cc: akpm at linux-foundation.org; mhocko at kernel.org; peterz at infradead.org; kirill at shutemov.name; ak at linux.intel.com; dave at stgolabs.net; jack at suse.cz; Matthew Wilcox; khandual at linux.vnet.ibm.com; aneesh.kumar at linux.vnet.ibm.com; benh at kernel.crashing.org; mpe at ellerman.id.au; paulus at samba.org; Thomas Gleixner; Ingo Molnar; hpa at zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work at gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel at vger.kernel.org; linux-mm at kvack.org; haren at linux.vnet.ibm.com; npiggin at gmail.com; bsingharora at gmail.com; paulmck at linux.vnet.ibm.com; Tim Chen; linuxppc-dev at lists.ozlabs.org; x86 at kernel.org
>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>
>>> On 11/06/2018 09:49, Song, HaiyanX wrote:
>>>> Hi Laurent,
>>>>
>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
>>>> V9 patch serials.
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
>>>> commit id:
>>>>   head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>>>>   base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
>>>> Benchmark: will-it-scale
>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
>>>>
>>>> Metrics:
>>>>   will-it-scale.per_process_ops=processes/nr_cpu
>>>>   will-it-scale.per_thread_ops=threads/nr_cpu
>>>>   test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task:100%
>>>>
>>>> 1. Regressions:
>>>>
>>>> a). Enable THP
>>>> testcase                          base           change      head           metric
>>>> page_fault3/enable THP           10519          -20.5%        836      will-it-scale.per_thread_ops
>>>> page_fault2/enalbe THP            8281          -18.8%       6728      will-it-scale.per_thread_ops
>>>> brk1/eanble THP                 998475           -2.2%     976893      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      223910           -1.3%     220930      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      233722           -1.0%     231288      will-it-scale.per_thread_ops
>>>>
>>>> b). Disable THP
>>>> page_fault3/disable THP          10856          -23.1%       8344      will-it-scale.per_thread_ops
>>>> page_fault2/disable THP           8147          -18.8%       6613      will-it-scale.per_thread_ops
>>>> brk1/disable THP                   957           -7.9%        881      will-it-scale.per_thread_ops
>>>> context_switch1/disable THP     237006           -2.2%     231907      will-it-scale.per_thread_ops
>>>> brk1/disable THP                997317           -2.0%     977778      will-it-scale.per_process_ops
>>>> page_fault3/disable THP         467454           -1.8%     459251      will-it-scale.per_process_ops
>>>> context_switch1/disable THP     224431           -1.3%     221567      will-it-scale.per_process_ops
>>>>
>>>> Notes: for the above  values of test result, the higher is better.
>>>
>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
>>> get reproducible results. The results have huge variation, even on the vanilla
>>> kernel, and I can't state on any changes due to that.
>>>
>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
>>> measure any changes between the vanilla and the SPF patched ones:
>>>
>>> test THP enabled                4.17.0-rc4-mm1  spf             delta
>>> page_fault3_threads             2697.7          2683.5          -0.53%
>>> page_fault2_threads             170660.6        169574.1        -0.64%
>>> context_switch1_threads         6915269.2       6877507.3       -0.55%
>>> context_switch1_processes       6478076.2       6529493.5       0.79%
>>> brk1                            243391.2        238527.5        -2.00%
>>>
>>> Tests were run 10 times, no high variation detected.
>>>
>>> Did you see high variation on your side ? How many times the test were run to
>>> compute the average values ?
>>>
>>> Thanks,
>>> Laurent.
>>>
>>>
>>>>
>>>> 2. Improvement: not found improvement based on the selected test cases.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>> ________________________________________
>>>> From: owner-linux-mm at kvack.org [owner-linux-mm at kvack.org] on behalf of Laurent Dufour [ldufour at linux.vnet.ibm.com]
>>>> Sent: Monday, May 28, 2018 4:54 PM
>>>> To: Song, HaiyanX
>>>> Cc: akpm at linux-foundation.org; mhocko at kernel.org; peterz at infradead.org; kirill at shutemov.name; ak at linux.intel.com; dave at stgolabs.net; jack at suse.cz; Matthew Wilcox; khandual at linux.vnet.ibm.com; aneesh.kumar at linux.vnet.ibm.com; benh at kernel.crashing.org; mpe at ellerman.id.au; paulus at samba.org; Thomas Gleixner; Ingo Molnar; hpa at zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work at gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel at vger.kernel.org; linux-mm at kvack.org; haren at linux.vnet.ibm.com; npiggin at gmail.com; bsingharora at gmail.com; paulmck at linux.vnet.ibm.com; Tim Chen; linuxppc-dev at lists.ozlabs.org; x86 at kernel.org
>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> On 28/05/2018 10:22, Haiyan Song wrote:
>>>>> Hi Laurent,
>>>>>
>>>>> Yes, these tests are done on V9 patch.
>>>>
>>>> Do you plan to give this V11 a run ?
>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Haiyan Song
>>>>>
>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>>>>
>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>>>>> tested on Intel 4s Skylake platform.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>>>>> series" while responding to the v11 header series...
>>>>>> Were these tests done on v9 or v11 ?
>>>>>>
>>>>>> Cheers,
>>>>>> Laurent.
>>>>>>
>>>>>>>
>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>>>>> Commit id:
>>>>>>>     base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>>>>     head commit: 0355322b3577eeab7669066df42c550a56801110
>>>>>>> Benchmark suite: will-it-scale
>>>>>>> Download link:
>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>>>>> Metrics:
>>>>>>>     will-it-scale.per_process_ops=processes/nr_cpu
>>>>>>>     will-it-scale.per_thread_ops=threads/nr_cpu
>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>>>>> THP: enable / disable
>>>>>>> nr_task: 100%
>>>>>>>
>>>>>>> 1. Regressions:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ enable THP         10092           -17.5%          8323       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ enable THP          8300           -17.2%          6869       will-it-scale.per_thread_ops
>>>>>>> brk1/ enable THP                  957.67         -7.6%           885       will-it-scale.per_thread_ops
>>>>>>> page_fault3/ enable THP        172821            -5.3%        163692       will-it-scale.per_process_ops
>>>>>>> signal1/ enable THP              9125            -3.2%          8834       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ disable THP        10107           -19.1%          8180       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ disable THP         8432           -17.8%          6931       will-it-scale.per_thread_ops
>>>>>>> context_switch1/ disable THP   215389            -6.8%        200776       will-it-scale.per_thread_ops
>>>>>>> brk1/ disable THP                 939.67         -6.6%           877.33    will-it-scale.per_thread_ops
>>>>>>> page_fault3/ disable THP       173145            -4.7%        165064       will-it-scale.per_process_ops
>>>>>>> signal1/ disable THP             9162            -3.9%          8802       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> 2. Improvements:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ enable THP               66.33        +469.8%           383.67    will-it-scale.per_thread_ops
>>>>>>> writeseek3/ enable THP          2531             +4.5%          2646       will-it-scale.per_thread_ops
>>>>>>> signal1/ enable THP              989.33          +2.8%          1016       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ disable THP              90.33        +417.3%           467.33    will-it-scale.per_thread_ops
>>>>>>> read2/ disable THP             58934            +39.2%         82060       will-it-scale.per_thread_ops
>>>>>>> page_fault1/ disable THP        8607            +36.4%         11736       will-it-scale.per_thread_ops
>>>>>>> read1/ disable THP            314063            +12.7%        353934       will-it-scale.per_thread_ops
>>>>>>> writeseek3/ disable THP         2452            +12.5%          2759       will-it-scale.per_thread_ops
>>>>>>> signal1/ disable THP             971.33          +5.5%          1024       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>>>>> on head commit is better than that on base commit for this benchmark.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Haiyan Song
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-linux-mm at kvack.org [owner-linux-mm at kvack.org] on behalf of Laurent Dufour [ldufour at linux.vnet.ibm.com]
>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>>>>> To: akpm at linux-foundation.org; mhocko at kernel.org; peterz at infradead.org; kirill at shutemov.name; ak at linux.intel.com; dave at stgolabs.net; jack at suse.cz; Matthew Wilcox; khandual at linux.vnet.ibm.com; aneesh.kumar at linux.vnet.ibm.com; benh at kernel.crashing.org; mpe at ellerman.id.au; paulus at samba.org; Thomas Gleixner; Ingo Molnar; hpa at zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work at gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>>>>> Cc: linux-kernel at vger.kernel.org; linux-mm at kvack.org; haren at linux.vnet.ibm.com; npiggin at gmail.com; bsingharora at gmail.com; paulmck at linux.vnet.ibm.com; Tim Chen; linuxppc-dev at lists.ozlabs.org; x86 at kernel.org
>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>>>>
>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>>>>> page fault without holding the mm semaphore [1].
>>>>>>>
>>>>>>> The idea is to try to handle user space page faults without holding the
>>>>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>>>>> process since the page fault handler will not wait for other threads memory
>>>>>>> layout change to be done, assuming that this change is done in another part
>>>>>>> of the process's memory space. This type page fault is named speculative
>>>>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>>>>> is failing its processing and a classic page fault is then tried.
>>>>>>>
>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>>>>> freeing operation which was hitting the performance by 20% as reported by
>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>>>>> limiting the locking contention to these operations which are expected to
>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>>>>> our back a reference count is added and 2 services (get_vma() and
>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>>>>> benchmark anymore.
>>>>>>>
>>>>>>> The VMA's attributes checked during the speculative page fault processing
>>>>>>> have to be protected against parallel changes. This is done by using a per
>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>>>>> handler to fast check for parallel changes in progress and to abort the
>>>>>>> speculative page fault in that case.
>>>>>>>
>>>>>>> Once the VMA has been found, the speculative page fault handler would check
>>>>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>>>>> is tried.  VMA sequence lockings are added when VMA attributes which are
>>>>>>> checked during the page fault are modified.
>>>>>>>
>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>>>>> leading to touching this PTE will need to lock the page table, so no
>>>>>>> parallel change is possible at this time.
>>>>>>>
>>>>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>>>>> different from the one recorded at the beginning of the SPF operation, the
>>>>>>> classic page fault handler will be called to handle the operation while
>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>>>>> PTE.
>>>>>>>
>>>>>>> In pseudo code, this could be seen as:
>>>>>>>     speculative_page_fault()
>>>>>>>     {
>>>>>>>             vma = get_vma()
>>>>>>>             check vma sequence count
>>>>>>>             check vma's support
>>>>>>>             disable interrupt
>>>>>>>                   check pgd,p4d,...,pte
>>>>>>>                   save pmd and pte in vmf
>>>>>>>                   save vma sequence counter in vmf
>>>>>>>             enable interrupt
>>>>>>>             check vma sequence count
>>>>>>>             handle_pte_fault(vma)
>>>>>>>                     ..
>>>>>>>                     page = alloc_page()
>>>>>>>                     pte_map_lock()
>>>>>>>                             disable interrupt
>>>>>>>                                     abort if sequence counter has changed
>>>>>>>                                     abort if pmd or pte has changed
>>>>>>>                                     pte map and lock
>>>>>>>                             enable interrupt
>>>>>>>                     if abort
>>>>>>>                        free page
>>>>>>>                        abort
>>>>>>>                     ...
>>>>>>>     }
>>>>>>>
>>>>>>>     arch_fault_handler()
>>>>>>>     {
>>>>>>>             if (speculative_page_fault(&vma))
>>>>>>>                goto done
>>>>>>>     again:
>>>>>>>             lock(mmap_sem)
>>>>>>>             vma = find_vma();
>>>>>>>             handle_pte_fault(vma);
>>>>>>>             if retry
>>>>>>>                unlock(mmap_sem)
>>>>>>>                goto again;
>>>>>>>     done:
>>>>>>>             handle fault error
>>>>>>>     }
>>>>>>>
>>>>>>> Support for THP is not done because when checking for the PMD, we can be
>>>>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>>>>
>>>>>>> This series add a new software performance event named 'speculative-faults'
>>>>>>> or 'spf'. It counts the number of successful page fault event handled
>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>>>>> counting the total number of page fault events while 'spf' is only counting
>>>>>>> the part of the faults processed speculatively.
>>>>>>>
>>>>>>> There are some trace events introduced by this series. They allow
>>>>>>> identifying why the page faults were not processed speculatively. This
>>>>>>> doesn't take in account the faults generated by a monothreaded process
>>>>>>> which directly processed while holding the mmap_sem. This trace events are
>>>>>>> grouped in a system named 'pagefault', they are:
>>>>>>>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>>>>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>>>>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>>>>  - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>>>>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>>>>    back.
>>>>>>>
>>>>>>> To record all the related events, the easier is to run perf with the
>>>>>>> following arguments :
>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>>>>
>>>>>>> There is also a dedicated vmstat counter showing the number of successful
>>>>>>> page fault handled speculatively. I can be seen this way:
>>>>>>> $ grep speculative_pgfault /proc/vmstat
>>>>>>>
>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>>>>> on x86, PowerPC and arm64.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Real Workload results
>>>>>>>
>>>>>>> As mentioned in previous email, we did non official runs using a "popular
>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>>>>> which showed a 30% improvements in the number of transaction processed per
>>>>>>> second. This run has been done on the v6 series, but changes introduced in
>>>>>>> this new version should not impact the performance boost seen.
>>>>>>>
>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>>>>> series:
>>>>>>>                 vanilla         spf
>>>>>>> faults          89.418          101.364         +13%
>>>>>>> spf                n/a           97.989
>>>>>>>
>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>>>>> way.
>>>>>>>
>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>>>>> it a try on an android device. He reported that the application launch time
>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>>>>> 20%.
>>>>>>>
>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>>>>
>>>>>>> Application                             4.9     4.9+spf delta
>>>>>>> com.tencent.mm                          416     389     -7%
>>>>>>> com.eg.android.AlipayGphone             1135    986     -13%
>>>>>>> com.tencent.mtt                         455     454     0%
>>>>>>> com.qqgame.hlddz                        1497    1409    -6%
>>>>>>> com.autonavi.minimap                    711     701     -1%
>>>>>>> com.tencent.tmgp.sgame                  788     748     -5%
>>>>>>> com.immomo.momo                         501     487     -3%
>>>>>>> com.tencent.peng                        2145    2112    -2%
>>>>>>> com.smile.gifmaker                      491     461     -6%
>>>>>>> com.baidu.BaiduMap                      479     366     -23%
>>>>>>> com.taobao.taobao                       1341    1198    -11%
>>>>>>> com.baidu.searchbox                     333     314     -6%
>>>>>>> com.tencent.mobileqq                    394     384     -3%
>>>>>>> com.sina.weibo                          907     906     0%
>>>>>>> com.youku.phone                         816     731     -11%
>>>>>>> com.happyelements.AndroidAnimal.qq      763     717     -6%
>>>>>>> com.UCMobile                            415     411     -1%
>>>>>>> com.tencent.tmgp.ak                     1464    1431    -2%
>>>>>>> com.tencent.qqmusic                     336     329     -2%
>>>>>>> com.sankuai.meituan                     1661    1302    -22%
>>>>>>> com.netease.cloudmusic                  1193    1200    1%
>>>>>>> air.tv.douyu.android                    4257    4152    -2%
>>>>>>>
>>>>>>> ------------------
>>>>>>> Benchmarks results
>>>>>>>
>>>>>>> Base kernel is v4.17.0-rc4-mm1
>>>>>>> SPF is BASE + this series
>>>>>>>
>>>>>>> Kernbench:
>>>>>>> ----------
>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>>>>> kernel (kernel is build 5 times):
>>>>>>>
>>>>>>> Average Half load -j 8
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     1448.65 (5.72312)      1455.84 (4.84951)       0.50%
>>>>>>> User    Time     10135.4 (30.3699)      10148.8 (31.1252)       0.13%
>>>>>>> System  Time     900.47  (2.81131)      923.28  (7.52779)       2.53%
>>>>>>> Percent CPU      761.4   (1.14018)      760.2   (0.447214)      -0.16%
>>>>>>> Context Switches 85380   (3419.52)      84748   (1904.44)       -0.74%
>>>>>>> Sleeps           105064  (1240.96)      105074  (337.612)       0.01%
>>>>>>>
>>>>>>> Average Optimal load -j 16
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     920.528 (10.1212)      927.404 (8.91789)       0.75%
>>>>>>> User    Time     11064.8 (981.142)      11085   (990.897)       0.18%
>>>>>>> System  Time     979.904 (84.0615)      1001.14 (82.5523)       2.17%
>>>>>>> Percent CPU      1089.5  (345.894)      1086.1  (343.545)       -0.31%
>>>>>>> Context Switches 159488  (78156.4)      158223  (77472.1)       -0.79%
>>>>>>> Sleeps           110566  (5877.49)      110388  (5617.75)       -0.16%
>>>>>>>
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          526743764      faults
>>>>>>>                210      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>               2278      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Very few speculative page faults were recorded as most of the processes
>>>>>>> involved are monothreaded (sounds that on this architecture some threads
>>>>>>> were created during the kernel build processing).
>>>>>>>
>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>>>>
>>>>>>> Average Half load -j 40
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     117.152 (0.774642)     117.166 (0.476057)      0.01%
>>>>>>> User    Time     4478.52 (24.7688)      4479.76 (9.08555)       0.03%
>>>>>>> System  Time     131.104 (0.720056)     134.04  (0.708414)      2.24%
>>>>>>> Percent CPU      3934    (19.7104)      3937.2  (19.0184)       0.08%
>>>>>>> Context Switches 92125.4 (576.787)      92581.6 (198.622)       0.50%
>>>>>>> Sleeps           317923  (652.499)      318469  (1255.59)       0.17%
>>>>>>>
>>>>>>> Average Optimal load -j 80
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     107.73  (0.632416)     107.31  (0.584936)      -0.39%
>>>>>>> User    Time     5869.86 (1466.72)      5871.71 (1467.27)       0.03%
>>>>>>> System  Time     153.728 (23.8573)      157.153 (24.3704)       2.23%
>>>>>>> Percent CPU      5418.6  (1565.17)      5436.7  (1580.91)       0.33%
>>>>>>> Context Switches 223861  (138865)       225032  (139632)        0.52%
>>>>>>> Sleeps           330529  (13495.1)      332001  (14746.2)       0.45%
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          116730856      faults
>>>>>>>                  0      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                476      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>>>>> there is no impact on the performance.
>>>>>>>
>>>>>>> Ebizzy:
>>>>>>> -------
>>>>>>> The test is counting the number of records per second it can manage, the
>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>>>>> consistent result I repeated the test 100 times and measure the average
>>>>>>> result. The number is the record processes per second, the higher is the
>>>>>>> best.
>>>>>>>
>>>>>>>                 BASE            SPF             delta
>>>>>>> 16 CPUs x86 VM  742.57          1490.24         100.69%
>>>>>>> 80 CPUs P8 node 13105.4         24174.23        84.46%
>>>>>>>
>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>>>>  Performance counter stats for './ebizzy -mTt 16':
>>>>>>>            1706379      faults
>>>>>>>            1674599      spf
>>>>>>>              30588      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                363      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>>>>  Performance counter stats for './ebizzy -mTt 80':
>>>>>>>            1874773      faults
>>>>>>>            1461153      spf
>>>>>>>             413293      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                200      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>>>>> leading the ebizzy performance boost.
>>>>>>>
>>>>>>> ------------------
>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>>>>  - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>>>>    and Minchan Kim, hopefully.
>>>>>>>  - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>>>>    __do_page_fault().
>>>>>>>  - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>>>>    instead
>>>>>>>    of aborting the speculative page fault handling. Dropping the now
>>>>>>> useless
>>>>>>>    trace event pagefault:spf_pte_lock.
>>>>>>>  - No more try to reuse the fetched VMA during the speculative page fault
>>>>>>>    handling when retrying is needed. This adds a lot of complexity and
>>>>>>>    additional tests done didn't show a significant performance improvement.
>>>>>>>  - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>>>>
>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>>>>
>>>>>>>
>>>>>>> Laurent Dufour (20):
>>>>>>>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>>>>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: make pte_unmap_same compatible with SPF
>>>>>>>   mm: introduce INIT_VMA()
>>>>>>>   mm: protect VMA modifications using VMA sequence count
>>>>>>>   mm: protect mremap() against SPF hanlder
>>>>>>>   mm: protect SPF handler against anon_vma changes
>>>>>>>   mm: cache some VMA fields in the vm_fault structure
>>>>>>>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>>>>   mm: introduce __lru_cache_add_active_or_unevictable
>>>>>>>   mm: introduce __vm_normal_page()
>>>>>>>   mm: introduce __page_add_new_anon_rmap()
>>>>>>>   mm: protect mm_rb tree with a rwlock
>>>>>>>   mm: adding speculative page fault failure trace events
>>>>>>>   perf: add a speculative page fault sw event
>>>>>>>   perf tools: add support for the SPF perf event
>>>>>>>   mm: add speculative page fault vmstats
>>>>>>>   powerpc/mm: add speculative page fault
>>>>>>>
>>>>>>> Mahendran Ganesh (2):
>>>>>>>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   arm64/mm: add speculative page fault
>>>>>>>
>>>>>>> Peter Zijlstra (4):
>>>>>>>   mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: VMA sequence count
>>>>>>>   mm: provide speculative fault infrastructure
>>>>>>>   x86/mm: add speculative pagefault handling
>>>>>>>
>>>>>>>  arch/arm64/Kconfig                    |   1 +
>>>>>>>  arch/arm64/mm/fault.c                 |  12 +
>>>>>>>  arch/powerpc/Kconfig                  |   1 +
>>>>>>>  arch/powerpc/mm/fault.c               |  16 +
>>>>>>>  arch/x86/Kconfig                      |   1 +
>>>>>>>  arch/x86/mm/fault.c                   |  27 +-
>>>>>>>  fs/exec.c                             |   2 +-
>>>>>>>  fs/proc/task_mmu.c                    |   5 +-
>>>>>>>  fs/userfaultfd.c                      |  17 +-
>>>>>>>  include/linux/hugetlb_inline.h        |   2 +-
>>>>>>>  include/linux/migrate.h               |   4 +-
>>>>>>>  include/linux/mm.h                    | 136 +++++++-
>>>>>>>  include/linux/mm_types.h              |   7 +
>>>>>>>  include/linux/pagemap.h               |   4 +-
>>>>>>>  include/linux/rmap.h                  |  12 +-
>>>>>>>  include/linux/swap.h                  |  10 +-
>>>>>>>  include/linux/vm_event_item.h         |   3 +
>>>>>>>  include/trace/events/pagefault.h      |  80 +++++
>>>>>>>  include/uapi/linux/perf_event.h       |   1 +
>>>>>>>  kernel/fork.c                         |   5 +-
>>>>>>>  mm/Kconfig                            |  22 ++
>>>>>>>  mm/huge_memory.c                      |   6 +-
>>>>>>>  mm/hugetlb.c                          |   2 +
>>>>>>>  mm/init-mm.c                          |   3 +
>>>>>>>  mm/internal.h                         |  20 ++
>>>>>>>  mm/khugepaged.c                       |   5 +
>>>>>>>  mm/madvise.c                          |   6 +-
>>>>>>>  mm/memory.c                           | 612 +++++++++++++++++++++++++++++-----
>>>>>>>  mm/mempolicy.c                        |  51 ++-
>>>>>>>  mm/migrate.c                          |   6 +-
>>>>>>>  mm/mlock.c                            |  13 +-
>>>>>>>  mm/mmap.c                             | 229 ++++++++++---
>>>>>>>  mm/mprotect.c                         |   4 +-
>>>>>>>  mm/mremap.c                           |  13 +
>>>>>>>  mm/nommu.c                            |   2 +-
>>>>>>>  mm/rmap.c                             |   5 +-
>>>>>>>  mm/swap.c                             |   6 +-
>>>>>>>  mm/swap_state.c                       |   8 +-
>>>>>>>  mm/vmstat.c                           |   5 +-
>>>>>>>  tools/include/uapi/linux/perf_event.h |   1 +
>>>>>>>  tools/perf/util/evsel.c               |   1 +
>>>>>>>  tools/perf/util/parse-events.c        |   4 +
>>>>>>>  tools/perf/util/parse-events.l        |   1 +
>>>>>>>  tools/perf/util/python.c              |   1 +
>>>>>>>  44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>>>>  create mode 100644 include/trace/events/pagefault.h
>>>>>>>
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>