[PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

Wed Apr 24 20:33:11 AEST 2019

Le 22/04/2019 à 22:32, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
>> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
>> lead to thread being blocked due to data corruption:
>>
>>      CPU 1                   CPU 2                    CPU 3
>>      Process 1,              Process 1,               Process 1,
>>      Thread A                Thread B                 Thread C
>>
>>      while (1) {             while (1) {              while(1) {
>>      pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>>      pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>>      }                       }
>>
>> In the details this happens because :
>>
>>      CPU 1                CPU 2                       CPU 3
>>      fork()
>>      copy_pte_range()
>>        set PTE rdonly
>>      got to next VMA...
>>       .                   PTE is seen rdonly          PTE still writable
>>       .                   thread is writing to page
>>       .                   -> page fault
>>       .                     copy the page             Thread writes to page
>>       .                      .                        -> no page fault
>>       .                     update the PTE
>>       .                     flush TLB for that PTE
>>     flush TLB                                        PTE are now rdonly
> 
> Should the fork be on CPU3 to be consistant with the top thing (just to
> make it easier to read and go from one to the other as thread can move
> from one CPU to another).

Sure, this is quite confusing this way ;)

>>
>> So the write done by the CPU 3 is interfering with the page copy operation
>> done by CPU 2, leading to the data corruption.
>>
>> To avoid this we mark all the VMA involved in the COW mechanism as changing
>> by calling vm_write_begin(). This ensures that the speculative page fault
>> handler will not try to handle a fault on these pages.
>> The marker is set until the TLB is flushed, ensuring that all the CPUs will
>> now see the PTE as not writable.
>> Once the TLB is flush, the marker is removed by calling vm_write_end().
>>
>> The variable last is used to keep tracked of the latest VMA marked to
>> handle the error path where part of the VMA may have been marked.
>>
>> Since multiple VMA from the same mm may have the sequence count increased
>> during this process, the use of the vm_raw_write_begin/end() is required to
>> avoid lockdep false warning messages.
>>
>> Reported-by: Ganesh Mahendran <opensource.ganesh at gmail.com>
>> Reported-by: Vinayak Menon <vinmenon at codeaurora.org>
>> Signed-off-by: Laurent Dufour <ldufour at linux.ibm.com>
> 
> A minor comment (see below)
> 
> Reviewed-by: Jérome Glisse <jglisse at redhat.com>

Thanks for the review Jérôme.

>> ---
>>   kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>>   1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f8dae021c2e5..2992d2c95256 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>>   static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   					struct mm_struct *oldmm)
>>   {
>> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>>   	struct rb_node **rb_link, *rb_parent;
>>   	int retval;
>>   	unsigned long charge;
>> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		rb_parent = &tmp->vm_rb;
>>   
>>   		mm->map_count++;
>> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>> +				/*
>> +				 * Mark this VMA as changing to prevent the
>> +				 * speculative page fault hanlder to process
>> +				 * it until the TLB are flushed below.
>> +				 */
>> +				last = mpnt;
>> +				vm_raw_write_begin(mpnt);
>> +			}
>>   			retval = copy_page_range(mm, oldmm, mpnt);
>> +		}
>>   
>>   		if (tmp->vm_ops && tmp->vm_ops->open)
>>   			tmp->vm_ops->open(tmp);
>> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   out:
>>   	up_write(&mm->mmap_sem);
>>   	flush_tlb_mm(oldmm);
>> +
>> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> 
> You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
> will always be NULL if it is not enabled but maybe the compiler will
> miss the optimization opportunity if you only have the for() loop
> below.

I didn't check for the generated code, perhaps the compiler will be 
optimize that correctly.
This being said, I think the if block is better for the code 
readability, highlighting that this block is only needed in the case of SPF.

>> +		/*
>> +		 * Since the TLB has been flush, we can safely unmark the
>> +		 * copied VMAs and allows the speculative page fault handler to
>> +		 * process them again.
>> +		 * Walk back the VMA list from the last marked VMA.
>> +		 */
>> +		for (; last; last = last->vm_prev) {
>> +			if (last->vm_flags & VM_DONTCOPY)
>> +				continue;
>> +			if (!(last->vm_flags & VM_WIPEONFORK))
>> +				vm_raw_write_end(last);
>> +		}
>> +	}
>> +
>>   	up_write(&oldmm->mmap_sem);
>>   	dup_userfaultfd_complete(&uf);
>>   fail_uprobe_end:
>> -- 
>> 2.21.0
>>
>