[PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

Nikita Yushchenko nikita.yushchenko at virtuozzo.com
Sun Dec 19 01:31:43 AEDT 2021


>> This allows archs to optimize it, by
>> freeing multiple tables in a single release_pages() call. This is
>> faster than individual put_page() calls, especially with memcg
>> accounting enabled.
> 
> Could we quantify "faster"?  There's a non-trivial amount of code being
> added here and it would be nice to back it up with some cold-hard numbers.

I currently don't have numbers for this patch taken alone. This patch originates from work done some 
years ago to reduce cost of memory accounting, and x86-only version of this patch was in 
virtuozzo/openvz kernel since then. Other patches from that work have been upstreamed, but this one was 
missed.

Still it's obvious that release_pages() shall be faster that a loop calling put_page() - isn't that 
exactly the reason why release_pages() exists and is different from a loop calling put_page()?

>>   static void __tlb_remove_table_free(struct mmu_table_batch *batch)
>>   {
>> -	int i;
>> -
>> -	for (i = 0; i < batch->nr; i++)
>> -		__tlb_remove_table(batch->tables[i]);
>> -
>> +	__tlb_remove_tables(batch->tables, batch->nr);
>>   	free_page((unsigned long)batch);
>>   }
> 
> This leaves a single call-site for __tlb_remove_table():
> 
>> static void tlb_remove_table_one(void *table)
>> {
>>          tlb_remove_table_sync_one();
>>          __tlb_remove_table(table);
>> }
> 
> Is that worth it, or could it just be:
> 
> 	__tlb_remove_tables(&table, 1);

I was considering that while preparing the patch, however that resulted into even larger change in 
archs, due to removal of non-batched call, and I decided not to follow this way.

And, Peter's suggestion to integrate free_page_and_swap()-based implementation of __tlb_remove_table() 
into mm/mmu_gather.c under ifdef, and then do the optimization locally in mm/mmu_gather.c, looks better.

>> +void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
>> +{
>> +	__free_pages_and_swap_cache(pages, nr, false);
>>   }
> 
> This went unmentioned in the changelog.  But, it seems like there's a
> specific optimization here.  In the exiting code,
> free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
> LRU.  It doesn't need the lru_add_drain().

This is a somewhat different topic.

In scope of this patch, the _nolru version was added because there was no lru draining in the looped 
call to __tlb_remove_table(). Having it added to the batched version, although won't break things, does 
add overhead that was not there before, which is in direct conflict with the original goal.

If the version with draining lru is indeed not needed, it can be cleaned out in scope of a different 
patchset.

> 		if (!do_lru)
> 			VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
> 					     pagep[i]);
> 		free_swap_cache(...);

This looks like a good safety measure, will add it.

> But, even more than that, do all the architectures even need the
> free_swap_cache()?

I was under impression that process page tables are a valid target for swapping out. Although I can be 
wrong here.

Nikita


More information about the Linuxppc-dev mailing list