[PATCH] erofs: add a global page pool for lz4 decompression

Sun Dec 31 12:09:15 AEDT 2023

On 2023/12/29 12:48, Chunhai Guo wrote:
>> Hi Chunhai,
>>
>> On 2023/12/28 21:00, Chunhai Guo wrote:
>>> Using a global page pool for LZ4 decompression significantly reduces
>>> the time spent on page allocation in low memory scenarios.
>>>
>>> The table below shows the reduction in time spent on page allocation
>>> for
>>> LZ4 decompression when using a global page pool.
>>> The results were obtained from multi-app launch benchmarks on ARM64
>>> Android devices running the 5.15 kernel.
>>> +--------------+---------------+--------------+---------+
>>> |              | w/o page pool | w/ page pool |  diff   |
>>> +--------------+---------------+--------------+---------+
>>> | Average (ms) |     3434      |      21      | -99.38% |
>>> +--------------+---------------+--------------+---------+
>>>
>>> Based on the benchmark logs, it appears that 256 pages are sufficient
>>> for most cases, but this can be adjusted as needed. Additionally,
>>> turning on CONFIG_EROFS_FS_DEBUG will simplify the tuning process.
>>
>> Thanks for the patch. I have some questions:
>>    - what pcluster sizes are you using? 4k or more?
> We currently use a 4k pcluster size.
> 
>>    - what the detailed configuration are you using for the multi-app
>>      launch workload? Such as CPU / Memory / the number of apps.
> 
> We ran the benchmark on Android devices with the following configuration.
> In the benchmark, we launched 16 frequently-used apps, and the camera app
> was the last one in each round. The results in the table above were
> obtained from the launching process of the camera app.
> 	CPU: 8 cores
> 	Memory: 8GB

It's the accumulated time of camera app for all rounds or the average
time of camera app for each round?

> 
>>>
>>> This patch currently only supports the LZ4 decompressor, other
>>> decompressors will be supported in the next step.
>>>
>>> Signed-off-by: Chunhai Guo <guochunhai at vivo.com>
>>> ---
>>>    fs/erofs/compress.h     |   1 +
>>>    fs/erofs/decompressor.c |  42 ++++++++++++--
>>>    fs/erofs/internal.h     |   5 ++
>>>    fs/erofs/super.c        |   1 +
>>>    fs/erofs/utils.c        | 121 ++++++++++++++++++++++++++++++++++++++++
>>>    5 files changed, 165 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/erofs/compress.h b/fs/erofs/compress.h index
>>> 279933e007d2..67202b97d47b 100644
>>> --- a/fs/erofs/compress.h
>>> +++ b/fs/erofs/compress.h
>>> @@ -31,6 +31,7 @@ struct z_erofs_decompressor {
>>>    /* some special page->private (unsigned long, see below) */
>>>    #define Z_EROFS_SHORTLIVED_PAGE             (-1UL << 2)
>>>    #define Z_EROFS_PREALLOCATED_PAGE   (-2UL << 2)
>>> +#define Z_EROFS_POOL_PAGE            (-3UL << 2)
>>>
>>>    /*
>>>     * For all pages in a pcluster, page->private should be one of diff
>>> --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c index
>>> d08a6ee23ac5..41b34f01416f 100644
>>> --- a/fs/erofs/decompressor.c
>>> +++ b/fs/erofs/decompressor.c
>>> @@ -54,6 +54,7 @@ static int z_erofs_load_lz4_config(struct super_block *sb,
>>>        sbi->lz4.max_distance_pages = distance ?
>>>                                        DIV_ROUND_UP(distance, PAGE_SIZE) + 1 :
>>>                                        LZ4_MAX_DISTANCE_PAGES;
>>> +     erofs_global_page_pool_init();
>>>        return erofs_pcpubuf_growsize(sbi->lz4.max_pclusterblks);
>>>    }
>>>
>>> @@ -111,15 +112,42 @@ static int z_erofs_lz4_prepare_dstpages(struct z_erofs_lz4_decompress_ctx *ctx,
>>>                        victim = availables[--top];
>>>                        get_page(victim);
>>>                } else {
>>> -                     victim = erofs_allocpage(pagepool,
>>> +                     victim = erofs_allocpage_for_decmpr(pagepool,
>>>                                                 GFP_KERNEL |
>>> __GFP_NOFAIL);
>>
>> For each read request, the extreme case here will be 15 pages for 64k LZ4 sliding window (60k = 64k-4k). You could reduce
>> LZ4 sliding window to save more pages with slight compression ratio loss.
> 
> OK, we will do the test on this. However, based on the data we have, 97% of
> the compressed pages that have been read can be decompressed to less than 4
> pages. Therefore, we may not put too much hope on this.

Yes, but I'm not sure if just 3% of compressed data denodes the majority of
latencies.  It'd be better to try it out anyway.

> 
>>
>> Or, here __GFP_NOFAIL is actually unnecessary since we could bail out this if allocation failed for all readahead requests
>> and only address __read requests__.   I have some plan to do
>> this but it's too close to the next merge window.  So I was once to work this out for Linux 6.9.
> 
> This sounds great. It is more likely another optimization related to this
> case.
> 
>>
>> Anyway, I'm not saying mempool is not a good idea, but I tend to reserve memory as less as possible if there are some other way to mitigate the same workload since reserving memory is not free (which means 1 MiB memory will be only used for this.) Even we will do a mempool, I wonder if we could unify pcpubuf and mempool together to make a better pool.
> 
> I totally agree with your opinion. We use 256 pages for the worst-case
> scenario, and 1 MiB is acceptable in 8GB devices. However, for 95% of
> scenarios, 64 pages are sufficient and more acceptable for other devices.
> And you are right, I will create a patch to unify the pcpubuf and mempool
> in the next step.

Anyway, if a global mempool is really needed.  I'd like to add
some new sysfs interface to change this value (by default, 0).
Also you could reuse some of shortlived interfaces for global
pool rather than introduce another type of pages.

Thanks,
Gao Xiang