[PATCH] erofs: add a global page pool for lz4 decompression
Chunhai Guo
guochunhai at vivo.com
Tue Jan 2 17:08:56 AEDT 2024
On 2023/12/31 9:09, Gao Xiang wrote:
> [你通常不会收到来自 hsiangkao at linux.alibaba.com 的电子邮件。请访问 https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要]
>
> On 2023/12/29 12:48, Chunhai Guo wrote:
>>> Hi Chunhai,
>>>
>>> On 2023/12/28 21:00, Chunhai Guo wrote:
>>>> Using a global page pool for LZ4 decompression significantly reduces
>>>> the time spent on page allocation in low memory scenarios.
>>>>
>>>> The table below shows the reduction in time spent on page allocation
>>>> for
>>>> LZ4 decompression when using a global page pool.
>>>> The results were obtained from multi-app launch benchmarks on ARM64
>>>> Android devices running the 5.15 kernel.
>>>> +--------------+---------------+--------------+---------+
>>>> | | w/o page pool | w/ page pool | diff |
>>>> +--------------+---------------+--------------+---------+
>>>> | Average (ms) | 3434 | 21 | -99.38% |
>>>> +--------------+---------------+--------------+---------+
>>>>
>>>> Based on the benchmark logs, it appears that 256 pages are sufficient
>>>> for most cases, but this can be adjusted as needed. Additionally,
>>>> turning on CONFIG_EROFS_FS_DEBUG will simplify the tuning process.
>>> Thanks for the patch. I have some questions:
>>> - what pcluster sizes are you using? 4k or more?
>> We currently use a 4k pcluster size.
>>
>>> - what the detailed configuration are you using for the multi-app
>>> launch workload? Such as CPU / Memory / the number of apps.
>> We ran the benchmark on Android devices with the following configuration.
>> In the benchmark, we launched 16 frequently-used apps, and the camera app
>> was the last one in each round. The results in the table above were
>> obtained from the launching process of the camera app.
>> CPU: 8 cores
>> Memory: 8GB
> It's the accumulated time of camera app for all rounds or the average
> time of camera app for each round?
It's the the average time of camera app for each round.
>>>> This patch currently only supports the LZ4 decompressor, other
>>>> decompressors will be supported in the next step.
>>>>
>>>> Signed-off-by: Chunhai Guo <guochunhai at vivo.com>
>>>> ---
>>>> fs/erofs/compress.h | 1 +
>>>> fs/erofs/decompressor.c | 42 ++++++++++++--
>>>> fs/erofs/internal.h | 5 ++
>>>> fs/erofs/super.c | 1 +
>>>> fs/erofs/utils.c | 121 ++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 165 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/erofs/compress.h b/fs/erofs/compress.h index
>>>> 279933e007d2..67202b97d47b 100644
>>>> --- a/fs/erofs/compress.h
>>>> +++ b/fs/erofs/compress.h
>>>> @@ -31,6 +31,7 @@ struct z_erofs_decompressor {
>>>> /* some special page->private (unsigned long, see below) */
>>>> #define Z_EROFS_SHORTLIVED_PAGE (-1UL << 2)
>>>> #define Z_EROFS_PREALLOCATED_PAGE (-2UL << 2)
>>>> +#define Z_EROFS_POOL_PAGE (-3UL << 2)
>>>>
>>>> /*
>>>> * For all pages in a pcluster, page->private should be one of diff
>>>> --git a/fs/erofs/decompressor.c b/fs/erofs/decompressor.c index
>>>> d08a6ee23ac5..41b34f01416f 100644
>>>> --- a/fs/erofs/decompressor.c
>>>> +++ b/fs/erofs/decompressor.c
>>>> @@ -54,6 +54,7 @@ static int z_erofs_load_lz4_config(struct super_block *sb,
>>>> sbi->lz4.max_distance_pages = distance ?
>>>> DIV_ROUND_UP(distance, PAGE_SIZE) + 1 :
>>>> LZ4_MAX_DISTANCE_PAGES;
>>>> + erofs_global_page_pool_init();
>>>> return erofs_pcpubuf_growsize(sbi->lz4.max_pclusterblks);
>>>> }
>>>>
>>>> @@ -111,15 +112,42 @@ static int z_erofs_lz4_prepare_dstpages(struct z_erofs_lz4_decompress_ctx *ctx,
>>>> victim = availables[--top];
>>>> get_page(victim);
>>>> } else {
>>>> - victim = erofs_allocpage(pagepool,
>>>> + victim = erofs_allocpage_for_decmpr(pagepool,
>>>> GFP_KERNEL |
>>>> __GFP_NOFAIL);
>>> For each read request, the extreme case here will be 15 pages for 64k LZ4 sliding window (60k = 64k-4k). You could reduce
>>> LZ4 sliding window to save more pages with slight compression ratio loss.
>> OK, we will do the test on this. However, based on the data we have, 97% of
>> the compressed pages that have been read can be decompressed to less than 4
>> pages. Therefore, we may not put too much hope on this.
> Yes, but I'm not sure if just 3% of compressed data denodes the majority of
> latencies. It'd be better to try it out anyway.
OK, we will do the test on this.
>
>>> Or, here __GFP_NOFAIL is actually unnecessary since we could bail out this if allocation failed for all readahead requests
>>> and only address __read requests__. I have some plan to do
>>> this but it's too close to the next merge window. So I was once to work this out for Linux 6.9.
>> This sounds great. It is more likely another optimization related to this
>> case.
>>
>>> Anyway, I'm not saying mempool is not a good idea, but I tend to reserve memory as less as possible if there are some other way to mitigate the same workload since reserving memory is not free (which means 1 MiB memory will be only used for this.) Even we will do a mempool, I wonder if we could unify pcpubuf and mempool together to make a better pool.
>> I totally agree with your opinion. We use 256 pages for the worst-case
>> scenario, and 1 MiB is acceptable in 8GB devices. However, for 95% of
>> scenarios, 64 pages are sufficient and more acceptable for other devices.
>> And you are right, I will create a patch to unify the pcpubuf and mempool
>> in the next step.
> Anyway, if a global mempool is really needed. I'd like to add
> some new sysfs interface to change this value (by default, 0).
> Also you could reuse some of shortlived interfaces for global
> pool rather than introduce another type of pages.
OK. I will make another patch for this.
Thanks,
>
> Thanks,
> Gao Xiang
More information about the Linux-erofs
mailing list