[PATCH RFC 3/4] erofs: introduce page cache share feature

Thu Jul 25 18:52:25 AEST 2024

On 2024/7/25 16:42, Gao Xiang wrote:
> Hi all,
> 
> On 2024/7/25 15:43, Hongbo Li via Linux-erofs wrote:
>>
>>
>> On 2024/7/22 14:53, Hongzhen Luo wrote:
>>> Currently, reading files with different paths (or names) but the same
>>> content will consume multiple copies of the page cache, even if the
>>> content of these page caches is the same. For example, reading identical
>>> files (e.g., *.so files) from two different minor versions of container
>>> images will cost multiple copies of the same page cache, since different
>>> containers have different mount points. Therefore, sharing the page 
>>> cache
>>> for files with the same content can save memory.
>>>
>>> This introduces the page cache share feature in erofs. During the mkfs
>>> phase, file content is hashed and the hash value is stored in the
>>> `user.fingerprint` extended attribute. Inodes of files with the same
> 
>> Does this mean the hash is calculated at the file granularity?
> 
> Yes, it should share in the file granularity due to various MM limitations,
> but for strictly immutable fs like EROFS, that is pretty easy to go like
> this.  Also it doesn't matter since such fingerprint is just an extended
> attribute so even Linux future versions support finer granularity, it
> should backward compatiable.
> 
> But to Hongzhen, I think you need to rename it as
> "trusted.erofs.fingerprint" instead since user xattrs are unacceptable
> for this.
> 
> Also I don't have enough time to review this version (working on other
> stuffs now) this month, there are many details needs to be reworked.
> 
> So Hongbo, if you have interest and time to review and improve this work,
> I greatly appreciate it.
> 
ok, I will follow this.

>>> `user.fingerprint` are mapped to an anonymous inode, whose page cache
>>> stores the actual contents. In this way, a single copy of the anonymous
>>> inode's page cache can serve read requests from several files mapped 
>>> to it.
>>>
>>> Below is the memory usage for reading all files in two different minor
>>> versions of container images:
>>>
>>> +-------------------+------------------+-------------+---------------+
>>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>>> |                   |                  |             | Reduction (%) |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     241     |       -       |
>>> |       redis       +------------------+-------------+---------------+
>>> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     872     |       -       |
>>> |      postgres     +------------------+-------------+---------------+
>>> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     2771    |       -       |
>>> |     tensorflow    +------------------+-------------+---------------+
>>> |  1.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     926     |       -       |
>>> |       mysql       +------------------+-------------+---------------+
>>> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     390     |       -       |
>>> |       nginx       +------------------+-------------+---------------+
>>> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |       tomcat      |        No        |     924     |       -       |
>>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>>> |                   |        Yes       |     474     |      49%      |
>>> +-------------------+------------------+-------------+---------------+
>>>
>>> Additionally, the table below shows the runtime memory usage of the
>>> container:
>>>
>>> +-------------------+------------------+-------------+---------------+
>>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>>> |                   |                  |             | Reduction (%) |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     34.9    |       -       |
>>> |       redis       +------------------+-------------+---------------+
>>> |   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |    149.1    |       -       |
>>> |      postgres     +------------------+-------------+---------------+
>>> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |    1027.9   |       -       |
>>> |     tensorflow    +------------------+-------------+---------------+
>>> |  1.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |    155.0    |       -       |
>>> |       mysql       +------------------+-------------+---------------+
>>> |  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |                   |        No        |     25.4    |       -       |
>>> |       nginx       +------------------+-------------+---------------+
>>> |   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
>>> +-------------------+------------------+-------------+---------------+
>>> |       tomcat      |        No        |     186     |       -       |
>>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>>> |                   |        Yes       |      99     |      47%      |
>>> +-------------------+------------------+-------------+---------------+
>>>
>>> Signed-off-by: Hongzhen Luo <hongzhen at linux.alibaba.com>
>>> ---
>>>   fs/erofs/Kconfig           |  10 ++
>>>   fs/erofs/Makefile          |   1 +
>>>   fs/erofs/internal.h        |   8 +
>>>   fs/erofs/pagecache_share.c | 313 +++++++++++++++++++++++++++++++++++++
>>>   fs/erofs/pagecache_share.h |  20 +++
>>>   5 files changed, 352 insertions(+)
>>>   create mode 100644 fs/erofs/pagecache_share.c
>>>   create mode 100644 fs/erofs/pagecache_share.h
>>>
>>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>>> index 7dcdce660cac..756a74de623c 100644
>>> --- a/fs/erofs/Kconfig
>>> +++ b/fs/erofs/Kconfig
>>> @@ -158,3 +158,13 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
>>>         at higher priority.
>>>         If unsure, say N.
>>> +
>>> +config EROFS_FS_PAGE_CACHE_SHARE
>>> +       bool "EROFS page cache share support"
>>> +       depends on EROFS_FS
>>> +       default n
>>> +    help
>>> +      This permits EROFS to share page cache for files with same
>>> +      fingerprints.
>>> +
>>> +      If unsure, say N.
>>> diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
>>> index 097d672e6b14..f14a2ac0e561 100644
>>> --- a/fs/erofs/Makefile
>>> +++ b/fs/erofs/Makefile
>>> @@ -8,3 +8,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o
>>>   erofs-$(CONFIG_EROFS_FS_ZIP_DEFLATE) += decompressor_deflate.o
>>>   erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
>>>   erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
>>> +erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += pagecache_share.o
>>> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
>>> index 2c695ec6ee71..136d6bed417b 100644
>>> --- a/fs/erofs/internal.h
>>> +++ b/fs/erofs/internal.h
>>> @@ -288,6 +288,12 @@ struct erofs_inode {
>>>           };
>>>   #endif    /* CONFIG_EROFS_FS_ZIP */
>>>       };
>>> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
>>> +    struct list_head pcs_list;
>>> +    char *fprt;
>>> +    int fprt_len;
>>> +    unsigned long fprt_hash;
>>> +#endif
>>>       /* the corresponding vfs inode */
>>>       struct inode vfs_inode;
>>>   };
>>> @@ -376,6 +382,7 @@ extern const struct super_operations erofs_sops;
>>>   extern const struct address_space_operations erofs_raw_access_aops;
>>>   extern const struct address_space_operations z_erofs_aops;
>>>   extern const struct address_space_operations 
>>> erofs_fscache_access_aops;
>>> +extern const struct address_space_operations erofs_pcs_aops;
>>>   extern const struct inode_operations erofs_generic_iops;
>>>   extern const struct inode_operations erofs_symlink_iops;
>>> @@ -384,6 +391,7 @@ extern const struct inode_operations erofs_dir_iops;
>>>   extern const struct file_operations erofs_file_fops;
>>>   extern const struct file_operations erofs_dir_fops;
>>> +extern const struct file_operations erofs_pcs_file_fops;
>>>   extern const struct iomap_ops z_erofs_iomap_report_ops;
>>> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
>>> new file mode 100644
>>> index 000000000000..e8c8fe16cda5
>>> --- /dev/null
>>> +++ b/fs/erofs/pagecache_share.c
>>> @@ -0,0 +1,313 @@
>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>> +/*
>>> + * Copyright (C) 2024, Alibaba Cloud
>>> + */
>>> +#include <linux/xxhash.h>
>>> +#include "pagecache_share.h"
>>> +#include "internal.h"
>>> +#include "xattr.h"
>>> +
>>> +#define PCS_FPRT_IDX    1
>>> +#define PCS_FPRT_NAME    "fingerprint"
>>> +#define PCS_FPRT_MAXLEN 1024
>>> +
>>> +struct erofs_pcs {
>>> +    struct erofs_inode *cur;
>> Why we should cur to pointer the inode? I think we only need to record 
>> the fprt of the content. When we need inode, we can check the 
>> list_empty, and to get the any inode to use.
>>> +    struct rw_semaphore rw_sem;
>>> +    struct mutex list_mutex;
>>> +    struct list_head list;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(pseudo_mnt_lock);
>>> +static unsigned long pseudo_mnt_count;
>>> +static struct vfsmount *erofs_pcs_mnt;
>>> +
>>> +int erofs_pcs_init_mnt(void)
>>> +{
>>> +    mutex_lock(&pseudo_mnt_lock);
>>> +    if (!erofs_pcs_mnt) {
>>> +        erofs_pcs_mnt = kern_mount(&erofs_anon_fs_type);
>>> +        if (IS_ERR(erofs_pcs_mnt))
>> Not unlock here.
> 
> Also I don't think it needs another kern_mount just
> for this.
> 
> You need to refactor the existing code in advance.
> 
>>> +            return PTR_ERR(erofs_pcs_mnt);
>>> +        pseudo_mnt_count = 1;
>>> +    } else
>>> +        pseudo_mnt_count += 1;
>>> +    mutex_unlock(&pseudo_mnt_lock);
>>> +    return 0;
>>> +}
>>> +
>>> +int erofs_pcs_free_mnt(void)
>>> +{
>>> +    mutex_lock(&pseudo_mnt_lock);
>>> +    pseudo_mnt_count -= 1;
>>> +    if (pseudo_mnt_count < 0)
>>> +        return -EINVAL;
>>> +    if (pseudo_mnt_count == 0) {
>>> +        kern_unmount(erofs_pcs_mnt);
>>> +        erofs_pcs_mnt = NULL;
>>> +    }
>>> +    mutex_unlock(&pseudo_mnt_lock);
>>> +    return 0;
>>> +}
>>> +
>>> +void erofs_pcs_fill_inode(struct inode *inode)
>>> +{
>>> +    struct erofs_inode *vi = EROFS_I(inode);
>>> +    char fprt[PCS_FPRT_MAXLEN];
>>> +
>>> +    vi->fprt_len = erofs_getxattr(inode, PCS_FPRT_IDX, 
>>> PCS_FPRT_NAME, fprt,
>>> +                      PCS_FPRT_MAXLEN);
>>> +    if (vi->fprt_len > 0 && vi->fprt_len <= PCS_FPRT_MAXLEN) {
>>> +        vi->fprt = kmalloc(vi->fprt_len, GFP_KERNEL);
>>> +        if (IS_ERR(vi->fprt)) {
>>> +            vi->fprt_len = -1;
>>> +            return;
>>> +        }
>>> +        memcpy(vi->fprt, fprt, vi->fprt_len);
>>> +        vi->fprt_hash = xxh32(vi->fprt, vi->fprt_len, 0);
>>> +    }
>>> +}
>>> +
>>> +static int erofs_pcs_eq(struct inode *ano_inode, void *data)
>>> +{
>>> +    struct erofs_pcs *pcs = ano_inode->i_private;
>>> +    struct erofs_inode *vi = (struct erofs_inode *)data;
>>> +
>>> +    return pcs && pcs->cur && pcs->cur->fprt_len == vi->fprt_len &&
>>> +           memcmp(pcs->cur->fprt, vi->fprt, vi->fprt_len) == 0 ? 1 : 0;
>> May be we just need compare vi->fprt_hash, no need memcmp?
>>> +}
>>> +
>>> +static int erofs_pcs_set_aops(struct inode *ano_inode, void *data)
>>> +{
>>> +    ano_inode->i_mapping->a_ops = &erofs_pcs_aops;
>>> +    ano_inode->i_size = ((struct erofs_inode *)data)->vfs_inode.i_size;
>>> +    return 0;
>>> +}
>>> +
>>> +int erofs_pcs_add(struct inode *inode)
>>> +{
>>> +    struct erofs_inode *vi = EROFS_I(inode);
>>> +    struct inode *ano_inode;
>>> +    struct erofs_pcs *pcs;
>>> +
>>> +    ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, vi->fprt_hash,
>>> +                 erofs_pcs_eq, erofs_pcs_set_aops, vi);
>>> +    spin_lock(&ano_inode->i_lock);
>>> +    if (!ano_inode->i_private) {
>>> +        pcs = kzalloc(sizeof(struct erofs_pcs), GFP_KERNEL);
>>> +        if (IS_ERR(pcs))
>> should call spin_unlock when failed
>>> +            return -ENOMEM;
>>> +        INIT_LIST_HEAD(&pcs->list);
>>> +        mutex_init(&pcs->list_mutex);
>>> +        init_rwsem(&pcs->rw_sem);
>>> +        ano_inode->i_private = pcs;
>>> +    }
>>> +    spin_unlock(&ano_inode->i_lock);
>>> +    pcs = ano_inode->i_private;
>> pcs = ano_inode->i_private;
>> spin_lock(&ano_inode->i_lock);
>> if (!pcs) {
>> xxx
>> }
>> spin_unlock(&ano_inode->i_lock);
>>> +    mutex_lock(&pcs->list_mutex);
>>> +    list_add_tail(&vi->pcs_list, &pcs->list);
>>> +    if (!pcs->cur) {
>>> +        pcs->cur = list_first_entry(&pcs->list,
>>> +                        struct erofs_inode, pcs_list);
>>> +    }
>>> +    mutex_unlock(&pcs->list_mutex);
>>> +    if (ano_inode->i_state & I_NEW)
>>> +        unlock_new_inode(ano_inode);
>>> +    return 0;
>>> +}
>>> +
>>> +int erofs_pcs_remove(struct inode *inode)
>>> +{
>>> +    struct erofs_inode *vi = EROFS_I(inode), *p, *tmp;
>>> +    struct inode *ano_inode;
>>> +    struct erofs_pcs *pcs;
>>> +
>>> +    ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, vi->fprt_hash,
>>> +                 erofs_pcs_eq, erofs_pcs_set_aops, vi);
>>> +    iput(ano_inode);
>>> +    pcs = ano_inode->i_private;
>>> +    if (!pcs)
>>> +        return -EINVAL;
>>> +    mutex_lock(&pcs->list_mutex);
>>> +    if (vi == pcs->cur) {
>>> +        /* synchronize with reads using pcs->cur */
>>> +        down_write(&pcs->rw_sem);
>>> +        list_del(&pcs->cur->pcs_list);
>>> +        if (list_empty(&pcs->list)) {
>>> +            pcs->cur = NULL;
>>> +            up_write(&pcs->rw_sem);
>>> +            goto free_ano_inode;
>>> +        }
>>> +        pcs->cur = list_first_entry(&pcs->list, struct erofs_inode,
>>> +                        pcs_list);
>>> +        up_write(&pcs->rw_sem);
>>> +    } else {
>>> +        list_for_each_entry_safe(p, tmp, &pcs->list, pcs_list) {
>>> +            if (p == vi) {
>>> +                list_del(&p->pcs_list);
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    iput(ano_inode);
>>> +    mutex_unlock(&pcs->list_mutex);
>>> +    return 0;
>>> +
>>> +free_ano_inode:
>>> +    mutex_unlock(&pcs->list_mutex);
>>> +    spin_lock(&ano_inode->i_lock);
>>> +    if (atomic_read(&ano_inode->i_count) == 1) {
>>> +        ano_inode->i_private = NULL;
>>> +        kfree(pcs);
>>> +    }
>>> +    spin_unlock(&ano_inode->i_lock);
>>> +    iput(ano_inode);
>>> +    return 0;
>>> +}
>>> +
>>> +static int erofs_pcs_iomap_begin(struct inode *ano_inode, loff_t 
>>> offset,
>>> +                loff_t length, unsigned int flags,
>>> +                struct iomap *iomap, struct iomap *srcmap)
>>> +{
>>> +    struct erofs_pcs *pcs;
>>> +
>>> +    pcs = ano_inode->i_private;
>>> +    if (!pcs || !pcs->cur)
>>> +        return -EINVAL;
>>> +
>>> +    return erofs_iomap_begin(&pcs->cur->vfs_inode, offset, length, 
>>> flags,
>>> +                 iomap, srcmap);
>>> +}
>>> +
>>> +static int erofs_pcs_iomap_end(struct inode *ano_inode, loff_t pos,
>>> +                   loff_t length, ssize_t written,
>>> +                   unsigned int flags, struct iomap *iomap)
>>> +{
>>> +    struct erofs_pcs *pcs;
>>> +
>>> +    pcs = ano_inode->i_private;
>>> +    if (!pcs)
>>> +        return -EINVAL;
>>> +
>>> +    return erofs_iomap_end(&pcs->cur->vfs_inode, pos, length, written,
>>> +                   flags, iomap);
>>> +}
>>> +
>>> +const struct iomap_ops erofs_pcs_iomap_ops = {
>>> +    .iomap_begin = erofs_pcs_iomap_begin,
>>> +    .iomap_end = erofs_pcs_iomap_end,
>> Does this work on on-demand mode over fscache? As I know, this io path 
>> is only for erofs, not for erofs over fscache.
> 
> That is why I think this implementation is far from upstreaming:
>    - over fscache
>    - compression
> are all needed to be implemented.
> 
> I don't think it needs to add any extra helpers just for this, you
> need to handle page cache sharing in the original paths to find
> the share inodes instead.
> 
I remember that Jingbo have done the similar thing like this. At: 
https://patchew.org/linux/20230111083158.23462-1-jefflexu@linux.alibaba.com/

But I don't know what happened afterward.

Thanks,
Hongbo

> Thanks,
> Gao Xiang