[PATCH RFC 3/4] erofs: introduce page cache share feature

Thu Jul 25 19:00:38 AEST 2024

On 2024/7/25 16:52, Hongbo Li wrote:
> 
> 
> On 2024/7/25 16:42, Gao Xiang wrote:
>> Hi all,
>>
>> On 2024/7/25 15:43, Hongbo Li via Linux-erofs wrote:
>>>
>>>
>>> On 2024/7/22 14:53, Hongzhen Luo wrote:
>>>> Currently, reading files with different paths (or names) but the same
>>>> content will consume multiple copies of the page cache, even if the
>>>> content of these page caches is the same. For example, reading identical
>>>> files (e.g., *.so files) from two different minor versions of container
>>>> images will cost multiple copies of the same page cache, since different
>>>> containers have different mount points. Therefore, sharing the page cache
>>>> for files with the same content can save memory.
>>>>
>>>> This introduces the page cache share feature in erofs. During the mkfs
>>>> phase, file content is hashed and the hash value is stored in the
>>>> `user.fingerprint` extended attribute. Inodes of files with the same
>>
>>> Does this mean the hash is calculated at the file granularity?
>>
>> Yes, it should share in the file granularity due to various MM limitations,
>> but for strictly immutable fs like EROFS, that is pretty easy to go like
>> this.  Also it doesn't matter since such fingerprint is just an extended
>> attribute so even Linux future versions support finer granularity, it
>> should backward compatiable.
>>
>> But to Hongzhen, I think you need to rename it as
>> "trusted.erofs.fingerprint" instead since user xattrs are unacceptable
>> for this.
>>
>> Also I don't have enough time to review this version (working on other
>> stuffs now) this month, there are many details needs to be reworked.
>>
>> So Hongbo, if you have interest and time to review and improve this work,
>> I greatly appreciate it.
>>
> ok, I will follow this.

Thanks!

> 
>>>> `user.fingerprint` are mapped to an anonymous inode, whose page cache
>>>> stores the actual contents. In this way, a single copy of the anonymous
>>>> inode's page cache can serve read requests from several files mapped to it.
>>>>
>>>> Below is the memory usage for reading all files in two different minor
>>>> versions of container images:
>>>>
>>>> +-------------------+------------------+-------------+---------------+
>>>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>>>> |                   |                  |             | Reduction (%) |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     241     |       -       |
>>>> |       redis       +------------------+-------------+---------------+
>>>> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     872     |       -       |
>>>> |      postgres     +------------------+-------------+---------------+
>>>> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     2771    |       -       |
>>>> |     tensorflow    +------------------+-------------+---------------+
>>>> |  1.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     926     |       -       |
>>>> |       mysql       +------------------+-------------+---------------+
>>>> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     390     |       -       |
>>>> |       nginx       +------------------+-------------+---------------+
>>>> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |       tomcat      |        No        |     924     |       -       |
>>>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>>>> |                   |        Yes       |     474     |      49%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>>
>>>> Additionally, the table below shows the runtime memory usage of the
>>>> container:
>>>>
>>>> +-------------------+------------------+-------------+---------------+
>>>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>>>> |                   |                  |             | Reduction (%) |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     34.9    |       -       |
>>>> |       redis       +------------------+-------------+---------------+
>>>> |   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |    149.1    |       -       |
>>>> |      postgres     +------------------+-------------+---------------+
>>>> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |    1027.9   |       -       |
>>>> |     tensorflow    +------------------+-------------+---------------+
>>>> |  1.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |    155.0    |       -       |
>>>> |       mysql       +------------------+-------------+---------------+
>>>> |  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |                   |        No        |     25.4    |       -       |
>>>> |       nginx       +------------------+-------------+---------------+
>>>> |   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>> |       tomcat      |        No        |     186     |       -       |
>>>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>>>> |                   |        Yes       |      99     |      47%      |
>>>> +-------------------+------------------+-------------+---------------+
>>>>
>>>> Signed-off-by: Hongzhen Luo <hongzhen at linux.alibaba.com>
>>>> ---
>>>>   fs/erofs/Kconfig           |  10 ++
>>>>   fs/erofs/Makefile          |   1 +
>>>>   fs/erofs/internal.h        |   8 +
>>>>   fs/erofs/pagecache_share.c | 313 +++++++++++++++++++++++++++++++++++++
>>>>   fs/erofs/pagecache_share.h |  20 +++
>>>>   5 files changed, 352 insertions(+)
>>>>   create mode 100644 fs/erofs/pagecache_share.c
>>>>   create mode 100644 fs/erofs/pagecache_share.h
>>>>
>>>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>>>> index 7dcdce660cac..756a74de623c 100644
>>>> --- a/fs/erofs/Kconfig
>>>> +++ b/fs/erofs/Kconfig
>>>> @@ -158,3 +158,13 @@ config EROFS_FS_PCPU_KTHREAD_HIPRI
>>>>         at higher priority.
>>>>         If unsure, say N.
>>>> +
>>>> +config EROFS_FS_PAGE_CACHE_SHARE
>>>> +       bool "EROFS page cache share support"
>>>> +       depends on EROFS_FS
>>>> +       default n
>>>> +    help
>>>> +      This permits EROFS to share page cache for files with same
>>>> +      fingerprints.
>>>> +
>>>> +      If unsure, say N.
>>>> diff --git a/fs/erofs/Makefile b/fs/erofs/Makefile
>>>> index 097d672e6b14..f14a2ac0e561 100644
>>>> --- a/fs/erofs/Makefile
>>>> +++ b/fs/erofs/Makefile
>>>> @@ -8,3 +8,4 @@ erofs-$(CONFIG_EROFS_FS_ZIP_LZMA) += decompressor_lzma.o
>>>>   erofs-$(CONFIG_EROFS_FS_ZIP_DEFLATE) += decompressor_deflate.o
>>>>   erofs-$(CONFIG_EROFS_FS_ZIP_ZSTD) += decompressor_zstd.o
>>>>   erofs-$(CONFIG_EROFS_FS_ONDEMAND) += fscache.o
>>>> +erofs-$(CONFIG_EROFS_FS_PAGE_CACHE_SHARE) += pagecache_share.o
>>>> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
>>>> index 2c695ec6ee71..136d6bed417b 100644
>>>> --- a/fs/erofs/internal.h
>>>> +++ b/fs/erofs/internal.h
>>>> @@ -288,6 +288,12 @@ struct erofs_inode {
>>>>           };
>>>>   #endif    /* CONFIG_EROFS_FS_ZIP */
>>>>       };
>>>> +#ifdef CONFIG_EROFS_FS_PAGE_CACHE_SHARE
>>>> +    struct list_head pcs_list;
>>>> +    char *fprt;
>>>> +    int fprt_len;
>>>> +    unsigned long fprt_hash;
>>>> +#endif
>>>>       /* the corresponding vfs inode */
>>>>       struct inode vfs_inode;
>>>>   };
>>>> @@ -376,6 +382,7 @@ extern const struct super_operations erofs_sops;
>>>>   extern const struct address_space_operations erofs_raw_access_aops;
>>>>   extern const struct address_space_operations z_erofs_aops;
>>>>   extern const struct address_space_operations erofs_fscache_access_aops;
>>>> +extern const struct address_space_operations erofs_pcs_aops;
>>>>   extern const struct inode_operations erofs_generic_iops;
>>>>   extern const struct inode_operations erofs_symlink_iops;
>>>> @@ -384,6 +391,7 @@ extern const struct inode_operations erofs_dir_iops;
>>>>   extern const struct file_operations erofs_file_fops;
>>>>   extern const struct file_operations erofs_dir_fops;
>>>> +extern const struct file_operations erofs_pcs_file_fops;
>>>>   extern const struct iomap_ops z_erofs_iomap_report_ops;
>>>> diff --git a/fs/erofs/pagecache_share.c b/fs/erofs/pagecache_share.c
>>>> new file mode 100644
>>>> index 000000000000..e8c8fe16cda5
>>>> --- /dev/null
>>>> +++ b/fs/erofs/pagecache_share.c
>>>> @@ -0,0 +1,313 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>>> +/*
>>>> + * Copyright (C) 2024, Alibaba Cloud
>>>> + */
>>>> +#include <linux/xxhash.h>
>>>> +#include "pagecache_share.h"
>>>> +#include "internal.h"
>>>> +#include "xattr.h"
>>>> +
>>>> +#define PCS_FPRT_IDX    1
>>>> +#define PCS_FPRT_NAME    "fingerprint"
>>>> +#define PCS_FPRT_MAXLEN 1024
>>>> +
>>>> +struct erofs_pcs {
>>>> +    struct erofs_inode *cur;
>>> Why we should cur to pointer the inode? I think we only need to record the fprt of the content. When we need inode, we can check the list_empty, and to get the any inode to use.
>>>> +    struct rw_semaphore rw_sem;
>>>> +    struct mutex list_mutex;
>>>> +    struct list_head list;
>>>> +};
>>>> +
>>>> +static DEFINE_MUTEX(pseudo_mnt_lock);
>>>> +static unsigned long pseudo_mnt_count;
>>>> +static struct vfsmount *erofs_pcs_mnt;
>>>> +
>>>> +int erofs_pcs_init_mnt(void)
>>>> +{
>>>> +    mutex_lock(&pseudo_mnt_lock);
>>>> +    if (!erofs_pcs_mnt) {
>>>> +        erofs_pcs_mnt = kern_mount(&erofs_anon_fs_type);
>>>> +        if (IS_ERR(erofs_pcs_mnt))
>>> Not unlock here.
>>
>> Also I don't think it needs another kern_mount just
>> for this.
>>
>> You need to refactor the existing code in advance.
>>
>>>> +            return PTR_ERR(erofs_pcs_mnt);
>>>> +        pseudo_mnt_count = 1;
>>>> +    } else
>>>> +        pseudo_mnt_count += 1;
>>>> +    mutex_unlock(&pseudo_mnt_lock);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int erofs_pcs_free_mnt(void)
>>>> +{
>>>> +    mutex_lock(&pseudo_mnt_lock);
>>>> +    pseudo_mnt_count -= 1;
>>>> +    if (pseudo_mnt_count < 0)
>>>> +        return -EINVAL;
>>>> +    if (pseudo_mnt_count == 0) {
>>>> +        kern_unmount(erofs_pcs_mnt);
>>>> +        erofs_pcs_mnt = NULL;
>>>> +    }
>>>> +    mutex_unlock(&pseudo_mnt_lock);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void erofs_pcs_fill_inode(struct inode *inode)
>>>> +{
>>>> +    struct erofs_inode *vi = EROFS_I(inode);
>>>> +    char fprt[PCS_FPRT_MAXLEN];
>>>> +
>>>> +    vi->fprt_len = erofs_getxattr(inode, PCS_FPRT_IDX, PCS_FPRT_NAME, fprt,
>>>> +                      PCS_FPRT_MAXLEN);
>>>> +    if (vi->fprt_len > 0 && vi->fprt_len <= PCS_FPRT_MAXLEN) {
>>>> +        vi->fprt = kmalloc(vi->fprt_len, GFP_KERNEL);
>>>> +        if (IS_ERR(vi->fprt)) {
>>>> +            vi->fprt_len = -1;
>>>> +            return;
>>>> +        }
>>>> +        memcpy(vi->fprt, fprt, vi->fprt_len);
>>>> +        vi->fprt_hash = xxh32(vi->fprt, vi->fprt_len, 0);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int erofs_pcs_eq(struct inode *ano_inode, void *data)
>>>> +{
>>>> +    struct erofs_pcs *pcs = ano_inode->i_private;
>>>> +    struct erofs_inode *vi = (struct erofs_inode *)data;
>>>> +
>>>> +    return pcs && pcs->cur && pcs->cur->fprt_len == vi->fprt_len &&
>>>> +           memcmp(pcs->cur->fprt, vi->fprt, vi->fprt_len) == 0 ? 1 : 0;
>>> May be we just need compare vi->fprt_hash, no need memcmp?
>>>> +}
>>>> +
>>>> +static int erofs_pcs_set_aops(struct inode *ano_inode, void *data)
>>>> +{
>>>> +    ano_inode->i_mapping->a_ops = &erofs_pcs_aops;
>>>> +    ano_inode->i_size = ((struct erofs_inode *)data)->vfs_inode.i_size;
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int erofs_pcs_add(struct inode *inode)
>>>> +{
>>>> +    struct erofs_inode *vi = EROFS_I(inode);
>>>> +    struct inode *ano_inode;
>>>> +    struct erofs_pcs *pcs;
>>>> +
>>>> +    ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, vi->fprt_hash,
>>>> +                 erofs_pcs_eq, erofs_pcs_set_aops, vi);
>>>> +    spin_lock(&ano_inode->i_lock);
>>>> +    if (!ano_inode->i_private) {
>>>> +        pcs = kzalloc(sizeof(struct erofs_pcs), GFP_KERNEL);
>>>> +        if (IS_ERR(pcs))
>>> should call spin_unlock when failed
>>>> +            return -ENOMEM;
>>>> +        INIT_LIST_HEAD(&pcs->list);
>>>> +        mutex_init(&pcs->list_mutex);
>>>> +        init_rwsem(&pcs->rw_sem);
>>>> +        ano_inode->i_private = pcs;
>>>> +    }
>>>> +    spin_unlock(&ano_inode->i_lock);
>>>> +    pcs = ano_inode->i_private;
>>> pcs = ano_inode->i_private;
>>> spin_lock(&ano_inode->i_lock);
>>> if (!pcs) {
>>> xxx
>>> }
>>> spin_unlock(&ano_inode->i_lock);
>>>> +    mutex_lock(&pcs->list_mutex);
>>>> +    list_add_tail(&vi->pcs_list, &pcs->list);
>>>> +    if (!pcs->cur) {
>>>> +        pcs->cur = list_first_entry(&pcs->list,
>>>> +                        struct erofs_inode, pcs_list);
>>>> +    }
>>>> +    mutex_unlock(&pcs->list_mutex);
>>>> +    if (ano_inode->i_state & I_NEW)
>>>> +        unlock_new_inode(ano_inode);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int erofs_pcs_remove(struct inode *inode)
>>>> +{
>>>> +    struct erofs_inode *vi = EROFS_I(inode), *p, *tmp;
>>>> +    struct inode *ano_inode;
>>>> +    struct erofs_pcs *pcs;
>>>> +
>>>> +    ano_inode = iget5_locked(erofs_pcs_mnt->mnt_sb, vi->fprt_hash,
>>>> +                 erofs_pcs_eq, erofs_pcs_set_aops, vi);
>>>> +    iput(ano_inode);
>>>> +    pcs = ano_inode->i_private;
>>>> +    if (!pcs)
>>>> +        return -EINVAL;
>>>> +    mutex_lock(&pcs->list_mutex);
>>>> +    if (vi == pcs->cur) {
>>>> +        /* synchronize with reads using pcs->cur */
>>>> +        down_write(&pcs->rw_sem);
>>>> +        list_del(&pcs->cur->pcs_list);
>>>> +        if (list_empty(&pcs->list)) {
>>>> +            pcs->cur = NULL;
>>>> +            up_write(&pcs->rw_sem);
>>>> +            goto free_ano_inode;
>>>> +        }
>>>> +        pcs->cur = list_first_entry(&pcs->list, struct erofs_inode,
>>>> +                        pcs_list);
>>>> +        up_write(&pcs->rw_sem);
>>>> +    } else {
>>>> +        list_for_each_entry_safe(p, tmp, &pcs->list, pcs_list) {
>>>> +            if (p == vi) {
>>>> +                list_del(&p->pcs_list);
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    iput(ano_inode);
>>>> +    mutex_unlock(&pcs->list_mutex);
>>>> +    return 0;
>>>> +
>>>> +free_ano_inode:
>>>> +    mutex_unlock(&pcs->list_mutex);
>>>> +    spin_lock(&ano_inode->i_lock);
>>>> +    if (atomic_read(&ano_inode->i_count) == 1) {
>>>> +        ano_inode->i_private = NULL;
>>>> +        kfree(pcs);
>>>> +    }
>>>> +    spin_unlock(&ano_inode->i_lock);
>>>> +    iput(ano_inode);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int erofs_pcs_iomap_begin(struct inode *ano_inode, loff_t offset,
>>>> +                loff_t length, unsigned int flags,
>>>> +                struct iomap *iomap, struct iomap *srcmap)
>>>> +{
>>>> +    struct erofs_pcs *pcs;
>>>> +
>>>> +    pcs = ano_inode->i_private;
>>>> +    if (!pcs || !pcs->cur)
>>>> +        return -EINVAL;
>>>> +
>>>> +    return erofs_iomap_begin(&pcs->cur->vfs_inode, offset, length, flags,
>>>> +                 iomap, srcmap);
>>>> +}
>>>> +
>>>> +static int erofs_pcs_iomap_end(struct inode *ano_inode, loff_t pos,
>>>> +                   loff_t length, ssize_t written,
>>>> +                   unsigned int flags, struct iomap *iomap)
>>>> +{
>>>> +    struct erofs_pcs *pcs;
>>>> +
>>>> +    pcs = ano_inode->i_private;
>>>> +    if (!pcs)
>>>> +        return -EINVAL;
>>>> +
>>>> +    return erofs_iomap_end(&pcs->cur->vfs_inode, pos, length, written,
>>>> +                   flags, iomap);
>>>> +}
>>>> +
>>>> +const struct iomap_ops erofs_pcs_iomap_ops = {
>>>> +    .iomap_begin = erofs_pcs_iomap_begin,
>>>> +    .iomap_end = erofs_pcs_iomap_end,
>>> Does this work on on-demand mode over fscache? As I know, this io path is only for erofs, not for erofs over fscache.
>>
>> That is why I think this implementation is far from upstreaming:
>>    - over fscache
>>    - compression
>> are all needed to be implemented.
>>
>> I don't think it needs to add any extra helpers just for this, you
>> need to handle page cache sharing in the original paths to find
>> the share inodes instead.
>>
> I remember that Jingbo have done the similar thing like this. At: https://patchew.org/linux/20230111083158.23462-1-jefflexu@linux.alibaba.com/
> 
> But I don't know what happened afterward.

Yes, those two resolve the same requirement but in
different ways:

Jingbo's approach use bdev/fscache meta inodes for
page cache sharing, but in that way:
   - `vma->pgoff` needs to be modified (but which is
     unsafe due to some corner cases), and it needs
     the shared data physical consective.

   - it doesn't apply to encoded data like
     compressed data.

The current way (maintaining inodes across different
filesystems) will resolve the issues above. It's
much like a very simplified overlay filesystem anyway.

Thanks,
Gao Xiang