[PATCH RFC v7 0/7] erofs: inode page cache share feature

Wed Oct 29 23:58:37 AEDT 2025

Hi Xiang,

On 2025/10/21 21:04, Gao Xiang wrote:
> Hi Hongbo,
> 
> On 2025/10/21 18:48, Hongbo Li wrote:
>> Enabling page cahe sharing in container scenarios has become increasingly
>> crucial, as it can significantly reduce memory usage. In previous 
>> efforts,
>> Hongzhen has done substantial work to push this feature into the EROFS
>> mainline. Due to other commitments, he hasn't been able to continue his
>> work recently, and I'm very pleased to build upon his work and continue
>> to refine this implementation.
>>
>> This is a forward-port of Hongzhen's original erofs shared pagecache
>> posted a half yeas ago at (the latest):
>> https://lore.kernel.org/all/20250301145002.2420830-1-hongzhen@linux.alibaba.com/T/#u
>>
>> In addition to the forward-port, I have also fixed a couple bugs and
>> some minor cleanup during the migration.
>>
>> Notes: Currently, only compilation tests and basic function have been
>> verified. Validation for the shared page cache feature is pending until
>> the erofs-utils tool is complete.
>>
>> (A recap of Hongzhen's original cover letter is below, edited slightly
>> for this serise:)
> 
> I'm still left behind of this (currently heavily working on erofs-utils
> and containerd), but could we have a workable erofs-utils implementation
> first?
> 

Understood, I will implement a simple debug version first and will send 
the revised code later to address the noted issues.

Thanks,
Hongbo

> Also, Amir's previous suggestion needs to be resolved too..
> https://lore.kernel.org/r/CAOQ4uxjFcw7+w4jfjRKZRDitaXmgK1WhFbidPUFjXFt_6Kew5A@mail.gmail.com
> 
> Finally, thanks for remaining Hongzhen's email (but he was already
> left, thanks for remaining our credits).
> 
> Thanks,
> Gao Xiang
> 
> 
>>
>> Background
>> ==============
>> Currently, reading files with different paths (or names) but the same
>> content can consume multiple copies of the page cache, even if the
>> content of these caches is identical. For example, reading identical
>> files (e.g., *.so files) from two different minor versions of container
>> images can result in multiple copies of the same page cache, since
>> different containers have different mount points. Therefore, sharing
>> the page cache for files with the same content can save memory.
>>
>> Proposal
>> ==============
>>
>> 1. determining file identity
>> ----------------------------
>> First, a way needs to be found to check whether the content of two files
>> is the same. Here, the xattr values associated with the file
>> fingerprints are assessed for consistency. When creating the EROFS
>> image, users can specify the name of the xattr for file fingerprints,
>> and the corresponding name will be stored in the packfile. The on-disk
>> `ishare_key_start` indicates the offset of the xattr's name within the
>> packfile:
>>
>> ```
>> struct erofs_super_block {
>>     __le32 build_time;      /* seconds added to epoch for mkfs time */
>>     __le64 rootnid_8b;      /* (48BIT on) nid of root directory */
>> -    __le64 reserved2;
>> +    __le32 ishare_key_start;        /* start of ishare key */
>> +    __le32 reserved2;
>>     __le64 metabox_nid;     /* (METABOX on) nid of the metabox inode */
>>     __le64 reserved3;       /* [align to extslot 1] */
>> };
>> ```
>>
>> For example, users can specify the first long prefix as the name for the
>> file fingerprint as follows:
>>
>> ```
>> mkfs.erofs  --ishare_key=trusted.erofs.fingerprint  erofs.img ./dir
>> ```
>>
>> In this way, `trusted.erofs.fingerprint` serves as the name of the xattr
>> for the file fingerprint. The relevant patches for erofs-utils will be
>> released later.
>>
>> At the same time, for security reasons, this patch series only shares
>> files within the same domain, which is achieved by adding
>> "-o domain_id=xxxx" during the mounting process:
>>
>> ```
>> mount -t erofs -o domain_id=xxx erofs.img /mnt
>> ```
>>
>> If no domain ID is specified, it will fall back to the non-page cache
>> sharing mode.
>>
>> 2. whose page cache is shared?
>> ------------------------------
>>
>> 2.1. share the page cache of inode_A or inode_B
>> -----------------------------------------------
>> For example, we can share the page cache of inode_A, referred to as
>> PGCache_A. When reading file B, we read the contents from PGCache_A to
>> achieve memory savings. Furthermore, if we need to read another file C
>> with the same content, we will still read from PGCache_A. In this way,
>> we fulfill multiple read requests with just a single page cache.
>>
>> 2.2. share the de-duplicated inode's page cache
>> -----------------------------------------------
>> Unlike in 2.1, we allocate an internal deduplicated inode and use its
>> page cache as shared. Reads for files with identical content will
>> ultimately be routed to the page cache of the deduplicated inode. In
>> this way, a single page cache satisfies multiple read requests for
>> different files with the same contents.
>>
>> 2.3. discussion of the two solutions
>> -----------------------------------------------
>> Although the solution in 2.1 allows for page cache sharing, it has
>> inherent drawbacks. The creation and destruction of inode nodes in the
>> file system mean that when inode_A is destroyed, PGCache_A is also
>> released. Consequently, if we need to read the file content afterward,
>> we must retrieve the data from the disk again. This conflicts with the
>> design philosophy of page cache (caching contents from the disk).
>>
>> Therefore, I choose to implement the solution in 2.2, which is to
>> allocate an internal deduplicated inode and use its page cache as
>> shared.
>>
>> 3. Implementation
>> ==================
>>
>> 3.1. file open & close
>> ----------------------
>> When the file is opened, the ->private_data field of file A or file B is
>> set to point to an internal deduplicated file. When the actual read
>> occurs, the page cache of this deduplicated file will be accessed.
>>
>> When the file is opened, if the corresponding erofs inode is newly
>> created, then perform the following actions:
>> 1. add the erofs inode to the backing list of the deduplicated inode;
>> 2. increase the reference count of the deduplicated inode.
>>
>> The purpose of step 1 above is to ensure that when a real I/O operation
>> occurs, the deduplicated inode can locate one of the disk devices
>> (as the deduplicated inode itself is not bound to a specific device).
>> Step 2 is for managing the lifecycle of the deduplicated inode.
>>
>> When the erofs inode is destroyed, the opposite actions mentioned above
>> will be taken.
>>
>> 3.2. file reading
>> -----------------
>> Assuming the deduplication inode's page cache is PGCache_dedup, there
>> are two possible scenarios when reading a file:
>> 1) the content being read is already present in PGCache_dedup;
>> 2) the content being read is not present in PGCache_dedup.
>>
>> In the second scenario, it involves the iomap operation to read from the
>> disk.
>>
>> 3.2.1. reading existing data in PGCache_dedup
>> -------------------------------------------
>> In this case, the overall read flowchart is as follows (take ksys_read()
>> for example):
>>
>>           ksys_read
>>               │
>>               │
>>               ▼
>>              ...
>>               │
>>               │
>>               ▼
>> erofs_ishare_file_read_iter (switch to backing deduplicated file)
>>               │
>>               │
>>               ▼
>>
>>   read PGCache_dedup & return
>>
>> At this point, the content in PGCache_dedup will be read directly and
>> returned.
>>
>> 3.2.2 reading non-existent content in PGCache_dedup
>> ---------------------------------------------------
>> In this case, disk I/O operations will be involved. Taking the reading
>> of an uncompressed file as an example, here is the reading process:
>>
>>           ksys_read
>>               │
>>               │
>>               ▼
>>              ...
>>               │
>>               │
>>               ▼
>> erofs_ishare_file_read_iter (switch to backing deduplicated file)
>>               │
>>               │
>>               ▼
>>              ... (allocate pages)
>>               │
>>               │
>>               ▼
>> erofs_read_folio/erofs_readahead
>>               │
>>               │
>>               ▼
>>              ... (iomap)
>>               │
>>               │
>>               ▼
>>          erofs_iomap_begin
>>               │
>>               │
>>               ▼
>>              ...
>>
>> Iomap and the layers below will involve disk I/O operations. As
>> described in 3.1, the deduplicated inode itself is not bound to a
>> specific device. The deduplicated inode will select an erofs inode from
>> the backing list (by default, the first one) to complete the
>> corresponding iomap operation.
>>
>> 3.2.3 optimized inode selection
>> -------------------------------
>> The inode selection method described in 3.2.2 may select an "inactive"
>> inode. An inactive inode indicates that there may have been no read
>> operations on the inode's device for a long time, and there is a high
>> likelihood that the device may be unmounted. In this case, unmounting
>> the device may experience a slight delay due to other read requests
>> being routed to that device. Therefore, we need to select some "active"
>> inodes for the iomap operation.
>>
>> To achieve optimized inode selection, an additional `processing` list
>> has been added. At the beginning of erofs_{read_folio,readahead}(), the
>> corresponding erofs inode will be added to the `processing` list
>> (because they are active). And it is removed at the end of
>> erofs_{read_folio,readahead}(). In erofs_iomap_begin(), the selected
>> erofs inode's count is incremented, and in erofs_iomap_end(), the count
>> is decremented.
>>
>> In this way, even after the erofs inode is removed from the `processing`
>> list, the increment in the reference count can ensure the integrity of
>> the data reading process. This is somewhat similar to RCU (not exactly
>> the same, but similar).
>>
>> 3.3. release page cache
>> -----------------------
>> Similar to overlayfs, when dropping the page cache via .fadvise, erofs
>> locates the deduplicated file and applies vfs_fadvise to that specific
>> file.
>>
>> Effect
>> ==================
>> I conducted experiments on two aspects across two different minor
>> versions of container images:
>>
>> 1. reading all files in two different minor versions of container images
>>
>> 2. run workloads or use the default entrypoint within the containers^[1]
>>
>> Below is the memory usage for reading all files in two different minor
>> versions of container images:
>>
>> +-------------------+------------------+-------------+---------------+
>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>> |                   |                  |             | Reduction (%) |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     241     |       -       |
>> |       redis       +------------------+-------------+---------------+
>> |   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     872     |       -       |
>> |      postgres     +------------------+-------------+---------------+
>> |    16.1 & 16.2    |        Yes       |     630     |      28%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     2771    |       -       |
>> |     tensorflow    +------------------+-------------+---------------+
>> |  2.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     926     |       -       |
>> |       mysql       +------------------+-------------+---------------+
>> |  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     390     |       -       |
>> |       nginx       +------------------+-------------+---------------+
>> |   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
>> +-------------------+------------------+-------------+---------------+
>> |       tomcat      |        No        |     924     |       -       |
>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>> |                   |        Yes       |     474     |      49%      |
>> +-------------------+------------------+-------------+---------------+
>>
>> Additionally, the table below shows the runtime memory usage of the
>> container:
>>
>> +-------------------+------------------+-------------+---------------+
>> |       Image       | Page Cache Share | Memory (MB) |    Memory     |
>> |                   |                  |             | Reduction (%) |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     34.9    |       -       |
>> |       redis       +------------------+-------------+---------------+
>> |   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |    149.1    |       -       |
>> |      postgres     +------------------+-------------+---------------+
>> |    16.1 & 16.2    |        Yes       |      95     |      37%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |    1027.9   |       -       |
>> |     tensorflow    +------------------+-------------+---------------+
>> |  2.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |    155.0    |       -       |
>> |       mysql       +------------------+-------------+---------------+
>> |  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
>> +-------------------+------------------+-------------+---------------+
>> |                   |        No        |     25.4    |       -       |
>> |       nginx       +------------------+-------------+---------------+
>> |   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
>> +-------------------+------------------+-------------+---------------+
>> |       tomcat      |        No        |     186     |       -       |
>> | 10.1.25 & 10.1.26 +------------------+-------------+---------------+
>> |                   |        Yes       |      99     |      47%      |
>> +-------------------+------------------+-------------+---------------+
>>
>> It can be observed that when reading all the files in the image, the
>> reduced memory usage varies from 16% to 49%, depending on the specific
>> image. Additionally, the container's runtime memory usage reduction
>> ranges from 4% to 47%.
>>
>> [1] Below are the workload for these images:
>>        - redis: redis-benchmark
>>        - postgres: sysbench
>>        - tensorflow: app.py of tensorflow.python.platform
>>        - mysql: sysbench
>>        - nginx: wrk
>>        - tomcat: default entrypoint
>>
>> The patch in this version has made the following changes compared to
>> the previous versionv(patch v5):
>>
>> - support user-defined fingerprint name;
>> - support domain-specific page cache share;
>> - adjusted the code style;
>> - adjustments in code implementation, etc.
>>
>> v5: 
>> https://lore.kernel.org/all/20250105151208.3797385-1-hongzhen@linux.alibaba.com/
>> v4: 
>> https://lore.kernel.org/all/20240902110620.2202586-1-hongzhen@linux.alibaba.com/
>> v3: 
>> https://lore.kernel.org/all/20240828111959.3677011-1-hongzhen@linux.alibaba.com/
>> v2: 
>> https://lore.kernel.org/all/20240731080704.678259-1-hongzhen@linux.alibaba.com/
>> v1: 
>> https://lore.kernel.org/all/20240722065355.1396365-1-hongzhen@linux.alibaba.com/
>>
>