[PATCH RFC 0/4] erofs: introduce page cache share feature

Hongzhen Luo hongzhen at linux.alibaba.com
Mon Jul 22 16:53:50 AEST 2024


[Background]
================
Currently, reading files with different paths (or names) but the same
content will consume multiple copies of the page cache, even if the
content of these page caches is the same. For example, reading identical
files (e.g., *.so files) from two different minor versions of container
images will cost multiple copies of the same page cache, since different
containers have different mount points. Therefore, sharing the page cache
for files with the same content can save memory.

[Implementation]
================
During the mkfs phase, file content is hashed and the hash value is 
stored in the `user.fingerprint` extended attribute. Inodes of files
with the same `user.fingerprint` are mapped to an anonymous inode, whose
page cache stores the actual contents. In this way, a single copy of the
anonymous inode's page cache can serve read requests from several files
mapped to it. The following describes the relationship between the anonymous
inode and inodes with the same content:

                        page cache                            
                   ┌────┬────┬────┬──────┐                    
             ┌────►│    │    │ ...│      │                    
             │     └────┴────┴────┴──────┘                    
             │                                                
             │                                                
             │      i_private                                 
          ┌──┴────────┬───┐                                   
       ┌─►│ ano_inode │   │                                   
       │  └───────────┴─┬─┘                                   
       │                │                                     
       │       ┌────────┘                                     
mapped │       ▼                                              
  to   │  ┌──────────┬───┬─────┐                              
       │  │erofs_pcs │cur│ list│                              
       │  └──────────┴─┬─┴───┬─┘                              
       │               │     │                                
       │     ┌─────────┘     │                                
       │     │               │                                
       │     │    ┌──────────┘                                
       │     │    │                                           
       │     ▼    ▼                                           
       │  ┌────────┐       ┌────────┐               ┌────────┐
       └──┤        │ ────► │        │ ───►      ──► │        │
          │        │       │        │      ...      │        │
          └────────┘ ◄──── └────────┘ ◄───      ◄── └────────┘
                                                              
            inode_1          inode_2                  inode_n 

In the above diagram, the `i_private` (protected by `i_lock`) field of the
anonymous inode points to the `struct erofs_pcs` structure:

struct erofs_pcs {
	struct erofs_inode *cur;
	struct rw_semaphore rw_sem;
	struct mutex list_mutex;
	struct list_head list;
};

where the `list` field points to a list of inodes that are mapped to the
anonymous inode and has the same `user.fingerprint` field. The `cur` field
points to the first inode in the inode list, which is used for I/O
mapping (iomap) related operations. 

When an inode is created, it is added to the inode list pointed to by the
`erofs_pcs` structure corresponding to the anonymous inode; similarly, when
the inode is destroyed, it is removed from the inode list. Note that if the
inode is the one pointed to by `cur`, then it is necessary to acquire the
read-write semaphore `rw_sem` to maintain synchronization, in case the inode
is being used for iomap operations elsewhere. 

[Effect]
================
I conducted experiments on two aspects across two different minor versions of
container images:

1. reading all files in two different minor versions of container images 

2. run workloads or use the default entrypoint within the containers^[1]

Below is the memory usage for reading all files in two different minor
versions of container images:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     241     |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     163     |      33%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     872     |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |     630     |      28%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     2771    |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  1.11.0 & 2.11.1  |        Yes       |     2340    |      16%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     926     |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |     735     |      21%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     390     |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     219     |      44%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     924     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |     474     |      49%      |
+-------------------+------------------+-------------+---------------+

Additionally, the table below shows the runtime memory usage of the
container:

+-------------------+------------------+-------------+---------------+
|       Image       | Page Cache Share | Memory (MB) |    Memory     |
|                   |                  |             | Reduction (%) |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     34.9    |       -       |
|       redis       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     33.6    |       4%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    149.1    |       -       |
|      postgres     +------------------+-------------+---------------+
|    16.1 & 16.2    |        Yes       |      95     |      37%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    1027.9   |       -       |
|     tensorflow    +------------------+-------------+---------------+
|  1.11.0 & 2.11.1  |        Yes       |    934.3    |      10%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |    155.0    |       -       |
|       mysql       +------------------+-------------+---------------+
|  8.0.11 & 8.0.12  |        Yes       |    139.1    |      11%      |
+-------------------+------------------+-------------+---------------+
|                   |        No        |     25.4    |       -       |
|       nginx       +------------------+-------------+---------------+
|   7.2.4 & 7.2.5   |        Yes       |     18.8    |      26%      |
+-------------------+------------------+-------------+---------------+
|       tomcat      |        No        |     186     |       -       |
| 10.1.25 & 10.1.26 +------------------+-------------+---------------+
|                   |        Yes       |      99     |      47%      |
+-------------------+------------------+-------------+---------------+

It can be observed that when reading all the files in the image, the reduced
memory usage varies from 16% to 49%, depending on the specific image.
Additionally, the container's runtime memory usage reduction ranges from 4%
to 47%.

[1] Below are the workload for these images:
      - redis: redis-benchmark
      - postgres: sysbench
      - tensorflow: app.py of tensorflow.python.platform
      - mysql: sysbench
      - nginx: wrk
      - tomcat: default entrypoint

Hongzhen Luo (4):
  erofs: move `struct erofs_anon_fs_type` to super.c
  erofs: expose erofs_iomap_{begin, end}
  erofs: introduce page cache share feature
  erofs: apply the page cache share feature

 fs/erofs/Kconfig           |  10 ++
 fs/erofs/Makefile          |   1 +
 fs/erofs/data.c            |   9 +-
 fs/erofs/fscache.c         |  13 +-
 fs/erofs/inode.c           |  17 ++
 fs/erofs/internal.h        |   8 +
 fs/erofs/pagecache_share.c | 318 +++++++++++++++++++++++++++++++++++++
 fs/erofs/pagecache_share.h |  23 +++
 fs/erofs/super.c           |  40 +++++
 9 files changed, 425 insertions(+), 14 deletions(-)
 create mode 100644 fs/erofs/pagecache_share.c
 create mode 100644 fs/erofs/pagecache_share.h

-- 
2.43.5



More information about the Linux-erofs mailing list