[PATCH 4/4] erofs-utils: mkfs: introduce global compressed data deduplication
Gao Xiang
xiang at kernel.org
Thu Sep 8 10:56:32 AEST 2022
On Wed, Sep 07, 2022 at 01:55:50PM +0800, Gao Xiang wrote:
> On Tue, Sep 06, 2022 at 07:40:57PM +0800, ZiyangZhang wrote:
> > From: Ziyang Zhang <ZiyangZhang at linux.alibaba.com>
> >
> > This patch introduces global compressed data deduplication to
> > reuse potential prefixes for each pcluster.
> >
> > It also uses rolling hashing and shortens the previous compressed
> > extent in order to explore more possibilities for deduplication.
> >
> > Co-developped-by: Gao Xiang <hsiangkao at linux.alibaba.com>
> > Signed-off-by: Ziyang Zhang <ZiyangZhang at linux.alibaba.com>
>
> Some preliminary numbers:
> Image Fs Type Size
> system.raven.87e115a1 erofs uncompressed 910082048
> system.raven.87e115a1 erofs 4k pcluster + lz4hc,9 + ztailpacking 584970240 -35.7% off
> system.raven.87e115a1 erofs 4k pcluster + lz4hc,9 + ztailpacking + dedupe 569376768 -37.4% off
>
> linux-5.10 + linux-5.10.87 erofs uncompressed 1943691264
> linux-5.10 + linux-5.10.87 erofs 4k pcluster + lz4hc,9 + ztailpacking 661987328 -65.9% off
> linux-5.10 + linux-5.10.87 erofs 4k pcluster + lz4hc,9 + ztailpacking + dedupe 490295296 -74.8% off
>
> linux-5.10.87 erofs 4k pcluster + lz4hc,9 331292672
>
> One observation is since the tailpacking pcluster doesn't have blkaddr
> so data relating to tailpacking pcluster cannot be deduped.
>
> On the other side, it can work with `fragment' feature later together to
> minimize image sizes.
>
>
> Attach a fix for uncompressed pcluster:
>
> diff --git a/lib/dedupe.c b/lib/dedupe.c
> index c53a64edfc8d..c382303e2ceb 100644
> --- a/lib/dedupe.c
> +++ b/lib/dedupe.c
> @@ -21,7 +21,7 @@ struct z_erofs_dedupe_item {
> unsigned int compressed_blks;
>
> int original_length;
> - bool partial;
> + bool partial, raw;
> u8 extra_data[];
> };
>
> @@ -86,6 +86,7 @@ int z_erofs_dedupe_match(struct z_erofs_dedupe_ctx *ctx)
> ctx->e.length = window_size + extra;
> ctx->e.partial = e->partial ||
> (window_size + extra < e->original_length);
> + ctx->e.raw = e->raw;
> ctx->e.blkaddr = e->compressed_blkaddr;
> ctx->e.compressedblks = e->compressed_blks;
> return 0;
> @@ -114,6 +115,7 @@ int z_erofs_dedupe_insert(struct z_erofs_inmem_extent *e,
> di->compressed_blkaddr = e->blkaddr;
> di->compressed_blks = e->compressedblks;
> di->partial = e->partial;
> + di->raw = e->raw;
>
> /* with the same rolling hash */
> if (!rb_tree_insert(dedupe_subtree, di))
> --
> 2.30.2
>
Another fix detected by an Android system image:
diff --git a/lib/compress.c b/lib/compress.c
index bdb6e78d32ca..3247835b75b6 100644
--- a/lib/compress.c
+++ b/lib/compress.c
@@ -158,9 +158,14 @@ static int z_erofs_compress_dedupe(struct erofs_inode *inode,
do {
struct z_erofs_dedupe_ctx dctx = {
- .start = ctx->queue + ctx->head -
- (ctx->e.length < EROFS_BLKSIZ ? 0 :
- ctx->e.length - EROFS_BLKSIZ),
+ .start = ctx->queue + ctx->head - ({ int rc;
+ if (ctx->e.length <= EROFS_BLKSIZ)
+ rc = 0;
+ else if (ctx->e.length - EROFS_BLKSIZ >= ctx->head)
+ rc = ctx->head;
+ else
+ rc = ctx->e.length - EROFS_BLKSIZ;
+ rc; }),
.end = ctx->queue + ctx->head + *len,
.cur = ctx->queue + ctx->head,
};
More information about the Linux-erofs
mailing list