[RFC PATCH] erofs-utils: mkfs: introduce multi-thread compression

Gao Xiang hsiangkao at linux.alibaba.com
Sun Aug 20 04:41:18 AEST 2023


Hi Yifan,

On 2023/8/20 02:01, Yifan Zhao wrote:
> This patch introduce multi-thread compression to accelerate image
> packaging.
> ---
> Hi all:
> 
> This is a very imperfect patch not ready for merging, and any suggestions would be appreciated!
> If it's on track, I'd like to follow up on that.
> 
> The inefficiency of EROFS compressed image creation is a much criticized problem,
> and this patch attempts to address by creating multiple threads
> to run the compression algorithm in parallel.

Many thanks if you could have time following on that.

Yet due to the release process timing, erofs-utils 1.7 will be released
in about a month, so I think multithreaded support will be supported as
part of erofs-utils v1.8.

> 
> Specifically, each input file over 16MB is split into segments,
> and each thread compresses a segment as if it were a separate file.
> Finally, the main thread merges all the compressed segments into one file.
> This process does not involve any data contention.
> 
> Current issues:
> 1.	For each large file, we create and destroy a batch of worker threads, causing unnecessary overhead.
> 	Moreover, each worker thread's context is a global variable, making the binary bigger.
> 	In the future, we can pre-create worker threads when the program starts running.
> 	Worker threads serve as consumers and the main thread that makes the compression request is the producer.

I'd suggest if we could use (or enhance?)
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?id=e5b83309b199966cc757cb095d1ff1ebd0923b3e

as a start?

> 2.	Fragment/Dedupe together with other advanced features are not fully tested
> 	due to my poor knowledge of the compression process. Not sure if they work well with multithreading.

I have a preliminary design of Fragment/Dedupe, we could talk more details
later if you'd like to take more time on this, thanks! ;)

> 3.	There is a lot of code redundancy between the
> 	erofs_write_compressed_file() and erofs_write_compressed_file_single() functions.
> 	I don't want to break the original single-threaded execution logic,
> 	but erofs_write_compressed_file() has a high complexity and
> 	my failed attempt to merge the two functions makes the matter worse.
> 	I'm not sure if we should merge them together or keep two different function entries for single and multi-threaded compression.
I think we need to merge these finally.

> 
> Performance:
> 	Despite the naive patch, we still see performance gain due to the poor baseline performance especially for lz4hc.
> 	1. Packing time of an Arch linux container image [1] provided by @wszqkzqk [2].
> 		lz4  : 8s(multi-thread) v.s. 10s(single-thread)
> 		lz4hc: 48s(multi-thread) v.s. 54s(single-thread)
> 	2. Packint time of Linux v6.4 git repository (with several ~GB git object files).
> 		lz4  : 14s(multi-thread) v.s. 23s(single-thread)
> 		lz4hc: 49s(multi-thread) v.s. 212s(single-thread)

That is reasonable anyway, but in order to make multi-threaded support
better, some code needs to be refactored first.

Actually I'm have some cleanup patches to prepare for multithreaded
support on hand, but I will apply these after 1.7 is released, again.

> 
> BTW, is there any format file (e.g., .clang-format) available for me to format erofs-utils project?

Not yet, erofs-utils follows Linux kernel coding style, would you mind
submit a patch for this?

Thanks,
Gao Xiang


More information about the Linux-erofs mailing list