[PATCH 2/4] erofs-utils: add a built-in DEFLATE compressor

Mon Jul 10 14:51:03 AEST 2023

On 2023/7/10 12:16, Eric Biggers wrote:
> On Mon, Jul 10, 2023 at 11:01:16AM +0800, Gao Xiang wrote:
>> Hi Eric,
>>
>> On 2023/7/10 10:33, Eric Biggers wrote:
>>> Hi Gao,
>>>
>>> On Mon, Jul 10, 2023 at 02:25:09AM +0800, Gao Xiang wrote:
>>>>
>>>> Since there is _no fixed-sized output DEFLATE compression appoach_
>>>> available in public (fitblk is somewhat ineffective) and the original
>>>> zlib is quite slowly developping, let's work out one for our use cases.
>>>> Fortunately, it's only less than 1.5kLOC with lazy matching to match
>>>> the basic zlib ability.  Besides, near-optimal block splitting (based
>>>> on price function) doesn't support for now since it's not rush for us.
>>>>
>>>> In the future, there may be more built-in optimizers landed to fulfill
>>>> our needs even further.  In addition, I'd be quite happy to see more
>>>> popular encoders to support fixed-sized output compression too.
>>>>
>>>> [1] https://developer.apple.com/documentation/compression/compression_algorithm
>>>> [2] https://datatracker.ietf.org/doc/html/rfc1951
>>>> Signed-off-by: Gao Xiang <hsiangkao at linux.alibaba.com>
>>>> ---
>>>>    lib/Makefile.am    |    2 +
>>>>    lib/kite_deflate.c | 1267 ++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 1269 insertions(+)
>>>>    create mode 100644 lib/kite_deflate.c
>>>
>>> Have you considered simply running any standard compressor multiple times to
>>> find the maximum input size that compresses to less than or equal to the desired
>>> output size?  It may sound too slow, but it can be made to do the search in a
>>
>> Thanks for your interest and reply!
>>
>> Actually I've tried several ways before (just like fitblk.c or binary search),
>> but the compression ratios is sliently impacted
> 
> That's strange, since the method I suggested should improve the compression
> ratio.

It depends, the current kite_deflate approach is already better than zlib (no
matter fitblk.c or binary search.)  Compared to libdeflate, yes, it doesn't
support optimal deflate block splitting and some advanced matchfinder other
than hash chains so I'm afraid not.

This encoder is only used to replace zlib and I think we could integrate
`libdeflate` like what you suggested first.

> 
>> and compression speed is really
>> impacted (maybe we need to develop segment-splitted multi-threadded compression
>> as well, and it's developping by a college student now but I'm not sure the
>> progress.)
> 
> I'm surprised that multi-threaded compression isn't something you have already!

Because my own time is limited and no more volunteer for this, rolling-hash
compressed data deduplication is more useful and natual to EROFS since
EROFS has byte-granularity uncompressed offset/length indexes (kernel
improvements always have higher priority than userspace stuffs since
userspace stuffs can easily be bumpped with new released versions.)

> 
>>
>> I think we could use this way first for the new Zstd support (if they don't have
>> bandwidth to add a offical fixed-sized output approach), but considering deflate
>> algorithm is quite easy for me to understand so at least I could have a simple
>> built-in one to avoid external dependency (mainly considering zlib since it's
>> quite outdated).  Since EROFS is actually offline compression so the compressed
>> data correctness can always be checked in advance before image release.
> 
> Writing an optimized compressor, especially one that implements high compression
> levels, is very hard.  Are you sure it's something you want to implement and
> maintain in erofs-utils?
> 
>> Also I have more ideas to optimize the last symbol from matchfinders (even lazy
>> matching) for fixed-sized output approaches to get even better compression
>> ratios (especially for smaller filesystem cluster like 4kb.)
> 
> Yes, there are some tricks that compressors that natively support
> fixed-size-output mode could use in order to avoid redundant work.
> 
>>> smart way (binary search + heuristics).  Also, it would easily work with every
>>> compressor, and it would always produce the optimal input size.
>>>
>>> I wrote a proof-of-concept patch that implements this idea in the benchmark
>>> program in libdeflate.  It finds the optimal input size in about 8 attempts on
>>> average for a target output size of 4096, or about 13 for a target output size
>>> of 65536.  Note, if an optimal solution is not needed, it could be sped up
>>> slightly by stopping when the output size is at, or just under (if desired), the
>>> target output size.  Also keep in mind that any compression level can be used.
>>>
>>> The full code I tested is currently at
>>> https://github.com/ebiggers/libdeflate/tree/fixed-output-size, but below is the
>>> actual algorithm:
>>
>> If libdeflate officically supports this mode, I'm very happy to integrate
>> libdeflate in this way as an alternative encoder (if you could merge this), and
>> if the compression speed is improved, I could switch it to the default encoder
>> then. In the other words, I really hope there could be an official deflate
>> encoder to support this mode as well.  Also I think the in-kernel zlib
>> decompression is somewhat slow just due to some non-technical reasons, I might
>> need to improve this eveutually but it's no rush for us compared to support
>> DEFLATE format first as well.
>>
>>>
>>> /*
>>>    * Compresses the largest prefix of 'in', with maximum size 'in_nbytes', whose
>>>    * compressed output is at most 'target_output_size' bytes.  The compressed size
>>>    * is returned as the return value, and the uncompressed size is returned in
>>>    * *actual_in_nbytes_ret.  The output buffer 'out' has size 'out_nbytes_avail',
>>>    * which should be at least 'target_output_size + 9'.  'tmpbuf' must be a buffer
>>>    * the same size as 'out'.
>>>    */
>>> static size_t
>>> do_compress_withfixedoutputsize(struct compressor *c,
>>> 				const u8 *in, size_t in_nbytes,
>>> 				u8 *out, size_t out_nbytes_avail,
>>> 				size_t target_output_size,
>>> 				size_t last_uncompressed_size,
>>> 				u8 *tmpbuf,
>>> 				size_t *actual_in_nbytes_ret)
>>> {
>>> 	size_t l = 0; /* largest input that fits so far */
>>> 	size_t l_csize = 0;
>>> 	size_t r = in_nbytes + 1; /* smallest input that doesn't fit so far */
>>> 	size_t m;
>>>
>>> 	if (last_uncompressed_size)
>>> 		m = last_uncompressed_size * 15 / 16;
>>> 	else
>>> 		m = target_output_size * 4;
>>> 	for (;;) {
>>> 		size_t csize;
>>>
>>> 		m = MAX(m, l + 1);
>>> 		m = MIN(m, r - 1);
>>>
>>> 		csize = do_compress(c, in, m, tmpbuf, out_nbytes_avail);
>>> 		if (csize > 0 && csize <= target_output_size) {
>>> 			/* Fits */
>>> 			memcpy(out, tmpbuf, csize);
>>> 			l = m;
>>> 			l_csize = csize;
>>> 			if (r <= l + 1)
>>> 				break;
>>> 			/*
>>> 			 * Estimate needed input prefix size based on current
>>> 			 * compression ratio.
>>> 			 */
>>> 			m = (target_output_size * m) / csize;
>>> 		} else {
>>> 			/* Doesn't fit */
>>> 			r = m;
>>> 			if (r <= l + 1)
>>> 				break;
>>> 			m = (l + r) / 2;
>>> 		}
>>> 	}
>>> 	*actual_in_nbytes_ret = l;
>>> 	return l_csize;
>>> }
>>>
>>> What do you think?
>>
>> Thank you for the time and your efforts, I much appreciate if it could be
>> supported so I could add a alternative encoder for this.
>>
> 
> Native fixed-size-output size support in libdeflate is something I'll think
> about.  However, the point I'm making is that it's unclear how valuable native
> fixed-size-output size support in compressors is, given that the search
> algorithm I described above actually seems to work fairly well, and it can
> produce the optimal solution.
> 
> Here's a quick proof of concept patch for mkfs.erofs:
> 
> diff --git a/lib/Makefile.am b/lib/Makefile.am
> index 553b387..f58a684 100644
> --- a/lib/Makefile.am
> +++ b/lib/Makefile.am
> @@ -45,3 +45,4 @@ liberofs_la_SOURCES += compressor_liblzma.c
>   endif
>   
>   liberofs_la_SOURCES += kite_deflate.c compressor_deflate.c
> +liberofs_la_LDFLAGS = -ldeflate
> diff --git a/lib/compressor_deflate.c b/lib/compressor_deflate.c
> index ad76b7c..d7a5e90 100644
> --- a/lib/compressor_deflate.c
> +++ b/lib/compressor_deflate.c
> @@ -6,6 +6,7 @@
>   #include "erofs/internal.h"
>   #include "erofs/print.h"
>   #include "erofs/config.h"
> +#include "libdeflate.h"
>   #include "compressor.h"
>   
>   void *kite_deflate_init(int level, unsigned int dict_size);
> @@ -13,16 +14,64 @@ void kite_deflate_end(void *s);
>   int kite_deflate_destsize(void *s, const u8 *in, u8 *out,
>   			  unsigned int *srcsize, unsigned int target_dstsize);
>   
> +#define USE_LIBDEFLATE
> +
>   static int deflate_compress_destsize(const struct erofs_compress *c,
>   				     const void *src, unsigned int *srcsize,
>   				     void *dst, unsigned int dstsize)
>   {
> +#ifdef USE_LIBDEFLATE
> +	static size_t last_uncompressed_size = 0;
> +	size_t l = 0; /* largest input that fits so far */
> +	size_t l_csize = 0;
> +	size_t r = *srcsize + 1; /* smallest input that doesn't fit so far */
> +	size_t m;
> +	u8 tmpbuf[dstsize + 9];
> +
> +	if (last_uncompressed_size)
> +		m = last_uncompressed_size * 15 / 16;
> +	else
> +		m = dstsize * 4;
> +	for (;;) {
> +		size_t csize;
> +
> +		m = max(m, l + 1);
> +		m = min(m, r - 1);
> +
> +		csize = libdeflate_deflate_compress(c->private_data, src, m, tmpbuf, dstsize + 9);
> +		/*printf("Tried %zu => %zu\n", m, csize);*/
> +		if (csize > 0 && csize <= dstsize) {
> +			/* Fits */
> +			memcpy(dst, tmpbuf, csize);
> +			l = m;
> +			l_csize = csize;
> +			if (r <= l + 1 || csize + (22 - 2*(int)c->compression_level) >= dstsize)
> +				break;
> +			/*
> +			 * Estimate needed input prefix size based on current
> +			 * compression ratio.
> +			 */
> +			m = (dstsize * m) / csize;
> +		} else {
> +			/* Doesn't fit */
> +			r = m;
> +			if (r <= l + 1)
> +				break;
> +			m = (l + r) / 2;
> +		}
> +	}
> +	/*printf("Choosing %zu => %zu\n", l, l_csize);*/
> +	*srcsize = l;
> +	last_uncompressed_size = l;
> +	return l_csize;
> +#else
>   	int rc = kite_deflate_destsize(c->private_data, src, dst,
>   				       srcsize, dstsize);
>   
>   	if (rc <= 0)
>   		return -EFAULT;
>   	return rc;
> +#endif
>   }
>   
>   static int compressor_deflate_exit(struct erofs_compress *c)
> @@ -30,7 +79,11 @@ static int compressor_deflate_exit(struct erofs_compress *c)
>   	if (!c->private_data)
>   		return -EINVAL;
>   
> +#ifdef USE_LIBDEFLATE
> +	libdeflate_free_compressor(c->private_data);
> +#else
>   	kite_deflate_end(c->private_data);
> +#endif
>   	return 0;
>   }
>   
> @@ -47,8 +100,13 @@ static int compressor_deflate_init(struct erofs_compress *c)
>   static int erofs_compressor_deflate_setlevel(struct erofs_compress *c,
>   					     int compression_level)
>   {
> -	void *s;
> +	if (compression_level < 0)
> +		compression_level = erofs_compressor_deflate.default_level;
>   
> +#ifdef USE_LIBDEFLATE
> +	libdeflate_free_compressor(c->private_data);
> +	c->private_data = libdeflate_alloc_compressor(compression_level);
> +#else
>   	if (c->private_data) {
>   		kite_deflate_end(c->private_data);
>   		c->private_data = NULL;
> @@ -57,11 +115,13 @@ static int erofs_compressor_deflate_setlevel(struct erofs_compress *c,
>   	if (compression_level < 0)
>   		compression_level = erofs_compressor_deflate.default_level;
>   
> -	s = kite_deflate_init(compression_level, cfg.c_dict_size);
> +	void *s = kite_deflate_init(compression_level, cfg.c_dict_size);
>   	if (IS_ERR(s))
>   		return PTR_ERR(s);
>   
>   	c->private_data = s;
> +#endif
> +
>   	c->compression_level = compression_level;
>   	return 0;
>   }
> @@ -69,7 +129,11 @@ static int erofs_compressor_deflate_setlevel(struct erofs_compress *c,
>   const struct erofs_compressor erofs_compressor_deflate = {
>   	.name = "deflate",
>   	.default_level = 1,
> +#ifdef USE_LIBDEFLATE
> +	.best_level = 12,
> +#else
>   	.best_level = 9,
> +#endif
>   	.init = compressor_deflate_init,
>   	.exit = compressor_deflate_exit,
>   	.setlevel = erofs_compressor_deflate_setlevel,
> 
> Here are some quick benchmarks I did of mkfs.erofs showing the size of the
> resulting filesystem and the time it took for mkfs.erofs to run:
> 
> -zdeflate,1
>      kite_deflate: 10772480 bytes in 0.474s
>      libdeflate: 10366976 bytes in 0.468s
> -zdeflate,3
>      kite_deflate: 10203136 bytes in 0.511s
>      libdeflate: 10092544 bytes in 0.702s
> -zdeflate,5
>      kite_deflate: 10080256 bytes in 0.596s
>      libdeflate: 9973760 bytes in 0.896s
> -zdeflate,6
>      kite_deflate: 9977856 bytes in 0.650s
>      libdeflate: 9936896 bytes in 0.981s
> -zdeflate,7
>      kite_deflate: 9969664 bytes in 0.690s
>      libdeflate: 9920512 bytes in 1.187s
> -zdeflate,9
>      kite_deflate: 9945088 bytes in 0.986s
>      libdeflate: 9871360 bytes in 2.041s
> -zdeflate,12
>      kite_deflate: N/A
>      libdeflate: 9740288 bytes in 28.080s
> 
> So it seems my "naive" search algorithm, when combined with libdeflate, actually
> does about as well as kite_deflate...  Plus it automatically brings in support
> for higher compression levels.

kite_deflate still has some room to improve since I just
make it work to replace the original zlib, just TBH.

Yeah, I could integrate libdeflate like this first,
thanks Eric!

Thanks,
Gao Xiang