[question] Status of dictionary preload compression?

Thu Apr 24 22:01:53 AEST 2025

On 2025/4/24 13:14, Simon Hosie wrote:
> 23 Apr 2025, 17:27 by hsiangkao at linux.alibaba.com:
> 
>> Hi Simon,
>>
>> On 2025/4/24 03:24, Simon Hosie wrote:
>>
>>> I've struggled to determine if this is already a feature or in development or not (possibly because of overloading of the term "dictionary"), so I apologise in advance if the following brief is redundant:
>>>
>>> Compressors like LZ4, zstd, and even gzip talk about "dictionary compression" meaning to pre-load the history window of the compressor and decompressor, before the file is processed, with pre-arranged patterns; so that back references can be made for text the first time it appears in the file, rather than having to build up that window from an empty set at the beginning of the file by encoding everything as literals.
>>>
>>> This can lead to an improvement in compression ratio.
>>>
>>> It's generally only useful for small files because in a larger file the back-reference widow is established early and remains full of reference material for the rest of the file; but this should also benefit block-based compression which faces a loss of history at every entry point.
>>>
>>> So that's what I'm talking about; and my question, simply, is is this is a feature (or a planned feature) of erofs?  Something involving storing a set of uncompressed dictionary preload chunks within the filesystem which are then used as the starting dictionary when compressing and decompressing the small chunks of each file?
>>>
>>> In my imagination such a filesystem might provide a palette of uncompressed, and page-aligned, dictionaries and each file (or each cluster?) would give an index to the entry which it will use.  Typically that choice might be implied by the file type, but sometimes files can have different dispositions as you seek through them, or a .txt file may contain English or Chinese or ASCII art, each demanding different dictionaries.  Making the right choice is an external optimisation problem.
>>>
>>
>> Thanks for your interest.
>>
>> I know the dictionary compression (and the benefit for small
>> compression units as you said 4KiB compression) and it's on
>> our TODO list for years.
>>
>> Actually I made an erofs-utils dictionary compresion demo 4
>> years ago (but EROFS doesn't implement compression deduplication
>> at that time):
>> https://github.com/erofs/erofs-utils/tree/experimental-dictdemo
>>
>> The discussion part of this topic is the dictionary granularity:
>>   1) per-filesystem ?  I think it's almost useless, but it
>>   has least extra dictionary I/O.
>>   2) per-inode?
>>   3) per-(sub)inode?
>>
>> Since EROFS also supports compressed data deduplication (which
>> means a pcluster can be used for different parts of an inode or
>> different inodes), it makes the design for dictionary generation
>> (since some uncompressed data can be deduplicated) and selection
>> harder.
>>
>> If you have more ideas about the dictionary granularity and
>> the whole process, I'm very interested in hearing more.
>>
>> Thanks,
>> Gao Xiang
>>
> Ok, well the model I had in mind was to allocate maybe a few hundred kB of the filesystem image to various dictionaries optimised for different file types.  Some plain text dictionaries, a generic JSON dictionary, a generic XML dictionary, a shared object dictionary, etc..., and you enumerate those so that each file can choose the right dictionary using just a one-byte index into the table.
> 
> Hopefully an app will only work intensively with a couple of file types at a time so only a couple of dictionaries need be paged in at a time.
> My intuition tells me that diminishing returns will have set in well before 256 different dictionaries, and that having more than that would be mostly harmful; but I guess that's a thing which needs testing.
> I understand right now every file can specify a compression method for itself, so in that same data structure it makes sense to also specify the index number of the dictionary for the compression if the chosen compression method uses a dictionary.
> 
> By default mkfs would probably just look at the input files and a directory full of pre-cooked dictionaries, and based on the extension of the file and the availability of a matching dictionary it would put that dictionary in the next slot and mark all matching files as being compressed against that dictionary number.
> 
> Then the user can look at which sets of file types missed out on a dictionary and how much space they're using, and they can decide if they want to make another dictionary for those files as well, or just make them share a dictionary with another file type. Or maybe they want to split a group of files because they'll benefit from having different versions of a dictionary.  Or maybe they'll write their own optimiser to decide the optimal file groups by grinding on the problem with hundreds of GPUs for weeks on end.
> 
> If one file is particularly large it might warrant a dedicated dictionary, but I wouldn't plan for that as the general case.
> 
> Once those decisions have been finalised the dictionaries can be re-optimised against the actual files to get the best compression.
> 
> There's also the question of whether a file would want to select two or three small dictionaries to concatenate.  Like, for an XML file containing English tags and Chinese text, or something like that.  Or it might want to switch dictionaries halfway through the file.  That's probably over-engineering, though.
> 
> I think that's more or less all the things I've thought about the problem so far.

Ok, I know your idea on this. My overall thought is to get the
numbers and demo first, and then think more how to land this.
If you're interested in developping this, that's awesome!

BTW, one part I'm not sure if you noticed is that small files
(or even the whole files) can be moved to the virtual fragment
inode and compression together by using `-Efragments` (or even
`-Eall-fragments`) option.

So I think we have to develop sub-file dictionary support,
otherwise it will be conflict with fragment compression feature.

Thanks,
Gao Xiang