Readahead for compressed data

Fri Oct 22 20:39:29 AEDT 2021

Hi Qu,

On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/10/22 17:11, Gao Xiang wrote:
> > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
> > > On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> > > > 
> > > > Matthew Wilcox <willy at infradead.org> writes:
> > > > 
> > > > > As far as I can tell, the following filesystems support compressed data:
> > > > > 
> > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > > > > 
> > > > > I'd like to make it easier and more efficient for filesystems to
> > > > > implement compressed data.  There are a lot of approaches in use today,
> > > > > but none of them seem quite right to me.  I'm going to lay out a few
> > > > > design considerations next and then propose a solution.  Feel free to
> > > > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > > > > 
> > > > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > > > are going to be the most useful to bring in, but it doesn't know how
> > > > > pages are bundled together into blocks.  As I've learned from talking to
> > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > > > something we can teach the VFS.
> > > > > 
> > > > > We (David) added readahead_expand() recently to let the filesystem
> > > > > opportunistically add pages to the page cache "around" the area requested
> > > > > by the VFS.  That reduces the number of times the filesystem has to
> > > > > decompress the same block.  But it can fail (due to memory allocation
> > > > > failures or pages already being present in the cache).  So filesystems
> > > > > still have to implement some kind of fallback.
> > > > 
> > > > Wouldn't it be better to keep the *compressed* data in the cache and
> > > > decompress it multiple times if needed rather than decompress it once
> > > > and cache the decompressed data?  You would use more CPU time
> > > > decompressing multiple times, but be able to cache more data and avoid
> > > > more disk IO, which is generally far slower than the CPU can decompress
> > > > the data.
> > > 
> > > Well, one of the problems with keeping compressed data is that for mmap(2)
> > > you have to have pages decompressed so that CPU can access them. So keeping
> > > compressed data in the page cache would add a bunch of complexity. That
> > > being said keeping compressed data cached somewhere else than in the page
> > > cache may certainly me worth it and then just filling page cache on demand
> > > from this data...
> > 
> > It can be cached with a special internal inode, so no need to take
> > care of the memory reclaim or migration by yourself.
> 
> There is another problem for btrfs (and maybe other fses).
> 
> Btrfs can refer to part of the uncompressed data extent.
> 
> Thus we could have the following cases:
> 
> 	0	4K	8K	12K
> 	|	|	|	|
> 		    |	    \- 4K of an 128K compressed extent,
> 		    |		compressed size is 16K
> 		    \- 4K of an 64K compressed extent,
> 			compressed size is 8K

Thanks for this, but the diagram is broken on my side.
Could you help fix this?

> 
> In above case, we really only need 8K for page cache, but if we're
> caching the compressed extent, it will take extra 24K.

Apart from that, with my wild guess, could we cache whatever the
real I/O is rather than the whole compressed extent unconditionally?
If the whole compressed extent is needed then, we could check if
it's all available in cache, or read the rest instead?

Also, I think no need to cache uncompressed COW data...

Thanks,
Gao Xiang

> 
> It's definitely not really worthy for this particular corner case.
> 
> But it would be pretty common for btrfs, as CoW on compressed data can
> easily lead to above cases.
> 
> Thanks,
> Qu
> 
> > 
> > Otherwise, these all need to be take care of. For fixed-sized input
> > compression, since they are reclaimed in page unit, so it won't be
> > quite friendly since such data is all coupling. But for fixed-sized
> > output compression, it's quite natural.
> > 
> > Thanks,
> > Gao Xiang
> > 
> > > 
> > > 								Honza
> > > --
> > > Jan Kara <jack at suse.com>
> > > SUSE Labs, CR