Readahead for compressed data

Fri Oct 22 20:54:03 AEDT 2021

On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote:
> Hi Qu,
> 
> On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2021/10/22 17:11, Gao Xiang wrote:
> > > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
> > > > On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> > > > > 
> > > > > Matthew Wilcox <willy at infradead.org> writes:
> > > > > 
> > > > > > As far as I can tell, the following filesystems support compressed data:
> > > > > > 
> > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > > > > > 
> > > > > > I'd like to make it easier and more efficient for filesystems to
> > > > > > implement compressed data.  There are a lot of approaches in use today,
> > > > > > but none of them seem quite right to me.  I'm going to lay out a few
> > > > > > design considerations next and then propose a solution.  Feel free to
> > > > > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > > > > > 
> > > > > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > > > > are going to be the most useful to bring in, but it doesn't know how
> > > > > > pages are bundled together into blocks.  As I've learned from talking to
> > > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > > > > something we can teach the VFS.
> > > > > > 
> > > > > > We (David) added readahead_expand() recently to let the filesystem
> > > > > > opportunistically add pages to the page cache "around" the area requested
> > > > > > by the VFS.  That reduces the number of times the filesystem has to
> > > > > > decompress the same block.  But it can fail (due to memory allocation
> > > > > > failures or pages already being present in the cache).  So filesystems
> > > > > > still have to implement some kind of fallback.
> > > > > 
> > > > > Wouldn't it be better to keep the *compressed* data in the cache and
> > > > > decompress it multiple times if needed rather than decompress it once
> > > > > and cache the decompressed data?  You would use more CPU time
> > > > > decompressing multiple times, but be able to cache more data and avoid
> > > > > more disk IO, which is generally far slower than the CPU can decompress
> > > > > the data.
> > > > 
> > > > Well, one of the problems with keeping compressed data is that for mmap(2)
> > > > you have to have pages decompressed so that CPU can access them. So keeping
> > > > compressed data in the page cache would add a bunch of complexity. That
> > > > being said keeping compressed data cached somewhere else than in the page
> > > > cache may certainly me worth it and then just filling page cache on demand
> > > > from this data...
> > > 
> > > It can be cached with a special internal inode, so no need to take
> > > care of the memory reclaim or migration by yourself.
> > 
> > There is another problem for btrfs (and maybe other fses).
> > 
> > Btrfs can refer to part of the uncompressed data extent.
> > 
> > Thus we could have the following cases:
> > 
> > 	0	4K	8K	12K
> > 	|	|	|	|
> > 		    |	    \- 4K of an 128K compressed extent,
> > 		    |		compressed size is 16K
> > 		    \- 4K of an 64K compressed extent,
> > 			compressed size is 8K
> 
> Thanks for this, but the diagram is broken on my side.
> Could you help fix this?

Ok, I understand it. I think here is really a strategy problem
out of CoW, since only 2*4K is needed, you could
 1) cache the whole compressed extent and hope they can be accessed
    later, so no I/O later at all;
 2) don't cache such incomplete compressed extents;
 3) add some trace record and do some finer strategy.

> 
> > 
> > In above case, we really only need 8K for page cache, but if we're
> > caching the compressed extent, it will take extra 24K.
> 
> Apart from that, with my wild guess, could we cache whatever the
> real I/O is rather than the whole compressed extent unconditionally?
> If the whole compressed extent is needed then, we could check if
> it's all available in cache, or read the rest instead?
> 
> Also, I think no need to cache uncompressed COW data...
> 
> Thanks,
> Gao Xiang
> 
> > 
> > It's definitely not really worthy for this particular corner case.
> > 
> > But it would be pretty common for btrfs, as CoW on compressed data can
> > easily lead to above cases.
> > 
> > Thanks,
> > Qu
> > 
> > > 
> > > Otherwise, these all need to be take care of. For fixed-sized input
> > > compression, since they are reclaimed in page unit, so it won't be
> > > quite friendly since such data is all coupling. But for fixed-sized
> > > output compression, it's quite natural.
> > > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > > > 
> > > > 								Honza
> > > > --
> > > > Jan Kara <jack at suse.com>
> > > > SUSE Labs, CR