[PATCH v4 2/2] erofs-utils: optimize buffer allocation logic

Gao Xiang hsiangkao at redhat.com
Tue Jan 19 01:06:54 AEDT 2021


Hi Weiwen,

On Mon, Jan 18, 2021 at 09:02:16PM +0800, 胡玮文 wrote:
> 
> > 在 2021年1月16日,14:41,Gao Xiang <hsiangkao at aol.com> 写道:
> > 
> > From: Hu Weiwen <sehuww at mail.scut.edu.cn>
> > 
> > When using EROFS to pack our dataset which consists of millions of
> > files, mkfs.erofs is very slow compared with mksquashfs.
> > 
> > The bottleneck is `erofs_balloc` and `erofs_mapbh` function, which
> > iterate over all previously allocated buffer blocks, making the
> > complexity of the algrithm O(N^2) where N is the number of files.
> > 
> > With this patch:
> > 
> > * global `last_mapped_block` is mantained to avoid full scan in
> >  `erofs_mapbh` function.
> > 
> > * global `non_full_buffer_blocks` mantains a list of buffer block for
> >  each type and each possible remaining bytes in the block. Then it is
> >  used to identify the most suitable blocks in future `erofs_balloc`,
> >  avoiding full scan.
> > 
> > Some new data structure is allocated in this patch, more RAM usage is
> > expected, but not much. When I test it with ImageNet dataset (1.33M
> > files), 7GiB RAM is consumed, and it takes about 4 hours. Most time is
> > spent on IO.
> > 
> > Signed-off-by: Hu Weiwen <sehuww at mail.scut.edu.cn>
> > Signed-off-by: Gao Xiang <hsiangkao at aol.com>
> > ---
> > 
> > I've simplified the most-fit finding logic of v3... Since buffers.off
> > has to be aligned to alignsize, so I think it's better to use
> > buffers.off as the index of mapped_buckets compared to using remaining
> > size as it looks more straight-forward.
> > 
> > Also, I found the exist logic handling expended blocks might be
> > potential ineffective as well... we have to skip used < used0 only
> > after oob (extra blocks is allocated, so not expend such blocks but
> > allocate a new bb...) It might be more effective to reuse such
> > non-mapped buffer blocks...
> 
> I don’t get this. If not oob, then “used” should be larger than “used_before”, then we will not skip such block anyway, right?

Sorry, I think I was completely misleaded by the comment above the
code at that time. It's too far to get the original intention :(
I think you're right (anyway, I think the main purpose of this
condition was that I didn't want to introduce too many new bbs
with a lot of large unused space. But anyway such condition is
approximate since it's actually a 0-1 Knapsack problem). Please
ignore my previous words about this...

Thanks,
Gao Xiang



More information about the Linux-erofs mailing list