[PATCH] erofs: fix unexpected EIO under memory pressure

Fri Dec 19 20:47:01 AEDT 2025

Hi Xiang,

>
> Hi Junbeom,
>
> On 2025/12/19 15:10, Junbeom Yeom wrote:
>> erofs readahead could fail with ENOMEM under the memory pressure
>> because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while
>> GFP_KERNEL for a regular read. And if readahead fails (with
>> non-uptodate folios), the original request will then fall back to
>> synchronous read, and `.read_folio()` should return appropriate errnos.
>>
>> However, in scenarios where readahead and read operations compete,
>> read operation could return an unintended EIO because of an incorrect
>> error propagation.
>>
>> To resolve this, this patch modifies the behavior so that, when the
>> PCL is for read(which means pcl.besteffort is true), it attempts
>> actual decompression instead of propagating the privios error except initial EIO.
>>
>> - Page size: 4K
>> - The original size of FileA: 16K
>> - Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
>> [page0, page1] [page2, page3] [PCL0]---------[PCL1]
>>
>> - functions declaration:
>>    . pread(fd, buf, count, offset)
>>    . readahead(fd, offset, count)
>> - Thread A tries to read the last 4K
>> - Thread B tries to do readahead 8K from 4K
>> - RA, besteffort == false
>> - R, besteffort == true
>>
>>          <process A>                   <process B>
>>
>> pread(FileA, buf, 4K, 12K)
>>    do readahead(page3) // failed with ENOMEM
>>    wait_lock(page3)
>>      if (!uptodate(page3))
>>        goto do_read
>>                                 readahead(FileA, 4K, 8K)
>>                                 // Here create PCL-chain like below:
>>                                 // [null, page1] [page2, null]
>>                                 //   [PCL0:RA]-----[PCL1:RA]
>> ...
>>    do read(page3)        // found [PCL1:RA] and add page3 into it,
>>                          // and then, change PCL1 from RA to R ...
>>                                 // Now, PCL-chain is as below:
>>                                 // [null, page1] [page2, page3]
>>                                 //   [PCL0:RA]-----[PCL1:R]
>>
>>                                   // try to decompress PCL-chain...
>>                                   z_erofs_decompress_queue
>>                                     err = 0;
>>
>>                                     // failed with ENOMEM, so page 1
>>                                     // only for RA will not be uptodated.
>>                                     // it's okay.
>>                                     err = decompress([PCL0:RA], err)
>>
>>                                     // However, ENOMEM propagated to next
>>                                     // PCL, even though PCL is not only
>>                                     // for RA but also for R. As a result,
>>                                     // it just failed with ENOMEM without
>>                                     // trying any decompression, so page2
>>                                     // and page3 will not be uptodated.
>>                  ** BUG HERE ** --> err = decompress([PCL1:R], err)
>>
>>                                     return err as ENOMEM ...
>>      wait_lock(page3)
>>        if (!uptodate(page3))
>>          return EIO      <-- Return an unexpected EIO!
>> ...
>
> Many thanks for the report!
> It's indeed a new issue to me.
>
>>
>> Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs")
>> Cc: stable at vger.kernel.org
>> Reviewed-by: Jaewook Kim <jw5454.kim at samsung.com>
>> Reviewed-by: Sungjong Seo <sj1557.seo at samsung.com>
>> Signed-off-by: Junbeom Yeom <junbeom.yeom at samsung.com>
>> ---
>>   fs/erofs/zdata.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index
>> 27b1f44d10ce..86bf6e087d34 100644
>> --- a/fs/erofs/zdata.c
>> +++ b/fs/erofs/zdata.c
>> @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct
>z_erofs_decompressqueue *io,
>>   	};
>>   	struct z_erofs_pcluster *next;
>>   	int err = io->eio ? -EIO : 0;
>> +	int io_err = err;
>>
>>   	for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) {
>> +		int propagate_err;
>> +
>>   		DBG_BUGON(!be.pcl);
>>   		next = READ_ONCE(be.pcl->next);
>> -		err = z_erofs_decompress_pcluster(&be, err) ?: err;
>> +		propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err;
>> +		err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err;
>
> I wonder if it's just possible to decompress each pcluster according to io
> status only (but don't bother with previous pcluster status), like:
>
> 		err = z_erofs_decompress_pcluster(&be, io->eio) ?: err;
>
> and change the second argument of
> z_erofs_decompress_pcluster() to bool.
>
> So that we could leverage the successful i/o as much as possible.

Oh, I thought you were intending to address error propagation.
If that's not the case, I also believe the approach you're suggesting is better.
I'll send the next version.

Thanks,
Junbeom Yeom

>
> Thanks,
> Gao Xiang
>
>>   	}
>>   	return err;
>>   }
>