[PATCH] erofs: fix unexpected EIO under memory pressure

Fri Dec 19 20:54:22 AEDT 2025

On 2025/12/19 17:47, Junbeom Yeom wrote:
> Hi Xiang,
> 
>>
>> Hi Junbeom,
>>
>> On 2025/12/19 15:10, Junbeom Yeom wrote:
>>> erofs readahead could fail with ENOMEM under the memory pressure
>>> because it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while
>>> GFP_KERNEL for a regular read. And if readahead fails (with
>>> non-uptodate folios), the original request will then fall back to
>>> synchronous read, and `.read_folio()` should return appropriate errnos.
>>>
>>> However, in scenarios where readahead and read operations compete,
>>> read operation could return an unintended EIO because of an incorrect
>>> error propagation.
>>>
>>> To resolve this, this patch modifies the behavior so that, when the
>>> PCL is for read(which means pcl.besteffort is true), it attempts
>>> actual decompression instead of propagating the privios error except initial EIO.
>>>
>>> - Page size: 4K
>>> - The original size of FileA: 16K
>>> - Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
>>> [page0, page1] [page2, page3] [PCL0]---------[PCL1]
>>>
>>> - functions declaration:
>>>     . pread(fd, buf, count, offset)
>>>     . readahead(fd, offset, count)
>>> - Thread A tries to read the last 4K
>>> - Thread B tries to do readahead 8K from 4K
>>> - RA, besteffort == false
>>> - R, besteffort == true
>>>
>>>           <process A>                   <process B>
>>>
>>> pread(FileA, buf, 4K, 12K)
>>>     do readahead(page3) // failed with ENOMEM
>>>     wait_lock(page3)
>>>       if (!uptodate(page3))
>>>         goto do_read
>>>                                  readahead(FileA, 4K, 8K)
>>>                                  // Here create PCL-chain like below:
>>>                                  // [null, page1] [page2, null]
>>>                                  //   [PCL0:RA]-----[PCL1:RA]
>>> ...
>>>     do read(page3)        // found [PCL1:RA] and add page3 into it,
>>>                           // and then, change PCL1 from RA to R ...
>>>                                  // Now, PCL-chain is as below:
>>>                                  // [null, page1] [page2, page3]
>>>                                  //   [PCL0:RA]-----[PCL1:R]
>>>
>>>                                    // try to decompress PCL-chain...
>>>                                    z_erofs_decompress_queue
>>>                                      err = 0;
>>>
>>>                                      // failed with ENOMEM, so page 1
>>>                                      // only for RA will not be uptodated.
>>>                                      // it's okay.
>>>                                      err = decompress([PCL0:RA], err)
>>>
>>>                                      // However, ENOMEM propagated to next
>>>                                      // PCL, even though PCL is not only
>>>                                      // for RA but also for R. As a result,
>>>                                      // it just failed with ENOMEM without
>>>                                      // trying any decompression, so page2
>>>                                      // and page3 will not be uptodated.
>>>                   ** BUG HERE ** --> err = decompress([PCL1:R], err)
>>>
>>>                                      return err as ENOMEM ...
>>>       wait_lock(page3)
>>>         if (!uptodate(page3))
>>>           return EIO      <-- Return an unexpected EIO!
>>> ...
>>
>> Many thanks for the report!
>> It's indeed a new issue to me.
>>
>>>
>>> Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs")
>>> Cc: stable at vger.kernel.org
>>> Reviewed-by: Jaewook Kim <jw5454.kim at samsung.com>
>>> Reviewed-by: Sungjong Seo <sj1557.seo at samsung.com>
>>> Signed-off-by: Junbeom Yeom <junbeom.yeom at samsung.com>
>>> ---
>>>    fs/erofs/zdata.c | 6 +++++-
>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index
>>> 27b1f44d10ce..86bf6e087d34 100644
>>> --- a/fs/erofs/zdata.c
>>> +++ b/fs/erofs/zdata.c
>>> @@ -1414,11 +1414,15 @@ static int z_erofs_decompress_queue(const struct
>> z_erofs_decompressqueue *io,
>>>    	};
>>>    	struct z_erofs_pcluster *next;
>>>    	int err = io->eio ? -EIO : 0;
>>> +	int io_err = err;
>>>
>>>    	for (; be.pcl != Z_EROFS_PCLUSTER_TAIL; be.pcl = next) {
>>> +		int propagate_err;
>>> +
>>>    		DBG_BUGON(!be.pcl);
>>>    		next = READ_ONCE(be.pcl->next);
>>> -		err = z_erofs_decompress_pcluster(&be, err) ?: err;
>>> +		propagate_err = READ_ONCE(be.pcl->besteffort) ? io_err : err;
>>> +		err = z_erofs_decompress_pcluster(&be, propagate_err) ?: err;
>>
>> I wonder if it's just possible to decompress each pcluster according to io
>> status only (but don't bother with previous pcluster status), like:
>>
>> 		err = z_erofs_decompress_pcluster(&be, io->eio) ?: err;
>>
>> and change the second argument of
>> z_erofs_decompress_pcluster() to bool.
>>
>> So that we could leverage the successful i/o as much as possible.
> 
> Oh, I thought you were intending to address error propagation.

We could still propagate errors (-ENOMEM) to the callers, but for
the case you mentioned, I still think it's useful to handle the
following pclusters if the disk I/Os are successful.

and it still addresses the issue you mentioned, I think it's also
cleaner.

> If that's not the case, I also believe the approach you're suggesting is better.
> I'll send the next version.

Thank you for the effort!

Thanks,
Gao Xiang

> 
> Thanks,
> Junbeom Yeom
> 
>>
>> Thanks,
>> Gao Xiang
>>
>>>    	}
>>>    	return err;
>>>    }
>>
>