Weird EROFS data corruption

Juhyung Park qkrwngud825 at gmail.com
Mon Dec 4 04:01:27 AEDT 2023


Hi Gao,

On Mon, Dec 4, 2023 at 1:52 AM Gao Xiang <hsiangkao at linux.alibaba.com> wrote:
>
> Hi Juhyung,
>
> On 2023/12/4 00:22, Juhyung Park wrote:
> > (Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
> > while ago, which may mean that this is not specific to EROFS:
> > https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
> > )
> >
> > Hi.
> >
> > I'm encountering a very weird EROFS data corruption.
> >
> > I noticed when I build an EROFS image for AOSP development, the device
> > would randomly not boot from a certain build.
> > After inspecting the log, I noticed that a file got corrupted.
>
> Is it observed on your laptop (i7-1185G7), yes? or some other arm64
> device?

Yes, only on my laptop. The arm64 device seems fine.
The reason that it would not boot was that the host machine (my
laptop) was repacking the EROFS image wrongfully.

The workflow is something like this:
Server-built EROFS AOSP image -> Image copied to laptop -> Laptop
mounts the EROFS image -> Copies the entire content to a scratch
directory (CORRUPT!) -> Changes some files -> mkfs.erofs

So the device is not responsible for the corruption, the laptop is.

>
> >
> > After adding a hash check during the build flow, I noticed that EROFS
> > would randomly read data wrong.
> >
> > I now have a reliable method of reproducing the issue, but here's the
> > funny/weird part: it's only happening on my laptop (i7-1185G7). This
> > is not happening with my 128 cores buildfarm machine (Threadripper
> > 3990X).>
> > I first suspected a hardware issue, but:
> > a. The laptop had its motherboard replaced recently (due to a failing
> > physical Type-C port).
> > b. The laptop passes memory test (memtest86).
> > c. This happens on all kernel versions from v5.4 to the latest v6.6
> > including my personal custom builds and Canonical's official Ubuntu
> > kernels.
> > d. This happens on different host SSDs and file-system combinations.
> > e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
> > f. This only happens when mounting the image natively by the kernel.
> > Using fuse with erofsfuse is fine.
>
> I think it's a weird issue with inplace decompression because you said
> it depends on the hardware.  In addition, with your dataset sadly I
> cannot reproduce on my local server (Xeon(R) CPU E5-2682 v4).

As I feared. Bummer :(

>
> What is the difference between these two machines? just different CPU or
> they have some other difference like different compliers?

I fully and exclusively control both devices, and the setup is almost the same.
Same Ubuntu version, kernel/compiler version.

But as I said, on my laptop, the issue happens on kernels that someone
else (Canonical) built, so I don't think it matters.

>
> Thanks,
> Gao Xiang


More information about the Linux-erofs mailing list