Weird EROFS data corruption

Juhyung Park qkrwngud825 at gmail.com
Wed Dec 6 01:43:09 AEDT 2023


On Tue, Dec 5, 2023 at 11:34 PM Gao Xiang <hsiangkao at linux.alibaba.com> wrote:
>
>
>
> On 2023/12/5 22:23, Juhyung Park wrote:
> > Hi Gao,
> >
> > On Tue, Dec 5, 2023 at 4:32 PM Gao Xiang <hsiangkao at linux.alibaba.com> wrote:
> >>
> >> Hi Juhyung,
> >>
> >> On 2023/12/4 11:41, Juhyung Park wrote:
> >>
> >> ...
> >>>
> >>>>
> >>>> - Could you share the full message about the output of `lscpu`?
> >>>
> >>> Sure:
> >>>
> >>> Architecture:            x86_64
> >>>     CPU op-mode(s):        32-bit, 64-bit
> >>>     Address sizes:         39 bits physical, 48 bits virtual
> >>>     Byte Order:            Little Endian
> >>> CPU(s):                  8
> >>>     On-line CPU(s) list:   0-7
> >>> Vendor ID:               GenuineIntel
> >>>     BIOS Vendor ID:        Intel(R) Corporation
> >>>     Model name:            11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
> >>>       BIOS Model name:     11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz None CPU
> >>>                             @ 3.0GHz
> >>>       BIOS CPU family:     198
> >>>       CPU family:          6
> >>>       Model:               140
> >>>       Thread(s) per core:  2
> >>>       Core(s) per socket:  4
> >>>       Socket(s):           1
> >>>       Stepping:            1
> >>>       CPU(s) scaling MHz:  60%
> >>>       CPU max MHz:         4800.0000
> >>>       CPU min MHz:         400.0000
> >>>       BogoMIPS:            5990.40
> >>>       Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
> >>>                            a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
> >>>                            ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
> >>>                             arch_perfmon pebs bts rep_good nopl xtopology nonstop_
> >>>                            tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
> >>>                            4 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
> >>>                            pr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_dead
> >>>                            line_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowp
> >>>                            refetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb st
> >>>                            ibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_
> >>>                            ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
> >>>                             rdt_a avx512f avx512dq rdseed adx smap avx512ifma clfl
> >>>                            ushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl
> >>>                            xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm
> >>>                             ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
> >>>                             hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi
> >>>                            2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme av
> >>>                            x512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2i
> >>
> >> Sigh, I've been thinking.  Here FSRM is the most significant difference between
> >> our environments, could you only try the following diff to see if there's any
> >> difference anymore? (without the previous disable patch.)
> >>
> >> diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
> >> index 1b60ae81ecd8..1b52a913233c 100644
> >> --- a/arch/x86/lib/memmove_64.S
> >> +++ b/arch/x86/lib/memmove_64.S
> >> @@ -41,9 +41,7 @@ SYM_FUNC_START(__memmove)
> >>    #define CHECK_LEN     cmp $0x20, %rdx; jb 1f
> >>    #define MEMMOVE_BYTES movq %rdx, %rcx; rep movsb; RET
> >>    .Lmemmove_begin_forward:
> >> -       ALTERNATIVE_2 __stringify(CHECK_LEN), \
> >> -                     __stringify(CHECK_LEN; MEMMOVE_BYTES), X86_FEATURE_ERMS, \
> >> -                     __stringify(MEMMOVE_BYTES), X86_FEATURE_FSRM
> >> +       CHECK_LEN
> >>
> >>          /*
> >>           * movsq instruction have many startup latency
> >
> > Yup, that also seems to fix it.
> > Are we looking at a potential memmove issue?
>
> I'm still analyzing this behavior as well as the root cause and
> I will also try to get a recent cloud server with FSRM myself
> to find more clues.

Down the rabbit hole we go...

Let me know if you have trouble getting an instance with FSRM. I'll
see what I can do.

>
> Thanks,
> Gao Xiang


More information about the Linux-erofs mailing list