<div dir="ltr">That’s a great point — I hadn’t fully thought through the error aggregation side yet.<br>I’m leaning towards running to completion with a shared error accumulator so fsck can surface all corruption in one run, but I’ll think through the exit semantics carefully.<br><br>Thanks again for the insights, really helpful.<br><br>Regards, <br>Deepak Pathik</div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Mar 31, 2026 at 12:05 AM Utkal Singh <<a href="mailto:singhutkal015@gmail.com">singhutkal015@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, Mar 30, 2026 at 2:00 PM, Deepak Pathik wrote:<br>
> the two-phase model (serial traversal + parallel data<br>
> verification/extraction) makes a lot more sense now<br>
<br>
Good, that is the right framing.<br>
<br>
One more thing worth deciding early: error aggregation policy. In the<br>
current serial path, erofs_check_inode() returns -errno and the caller<br>
stops on first error. With parallel workers you need a shared error<br>
accumulator and a policy for whether to drain the queue on first<br>
error or run to completion — the choice affects both exit status<br>
correctness and how much corruption a single run surfaces on large<br>
images.<br>
<br>
Good luck with the proposal.<br>
<br>
Regards,<br>
Utkal Singh<br>
<br>
On Mon, 30 Mar 2026 at 14:00, Deepak Pathik <<a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a>> wrote:<br>
><br>
> Hi Utkal,<br>
><br>
> Thanks again for the detailed explanation and for pointing me to the RFC — it really helped clarify the bigger picture.<br>
><br>
> I spent some time going through the relevant parts of the code and your comments made a lot more sense in that context. I see now that while pcluster-level parallelism is valid, the main challenge is making the surrounding infrastructure safe before introducing concurrency.<br>
><br>
> In particular, I hadn’t fully accounted for:<br>
><br>
> the lseek() + read() pattern in erofs_read_one_data() and why switching to pread() is necessary for correctness,<br>
><br>
> the lack of synchronization in erofs_iget()/erofs_iput(), which could lead to refcount races,<br>
><br>
> and the implications of using an unbounded workqueue on large images.<br>
><br>
> Your point about backpressure was especially helpful — I’m now considering a bounded queue or a semaphore-based approach to ensure the producer doesn’t get too far ahead of the workers.<br>
><br>
> I also revisited the design with this in mind, and the two-phase model (serial traversal + parallel data verification/extraction) makes a lot more sense now, especially for isolating shared state like fsckcfg and path handling.<br>
><br>
> I’ll continue refining the proposal with these constraints in mind and go deeper into io.c, inode.c, and workqueue.c to make sure the design is correct before thinking about actual parallel execution.<br>
><br>
> Thanks again for taking the time to explain this — it was very helpful.<br>
><br>
> Regards,<br>
> Deepak Pathik<br>
><br>
><br>
> On Mon, Mar 30, 2026 at 1:50 AM Utkal Singh <<a href="mailto:singhutkal015@gmail.com" target="_blank">singhutkal015@gmail.com</a>> wrote:<br>
>><br>
>> On Sun, Mar 29, 2026 at 6:47 PM, Deepak Pathik wrote:<br>
>> > for LZMA-compressed images, are pclusters in fsck.erofs always<br>
>> > fixed-size and independently decompressible at the userspace level,<br>
>> > or are there cases where a pcluster depends on the state left by a<br>
>> > previous one?<br>
>><br>
>> Hi Deepak,<br>
>><br>
>> To answer your LZMA question: yes, each pcluster is independently<br>
>> decompressible by design. You can verify this directly in<br>
>> lib/decompress.c — z_erofs_decompress_lzma() calls lzma_stream_decoder()<br>
>> and lzma_end() within a single invocation, with no persistent lzma_stream<br>
>> across calls. The same holds for ZSTD and deflate. The on-disk format<br>
>> enforces this: no pcluster depends on decompressor state from a<br>
>> previous one.<br>
>><br>
>> The parallelism boundary you identified is correct. The deeper issue<br>
>> is one level up: erofs_check_inode() is called sequentially in the<br>
>> dispatch loop in fsck/main.c, and each call may decompress many<br>
>> pclusters per inode. Inode-level dispatch is simpler than<br>
>> pcluster-level because it avoids output ordering constraints.<br>
>><br>
>> One thing worth thinking through before wiring erofs_workqueue into<br>
>> the fsck path: the existing queue in lib/workqueue.c is an unbounded<br>
>> producer queue built for mkfs compression workloads. On a 34,000+<br>
>> inode image, it will accumulate all inode descriptors in memory before<br>
>> workers can drain it. Backpressure — either a bounded queue or a<br>
>> semaphore on the existing one — matters here.<br>
>><br>
>> Two paths in the surrounding infrastructure also need fixing before<br>
>> concurrent dispatch is correct:<br>
>><br>
>> - erofs_read_one_data() in lib/io.c: lseek()+read() on a shared fd<br>
>> is a TOCTOU race under concurrent calls. pread(2) fixes it cleanly.<br>
>><br>
>> - erofs_iget()/erofs_iput() in lib/inode.c: ref-count mutations<br>
>> without synchronisation. Concurrent iput() can double-free.<br>
>><br>
>> I sent an RFC on March 22 covering this design if it is useful context:<br>
>><br>
>> <a href="https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/" rel="noreferrer" target="_blank">https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/</a><br>
>><br>
>> Happy to discuss further on the list.<br>
>><br>
>> Regards,<br>
>> Utkal Singh<br>
>><br>
>><br>
>> On Sun, 29 Mar 2026 at 18:47, Deepak Pathik <<a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a>> wrote:<br>
>> ><br>
>> > Hi,<br>
>> ><br>
>> > I'm Deepak Pathik, a second-year B.Tech student applying for the GSoC 2026 project on multi-threaded decompression support in fsck.erofs.<br>
>> ><br>
>> > While reading through the source, I traced the decompression path in erofs_verify_inode_data() and noticed that z_erofs_decompress() operates on a locally scoped struct z_erofs_decompress_req with its own input and output buffers — no shared mutable state between calls. My plan is to wire the existing erofs_workqueue (already used in lib/compress.c for mkfs.erofs) into the fsck extraction path at the pcluster level, with pwrite() for position-based output writes to avoid ordering locks.<br>
>> ><br>
>> > One thing I wanted to confirm before finalizing my proposal: for LZMA-compressed images, are pclusters in fsck.erofs always fixed-size and independently decompressible at the userspace level, or are there cases where a pcluster depends on the state left by a previous one? I want to make sure I'm not understating the LZMA case in my design.<br>
>> ><br>
>> > I've drafted a proposal and would be happy to share it for early feedback if that's useful.<br>
>> ><br>
>> > Thanks,<br>
>> > Deepak Pathik<br>
>> > <a href="https://github.com/deepakpathik" rel="noreferrer" target="_blank">https://github.com/deepakpathik</a><br>
>> > <a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a><br>
</blockquote></div>