<div dir="ltr">That’s a great point — I hadn’t fully thought through the error aggregation side yet.<br>I’m leaning towards running to completion with a shared error accumulator so fsck can surface all corruption in one run, but I’ll think through the exit semantics carefully.<br><br>Thanks again for the insights, really helpful.<br><br>Regards,  <br>Deepak Pathik</div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Mar 31, 2026 at 12:05 AM Utkal Singh <<a href="mailto:singhutkal015@gmail.com">singhutkal015@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, Mar 30, 2026 at 2:00 PM, Deepak Pathik wrote:<br>

> the two-phase model (serial traversal + parallel data<br>

> verification/extraction) makes a lot more sense now<br>

<br>

Good, that is the right framing.<br>

<br>

One more thing worth deciding early: error aggregation policy. In the<br>

current serial path, erofs_check_inode() returns -errno and the caller<br>

stops on first error. With parallel workers you need a shared error<br>

accumulator and a policy for whether to drain the queue on first<br>

error or run to completion — the choice affects both exit status<br>

correctness and how much corruption a single run surfaces on large<br>

images.<br>

<br>

Good luck with the proposal.<br>

<br>

Regards,<br>

Utkal Singh<br>

<br>

On Mon, 30 Mar 2026 at 14:00, Deepak Pathik <<a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a>> wrote:<br>

><br>

> Hi Utkal,<br>

><br>

> Thanks again for the detailed explanation and for pointing me to the RFC — it really helped clarify the bigger picture.<br>

><br>

> I spent some time going through the relevant parts of the code and your comments made a lot more sense in that context. I see now that while pcluster-level parallelism is valid, the main challenge is making the surrounding infrastructure safe before introducing concurrency.<br>

><br>

> In particular, I hadn’t fully accounted for:<br>

><br>

> the lseek() + read() pattern in erofs_read_one_data() and why switching to pread() is necessary for correctness,<br>

><br>

> the lack of synchronization in erofs_iget()/erofs_iput(), which could lead to refcount races,<br>

><br>

> and the implications of using an unbounded workqueue on large images.<br>

><br>

> Your point about backpressure was especially helpful — I’m now considering a bounded queue or a semaphore-based approach to ensure the producer doesn’t get too far ahead of the workers.<br>

><br>

> I also revisited the design with this in mind, and the two-phase model (serial traversal + parallel data verification/extraction) makes a lot more sense now, especially for isolating shared state like fsckcfg and path handling.<br>

><br>

> I’ll continue refining the proposal with these constraints in mind and go deeper into io.c, inode.c, and workqueue.c to make sure the design is correct before thinking about actual parallel execution.<br>

><br>

> Thanks again for taking the time to explain this — it was very helpful.<br>

><br>

> Regards,<br>

> Deepak Pathik<br>

><br>

><br>

> On Mon, Mar 30, 2026 at 1:50 AM Utkal Singh <<a href="mailto:singhutkal015@gmail.com" target="_blank">singhutkal015@gmail.com</a>> wrote:<br>

>><br>

>> On Sun, Mar 29, 2026 at 6:47 PM, Deepak Pathik wrote:<br>

>> > for LZMA-compressed images, are pclusters in fsck.erofs always<br>

>> > fixed-size and independently decompressible at the userspace level,<br>

>> > or are there cases where a pcluster depends on the state left by a<br>

>> > previous one?<br>

>><br>

>> Hi Deepak,<br>

>><br>

>> To answer your LZMA question: yes, each pcluster is independently<br>

>> decompressible by design. You can verify this directly in<br>

>> lib/decompress.c — z_erofs_decompress_lzma() calls lzma_stream_decoder()<br>

>> and lzma_end() within a single invocation, with no persistent lzma_stream<br>

>> across calls. The same holds for ZSTD and deflate. The on-disk format<br>

>> enforces this: no pcluster depends on decompressor state from a<br>

>> previous one.<br>

>><br>

>> The parallelism boundary you identified is correct. The deeper issue<br>

>> is one level up: erofs_check_inode() is called sequentially in the<br>

>> dispatch loop in fsck/main.c, and each call may decompress many<br>

>> pclusters per inode. Inode-level dispatch is simpler than<br>

>> pcluster-level because it avoids output ordering constraints.<br>

>><br>

>> One thing worth thinking through before wiring erofs_workqueue into<br>

>> the fsck path: the existing queue in lib/workqueue.c is an unbounded<br>

>> producer queue built for mkfs compression workloads. On a 34,000+<br>

>> inode image, it will accumulate all inode descriptors in memory before<br>

>> workers can drain it. Backpressure — either a bounded queue or a<br>

>> semaphore on the existing one — matters here.<br>

>><br>

>> Two paths in the surrounding infrastructure also need fixing before<br>

>> concurrent dispatch is correct:<br>

>><br>

>>   - erofs_read_one_data() in lib/io.c: lseek()+read() on a shared fd<br>

>>     is a TOCTOU race under concurrent calls. pread(2) fixes it cleanly.<br>

>><br>

>>   - erofs_iget()/erofs_iput() in lib/inode.c: ref-count mutations<br>

>>     without synchronisation. Concurrent iput() can double-free.<br>

>><br>

>> I sent an RFC on March 22 covering this design if it is useful context:<br>

>><br>

>>   <a href="https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/" rel="noreferrer" target="_blank">https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/</a><br>

>><br>

>> Happy to discuss further on the list.<br>

>><br>

>> Regards,<br>

>> Utkal Singh<br>

>><br>

>><br>

>> On Sun, 29 Mar 2026 at 18:47, Deepak Pathik <<a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a>> wrote:<br>

>> ><br>

>> > Hi,<br>

>> ><br>

>> > I'm Deepak Pathik, a second-year B.Tech student applying for the GSoC 2026 project on multi-threaded decompression support in fsck.erofs.<br>

>> ><br>

>> > While reading through the source, I traced the decompression path in erofs_verify_inode_data() and noticed that z_erofs_decompress() operates on a locally scoped struct z_erofs_decompress_req with its own input and output buffers — no shared mutable state between calls. My plan is to wire the existing erofs_workqueue (already used in lib/compress.c for mkfs.erofs) into the fsck extraction path at the pcluster level, with pwrite() for position-based output writes to avoid ordering locks.<br>

>> ><br>

>> > One thing I wanted to confirm before finalizing my proposal: for LZMA-compressed images, are pclusters in fsck.erofs always fixed-size and independently decompressible at the userspace level, or are there cases where a pcluster depends on the state left by a previous one? I want to make sure I'm not understating the LZMA case in my design.<br>

>> ><br>

>> > I've drafted a proposal and would be happy to share it for early feedback if that's useful.<br>

>> ><br>

>> > Thanks,<br>

>> > Deepak Pathik<br>

>> > <a href="https://github.com/deepakpathik" rel="noreferrer" target="_blank">https://github.com/deepakpathik</a><br>

>> > <a href="mailto:deepakpathik2005@gmail.com" target="_blank">deepakpathik2005@gmail.com</a><br>

</blockquote></div>