Re: [GSoC 2026] Multi-threaded decompression for fsck.erofs — design question on z_erofs_decompress() parallelism

Tue Mar 31 06:02:13 AEDT 2026

That’s a great point — I hadn’t fully thought through the error aggregation
side yet.
I’m leaning towards running to completion with a shared error accumulator
so fsck can surface all corruption in one run, but I’ll think through the
exit semantics carefully.

Thanks again for the insights, really helpful.

Regards,
Deepak Pathik

On Tue, Mar 31, 2026 at 12:05 AM Utkal Singh <singhutkal015 at gmail.com>
wrote:

> On Mon, Mar 30, 2026 at 2:00 PM, Deepak Pathik wrote:
> > the two-phase model (serial traversal + parallel data
> > verification/extraction) makes a lot more sense now
>
> Good, that is the right framing.
>
> One more thing worth deciding early: error aggregation policy. In the
> current serial path, erofs_check_inode() returns -errno and the caller
> stops on first error. With parallel workers you need a shared error
> accumulator and a policy for whether to drain the queue on first
> error or run to completion — the choice affects both exit status
> correctness and how much corruption a single run surfaces on large
> images.
>
> Good luck with the proposal.
>
> Regards,
> Utkal Singh
>
> On Mon, 30 Mar 2026 at 14:00, Deepak Pathik <deepakpathik2005 at gmail.com>
> wrote:
> >
> > Hi Utkal,
> >
> > Thanks again for the detailed explanation and for pointing me to the RFC
> — it really helped clarify the bigger picture.
> >
> > I spent some time going through the relevant parts of the code and your
> comments made a lot more sense in that context. I see now that while
> pcluster-level parallelism is valid, the main challenge is making the
> surrounding infrastructure safe before introducing concurrency.
> >
> > In particular, I hadn’t fully accounted for:
> >
> > the lseek() + read() pattern in erofs_read_one_data() and why switching
> to pread() is necessary for correctness,
> >
> > the lack of synchronization in erofs_iget()/erofs_iput(), which could
> lead to refcount races,
> >
> > and the implications of using an unbounded workqueue on large images.
> >
> > Your point about backpressure was especially helpful — I’m now
> considering a bounded queue or a semaphore-based approach to ensure the
> producer doesn’t get too far ahead of the workers.
> >
> > I also revisited the design with this in mind, and the two-phase model
> (serial traversal + parallel data verification/extraction) makes a lot more
> sense now, especially for isolating shared state like fsckcfg and path
> handling.
> >
> > I’ll continue refining the proposal with these constraints in mind and
> go deeper into io.c, inode.c, and workqueue.c to make sure the design is
> correct before thinking about actual parallel execution.
> >
> > Thanks again for taking the time to explain this — it was very helpful.
> >
> > Regards,
> > Deepak Pathik
> >
> >
> > On Mon, Mar 30, 2026 at 1:50 AM Utkal Singh <singhutkal015 at gmail.com>
> wrote:
> >>
> >> On Sun, Mar 29, 2026 at 6:47 PM, Deepak Pathik wrote:
> >> > for LZMA-compressed images, are pclusters in fsck.erofs always
> >> > fixed-size and independently decompressible at the userspace level,
> >> > or are there cases where a pcluster depends on the state left by a
> >> > previous one?
> >>
> >> Hi Deepak,
> >>
> >> To answer your LZMA question: yes, each pcluster is independently
> >> decompressible by design. You can verify this directly in
> >> lib/decompress.c — z_erofs_decompress_lzma() calls lzma_stream_decoder()
> >> and lzma_end() within a single invocation, with no persistent
> lzma_stream
> >> across calls. The same holds for ZSTD and deflate. The on-disk format
> >> enforces this: no pcluster depends on decompressor state from a
> >> previous one.
> >>
> >> The parallelism boundary you identified is correct. The deeper issue
> >> is one level up: erofs_check_inode() is called sequentially in the
> >> dispatch loop in fsck/main.c, and each call may decompress many
> >> pclusters per inode. Inode-level dispatch is simpler than
> >> pcluster-level because it avoids output ordering constraints.
> >>
> >> One thing worth thinking through before wiring erofs_workqueue into
> >> the fsck path: the existing queue in lib/workqueue.c is an unbounded
> >> producer queue built for mkfs compression workloads. On a 34,000+
> >> inode image, it will accumulate all inode descriptors in memory before
> >> workers can drain it. Backpressure — either a bounded queue or a
> >> semaphore on the existing one — matters here.
> >>
> >> Two paths in the surrounding infrastructure also need fixing before
> >> concurrent dispatch is correct:
> >>
> >>   - erofs_read_one_data() in lib/io.c: lseek()+read() on a shared fd
> >>     is a TOCTOU race under concurrent calls. pread(2) fixes it cleanly.
> >>
> >>   - erofs_iget()/erofs_iput() in lib/inode.c: ref-count mutations
> >>     without synchronisation. Concurrent iput() can double-free.
> >>
> >> I sent an RFC on March 22 covering this design if it is useful context:
> >>
> >>
> https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/
> >>
> >> Happy to discuss further on the list.
> >>
> >> Regards,
> >> Utkal Singh
> >>
> >>
> >> On Sun, 29 Mar 2026 at 18:47, Deepak Pathik <deepakpathik2005 at gmail.com>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'm Deepak Pathik, a second-year B.Tech student applying for the GSoC
> 2026 project on multi-threaded decompression support in fsck.erofs.
> >> >
> >> > While reading through the source, I traced the decompression path in
> erofs_verify_inode_data() and noticed that z_erofs_decompress() operates on
> a locally scoped struct z_erofs_decompress_req with its own input and
> output buffers — no shared mutable state between calls. My plan is to wire
> the existing erofs_workqueue (already used in lib/compress.c for
> mkfs.erofs) into the fsck extraction path at the pcluster level, with
> pwrite() for position-based output writes to avoid ordering locks.
> >> >
> >> > One thing I wanted to confirm before finalizing my proposal: for
> LZMA-compressed images, are pclusters in fsck.erofs always fixed-size and
> independently decompressible at the userspace level, or are there cases
> where a pcluster depends on the state left by a previous one? I want to
> make sure I'm not understating the LZMA case in my design.
> >> >
> >> > I've drafted a proposal and would be happy to share it for early
> feedback if that's useful.
> >> >
> >> > Thanks,
> >> > Deepak Pathik
> >> > https://github.com/deepakpathik
> >> > deepakpathik2005 at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/linux-erofs/attachments/20260331/512f3aa4/attachment.htm>