Re: [GSoC 2026] Multi-threaded decompression for fsck.erofs — design question on z_erofs_decompress() parallelism
Utkal Singh
singhutkal015 at gmail.com
Tue Mar 31 05:35:26 AEDT 2026
On Mon, Mar 30, 2026 at 2:00 PM, Deepak Pathik wrote:
> the two-phase model (serial traversal + parallel data
> verification/extraction) makes a lot more sense now
Good, that is the right framing.
One more thing worth deciding early: error aggregation policy. In the
current serial path, erofs_check_inode() returns -errno and the caller
stops on first error. With parallel workers you need a shared error
accumulator and a policy for whether to drain the queue on first
error or run to completion — the choice affects both exit status
correctness and how much corruption a single run surfaces on large
images.
Good luck with the proposal.
Regards,
Utkal Singh
On Mon, 30 Mar 2026 at 14:00, Deepak Pathik <deepakpathik2005 at gmail.com> wrote:
>
> Hi Utkal,
>
> Thanks again for the detailed explanation and for pointing me to the RFC — it really helped clarify the bigger picture.
>
> I spent some time going through the relevant parts of the code and your comments made a lot more sense in that context. I see now that while pcluster-level parallelism is valid, the main challenge is making the surrounding infrastructure safe before introducing concurrency.
>
> In particular, I hadn’t fully accounted for:
>
> the lseek() + read() pattern in erofs_read_one_data() and why switching to pread() is necessary for correctness,
>
> the lack of synchronization in erofs_iget()/erofs_iput(), which could lead to refcount races,
>
> and the implications of using an unbounded workqueue on large images.
>
> Your point about backpressure was especially helpful — I’m now considering a bounded queue or a semaphore-based approach to ensure the producer doesn’t get too far ahead of the workers.
>
> I also revisited the design with this in mind, and the two-phase model (serial traversal + parallel data verification/extraction) makes a lot more sense now, especially for isolating shared state like fsckcfg and path handling.
>
> I’ll continue refining the proposal with these constraints in mind and go deeper into io.c, inode.c, and workqueue.c to make sure the design is correct before thinking about actual parallel execution.
>
> Thanks again for taking the time to explain this — it was very helpful.
>
> Regards,
> Deepak Pathik
>
>
> On Mon, Mar 30, 2026 at 1:50 AM Utkal Singh <singhutkal015 at gmail.com> wrote:
>>
>> On Sun, Mar 29, 2026 at 6:47 PM, Deepak Pathik wrote:
>> > for LZMA-compressed images, are pclusters in fsck.erofs always
>> > fixed-size and independently decompressible at the userspace level,
>> > or are there cases where a pcluster depends on the state left by a
>> > previous one?
>>
>> Hi Deepak,
>>
>> To answer your LZMA question: yes, each pcluster is independently
>> decompressible by design. You can verify this directly in
>> lib/decompress.c — z_erofs_decompress_lzma() calls lzma_stream_decoder()
>> and lzma_end() within a single invocation, with no persistent lzma_stream
>> across calls. The same holds for ZSTD and deflate. The on-disk format
>> enforces this: no pcluster depends on decompressor state from a
>> previous one.
>>
>> The parallelism boundary you identified is correct. The deeper issue
>> is one level up: erofs_check_inode() is called sequentially in the
>> dispatch loop in fsck/main.c, and each call may decompress many
>> pclusters per inode. Inode-level dispatch is simpler than
>> pcluster-level because it avoids output ordering constraints.
>>
>> One thing worth thinking through before wiring erofs_workqueue into
>> the fsck path: the existing queue in lib/workqueue.c is an unbounded
>> producer queue built for mkfs compression workloads. On a 34,000+
>> inode image, it will accumulate all inode descriptors in memory before
>> workers can drain it. Backpressure — either a bounded queue or a
>> semaphore on the existing one — matters here.
>>
>> Two paths in the surrounding infrastructure also need fixing before
>> concurrent dispatch is correct:
>>
>> - erofs_read_one_data() in lib/io.c: lseek()+read() on a shared fd
>> is a TOCTOU race under concurrent calls. pread(2) fixes it cleanly.
>>
>> - erofs_iget()/erofs_iput() in lib/inode.c: ref-count mutations
>> without synchronisation. Concurrent iput() can double-free.
>>
>> I sent an RFC on March 22 covering this design if it is useful context:
>>
>> https://lore.kernel.org/linux-erofs/CAGSu4WNBdB30K61xoUCi3FB9QR081fNh-1hoX1z2TZMk0nGpHQ@mail.gmail.com/
>>
>> Happy to discuss further on the list.
>>
>> Regards,
>> Utkal Singh
>>
>>
>> On Sun, 29 Mar 2026 at 18:47, Deepak Pathik <deepakpathik2005 at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I'm Deepak Pathik, a second-year B.Tech student applying for the GSoC 2026 project on multi-threaded decompression support in fsck.erofs.
>> >
>> > While reading through the source, I traced the decompression path in erofs_verify_inode_data() and noticed that z_erofs_decompress() operates on a locally scoped struct z_erofs_decompress_req with its own input and output buffers — no shared mutable state between calls. My plan is to wire the existing erofs_workqueue (already used in lib/compress.c for mkfs.erofs) into the fsck extraction path at the pcluster level, with pwrite() for position-based output writes to avoid ordering locks.
>> >
>> > One thing I wanted to confirm before finalizing my proposal: for LZMA-compressed images, are pclusters in fsck.erofs always fixed-size and independently decompressible at the userspace level, or are there cases where a pcluster depends on the state left by a previous one? I want to make sure I'm not understating the LZMA case in my design.
>> >
>> > I've drafted a proposal and would be happy to share it for early feedback if that's useful.
>> >
>> > Thanks,
>> > Deepak Pathik
>> > https://github.com/deepakpathik
>> > deepakpathik2005 at gmail.com
More information about the Linux-erofs
mailing list