[PATCH v2] erofs-utils: lib: converted division to shift in z_erofs_load_compact_lcluster

Thu Feb 26 04:30:30 AEDT 2026

perf on fsck.erofs in gcc reports that z_erofs_load_compact_lcluster
was spending 20% of its time doing the div instruction. While the 
function itself is ~40% of user runtime. In the source code, it seems 
that dividing by vcnt doesn't optimize to a shift despite the two 
possible states being powers of 2.

Changing the division into a ilog2() function call encourages the 
compiler to recognize it as a power of 2. Thus performing a shift. 

Running a benchmark on lzma compressed freebsd code on x86, shows 
there is a ~4% increase in performance in gcc. While clang shows 
virtually no regression in performance. The tradeoff is slightly 
obfuscated source code.

The following command was run locally on x86. 

$ hyperfine -w 5 -m 30 "./fsck.erofs ./bsd.erofs.lzma"

patch on gcc 15.2.1 
Time (mean ± σ):     354.8 ms ±   6.0 ms    \
  [User: 227.8 ms, System: 126.1 ms]
Range (min … max):   345.8 ms … 366.2 ms    30 runs

dev on gcc 15.2.1 
Time (mean ± σ):     370.7 ms ±   6.7 ms    \
  [User: 246.5 ms, System: 123.4 ms]
Range (min … max):   362.7 ms … 390.7 ms    30 runs

patch on clang 21.1.8 
Time (mean ± σ):     371.9 ms ±   2.4 ms    \
  [User: 247.2 ms, System: 123.9 ms]
Range (min … max):   369.1 ms … 380.0 ms    30 runs

dev on clang 21.1.8
Time (mean ± σ):     371.0 ms ±   1.9 ms    \
  [User: 245.5 ms, System: 124.5 ms]
Range (min … max):   368.4 ms … 377.7 ms    30 runs

Signed-off-by: Ashley Lee <yester1324 at gmail.com>
---
v2: changed vdiv to ilog2 call

 lib/zmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/zmap.c b/lib/zmap.c
index baec278..3ac7fe9 100644
--- a/lib/zmap.c
+++ b/lib/zmap.c
@@ -160,7 +160,7 @@ static int z_erofs_load_compact_lcluster(struct z_erofs_maprecorder *m,
 	m->nextpackoff = round_down(pos, vcnt << amortizedshift) +
 			 (vcnt << amortizedshift);
 	lobits = max(lclusterbits, ilog2(Z_EROFS_LI_D0_CBLKCNT) + 1U);
-	encodebits = ((vcnt << amortizedshift) - sizeof(__le32)) * 8 / vcnt;
+	encodebits = (((vcnt << amortizedshift) - sizeof(__le32)) * 8) >> ilog2(vcnt);
 	bytes = pos & ((vcnt << amortizedshift) - 1);
 	in -= bytes;
 	i = bytes >> amortizedshift;
-- 
2.53.0