Boot flakiness with QEMU 3.1.0 and Clang built kernels

Nathan Chancellor natechancellor at gmail.com
Sat Apr 11 06:59:32 AEST 2020


Hi all,

Recently, our CI started running into several hangs when running the
spinlock torture tests during a boot with QEMU 3.1.0 on
powernv_defconfig and pseries_defconfig when compiled with Clang.

I initially bisected Linux and came down to commit 3282a3da25bd
("powerpc/64: Implement soft interrupt replay in C") [1], which seems to
make sense. However, I realized I could not reproduce this in my local
environment no matter how hard I tried, only in our Docker image. I then
realized my environment's QEMU version was 4.2.0; I compiled 3.1.0 and
was able to reproduce it then.

I bisected QEMU down to two commits: powernv_defconfig was fixed by [2]
and pseries_defconfig was fixed by [3].

I ran 100 boots with our boot-qemu.sh script [4] and QEMU 3.1.0 failed
approximately 80% of the time but 4.2.0 and 5.0.0-rc1 only failed 1% of
the time [5]. GCC 9.3.0 built kernels failed approximately 3% of time
[6].

Without access to real hardware, I cannot really say if there is a
problem here. We are going to upgrade to QEMU 4.2.0 to fix it. This is
more of an FYI so that there is some record of it outside of our issue
tracker and so people can be aware of it in case it comes up somewhere
else.

[1]: https://git.kernel.org/linus/3282a3da25bd63fdb7240bc35dbdefa4b1947005
[2]: https://git.qemu.org/?p=qemu.git;a=commit;h=f30c843ced5055fde71d28d10beb15af97fdfe39
[3]: https://git.qemu.org/?p=qemu.git;a=commit;h=34a6b015a98733a4b32881777dafd70156c5a322.
[4]: https://github.com/ClangBuiltLinux/boot-utils/blob/5f49a87e272fbe967a8d26cf405cec15b024702c/boot-qemu.sh
[5]: https://user-images.githubusercontent.com/11478138/78957618-b1842080-7a9a-11ea-8856-279c3dcc6c19.png
[6]: https://user-images.githubusercontent.com/11478138/78955535-62d38800-7a94-11ea-9e61-9e3d8c068ace.png

Cheers,
Nathan


More information about the Linuxppc-dev mailing list