Boot flakiness with QEMU 3.1.0 and Clang built kernels

Nathan Chancellor natechancellor at gmail.com
Sat Apr 11 10:53:54 AEST 2020


Hi Nicholas,

On Sat, Apr 11, 2020 at 10:29:45AM +1000, Nicholas Piggin wrote:
> Nathan Chancellor's on April 11, 2020 6:59 am:
> > Hi all,
> > 
> > Recently, our CI started running into several hangs when running the
> > spinlock torture tests during a boot with QEMU 3.1.0 on
> > powernv_defconfig and pseries_defconfig when compiled with Clang.
> > 
> > I initially bisected Linux and came down to commit 3282a3da25bd
> > ("powerpc/64: Implement soft interrupt replay in C") [1], which seems to
> > make sense. However, I realized I could not reproduce this in my local
> > environment no matter how hard I tried, only in our Docker image. I then
> > realized my environment's QEMU version was 4.2.0; I compiled 3.1.0 and
> > was able to reproduce it then.
> > 
> > I bisected QEMU down to two commits: powernv_defconfig was fixed by [2]
> > and pseries_defconfig was fixed by [3].
> 
> Looks like it might have previously been testing power8, now power9?
> -cpu power8 might get it reproducing again.

Yes, that is what it looks like. I can reproduce the hang with both
pseries-3.1 and powernv8 on QEMU 4.2.0.

> > I ran 100 boots with our boot-qemu.sh script [4] and QEMU 3.1.0 failed
> > approximately 80% of the time but 4.2.0 and 5.0.0-rc1 only failed 1% of
> > the time [5]. GCC 9.3.0 built kernels failed approximately 3% of time
> > [6].
> 
> Do they fail in the same way? Was the fail rate at 0% before upgrading
> kernels?

Yes, it just hangs after I see the print out that the torture tests are
running.

[    2.277125] spin_lock-torture: Creating torture_shuffle task
[    2.279058] spin_lock-torture: Creating torture_stutter task
[    2.280285] spin_lock-torture: torture_shuffle task started
[    2.281326] spin_lock-torture: Creating lock_torture_writer task
[    2.282509] spin_lock-torture: torture_stutter task started
[    2.283511] spin_lock-torture: Creating lock_torture_writer task
[    2.285155] spin_lock-torture: lock_torture_writer task started
[    2.286586] spin_lock-torture: Creating lock_torture_stats task
[    2.287772] spin_lock-torture: lock_torture_writer task started
[    2.290578] spin_lock-torture: lock_torture_stats task started

Yes, we never had any failures in our CI before that upgrade happened. I
will try to run a set of boot tests with a kernel built at the commit
right before 3282a3da25bd and at 3282a3da25bd to make triple sure I did
fall on the right commit.

> > Without access to real hardware, I cannot really say if there is a
> > problem here. We are going to upgrade to QEMU 4.2.0 to fix it. This is
> > more of an FYI so that there is some record of it outside of our issue
> > tracker and so people can be aware of it in case it comes up somewhere
> > else.
> 
> Thanks for this I'll try to reproduce. You're not running SMP guest?

No, not as far as I am aware at least. You can see our QEMU line in our
CI and the boot-qemu.sh script I have listed below:

https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/318260635

> Anything particular to run the lock torture test? This is just 
> powernv_defconfig + CONFIG_LOCK_TORTURE_TEST=y ?

We do enable some other configs, you can see those here:

https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/common.config
https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/tt.config

The tt.config values are needed to reproduce but I did not verify that
ONLY tt.config was needed. Other than that, no, we are just building
either pseries_defconfig or powernv_defconfig with those configs and
letting it boot up with a simple initramfs, which prints the version
string then shuts the machine down.

Let me know if you need any more information, cheers!
Nathan

> Thanks,
> Nick
> 
> > 
> > [1]: https://git.kernel.org/linus/3282a3da25bd63fdb7240bc35dbdefa4b1947005
> > [2]: https://git.qemu.org/?p=qemu.git;a=commit;h=f30c843ced5055fde71d28d10beb15af97fdfe39
> > [3]: https://git.qemu.org/?p=qemu.git;a=commit;h=34a6b015a98733a4b32881777dafd70156c5a322.
> > [4]: https://github.com/ClangBuiltLinux/boot-utils/blob/5f49a87e272fbe967a8d26cf405cec15b024702c/boot-qemu.sh
> > [5]: https://user-images.githubusercontent.com/11478138/78957618-b1842080-7a9a-11ea-8856-279c3dcc6c19.png
> > [6]: https://user-images.githubusercontent.com/11478138/78955535-62d38800-7a94-11ea-9e61-9e3d8c068ace.png
> > 
> > Cheers,
> > Nathan
> > 


More information about the Linuxppc-dev mailing list