Disabling ppc64le hosts for Qemu tests

Thu Nov 15 08:45:14 AEDT 2018

On Tue, Nov 13, 2018 at 5:44 PM Joel Stanley <joel at jms.id.au> wrote:
>
> On Wed, 14 Nov 2018 at 07:00, Andrew Geissler <geissonator at gmail.com> wrote:
> >
> > On Mon, Nov 12, 2018 at 11:08 PM Joel Stanley <joel at jms.id.au> wrote:
> > >
> > > Hello Andrew,
> > >
> > > I've been trying to get to the bottom of the slow-booting qemu issue
> > > for a few weeks now. We commited a fix that resolve the issue on x86
> > > hosts, but when running the same Qemu build on ppc64le it still has
> > > the "go slow" behaviour.
> > >
> > >  https://github.com/openbmc/qemu/issues/14
> > >
> > > I propose we run the Romulus Qemu boot test on x86-only build slaves
> > > while we work this out. That will allow us to unblock the kernel
> > > security bumps, which have been pending for a few weeks.
> >
> > Unfortunately there's no way that I can find in the jenkins multi-configuration
> > matrix plugin to specify which part of an axis goes to which slave node.
> >
> > The flow right now is that whichever node gets assigned to build the
> > image, is then used to run the QEMU job (so no transfers of
> > images have to occur).
>
> This sounds like the easiest thing to fix. We would still maintain all
> four CI boxes, but ensure the Qemu job is run on an x86 box.
>
> We could even run the Qemu job on openpower.xyz (aka 'slave') itself,
> as the images are already being copied here.

I think to do this, we'd need something like
https://stackoverflow.com/questions/7133027/retrieve-build-number-or-artifacts-of-downstream-build-in-jenkins
The images are not copied to openpower.xyz until the very end
of the job currently, to get them to the QEMU job, we'd need
to copy them in between jobs.

Honestly, if it's just a couple of weeks, I think our best solution is
to just remove the ppc64le system from the bitbake builder queue.
With the US holiday, it should be a quieter few weeks then what we've
had recently.

I have for other optimizations reasons always wanted a way to
dedicate a QEMU queue, but I don't really have time to deal
with getting that going right now.

>
> > The assignment of which node builds
> > which config is random so builder4 (our ppc64le) node gets Romulus
> > about 25% of the time.
> >
> > Removing builder4 from the build queue would work, but would
> > also remove 25% of our build capacity.
>
> Is there a reason why we don't use the "openbmc-ci" build slave more?
>
>  https://openpower.xyz/computer/openbmc-ci1/builds

openbmc-ci1 is our ppc64le that runs the gerrit server. I send the
per-repository CI
jobs there (i.e. the make/make check google test jobs). I don't send the
big bitbake jobs (i.e. builder) because of the cpu impacts it would have
on gerrit.

> > > I'll continue working on fixing Qemu in the future, but I don't have
> > > time for the next two weeks due to some higher priority work.
> >
> > This feels like one of those things that's always tough to come back
> > to once you let a workaround in.
>
> I don't understand? There's no workaround going into Qemu, this is
> simply for the Jenkins issue. Once Qemu is fixed we can re-enable the
> Jenkins slave.
>
> > Is there a specific kernel commit
> > you've bisected down to that we could just pull from the openbmc/linux
> > tree? Is there a workaround we could put in our openbmc/qemu for
> > now? I'd prefer both of these over just disabling ppc64le.
>
> The fix corrects a nasty bug in the Linux timer driver. Reverting this
> fix would result in inaccurate running of timers in Linux, which is
> what the system currently does. I would consider it a high priority
> fix to have in the product.

Yeah, just feels like we're stuck between a rock and hard place.
I def understand the desire to keep the kernel up to date for the
community.

With the tag happening Friday, I assume we don't want to get this
in until next week anyway? Or was this something you felt we
really needed before the tag?

>
> The Qemu workaround is already in place, and works for x86. It doesn't
> work for ppc64le.