[PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

Fri Aug 27 04:42:12 AEST 2021

----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc at google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers at efficios.com wrote:
>> 
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc at google.com wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> +	cpu_set_t allowed_mask;
>> >> +	int r, i, nr_cpus, cpu;
>> >> +
>> >> +	CPU_ZERO(&allowed_mask);
>> >> +
>> >> +	nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> +	for (i = 0; i < 20000; i++) {
>> >> +		cpu = i % nr_cpus;
>> >> +		if (!CPU_ISSET(cpu, &possible_mask))
>> >> +			continue;
>> >> +
>> >> +		CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Bump the sequence count twice to allow the reader to detect
>> >> +		 * that a migration may have occurred in between rseq and sched
>> >> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> >> +		 * is in-progress, while a completely different count indicates
>> >> +		 * a migration occurred since the count was last read.
>> >> +		 */
>> >> +		atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> +			    errno, strerror(errno));
>> >> +		atomic_inc(&seq_cnt);
>> >> +
>> >> +		CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> +		/*
>> >> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> >> +		 * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
> 
> Hmm, but that's not why there's a delay.  I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread                             KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
                                             - ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
                                             - a = atomic_load(&seqcnt) & ~1
                                             - smp_rmb()
                                             - b = LOAD_ONCE(__rseq_abi->cpu_id);
                                             - sched_getcpu()
                                             - smp_rmb()
                                             - re-load seqcnt/compare (succeeds)
                                               - Can only succeed if entire read-side happens while the seqcnt
                                                 is in an even numbered state.
                                             - if (a != b) abort()
  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Let's suppose the lack of delay causes the
     setaffinity to complete too early compared
     with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

  /* no delay. Even counter state is very
     short. */
- atomic_inc_seq_cst(&seqcnt)
  /* Then a setaffinity from a following
     migration thread loop will run
     concurrently with KVM_RUN */
                                             - ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that 
the even counter state is very short may very well hurt progress of the read seqlock.

> 
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

> 
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
> 
> I'm definitely wondering that as well :-)
> 
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
> 
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute.  One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
> 
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com