[PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

Tue Aug 24 01:20:17 AEST 2021

[ re-send to Darren Hart ]

----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers at efficios.com wrote:

> ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc at google.com wrote:
> 
>> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> without updating rseq, leading to a stale CPU ID and other badness.
>> 
> 
> [...]
> 
> +#define RSEQ_SIG 0xdeadbeef
> 
> Is there any reason for defining a custom signature rather than including
> tools/testing/selftests/rseq/rseq.h ? This should take care of including
> the proper architecture header which will define the appropriate signature.
> 
> Arguably you don't define rseq critical sections in this test per se, but
> I'm wondering why the custom signature here.
> 
> [...]
> 
>> +
>> +static void *migration_worker(void *ign)
>> +{
>> +	cpu_set_t allowed_mask;
>> +	int r, i, nr_cpus, cpu;
>> +
>> +	CPU_ZERO(&allowed_mask);
>> +
>> +	nr_cpus = CPU_COUNT(&possible_mask);
>> +
>> +	for (i = 0; i < 20000; i++) {
>> +		cpu = i % nr_cpus;
>> +		if (!CPU_ISSET(cpu, &possible_mask))
>> +			continue;
>> +
>> +		CPU_SET(cpu, &allowed_mask);
>> +
>> +		/*
>> +		 * Bump the sequence count twice to allow the reader to detect
>> +		 * that a migration may have occurred in between rseq and sched
>> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> +		 * is in-progress, while a completely different count indicates
>> +		 * a migration occurred since the count was last read.
>> +		 */
>> +		atomic_inc(&seq_cnt);
> 
> So technically this atomic_inc contains the required barriers because the
> selftests
> implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd
> that
> the semantic differs from the kernel implementation in terms of memory barriers:
> the
> kernel implementation of atomic_inc guarantees no memory barriers, but this one
> happens to provide full barriers pretty much by accident (selftests
> futex/include/atomic.h documents no such guarantee).
> 
> If this full barrier guarantee is indeed provided by the selftests atomic.h
> header,
> I would really like a comment stating that in the atomic.h header so the carpet
> is
> not pulled from under our feet by a future optimization.
> 
> 
>> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> +			    errno, strerror(errno));
>> +		atomic_inc(&seq_cnt);
>> +
>> +		CPU_CLR(cpu, &allowed_mask);
>> +
>> +		/*
>> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> +		 * of task migration coinciding with KVM's run loop.
> 
> This comment should be about increasing the odds of letting the seqlock
> read-side
> complete. Otherwise, the delay between the two back-to-back atomic_inc is so
> small
> that the seqlock read-side may never have time to complete the reading the rseq
> cpu id and the sched_getcpu() call, and can retry forever.
> 
> I'm wondering if 1 microsecond is sufficient on other architectures as well. One
> alternative way to make this depend less on the architecture's implementation of
> sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
> the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the
> migration
> thread rather than use usleep, and throw away the value read. This would ensure
> the delay is appropriate on all architectures.
> 
> Thanks!
> 
> Mathieu
> 
>> +		 */
>> +		usleep(1);
>> +	}
>> +	done = true;
>> +	return NULL;
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> +	struct kvm_vm *vm;
>> +	u32 cpu, rseq_cpu;
>> +	int r, snapshot;
>> +
>> +	/* Tell stdout not to buffer its content */
>> +	setbuf(stdout, NULL);
>> +
>> +	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
>> +	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
>> +		    strerror(errno));
>> +
>> +	if (CPU_COUNT(&possible_mask) < 2) {
>> +		print_skip("Only one CPU, task migration not possible\n");
>> +		exit(KSFT_SKIP);
>> +	}
>> +
>> +	sys_rseq(0);
>> +
>> +	/*
>> +	 * Create and run a dummy VM that immediately exits to userspace via
>> +	 * GUEST_SYNC, while concurrently migrating the process by setting its
>> +	 * CPU affinity.
>> +	 */
>> +	vm = vm_create_default(VCPU_ID, 0, guest_code);
>> +
>> +	pthread_create(&migration_thread, NULL, migration_worker, 0);
>> +
>> +	while (!done) {
>> +		vcpu_run(vm, VCPU_ID);
>> +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> +			    "Guest failed?");
>> +
>> +		/*
>> +		 * Verify rseq's CPU matches sched's CPU.  Ensure migration
>> +		 * doesn't occur between sched_getcpu() and reading the rseq
>> +		 * cpu_id by rereading both if the sequence count changes, or
>> +		 * if the count is odd (migration in-progress).
>> +		 */
>> +		do {
>> +			/*
>> +			 * Drop bit 0 to force a mismatch if the count is odd,
>> +			 * i.e. if a migration is in-progress.
>> +			 */
>> +			snapshot = atomic_read(&seq_cnt) & ~1;
>> +			smp_rmb();
>> +			cpu = sched_getcpu();
>> +			rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> +			smp_rmb();
>> +		} while (snapshot != atomic_read(&seq_cnt));
>> +
>> +		TEST_ASSERT(rseq_cpu == cpu,
>> +			    "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
>> +	}
>> +
>> +	pthread_join(migration_thread, NULL);
>> +
>> +	kvm_vm_free(vm);
>> +
>> +	sys_rseq(RSEQ_FLAG_UNREGISTER);
>> +
>> +	return 0;
>> +}
>> --
>> 2.33.0.rc2.250.ged5fa647cd-goog
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com