[PATCH v4 28/46] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

Nicholas Piggin npiggin at gmail.com
Tue Mar 23 21:36:44 AEDT 2021


Excerpts from Alexey Kardashevskiy's message of March 23, 2021 8:13 pm:
> 
> 
> On 23/03/2021 12:02, Nicholas Piggin wrote:
>> irq_work's use of the DEC SPR is racy with guest<->host switch and guest
>> entry which flips the DEC interrupt to guest, which could lose a host
>> work interrupt.
>> 
>> This patch closes one race, and attempts to comment another class of
>> races.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 1f38a0abc611..989a1ff5ad11 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -3745,6 +3745,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
>>   	if (!(vcpu->arch.ctrl & 1))
>>   		mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
>>   
>> +	/*
>> +	 * When setting DEC, we must always deal with irq_work_raise via NMI vs
>> +	 * setting DEC. The problem occurs right as we switch into guest mode
>> +	 * if a NMI hits and sets pending work and sets DEC, then that will
>> +	 * apply to the guest and not bring us back to the host.
>> +	 *
>> +	 * irq_work_raise could check a flag (or possibly LPCR[HDICE] for
>> +	 * example) and set HDEC to 1? That wouldn't solve the nested hv
>> +	 * case which needs to abort the hcall or zero the time limit.
>> +	 *
>> +	 * XXX: Another day's problem.
>> +	 */
>>   	mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
>>   
>>   	if (kvmhv_on_pseries()) {
>> @@ -3879,7 +3891,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
>>   	vc->entry_exit_map = 0x101;
>>   	vc->in_guest = 0;
>>   
>> -	mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - tb);
>> +	set_dec_or_work(local_paca->kvm_hstate.dec_expires - tb);
> 
> 
> set_dec_or_work() will write local_paca->kvm_hstate.dec_expires - tb - 1 
> to SPRN_DEC which is not exactly the same, is this still alright?
> 
> I asked in v3 but it is probably lost :)

Oh I did see that then forgot.

It will write dec_expires - tb, then it will write 1 if it found irq_work
was pending.

The change is intentional to fixes one of the lost irq_work races.

Thanks,
Nick


More information about the Linuxppc-dev mailing list