[PATCH v4 28/46] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

Nicholas Piggin npiggin at gmail.com
Tue Mar 23 22:15:36 AEDT 2021


Excerpts from Nicholas Piggin's message of March 23, 2021 8:36 pm:
> Excerpts from Alexey Kardashevskiy's message of March 23, 2021 8:13 pm:
>> 
>> 
>> On 23/03/2021 12:02, Nicholas Piggin wrote:
>>> irq_work's use of the DEC SPR is racy with guest<->host switch and guest
>>> entry which flips the DEC interrupt to guest, which could lose a host
>>> work interrupt.
>>> 
>>> This patch closes one race, and attempts to comment another class of
>>> races.
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
>>> ---
>>>   arch/powerpc/kvm/book3s_hv.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 1f38a0abc611..989a1ff5ad11 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -3745,6 +3745,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
>>>   	if (!(vcpu->arch.ctrl & 1))
>>>   		mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
>>>   
>>> +	/*
>>> +	 * When setting DEC, we must always deal with irq_work_raise via NMI vs
>>> +	 * setting DEC. The problem occurs right as we switch into guest mode
>>> +	 * if a NMI hits and sets pending work and sets DEC, then that will
>>> +	 * apply to the guest and not bring us back to the host.
>>> +	 *
>>> +	 * irq_work_raise could check a flag (or possibly LPCR[HDICE] for
>>> +	 * example) and set HDEC to 1? That wouldn't solve the nested hv
>>> +	 * case which needs to abort the hcall or zero the time limit.
>>> +	 *
>>> +	 * XXX: Another day's problem.
>>> +	 */
>>>   	mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
>>>   
>>>   	if (kvmhv_on_pseries()) {
>>> @@ -3879,7 +3891,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit,
>>>   	vc->entry_exit_map = 0x101;
>>>   	vc->in_guest = 0;
>>>   
>>> -	mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - tb);
>>> +	set_dec_or_work(local_paca->kvm_hstate.dec_expires - tb);
>> 
>> 
>> set_dec_or_work() will write local_paca->kvm_hstate.dec_expires - tb - 1 
>> to SPRN_DEC which is not exactly the same, is this still alright?
>> 
>> I asked in v3 but it is probably lost :)
> 
> Oh I did see that then forgot.
> 
> It will write dec_expires - tb, then it will write 1 if it found irq_work
> was pending.

Ah you were actually asking about set_dec writing val - 1. I totally 
missed that.

Yes that was an unintentional change. This is the way timer.c code works 
with respect to the decrementers_next_tb value, so it's probably better 
to make them so it seems like it should be okay (and better to bring the 
KVM code up to match timer code rather than be different or the other
way around). The difference should be noted in the changelog though.

Thanks,
Nick


More information about the Linuxppc-dev mailing list