[PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

Fri Oct 15 13:23:04 AEDT 2021

Excerpts from Laurent Vivier's message of October 13, 2021 7:30 pm:
> On 13/10/2021 01:18, Michael Ellerman wrote:
>> Laurent Vivier <lvivier at redhat.com> writes:
>>> Commit 112665286d08 moved guest_exit() in the interrupt protected
>>> area to avoid wrong context warning (or worse), but the tick counter
>>> cannot be updated and the guest time is accounted to the system time.
>>>
>>> To fix the problem port to POWER the x86 fix
>>> 160457140187 ("Defer vtime accounting 'til after IRQ handling"):
>>>
>>> "Defer the call to account guest time until after servicing any IRQ(s)
>>>   that happened in the guest or immediately after VM-Exit.  Tick-based
>>>   accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>>>   handler runs, and IRQs are blocked throughout the main sequence of
>>>   vcpu_enter_guest(), including the call into vendor code to actually
>>>   enter and exit the guest."
>>>
>>> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest context before enabling irqs")
>>> Cc: npiggin at gmail.com
>>> Cc: <stable at vger.kernel.org> # 5.12
>>> Signed-off-by: Laurent Vivier <lvivier at redhat.com>
>>> ---
>>>
>>> Notes:
>>>      v2: remove reference to commit 61bd0f66ff92
>>>          cc stable 5.12
>>>          add the same comment in the code as for x86
>>>
>>>   arch/powerpc/kvm/book3s_hv.c | 24 ++++++++++++++++++++----
>>>   1 file changed, 20 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 2acb1c96cfaf..a694d1a8f6ce 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> ...
>>> @@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>>>   
>>>   	srcu_read_unlock(&kvm->srcu, srcu_idx);
>>>   
>>> +	context_tracking_guest_exit();
>>> +
>>>   	set_irq_happened(trap);
>>>   
>>>   	kvmppc_set_host_core(pcpu);
>>>   
>>> -	guest_exit_irqoff();
>>> -
>>>   	local_irq_enable();
>>> +	/*
>>> +	 * Wait until after servicing IRQs to account guest time so that any
>>> +	 * ticks that occurred while running the guest are properly accounted
>>> +	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
>>> +	 * of accounting via context tracking, but the loss of accuracy is
>>> +	 * acceptable for all known use cases.
>>> +	 */
>>> +	vtime_account_guest_exit();
>> 
>> This pops a warning for me, running guest(s) on Power8:
>>   
>>    [  270.745303][T16661] ------------[ cut here ]------------
>>    [  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0
> 
> Thank you, I missed that...
> 
> My patch is wrong, I have to add vtime_account_guest_exit() before the local_irq_enable().

I thought so because if we take an interrupt after exiting the guest that 
should be accounted to kernel not guest.

> 
> arch/powerpc/kernel/time.c
> 
>   305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
>   306                                  unsigned long *stime_scaled,
>   307                                  unsigned long *steal_time)
>   308 {
>   309         unsigned long now, stime;
>   310
>   311         WARN_ON_ONCE(!irqs_disabled());
> ...
> 
> But I don't understand how ticks can be accounted now if irqs are still disabled.
> 
> Not sure it is as simple as expected...

I don't know all the timer stuff too well. The 
!CONFIG_VIRT_CPU_ACCOUNTING case is relying on PF_VCPU to be set when 
the host timer interrupt runs irqtime_account_process_tick runs so it
can accumulate that tick to the guest?

That probably makes sense then, but it seems like we need that in a
different place. Timer interrupts are not guaranteed to be the first one
to occur when interrupts are enabled.

Maybe a new tick_account_guest_exit() and move PF_VCPU clearing to that
for tick based accounting. Call it after local_irq_enable and call the
vtime accounting before it. Would that work?

Thanks,
Nick