[PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

Mon Apr 27 16:45:09 AEST 2015

On 24.04.2015 00:26, Scott Wood wrote:
> On Thu, 2015-04-23 at 15:31 +0300, Purcareata Bogdan wrote:
>> On 23.04.2015 03:30, Scott Wood wrote:
>>> On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
>>>> On 21.04.2015 03:52, Scott Wood wrote:
>>>>> On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
>>>>>> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
>>>>>> function is kvmppc_set_epr, which is a static inline. Removing the static inline
>>>>>> yields a compiler crash (Segmentation fault (core dumped) -
>>>>>> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
>>>>>> but that's a different story, so I just let it be for now. Point is the time may
>>>>>> include other work after the lock has been released, but before the function
>>>>>> actually returned. I noticed this was the case for .kvm_set_msi, which could
>>>>>> work up to 90 ms, not actually under the lock. This made me change what I'm
>>>>>> looking at.
>>>>>
>>>>> kvm_set_msi does pretty much nothing outside the lock -- I suspect
>>>>> you're measuring an interrupt that happened as soon as the lock was
>>>>> released.
>>>>
>>>> That's exactly right. I've seen things like a timer interrupt occuring right
>>>> after the spinlock_irqrestore, but before kvm_set_msi actually returned.
>>>>
>>>> [...]
>>>>
>>>>>>     Or perhaps a different stress scenario involving a lot of VCPUs
>>>>>> and external interrupts?
>>>>>
>>>>> You could instrument the MPIC code to find out how many loop iterations
>>>>> you maxed out on, and compare that to the theoretical maximum.
>>>>
>>>> Numbers are pretty low, and I'll try to explain based on my observations.
>>>>
>>>> The problematic section in openpic_update_irq is this [1], since it loops
>>>> through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
>>>> through all pending interrupts for a VCPU [2].
>>>>
>>>> The guest interfaces are virtio-vhostnet, which are based on MSI
>>>> (/proc/interrupts in guest shows they are MSI). For external interrupts to the
>>>> guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
>>>> so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).
>>>>
>>>> I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
>>>> are actually pending for the destination VCPU. At most, there were 3 interrupts
>>>> - n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
>>>> guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.
>>>>
>>>> So worst case, in this scenario, was checking the priorities for 3 pending
>>>> interrupts for 1 VCPU. Something like this (some of my prints included):
>>>>
>>>> [61010.582033] openpic_update_irq: destmask 1 last_cpu 0
>>>> [61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
>>>> [61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
>>>> [61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
>>>> [61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
>>>> [61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1
>>>>
>>>> It would be really helpful to get your comments regarding whether these are
>>>> realistical number for everyday use, or they are relevant only to this
>>>> particular scenario.
>>>
>>> RT isn't about "realistic numbers for everyday use".  It's about worst
>>> cases.
>>>
>>>> - Can these interrupts be used in directed delivery, so that the destination
>>>> mask can include multiple VCPUs?
>>>
>>> The Freescale MPIC does not support multiple destinations for most
>>> interrupts, but the (non-FSL-specific) emulation code appears to allow
>>> it.
>>>
>>>>    The MPIC manual states that timer and IPI
>>>> interrupts are supported for directed delivery, altough I'm not sure how much of
>>>> this is used in the emulation. I know that kvmppc uses the decrementer outside
>>>> of the MPIC.
>>>>
>>>> - How are virtio interrupts cascaded over the shared MSI interrupts?
>>>> /proc/device-tree/soc at e0000000/msi at 41600/interrupts in the guest shows 8 values
>>>> - 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
>>>> that correct?
>>>
>>> It looks like that's currently the case, but actual hardware supports
>>> more than that, so it's possible (albeit unlikely any time soon) that
>>> the emulation eventually does as well.
>>>
>>> But it's possible to have interrupts other than MSIs...
>>
>> Right.
>>
>> So given that the raw spinlock conversion is not suitable for all the scenarios
>> supported by the OpenPIC emulation, is it ok that my next step would be to send
>> a patch containing both the raw spinlock conversion and a mandatory disable of
>> the in-kernel MPIC? This is actually the last conclusion we came up with some
>> time ago, but I guess it was good to get some more insight on how things
>> actually work (at least for me).
>
> Fine with me.  Have you given any thought to ways to restructure the
> code to eliminate the problem?

My first thought would be to create a separate lock for each VCPU pending 
interrupts queue, so that we make the whole openpic_irq_update more granular. 
However, this is just a very preliminary thought. Before I can come up with 
anything worthy of consideration, I must read the OpenPIC specification and the 
current KVM emulated OpenPIC implementation thoroughly. I currently have other 
things on my hands, and will come back to this once I have some time.

Meanwhile, I've sent a v2 on the PPC and RT mailing lists for this raw_spinlock 
conversion, alongside disabling the in-kernel MPIC emulation for PREEMPT_RT. I 
would be grateful to hear your feedback on that, so that it can get applied.

Thank you,
Bogdan P.