[PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

Tue Jan 29 05:26:00 AEDT 2019

On 1/28/19 7:13 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 12:07:19PM +0100, Cédric Le Goater wrote:
>> On 1/23/19 11:30 AM, Paul Mackerras wrote:
>>> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>>>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
>>>>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
>>>>>> guest and let the fault handler repopulate the VMA when the ESB pages
>>>>>> are accessed for an EOI or for a trigger.
>>>>>
>>>>> Why do we want to do this?
>>>>>
>>>>> I don't see any possible advantage to removing the PTEs from the
>>>>> userspace mapping.  You'll need to explain further.
>>>>
>>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>>> instead of the "IPI" that was there at VM init time.
>>
>> yes exactly. You need to clean up the pages each time.
>>  
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>
>> OK. I have started doing so for the models merged in QEMU but not yet 
>> for KVM. I will work on it.
>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> I agree. 
>>
>>
>> So, storing the ESB VMA under the KVM device is not shocking anyone ?  
> 
> Actually, now that I think of it, why can't userspace (QEMU) manage
> this using mmap()?  Based on what Ben has said, I assume there would
> be a pair of pages for each interrupt that a PCI pass-through device
> has. 

Yes. there is a pair of ESB pages per IRQ number.

> Would we end up with too many VMAs if we just used mmap() to
> change the mappings from the software-generated pages to the
> hardware-generated interrupt pages?  
The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
dedicated to CPU IPIs and the remaining 4K are for devices. We can 
extend the last range if needed as these are for MSIs. Dynamic 
extensions under KVM should work also.

This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
lot of mmap(), too much. Also the KVM model needs to be compatible
with the QEMU emulated one and it was simpler to have one overall
memory region for the IPI ESBs, one for the END ESBs (if we support
that one day) and one for the TIMA.

> Are the necessary pages for a PCI
> passthrough device contiguous in both host real space 

They should as they are the PHB4 ESBs.

> and guest real space ? 

also. That's how we organized the mapping. 

> If so we'd only need one mmap() for all the device's interrupt
> pages.

Ah. So we would want to make a special case for the passthrough 
device and have a mmap() and a memory region for its ESBs. Hmm.

Wouldn't that require to hot plug a memory region under the guest ? 
which means that we need to provision an address space/container 
region for theses regions. What are the benefits ? 

Is clearing the PTEs and repopulating the VMA unsafe ? 

Thanks,     

C.