[PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

Thu Jan 31 02:54:23 AEDT 2019

On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?  
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic 
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space 
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ? 
>>>>
>>>> also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough 
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Linux kernel. It has other use with VFIO.   

> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.

Yes I think I understand the idea. I will give a try. I need to find the 
right place to do so in QEMU. See other email thread.

>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container 
>>>> region for theses regions. What are the benefits ? 
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it.  If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we 
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
>> implement an IRQ space for the guest using the generic IPI interrupts 
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and 
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information 
>> on the interrupts, among which the xive_irq_data struct containing the 
>> addresses of the IPI ESB pages and an extra one in case of passthrough. 
>> The later contains the addresses of the ESB pages of the underlying HW 
>> controller interrupts, PHB4 in all cases for now.    
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM 
>> XICS-over-XIVE device "handle" interrupt management, that is to perform  
>> the loads and stores on the addresses of the ESB pages of the guest 
>> interrupts. 
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE 
>> native device exposes the interrupt ESB pages to the guest and lets 
>> the guest perform directly the loads and stores. 
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB 
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native 
>> exploitation mode adds some extra refinements because the ESB pages 
>> of a different HW controller (PHB4) need to be exposed to the guest 
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same. 
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the 
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the 
>> guest IRQ number being mapped and let the VM fault handler repopulate. 
>> The handler will insert the ESB page corresponding to the HW interrupt 
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.   
> 
> That's a much better write-up.  Thanks.

OK. I will reuse it when time comes.

Thanks,

C.