[PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
Cédric Le Goater
clg at kaod.org
Thu Jan 31 02:54:23 AEDT 2019
On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can
>>>
>>> Confused. You say the number space has 32768 entries but then imply
>>> there are only 8K entries. Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again.
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the
>> ESB pages.
>>
>> But it would mean extra work on the QEMU side, which is not needed
>> with this patch.
>
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach. First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.
I fully agree. That's why I was uncomfortable with the solution. There
are a few other drivers (GPUs if I recall) doing that but it feels wrong.
> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas. You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.
yes ...
> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
>
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.
>
> I'd much rather solve it in userspace.
OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution.
>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ?
>>>>
>>>> also. That's how we organized the mapping.
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ?
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a
>> new 'ram device' memory region for the device interrupt ESB pages
>> that would overlap with the initial guest ESB pages.
>
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary.
'ram device' memory regions are of a special type which is used to
directly map into the guest the result of a mmap() in QEMU.
This is how we propagate the XIVE ESB pages from HW (and the TIMA)
to the guest and the Linux kernel. It has other use with VFIO.
> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.
Yes I think I understand the idea. I will give a try. I need to find the
right place to do so in QEMU. See other email thread.
>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container
>>>> region for theses regions. What are the benefits ?
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ?
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it. If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
>> implement an IRQ space for the guest using the generic IPI interrupts
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information
>> on the interrupts, among which the xive_irq_data struct containing the
>> addresses of the IPI ESB pages and an extra one in case of passthrough.
>> The later contains the addresses of the ESB pages of the underlying HW
>> controller interrupts, PHB4 in all cases for now.
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM
>> XICS-over-XIVE device "handle" interrupt management, that is to perform
>> the loads and stores on the addresses of the ESB pages of the guest
>> interrupts.
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE
>> native device exposes the interrupt ESB pages to the guest and lets
>> the guest perform directly the loads and stores.
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native
>> exploitation mode adds some extra refinements because the ESB pages
>> of a different HW controller (PHB4) need to be exposed to the guest
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same.
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the
>> guest IRQ number being mapped and let the VM fault handler repopulate.
>> The handler will insert the ESB page corresponding to the HW interrupt
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.
>
> That's a much better write-up. Thanks.
OK. I will reuse it when time comes.
Thanks,
C.
More information about the Linuxppc-dev
mailing list