[PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device

Mon Feb 11 13:38:42 AEDT 2019

On Sat, Feb 09, 2019 at 10:41:38AM +0100, Cédric Le Goater wrote:
> On 2/8/19 10:53 PM, Paul Mackerras wrote:
> > On Fri, Feb 08, 2019 at 08:58:14AM +0100, Cédric Le Goater wrote:
> >> On 2/8/19 6:15 AM, David Gibson wrote:
> >>> On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
> >>>> That's the plan I have in mind as suggested by Paul if I understood it well.
> >>>> The mechanics are more complex than the patch zapping the PTEs from the VMA
> >>>> but it's also safer.
> >>>
> >>> Well, yes, where "safer" means "has the possibility to be correct".
> >>
> >> Well, the only problem with the kernel approach is keeping a pointer on 
> >> the VMA. If we could call find_vma(), it would be perfectly safe and much 
> >> more simpler.
> > 
> > You seem to be assuming that the kernel can easily work out a single
> > virtual address which will be the only place where a given set of
> > interrupt pages are mapped.  But that is really not possible in the
> > general case, because userspace could have mapped the fd at many
> > different offsets in many different places.
> > 
> > QEMU doesn't do that; in QEMU, the mmaps are sufficiently limited that
> > it can work out a single virtual address that needs to be changed.
> > The way that QEMU should tell the kernel what that address is and what
> > the mapping should be changed to, is via the existing munmap()/mmap()
> > interface.
> 
> Yes. We agreed on that. QEMU should handle these mappings somewhere in 
> VFIO. It's me grumbling, that's all.
> 
> The discussion has moved to the mmap() interface of the KVM device. The 
> current proposal adds controls on the device creating fds to mmap() the 
> TIMA pages and the ESB pages. David is proposing to use directly the fd 
> of the KVM device to mmap() these pages with a different offset for each 
> set. 
> 
> I think that should work pretty well, for passthrough also. The fault 
> handler should take care of populating the VMA(s) with the appropriate 
> pages. 
> 
> We might support END notification one day, so we should have room for 
> these pages. And nested might require IRQ space extensions at L1. 
> something to keep in mind.

I had some more thoughts on this topic.  I think there's been some
confusion because there are more ways of tackling this than I
previously realized:

1) All in kernel

The offset always maps directly to guest irq number and the kernel
somehow binds it either to an IPI or a host irq as necessary.
Cédric's original code attempts this, but the mechanism of keeping a
pointer to the VMA can't work.

But.. remapping the irqs should be sufficiently infrequent that it
might be ok to consider simply stepping through all the hosting
process's VMAs to do this.

2) Remapped in qemu (using memory regions)

I _think_ (in hindsight) was Cédric's been discussing as the
alternative in more recent posts.

Qemu maps the IPI pages at one place and the passthrough IRQ pages
somewhere else.  The IPIs are mapped into the guest as one memory
region, then any passthrough IRQ pages are mapped over that using
overlapping memory regions.

I don't think this approach will work well, because it could require a
bunch of separate KVM memory slots, which are fairly scarce.

3) Remapped in qemu (using mmap())

This is the approach I (and I think Paul) have been suggested in
contrast to (1).

Qemu maps the IPI pages and maps those into the guest.  When we need
to set up a passthrough IRQ, qemu mmap()s its pages directly over the
IPI pages, and it remains mapped into the guest with the same memory
region / memslot as the IPIs are already using.  If the passthrough
device is removed we have to remap the IPI pages back into place.

4) Dedicated irq numbers

We never re-use regular guest irq numbers for passthrough irqs,
instead we put them somewhere else and keep those mapped to the
passthrough irq pages.

I was favouring this approach, but it does mean there will be a guest
visible difference between kernel_irqchip=on and off which isn't
great.

(1) is the most elegant _interface_, but as we've seen it's
problematic to implement.  Looking at the for_all_vmas() approach
could be interesting, but otherwise option (3) might be the most
practical.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20190211/575985d8/attachment-0001.sig>