Using Restricted DMA for virtio-pci

Mon Mar 31 20:42:16 AEDT 2025

On Sun, 2025-03-30 at 17:48 -0400, Michael S. Tsirkin wrote:
> On Sun, Mar 30, 2025 at 10:27:58PM +0100, David Woodhouse wrote:
> > On 30 March 2025 18:06:47 BST, "Michael S. Tsirkin" <mst at redhat.com> wrote:
> > > > It's basically just allowing us to expose through PCI, what I believe
> > > > we can already do for virtio in DT.
> > > 
> > > I am not saying I am against this extension.
> > > The idea to restrict DMA has a lot of merit outside pkvm.
> > > For example, with a physical devices, limiting its DMA
> > > to a fixed range can be good for security at a cost of
> > > an extra data copy.
> > > 
> > > So I am not saying we have to block this specific hack.
> > > 
> > > what worries me fundamentally is I am not sure it works well
> > > e.g. for physical virtio cards.
> > 
> > Not sure why it doesn't work for physical cards. They don't need to
> > be bus-mastering; they just take data from a buffer in their own
> > RAM.
> 
> I mean, it kind of does, it is just that CPU pulling data over the PCI bus
> stalls it so is very expensive. It is not by chance people switched to
> DMA almost exclusively.

Yes. For a physical implementation it would not be the most high-
performance option... unless DMA is somehow blocked as it is in the
pKVM+virt case.

In the case of a virtual implementation, however, the performance is
not an issue because it'll be backed by host memory anyway. (It's just
that because it's presented to the guest and the trusted part of the
hypervisor as PCI BAR space instead of main memory, it's a whole lot
more practical to deal with the fact that it's *shared* with the VMM.)

> > > Attempts to pass data between devices will now also require
> > > extra data copies.
> > 
> > Yes. I think that's acceptable, but if we really cared we could
> > perhaps extend the capability to refer to a range inside a given
> > BAR on a specific *device*? Or maybe just *function*, and allow
> > sharing of SWIOTLB buffer within a multi-function device?
> 
> Fundamentally, this is what dmabuf does.

In software, yes. Extending it to hardware is a little harder.

In principle, it might be quite nice to offer a single SWIOTLB buffer
region (in a BAR of one device) and have multiple virtio devices share
it. Not just because of passing data between devices, as you mentioned,
but also because it'll be a more efficient use of memory than each
device having its own buffer and allocation pool.

So how would a device indicate that it can use a SWIOTLB buffer which
is in a BAR of a *different* device?

Not by physical address, because BARs get moved around.
Not even by PCI bus/dev/fn/BAR# because *buses* get renumbered.

You could limit it to sharing within one PCI "bus", and use just
dev/fn/BAR#? Or even within one PCI device and just fn/BAR#? The latter
could theoretically be usable by multi-function physical devices.

The standard struct virtio_pci_cap (which I used for
VIRTIO_PCI_CAP_SWIOTLB) just contains BAR and offset/length. We could
extend it with device + function, using -1 for 'self', to allow for
such sharing?

Still not convinced it isn't overkill, but it's certainly easy enough
to add on the *spec* side. I haven't yet looked at how that sharing
would work in Linux on the guest side; thus far what I'm proposing is
intended to be almost identical to the per-device thing that should
already work with a `restricted-dma-pool' node in device-tree.

> > I think it's overkill though.
> > 
> > > Did you think about adding an swiotlb mode to virtio-iommu at all?
> > > Much easier than parsing page tables.
> > 
> > Often the guests which need this will have a real IOMMU for the true
> > pass-through devices.
> 
> Not sure I understand. You mean with things like stage 2 passthrough?

Yes. AMD's latest IOMMU spec documents it, for example. Exposing a
'vIOMMU' to the guest which handles just stage 1 (IOVA→GPA) while the
hypervisor controls the normal GPA→HPA translation in stage 2.

Then the guest gets an accelerated path *directly* to the hardware for
its IOTLB flushes... which means the hypervisor doesn't get to *see*
those IOTLB flushes so it's a PITA to do device emulation as if it's
covered by that same IOMMU.

(Actually I haven't checked the AMD one in detail for that flaw; most
*other* 2-stage IOMMUs I've seen do have it, and I *bet* AMD does too).

> > Adding a virtio-iommu into the mix (or any other
> > system-wide way of doing something different for certain devices) is
> > problematic.
> 
> OK... but the issue isn't specific to no DMA devices, is it?

Hm? Allowing virtio devices to operate as "no-DMA devices" is a
*workaround* for the issue.

The issue is that the VMM may not have full access to the guest's
memory for emulating devices. These days, virtio covers a large
proportion of emulated devices.

So I do think the issue is fairly specific to virtio devices, and
suspect that's what you meant to type above?

We pondered teaching the trusted part of the hypervisor (e.g. pKVM) to
snoop on virtqueues enough to 'know' which memory the VMM was genuinely
being *invited* to read/write... and we ran away screaming. (In order
to have sufficient trust, you end up not just snooping but implementing
quite a lot of the emulation on the trusted side. And then complex
enlightenments in the VMM and the untrusted Linux/KVM which hosts it,
to interact with that.)

Then we realised that for existing DT guests it's trivial just to add
the `restricted-dma-pool` node. And wanted to do the same for the
guests who are afflicted with UEFI/ACPI too. So here we are, trying to
add the same capability to virtio-pci.

> > The on-device buffer keeps it nice and simple,
> 
> I am not saying it is not.
> It's just a little boutique.

Fair. Although with the advent of confidential computing and
restrictions on guest memory access, perhaps becoming less boutique
over time?

And it should also be fairly low-friction; it's a whole lot cleaner in
the spec than the awful VIRTIO_F_ACCESS_PLATFORM legacy, and even in
the Linux guest driver it should work fairly simply given the existing
restricted-dma support (although of course that shouldn't entirely be
our guiding motivation).

> > and even allows us to
> > do device support for operating systems like Windows where it's a lot
> > harder to do anything generic in the core OS.
> 
> Well we do need virtio iommu windows support sooner or later, anyway.

Heh, good luck with that :)

And actually, doesn't that only support *DMA* remapping? So you still
wouldn't be able to boot a Windows guest with >255 vCPUs without some
further enlightenment (like Windows guests finally supporting the 15-
bit MSI extension that even Hyper-V supports on the host side...)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5069 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20250331/6e99505c/attachment.p7s>