[PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Thu Apr 16 12:53:41 AEST 2020

On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall at gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
> >
> > Anyone? Is it totally useless or wrong approach? Thanks,
>
> I wouldn't say it's either, but I still hate it.
>
> The 4GB mode being per-PHB makes it difficult to use unless we force
> that mode on 100% of the time which I'd prefer not to do. Ideally
> devices that actually support 64bit addressing (which is most of them)
> should be able to use no-translate mode when possible since a) It's
> faster, and b) It frees up room in the TCE cache devices that actually
> need them. I know you've done some testing with 100G NICs and found
> the overhead was fine, but IMO that's a bad test since it's pretty
> much the best-case scenario since all the devices on the PHB are in
> the same PE. The PHB's TCE cache only hits when the TCE matches the
> DMA bus address and the PE number for the device so in a multi-PE
> environment there's a lot of potential for TCE cache trashing. If
> there was one or two PEs under that PHB it's probably not going to
> matter, but if you have an NVMe rack with 20 drives it starts to look
> a bit ugly.
>
> That all said, it might be worth doing this anyway since we probably
> want the software infrastructure in place to take advantage of it.
> Maybe expand the command line parameters to allow it to be enabled on
> a per-PHB basis rather than globally.

Since we're on the topic

I've been thinking the real issue we have is that we're trying to pick
an "optimal" IOMMU config at a point where we don't have enough
information to work out what's actually optimal. The IOMMU config is
done on a per-PE basis, but since PEs may contain devices with
different DMA masks (looking at you wierd AMD audio function) we're
always going to have to pick something conservative as the default
config for TVE#0 (64k, no bypass mapping) since the driver will tell
us what the device actually supports long after the IOMMU configuation
is done. What we really want is to be able to have separate IOMMU
contexts for each device, or at the very least a separate context for
the crippled devices.

We could allow a per-device IOMMU context by extending the Master /
Slave PE thing to cover DMA in addition to MMIO. Right now we only use
slave PEs when a device's MMIO BARs extend over multiple m64 segments.
When that happens an MMIO error causes the PHB to freezes the PE
corresponding to one of those segments, but not any of the others. To
present a single "PE" to the EEH core we check the freeze status of
each of the slave PEs when the EEH core does a PE status check and if
any of them are frozen, we freeze the rest of them too. When a driver
sets a limited DMA mask we could move that device to a seperate slave
PE so that it has it's own IOMMU context taylored to its DMA
addressing limits.

Thoughts?

Oliver