[PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Fri Apr 17 15:47:27 AEST 2020

On 17/04/2020 11:26, Russell Currey wrote:
> On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
>> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall at gmail.com>
>> wrote:
>>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
>>> aik at ozlabs.ru> wrote:
>>>> Anyone? Is it totally useless or wrong approach? Thanks,
>>>
>>> I wouldn't say it's either, but I still hate it.
>>>
>>> The 4GB mode being per-PHB makes it difficult to use unless we
>>> force
>>> that mode on 100% of the time which I'd prefer not to do. Ideally
>>> devices that actually support 64bit addressing (which is most of
>>> them)
>>> should be able to use no-translate mode when possible since a) It's
>>> faster, and b) It frees up room in the TCE cache devices that
>>> actually
>>> need them. I know you've done some testing with 100G NICs and found
>>> the overhead was fine, but IMO that's a bad test since it's pretty
>>> much the best-case scenario since all the devices on the PHB are in
>>> the same PE. The PHB's TCE cache only hits when the TCE matches the
>>> DMA bus address and the PE number for the device so in a multi-PE
>>> environment there's a lot of potential for TCE cache trashing. If
>>> there was one or two PEs under that PHB it's probably not going to
>>> matter, but if you have an NVMe rack with 20 drives it starts to
>>> look
>>> a bit ugly.
>>>
>>> That all said, it might be worth doing this anyway since we
>>> probably
>>> want the software infrastructure in place to take advantage of it.
>>> Maybe expand the command line parameters to allow it to be enabled
>>> on
>>> a per-PHB basis rather than globally.
>>
>> Since we're on the topic
>>
>> I've been thinking the real issue we have is that we're trying to
>> pick
>> an "optimal" IOMMU config at a point where we don't have enough
>> information to work out what's actually optimal. The IOMMU config is
>> done on a per-PE basis, but since PEs may contain devices with
>> different DMA masks (looking at you wierd AMD audio function) we're
>> always going to have to pick something conservative as the default
>> config for TVE#0 (64k, no bypass mapping) since the driver will tell
>> us what the device actually supports long after the IOMMU
>> configuation
>> is done. What we really want is to be able to have separate IOMMU
>> contexts for each device, or at the very least a separate context for
>> the crippled devices.
>>
>> We could allow a per-device IOMMU context by extending the Master /
>> Slave PE thing to cover DMA in addition to MMIO. Right now we only
>> use
>> slave PEs when a device's MMIO BARs extend over multiple m64
>> segments.
>> When that happens an MMIO error causes the PHB to freezes the PE
>> corresponding to one of those segments, but not any of the others. To
>> present a single "PE" to the EEH core we check the freeze status of
>> each of the slave PEs when the EEH core does a PE status check and if
>> any of them are frozen, we freeze the rest of them too. When a driver
>> sets a limited DMA mask we could move that device to a seperate slave
>> PE so that it has it's own IOMMU context taylored to its DMA
>> addressing limits.
>>
>> Thoughts?
> 
> For what it's worth this sounds like a good idea to me, it just sounds
> tricky to implement.  You're adding another layer of complexity on top
> of EEH (well, making things look simple to the EEH core and doing your
> own freezing on top of it) in addition to the DMA handling.
> 
> If it works then great, just has a high potential to become a new bug
> haven.

imho putting every PCI function to a separate PE is the right thing to
do here but I've been told it is not that simple, and I believe that.
Reusing slave PEs seems unreliable - the configuration will depend on
whether a PE occupied enough segments to give an unique PE to a PCI
function and my little brain explodes.

So this is not happening soon.

For the time being, this patchset is good for:
1. weird hardware which has limited DMA mask (this is why the patchset
was written in the first place)
2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

-- 
Alexey