[PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Tue Apr 21 15:11:09 AEST 2020

On 21/04/2020 00:04, Oliver O'Halloran wrote:
> On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
>>
>> On 17/04/2020 11:26, Russell Currey wrote:
>>>
>>> For what it's worth this sounds like a good idea to me, it just sounds
>>> tricky to implement.  You're adding another layer of complexity on top
>>> of EEH (well, making things look simple to the EEH core and doing your
>>> own freezing on top of it) in addition to the DMA handling.
>>>
>>> If it works then great, just has a high potential to become a new bug
>>> haven.
>>
>> imho putting every PCI function to a separate PE is the right thing to
>> do here but I've been told it is not that simple, and I believe that.
>> Reusing slave PEs seems unreliable - the configuration will depend on
>> whether a PE occupied enough segments to give an unique PE to a PCI
>> function and my little brain explodes.
> 
> You're overthinking it.
> 
> If a bus has no m64 MMIO space we're free to assign whatever PE number
> we want since the PE for the bus isn't fixed by the m64 segment its
> BARs were placed in. For those buses we assign a PE number starting
> from the max and counting down (0xff, 0xfe, etc). For example, with
> this PHB:
> 
> # lspci -s 1:: -v | egrep '0001:|Memory at'
> 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
>         Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
> 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
> f117 (rev 06) (prog-if 02 [NVM Express])
>         Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
>         Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
> 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> X710/X557-AT 10GBASE-T (rev 02)
>         Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
>         Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
> (redundant functions removed)
> 
> We get these PE assignments:
> 
> 0001:00:00.0 - 0xfe # Root port
> 0001:01:00.0 - 0xfc # upstream port
> 0001:02:01.0 - 0xfd # downstream port bus
> 0001:02:08.0 - 0xfd
> 0001:02:09.0 - 0xfd
> 0001:03:00.0 - 0x0  # NVMe
> 0001:09:00.0 - 0x1  # Ethernet
> 
> All the switch ports either have 32bit BARs or no BARs so they get
> assigned PEs starting from the top. The Ethernet and the NVMe have some
> prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
> respectively since that's where their m64 BARs are located. For our
> DMA-only slave PEs any MMIO space would remain in their master PE so we
> can allocate a PE number for the DMA-PE (our iommu context).

One example of a problem device is AMD GPU with 64bit video PCI function
and 32bit audio, no?

What PEs will they get assigned to now? Where will audio's MMIO go? It
cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
already. If not, then I do not understand "we're free to assign whatever
PE number we want.

> I think the key thing to realise is that we'd only be using the DMA-PE
> when a crippled DMA mask is set by the driver. In all other cases we
> can just use the "native PE" and when the driver unbinds we can de-
> allocate our DMA-PE and return the device to the PE containing it's
> MMIO BARs. I think we can keep things relatively sane that way and the
> real issue is detecting EEH events on the DMA-PE.

Oooor we could just have 1 DMA window (or, more precisely, a single
"TVE" as it is either window or bypass) per a PE and give every function
its own PE and create a window or a table when a device sets a DMA mask.
I feel I am missing something here though.

> 
> On P9 we don't have PHB error interrupts enabled in firmware so we're
> completely reliant on seeing a 0xFF response to an MMIO and manually
> checking the PE status to see if it's due to a PE freeze. For our DMA-
> PE it could be frozen (due to a bad DMA) and we'd never notice unless
> we explicitly check the status of the DMA-PE since there's no
> corresponding MMIO space to freeze.
> 
> On P8 we had PHB Error interrupts so you would notice that *something*
> happened, then go check for frozen PEs, at which point the master-slave 
> grouping logic would see that one PE in the group was frozen and freeze
> the rest of them. We can re-use that on that, but we still need
> something to actually notice a freeze occured. A background poller
> checking for freezes on each PE might do the trick.
> 
>> So this is not happening soon.
> 
> Oh ye of little faith.
> 
>> For the time being, this patchset is good for:
>> 1. weird hardware which has limited DMA mask (this is why the patchset
>> was written in the first place)
>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> 
> Sure, but it's still dependent on having firmware which supports the
> 4GB hack and I don't think that's in any offical firmware releases yet.

It's been a while :-/

-- 
Alexey