<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hello Ritesh/Dan,</p>

    <p><br>

    </p>

    <p>Here is the motivation for my patch and thoughts on the issue. </p>

    <p><br>

    </p>

    <p>Before my patch, there were 2 scenarios to consider where, even

      when the memory<br>

      was pre-mapped for DMA, coherent allocations were getting mapped

      from 2GB<br>

      default DMA Window. In case of pre-mapped memory, the allocations

      should not be<br>

      directed towards 2GB default DMA window.<br>

      <br>

      1. AMD GPU which has device DMA mask > 32 bits but less then 64

      bits. In this<br>

      case the PHB is put into Limited Addressability mode.<br>

        <br>

         This scenario doesn't have vPMEM<br>

        <br>

      2. Device that supports 64-bit DMA mask. The LPAR has vPMEM

      assigned.<br>

      <br>

        <br>

      In both the above scenarios, IOMMU has pre-mapped RAM from DDW

      (64-bit PPC DMA<br>

      window).<br>

        <br>

      <br>

      Lets consider code paths for both the case, before my patch<br>

      <br>

      1. AMD GPU<br>

          <br>

      dev->dma_ops_bypass = true<br>

              <br>

      dev->bus_dma_limit = 0<br>

              <br>

      - Here the AMD controller shows 3 functions on the PHB.<br>

      <br>

      - After the first function is probed, it sees that the memory is

      pre-mapped<br>

        and doesn't direct DMA allocations towards 2GB default window.<br>

        So, dma_go_direct() worked as expected.<br>

              <br>

      - AMD GPU driver, adds device memory to system pages. The stack is

      as below<br>

              <br>

      add_pages+0x118/0x130 (unreliable)<br>

      pagemap_range+0x404/0x5e0<br>

      memremap_pages+0x15c/0x3d0<br>

      devm_memremap_pages+0x38/0xa0<br>

      kgd2kfd_init_zone_device+0x110/0x210 [amdgpu]<br>

      amdgpu_device_ip_init+0x648/0x6d8 [amdgpu]<br>

      amdgpu_device_init+0xb10/0x10c0 [amdgpu]<br>

      amdgpu_driver_load_kms+0x2c/0xb0 [amdgpu]<br>

      amdgpu_pci_probe+0x2e4/0x790 [amdgpu]<br>

      <br>

      - This changed max_pfn to some high value beyond max RAM.<br>

      <br>

      - Subsequently, for each other functions on the PHB, the call to<br>

        dma_go_direct() will return false which will then direct DMA

      allocations towards<br>

        2GB Default DMA window even if the memory is pre-mapped.<br>

        <br>

         dev->dma_ops_bypass is true, dma_direct_get_required_mask()

      resulted in large<br>

         value for the mask (due to changed max_pfn) which is beyond AMD

      GPU device DMA mask<br>

      <br>

          <br>

      2. Device supports 64-bit DMA mask. The LPAR has vPMEM assigned<br>

      <br>

      dev->dma_ops_bypass = false<br>

      dev->bus_dma_limit = has some value depending on size of RAM

      (eg.  0x0800001000000000)<br>

        <br>

      - Here the call to dma_go_direct() returns false since

      dev->dma_ops_bypass = false.<br>

        <br>

      <br>

        <br>

      I crafted the solution to cover both the case. I tested today on

      an LPAR<br>

      with 7.0-rc4 and it works with AMDGPU.<br>

      <br>

      With my patch, allocations will go towards direct only when

      dev->dma_ops_bypass = true,<br>

      which will be the case for "pre-mapped" RAM.<br>

      <br>

      Ritesh mentioned that this is PowerNV. I need to revisit this

      patch and see why it<br>

      is failing on PowerNV. From the logs, I do see some issue. The log

      indicates<br>

      dev->bus_dma_limit is set to 0. This is incorrect. For

      pre-mapped RAM, with my<br>

      patch, bus_dma_limit should always be set to some value.<br>

      <br>

      bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu:

      64-bit OK but direct DMA is limited by <b>0</b><br>

           </p>

    <p>Thanks,</p>

    <p>Gaurav</p>

    <div class="moz-cite-prefix">On 3/15/26 4:50 AM, Dan Horák wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:20260315105021.667e52d4a99b154ef1e6aa34@danny.cz">

      <pre wrap="" class="moz-quote-pre">Hi Ritesh,

On Sun, 15 Mar 2026 09:55:11 +0530

Ritesh Harjani (IBM) <a class="moz-txt-link-rfc2396E" href="mailto:ritesh.list@gmail.com"><ritesh.list@gmail.com></a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">Dan Horák <a class="moz-txt-link-rfc2396E" href="mailto:dan@danny.cz"><dan@danny.cz></a> writes:

+cc Gaurav,

</pre>

        <blockquote type="cite">

          <pre wrap="" class="moz-quote-pre">Hi,

starting with 7.0-rc1 (meaning 6.19 is OK) the amdgpu driver fails to

initialize on my Linux/ppc64le Power9 based system (with Radeon Pro WX4100)

with the following in the log

...

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF

</pre>

        </blockquote>

        <pre wrap="" class="moz-quote-pre">

                  ^^^^

So looks like this is a PowerNV (Power9) machine.

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

correct :-)

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="" class="moz-quote-pre">bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Detected VRAM RAM=4096M, BAR=4096M

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] RAM width 128bits GDDR5

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by 0

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  4096M of VRAM memory ready

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  32570M of GTT memory ready.

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Debug VRAM access will use slowpath MM access

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] GART: num cpu pages 4096, num gpu pages 65536

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] PCIE GART of 256M enabled (table at 0x000000F4FFF80000).

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) create WB bo failed

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu_device_wb_init failed -12

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: amdgpu_device_ip_init failed

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: Fatal error during GPU init

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: finishing device.

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: probe with driver amdgpu failed with error -12

bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0:  ttm finalized

...

After some hints from Alex and bisecting and other investigation I have

found that <a class="moz-txt-link-freetext" href="https://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0">https://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0</a>

is the culprit and reverting it makes amdgpu load (and work) again.

</pre>

        </blockquote>

        <pre wrap="" class="moz-quote-pre">

Thanks for confirming this. Yes, this was recently added [1]

[1]: <a class="moz-txt-link-freetext" href="https://lore.kernel.org/linuxppc-dev/20251107161105.85999-1-gbatra@linux.ibm.com/">https://lore.kernel.org/linuxppc-dev/20251107161105.85999-1-gbatra@linux.ibm.com/</a> 

@Gaurav,

I am not too familiar with the area, however looking at the logs shared

by Dan, it looks like we might be always going for dma direct allocation

path and maybe the device doesn't support this address limit. 

 bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK but direct DMA is limited by 0

 bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff

</pre>

      </blockquote>

      <pre wrap="" class="moz-quote-pre">

a complete kernel log is at

<a class="moz-txt-link-freetext" href="https://gitlab.freedesktop.org/-/project/4522/uploads/c4935bca6f37bbd06bb4045c07d00b5b/kernel.log">https://gitlab.freedesktop.org/-/project/4522/uploads/c4935bca6f37bbd06bb4045c07d00b5b/kernel.log</a>

Please let me know if you need more info.

                Dan

</pre>

      <blockquote type="cite">

        <pre wrap="" class="moz-quote-pre">Looking at the code..

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c

index fe7472f13b10..d5743b3c3ab3 100644

--- a/kernel/dma/mapping.c

+++ b/kernel/dma/mapping.c

@@ -654,7 +654,7 @@ void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,

        /* let the implementation decide on the zone to allocate from: */

        flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM);

-       if (dma_alloc_direct(dev, ops)) {

+       if (dma_alloc_direct(dev, ops) || arch_dma_alloc_direct(dev)) {

                cpu_addr = dma_direct_alloc(dev, size, dma_handle, flag, attrs);

        } else if (use_dma_iommu(dev)) {

                cpu_addr = iommu_dma_alloc(dev, size, dma_handle, flag, attrs);

Now, do we need arch_dma_alloc_direct() here? It always returns true if

dev->dma_ops_bypass is set to true, w/o checking for checks that

dma_go_direct() has.

whereas...

/*

 * Check if the devices uses a direct mapping for streaming DMA operations.

 * This allows IOMMU drivers to set a bypass mode if the DMA mask is large

 * enough.

 */

static inline bool

dma_alloc_direct(struct device *dev, const struct dma_map_ops *ops)

..dma_go_direct(dev, dev->coherent_dma_mask, ops);

....  ...

      #ifdef CONFIG_DMA_OPS_BYPASS

          if (dev->dma_ops_bypass)

              return min_not_zero(mask, dev->bus_dma_limit) >=

                      dma_direct_get_required_mask(dev);

      #endif

dma_alloc_direct() already checks for dma_ops_bypass and also if

dev->coherent_dma_mask >= dma_direct_get_required_mask(). So...

.... Do we really need the machinary of arch_dma_{alloc|free}_direct()?

Isn't dma_alloc_direct() checks sufficient?

Thoughts?

-ritesh

</pre>

        <blockquote type="cite">

          <pre wrap="" class="moz-quote-pre">

for the record, I have originally opened <a class="moz-txt-link-freetext" href="https://gitlab.freedesktop.org/drm/amd/-/issues/5039">https://gitlab.freedesktop.org/drm/amd/-/issues/5039</a>

        With regards,

                Dan

</pre>

        </blockquote>

      </blockquote>

    </blockquote>

  </body>

</html>