[PATCH v2] powerpc/pseries/iommu: Wait until all TCEs are unmapped before deleting DDW
Jesper Dangaard Brouer
hawk at kernel.org
Wed Feb 19 01:40:05 AEDT 2025
Cc. netdev and Yunsheng Lin
On 13/02/2025 18.10, Gaurav Batra wrote:
> Some of the network drivers, like Mellanox, use core linux page_pool APIs
> to manage DMA buffers. These page_pool APIs cache DMA buffers with
> infrequent map/unmap calls for DMA mappings, thus increasing performance.
>
> When a device is initialized, the drivers makes a call to the page_pool API
> to create a DMA buffer pool. Hence forth DMA buffers are allocated and
> freed from this pool by the driver. The DMA map/unmap is done by the core
> page_pool infrastructure.
>
> These DMA buffers could be allocated for RX/TX buffer rings for the device
> or could be in-process by the network stack.
>
> When a network device is closed, driver will release all DMA mapped
> buffers. All the DMA buffers allocated to the RX/TX rings are released back
> to the page_pool by the driver. Some of the DMA mapped buffers could still
> be allocated and in-process by the network stack.
>
> DMA buffers that are relased by the Network driver are synchronously
> unmapped by the page_pool APIs. But, DMA buffers that are passed to the
> network stack and still in-process are unmapped later asynchronously by the
> page_pool infrastructure.
>
> This asynchronous unmapping of the DMA buffers, by the page_pool, can lead
> to issues when a network device is dynamically removed in PowerPC
> architecture. When a network device is DLPAR removed, the driver releases
> all the mapped DMA buffers and stops using the device. Driver returns
> successfully. But, at this stage there still could be mapped DMA buffers
> which are in-process by the network stack.
>
> DLPAR code proceeds to remove the device from the device tree, deletes
> Dynamic DMA Window (DDW) and associated IOMMU tables. DLPAR of the device
> succeeds.
>
> Later, when network stack release some of the DMA buffers, page_pool
> proceeds to unmap them. The page_pool relase path calls into PowerPC TCE
> management to release the TCE. This is where the LPAR OOPses since the DDW
> and associated resources for the device are already free'ed.
>
> This issue was exposed during (Live Partition Migration) LPM from a Power9
> to Power10 machine with HNV configuration. The bonding device is Virtual
> Ethernet with SR-IOV. During LPM, I/O is switched from SR-IOV to passive
> Virtual Ethernet and DLPAR remove of SR-IOV is initiated. This lead to the
> above mentioned scenario.
>
> It is possible to hit this issue by just Dynamically removing SR-IOV device
> which is under heavy I/O load, a scenario where some of the mapped DMA
> buffers are in-process somewhere in the network stack and not mapped to the
> RX/TX ring of the device.
>
> The issue is only encountered when TCEs are dynamically managed. In this
> scenario map/unmap of TCEs goes into the PowerPC TCE management path as and
> when DMA bufffers are mapped/unmaped and accesses DDW resources. When RAM
> is directly mapped during device initialization, this dynamic TCE
> management is by-passed and LPAR doesn't OOPses.
>
> Solution:
>
> During DLPAR remove of the device, before deleting the DDW and associated
> resources, check to see if there are any outstanding TCEs. If there are
> outstanding TCEs, sleep for 50ms and check again, until all the TCEs are
> unmapped.
>
> Once all the TCEs are unmapped, DDW is removed and DLPAR succeeds. This
> ensures there will be no reference to the DDW after it is deleted.
>
> Here is the stack for reference
>
> [ 3610.403820] tce_freemulti_pSeriesLP: 48 callbacks suppressed
> [ 3610.403833] tce_freemulti_pSeriesLP: plpar_tce_stuff failed
> [ 3610.403869] rc = -4
> [ 3610.403872] index = 0x70000016
> [ 3610.403876] limit = 0x1
> [ 3610.403879] tce = 0x80000061ee00000
> [ 3610.403882] pgshift = 0x10
> [ 3610.403884] npages = 0x1
> [ 3610.403887] tbl = 000000003a6a2145
> [ 3610.403912] CPU: 86 PID: 97129 Comm: kworker/86:2 Kdump: loaded Tainted: G E 6.4.0-623164-default #1 SLE15-SP6 763d454e096eda7d91355fd5b171013052d83ed3
> [ 3610.403928] Hardware name: IBM,9080-M9S POWER9 (raw) 0x4e2101 0xf000005 of:IBM,FW950.80 (VH950_131) hv:phyp pSeries
> [ 3610.403937] Workqueue: events page_pool_release_retry
> [ 3610.404003] Call Trace:
> [ 3610.404006] [c000055034e6bb30] [c000000000f63108] dump_stack_lvl+0x6c/0x9c (unreliable)
> [ 3610.404039] [c000055034e6bb60] [c000000000101258] tce_freemulti_pSeriesLP+0x1e8/0x1f0
> [ 3610.404070] [c000055034e6bbf0] [c00000000005d248] __iommu_free+0x118/0x220
> [ 3610.404086] [c000055034e6bc80] [c00000000005d4e8] iommu_free+0x28/0x70
> [ 3610.404106] [c000055034e6bcb0] [c00000000005c4b4] dma_iommu_unmap_page+0x24/0x40
> [ 3610.404113] [c000055034e6bcd0] [c00000000024b56c] dma_unmap_page_attrs+0x1ac/0x1e0
> [ 3610.404139] [c000055034e6bd30] [c000000000cfa178] page_pool_return_page+0x58/0x1b0
> [ 3610.404146] [c000055034e6bd60] [c000000000cfb7bc] page_pool_release+0x10c/0x270^
> [ 3610.404152] [c000055034e6be00] [c000000000cfbb2c] page_pool_release_retry+0x2c/0x110
> [ 3610.404159] [c000055034e6be70] [c00000000018e294] process_one_work+0x314/0x620
> [ 3610.404173] [c000055034e6bf10] [c00000000018ee88] worker_thread+0x78/0x620
> [ 3610.404179] [c000055034e6bf90] [c00000000019b958] kthread+0x148/0x150
> [ 3610.404188] [c000055034e6bfe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
>
> Signed-off-by: Gaurav Batra <gbatra at linux.ibm.com>
> ---
> arch/powerpc/kernel/iommu.c | 22 ++++++++++++++++++++--
> arch/powerpc/platforms/pseries/iommu.c | 8 ++++----
> 2 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 76381e14e800..af7511a8f480 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -14,6 +14,7 @@
> #include <linux/types.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
> +#include <linux/delay.h>
> #include <linux/spinlock.h>
> #include <linux/string.h>
> #include <linux/dma-mapping.h>
> @@ -803,6 +804,7 @@ bool iommu_table_in_use(struct iommu_table *tbl)
> static void iommu_table_free(struct kref *kref)
> {
> struct iommu_table *tbl;
> + unsigned long start_time;
>
> tbl = container_of(kref, struct iommu_table, it_kref);
>
> @@ -817,8 +819,24 @@ static void iommu_table_free(struct kref *kref)
> iommu_debugfs_del(tbl);
>
> /* verify that table contains no entries */
> - if (iommu_table_in_use(tbl))
> - pr_warn("%s: Unexpected TCEs\n", __func__);
> + start_time = jiffies;
> + while (iommu_table_in_use(tbl)) {
> + int sec;
> +
> + pr_info("%s: Unexpected TCEs, wait for 50ms\n", __func__);
> + msleep(50);
> +
> + /* Come out of the loop if we have already waited for 120 seconds
> + * for the TCEs to be free'ed. TCE are being free'ed
> + * asynchronously by some DMA buffer management API - like
> + * page_pool.
> + */
> + sec = (s32)((u32)jiffies - (u32)start_time) / HZ;
> + if (sec >= 120) {
> + pr_warn("%s: TCEs still mapped even after 120 seconds\n", __func__);
> + break;
> + }
> + }
>
> /* free bitmap */
> vfree(tbl->it_map);
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 534cd159e9ab..925494b6fafb 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -2390,6 +2390,10 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
>
> switch (action) {
> case OF_RECONFIG_DETACH_NODE:
> + if (pci && pci->table_group)
> + iommu_pseries_free_group(pci->table_group,
> + np->full_name);
> +
> /*
> * Removing the property will invoke the reconfig
> * notifier again, which causes dead-lock on the
> @@ -2400,10 +2404,6 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
> if (remove_dma_window_named(np, false, DIRECT64_PROPNAME, true))
> remove_dma_window_named(np, false, DMA64_PROPNAME, true);
>
> - if (pci && pci->table_group)
> - iommu_pseries_free_group(pci->table_group,
> - np->full_name);
> -
> spin_lock(&dma_win_list_lock);
> list_for_each_entry(window, &dma_win_list, list) {
> if (window->device == np) {
>
> base-commit: 6e4436539ae182dc86d57d13849862bcafaa4709
More information about the Linuxppc-dev
mailing list