[PATCH] powerpc/pseries/iommu: Wait until all TCEs are unmapped before deleting DDW
Gaurav Batra
gbatra at linux.ibm.com
Tue Sep 17 04:35:58 AEST 2024
Some of the network drivers, like Mellanox, use core linux page_pool APIs
to manage DMA buffers. These page_pool APIs cache DMA buffers with
infrequent map/unmap calls for DMA mappings, thus increasing performance.
When a device is initialized, the drivers makes a call to the page_pool API
to create a DMA buffer pool. Hence forth DMA buffers are allocated and
freed from this pool by the driver. The DMA map/unmap is done by the core
page_pool infrastructure.
These DMA buffers could be allocated for RX/TX buffer rings for the device
or could be in-process by the network stack.
When a network device is closed, driver will release all DMA mapped
buffers. All the DMA buffers allocated to the RX/TX rings are released back
to the page_pool by the driver. Some of the DMA mapped buffers could still
be allocated and in-process by the network stack.
DMA buffers that are relased by the Network driver are synchronously
unmapped by the page_pool APIs. But, DMA buffers that are passed to the
network stack and still in-process are unmapped later asynchronously by the
page_pool infrastructure.
This asynchronous unmapping of the DMA buffers, by the page_pool, can lead
to issues when a network device is dynamically removed in PowerPC
architecture. When a network device is DLPAR removed, the driver releases
all the mapped DMA buffers and stops using the device. Driver returns
successfully. But, at this stage there still could be mapped DMA buffers
which are in-process by the network stack.
DLPAR code proceeds to remove the device from the device tree, deletes
Dynamic DMA Window (DDW) and associated IOMMU tables. DLPAR of the device
succeeds.
Later, when network stack release some of the DMA buffers, page_pool
proceeds to unmap them. The page_pool relase path calls into PowerPC TCE
management to release the TCE. This is where the LPAR OOPses since the DDW
and associated resources for the device are already free'ed.
This issue was exposed during LPM from a Power9 to Power10 machine with HNV
configuration. The bonding device is Virtual Ethernet with SR-IOV. During
LPM, I/O is switched from SR-IOV to passive Virtual Ethernet and DLPAR
remove of SR-IOV is initiated. This lead to the above mentioned scenario.
It is possible to hit this issue by just Dynamically removing SR-IOV device
which is under heavy I/O load, a scenario where some of the mapped DMA
buffers are in-process somewhere in the network stack and not mapped to the
RX/TX ring of the device.
The issue is only encountered when TCEs are dynamically managed. In this
scenario map/unmap of TCEs goes into the PowerPC TCE management path as and
when DMA bufffers are mapped/unmaped and accesses DDW resources. When RAM
is directly mapped during device initialization, this dynamic TCE
management is by-passed and LPAR doesn't OOPses.
Solution:
During DLPAR remove of the device, before deleting the DDW and associated
resources, check to see if there are any outstanding TCEs. If there are
outstanding TCEs, sleep for 50ms and check again, until all the TCEs are
unmapped.
Once all the TCEs are unmapped, DDW is removed and DLPAR succeeds. This
ensures there will be no reference to the DDW after it is deleted.
Signed-off-by: Gaurav Batra <gbatra at linux.ibm.com>
---
arch/powerpc/kernel/iommu.c | 22 ++++++++++++++++++++--
arch/powerpc/platforms/pseries/iommu.c | 8 ++++----
2 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 76381e14e800..af7511a8f480 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/mm.h>
+#include <linux/delay.h>
#include <linux/spinlock.h>
#include <linux/string.h>
#include <linux/dma-mapping.h>
@@ -803,6 +804,7 @@ bool iommu_table_in_use(struct iommu_table *tbl)
static void iommu_table_free(struct kref *kref)
{
struct iommu_table *tbl;
+ unsigned long start_time;
tbl = container_of(kref, struct iommu_table, it_kref);
@@ -817,8 +819,24 @@ static void iommu_table_free(struct kref *kref)
iommu_debugfs_del(tbl);
/* verify that table contains no entries */
- if (iommu_table_in_use(tbl))
- pr_warn("%s: Unexpected TCEs\n", __func__);
+ start_time = jiffies;
+ while (iommu_table_in_use(tbl)) {
+ int sec;
+
+ pr_info("%s: Unexpected TCEs, wait for 50ms\n", __func__);
+ msleep(50);
+
+ /* Come out of the loop if we have already waited for 120 seconds
+ * for the TCEs to be free'ed. TCE are being free'ed
+ * asynchronously by some DMA buffer management API - like
+ * page_pool.
+ */
+ sec = (s32)((u32)jiffies - (u32)start_time) / HZ;
+ if (sec >= 120) {
+ pr_warn("%s: TCEs still mapped even after 120 seconds\n", __func__);
+ break;
+ }
+ }
/* free bitmap */
vfree(tbl->it_map);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 534cd159e9ab..925494b6fafb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -2390,6 +2390,10 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
switch (action) {
case OF_RECONFIG_DETACH_NODE:
+ if (pci && pci->table_group)
+ iommu_pseries_free_group(pci->table_group,
+ np->full_name);
+
/*
* Removing the property will invoke the reconfig
* notifier again, which causes dead-lock on the
@@ -2400,10 +2404,6 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
if (remove_dma_window_named(np, false, DIRECT64_PROPNAME, true))
remove_dma_window_named(np, false, DMA64_PROPNAME, true);
- if (pci && pci->table_group)
- iommu_pseries_free_group(pci->table_group,
- np->full_name);
-
spin_lock(&dma_win_list_lock);
list_for_each_entry(window, &dma_win_list, list) {
if (window->device == np) {
base-commit: 6e4436539ae182dc86d57d13849862bcafaa4709
--
2.39.3 (Apple Git-146)
More information about the Linuxppc-dev
mailing list