[PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory

Michal Suchánek msuchanek at suse.de
Thu Mar 27 01:53:14 AEDT 2025


Hello,

On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> Hello Michal,
> 
> In the patch to fix the pmemory bug, I made some changes to the code that
> determines Max memory an LPAR can have (excluding pmemory). This information
> is needed while creating Dynamic DMA Window (DDW). These changes are in the
> main line code path of DDW creation. This might have irritated QEMU somehow,
> no idea yet on how.

Yes, it's defeinitely something with the DDW code. Using the
disable_ddw=1 kernel parameter avoids the qemu crash.

The kernels in
https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/

have the patch applied.

Booting the kernel inside qemu VM with a PCI device (such as the USB
hub) and then rebooting the VM crashes qemu.

Thanks

Michal

> 
> Thanks,
> 
> Gaurav
> 
> On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > Hello,
> > 
> > looks like this upsets some assumption qemu has about these windows.
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> > 
> > When Linux kernel that has this patch applied is running inside a qemu
> > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > the next Linux kernel starts.
> > 
> > This is quite curious since qemu does AFAIK not support pmemory at all.
> > 
> > Any idea what went wrong there?
> > 
> > Thanks
> > 
> > Michal
> > 
> > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > dynamically mapped when the device driver instructs to do so.
> > > 
> > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > after LPAR boot. The command to do so is -
> > > 
> > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > > 
> > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > created for the device, at device initialization time.
> > > 
> > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > command, daxctl, to add pmemory as RAM.
> > > 
> > > The solution is to not pre-map TCEs for pmemory.
> > > 
> > > Signed-off-by: Gaurav Batra <gbatra at linux.ibm.com>
> > > ---
> > >   arch/powerpc/include/asm/mmzone.h      |  1 +
> > >   arch/powerpc/mm/numa.c                 |  2 +-
> > >   arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > >   3 files changed, 18 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> > > index d99863cd6cde..049152f8d597 100644
> > > --- a/arch/powerpc/include/asm/mmzone.h
> > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > >   #ifdef CONFIG_MEMORY_HOTPLUG
> > >   extern unsigned long max_pfn;
> > >   u64 memory_hotplug_max(void);
> > > +u64 hot_add_drconf_memory_max(void);
> > >   #else
> > >   #define memory_hotplug_max() memblock_end_of_DRAM()
> > >   #endif
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 3c1da08304d0..603a0f652ba6 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > >   	return nid;
> > >   }
> > > -static u64 hot_add_drconf_memory_max(void)
> > > +u64 hot_add_drconf_memory_max(void)
> > >   {
> > >   	struct device_node *memory = NULL;
> > >   	struct device_node *dn = NULL;
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > >   static phys_addr_t ddw_memory_hotplug_max(void)
> > >   {
> > > -	resource_size_t max_addr = memory_hotplug_max();
> > > -	struct device_node *memory;
> > > +	resource_size_t max_addr;
> > > -	for_each_node_by_type(memory, "memory") {
> > > -		struct resource res;
> > > -
> > > -		if (of_address_to_resource(memory, 0, &res))
> > > -			continue;
> > > -
> > > -		max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > -	}
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > +	max_addr = hot_add_drconf_memory_max();
> > > +#else
> > > +	max_addr = memblock_end_of_DRAM();
> > > +#endif
> > >   	return max_addr;
> > >   }
> > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > >   	if (direct_mapping) {
> > >   		/* DDW maps the whole partition, so enable direct DMA mapping */
> > > -		ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > +		ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> > >   					    win64->value, tce_setrange_multi_pSeriesLP_walk);
> > >   		if (ret) {
> > >   			dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > >   	struct memory_notify *arg = data;
> > >   	int ret = 0;
> > > +	/* This notifier can get called when onlining persistent memory as well.
> > > +	 * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > +	 * always be above ddw_memory_hotplug_max()
> > > +	 */
> > > +
> > >   	switch (action) {
> > >   	case MEM_GOING_ONLINE:
> > >   		spin_lock(&dma_win_list_lock);
> > >   		list_for_each_entry(window, &dma_win_list, list) {
> > > -			if (window->direct) {
> > > +			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > +				ddw_memory_hotplug_max()) {
> > >   				ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > >   						arg->nr_pages, window->prop);
> > >   			}
> > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > >   	case MEM_OFFLINE:
> > >   		spin_lock(&dma_win_list_lock);
> > >   		list_for_each_entry(window, &dma_win_list, list) {
> > > -			if (window->direct) {
> > > +			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > +				ddw_memory_hotplug_max()) {
> > >   				ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > >   						arg->nr_pages, window->prop);
> > >   			}
> > > 
> > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > -- 
> > > 2.39.3 (Apple Git-146)
> > > 
> > > 


More information about the Linuxppc-dev mailing list