[PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory

Amit Machhiwal amachhiw at linux.ibm.com
Wed May 7 19:06:57 AEST 2025


Hi Michal,

I can recreate this issue on sles16 distro kernel but I don't observe this issue
with upstream Linux 6.15-rc5 on the **same** sles16 guest.

Note: the commit 6aa989ab2bd0 ("powerpc/pseries/iommu: memory notifier
incorrectly adds TCEs for pmemory") was included since Linux 6.15-rc1.

I think there's something more to this crash.

Thanks,
Amit

On 2025/03/26 03:53 PM, Michal Suchánek wrote:
> Hello,
> 
> On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> > Hello Michal,
> > 
> > In the patch to fix the pmemory bug, I made some changes to the code that
> > determines Max memory an LPAR can have (excluding pmemory). This information
> > is needed while creating Dynamic DMA Window (DDW). These changes are in the
> > main line code path of DDW creation. This might have irritated QEMU somehow,
> > no idea yet on how.
> 
> Yes, it's defeinitely something with the DDW code. Using the
> disable_ddw=1 kernel parameter avoids the qemu crash.
> 
> The kernels in
> https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/
> 
> have the patch applied.
> 
> Booting the kernel inside qemu VM with a PCI device (such as the USB
> hub) and then rebooting the VM crashes qemu.
> 
> Thanks
> 
> Michal
> 
> > 
> > Thanks,
> > 
> > Gaurav
> > 
> > On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > > Hello,
> > > 
> > > looks like this upsets some assumption qemu has about these windows.
> > > 
> > > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> > > 
> > > When Linux kernel that has this patch applied is running inside a qemu
> > > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > > the next Linux kernel starts.
> > > 
> > > This is quite curious since qemu does AFAIK not support pmemory at all.
> > > 
> > > Any idea what went wrong there?
> > > 
> > > Thanks
> > > 
> > > Michal
> > > 
> > > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> > > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> > > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > > dynamically mapped when the device driver instructs to do so.
> > > > 
> > > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > > after LPAR boot. The command to do so is -
> > > > 
> > > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > > > 
> > > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > > created for the device, at device initialization time.
> > > > 
> > > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > > command, daxctl, to add pmemory as RAM.
> > > > 
> > > > The solution is to not pre-map TCEs for pmemory.
> > > > 
> > > > Signed-off-by: Gaurav Batra <gbatra at linux.ibm.com>
> > > > ---
> > > >   arch/powerpc/include/asm/mmzone.h      |  1 +
> > > >   arch/powerpc/mm/numa.c                 |  2 +-
> > > >   arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > > >   3 files changed, 18 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> > > > index d99863cd6cde..049152f8d597 100644
> > > > --- a/arch/powerpc/include/asm/mmzone.h
> > > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > > >   #ifdef CONFIG_MEMORY_HOTPLUG
> > > >   extern unsigned long max_pfn;
> > > >   u64 memory_hotplug_max(void);
> > > > +u64 hot_add_drconf_memory_max(void);
> > > >   #else
> > > >   #define memory_hotplug_max() memblock_end_of_DRAM()
> > > >   #endif
> > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > > index 3c1da08304d0..603a0f652ba6 100644
> > > > --- a/arch/powerpc/mm/numa.c
> > > > +++ b/arch/powerpc/mm/numa.c
> > > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > > >   	return nid;
> > > >   }
> > > > -static u64 hot_add_drconf_memory_max(void)
> > > > +u64 hot_add_drconf_memory_max(void)
> > > >   {
> > > >   	struct device_node *memory = NULL;
> > > >   	struct device_node *dn = NULL;
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > > >   static phys_addr_t ddw_memory_hotplug_max(void)
> > > >   {
> > > > -	resource_size_t max_addr = memory_hotplug_max();
> > > > -	struct device_node *memory;
> > > > +	resource_size_t max_addr;
> > > > -	for_each_node_by_type(memory, "memory") {
> > > > -		struct resource res;
> > > > -
> > > > -		if (of_address_to_resource(memory, 0, &res))
> > > > -			continue;
> > > > -
> > > > -		max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > > -	}
> > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > > +	max_addr = hot_add_drconf_memory_max();
> > > > +#else
> > > > +	max_addr = memblock_end_of_DRAM();
> > > > +#endif
> > > >   	return max_addr;
> > > >   }
> > > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > >   	if (direct_mapping) {
> > > >   		/* DDW maps the whole partition, so enable direct DMA mapping */
> > > > -		ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > > +		ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> > > >   					    win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > >   		if (ret) {
> > > >   			dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> > > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > >   	struct memory_notify *arg = data;
> > > >   	int ret = 0;
> > > > +	/* This notifier can get called when onlining persistent memory as well.
> > > > +	 * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > > +	 * always be above ddw_memory_hotplug_max()
> > > > +	 */
> > > > +
> > > >   	switch (action) {
> > > >   	case MEM_GOING_ONLINE:
> > > >   		spin_lock(&dma_win_list_lock);
> > > >   		list_for_each_entry(window, &dma_win_list, list) {
> > > > -			if (window->direct) {
> > > > +			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > > +				ddw_memory_hotplug_max()) {
> > > >   				ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > > >   						arg->nr_pages, window->prop);
> > > >   			}
> > > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > >   	case MEM_OFFLINE:
> > > >   		spin_lock(&dma_win_list_lock);
> > > >   		list_for_each_entry(window, &dma_win_list, list) {
> > > > -			if (window->direct) {
> > > > +			if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > > +				ddw_memory_hotplug_max()) {
> > > >   				ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > > >   						arg->nr_pages, window->prop);
> > > >   			}
> > > > 
> > > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > > -- 
> > > > 2.39.3 (Apple Git-146)
> > > > 
> > > > 
> 


More information about the Linuxppc-dev mailing list