[PATCH] powerpc/pseries/iommu: memory notifier incorrectly adds TCEs for pmemory
Michal Suchánek
msuchanek at suse.de
Thu Mar 27 01:53:14 AEDT 2025
Hello,
On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> Hello Michal,
>
> In the patch to fix the pmemory bug, I made some changes to the code that
> determines Max memory an LPAR can have (excluding pmemory). This information
> is needed while creating Dynamic DMA Window (DDW). These changes are in the
> main line code path of DDW creation. This might have irritated QEMU somehow,
> no idea yet on how.
Yes, it's defeinitely something with the DDW code. Using the
disable_ddw=1 kernel parameter avoids the qemu crash.
The kernels in
https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/
have the patch applied.
Booting the kernel inside qemu VM with a PCI device (such as the USB
hub) and then rebooting the VM crashes qemu.
Thanks
Michal
>
> Thanks,
>
> Gaurav
>
> On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > Hello,
> >
> > looks like this upsets some assumption qemu has about these windows.
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> >
> > When Linux kernel that has this patch applied is running inside a qemu
> > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > the next Linux kernel starts.
> >
> > This is quite curious since qemu does AFAIK not support pmemory at all.
> >
> > Any idea what went wrong there?
> >
> > Thanks
> >
> > Michal
> >
> > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. This
> > > notifier call is responsible to add/remove TCEs from the Dynamic DMA Window
> > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > dynamically mapped when the device driver instructs to do so.
> > >
> > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > after LPAR boot. The command to do so is -
> > >
> > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > >
> > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > created for the device, at device initialization time.
> > >
> > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > command, daxctl, to add pmemory as RAM.
> > >
> > > The solution is to not pre-map TCEs for pmemory.
> > >
> > > Signed-off-by: Gaurav Batra <gbatra at linux.ibm.com>
> > > ---
> > > arch/powerpc/include/asm/mmzone.h | 1 +
> > > arch/powerpc/mm/numa.c | 2 +-
> > > arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > > 3 files changed, 18 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
> > > index d99863cd6cde..049152f8d597 100644
> > > --- a/arch/powerpc/include/asm/mmzone.h
> > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > > #ifdef CONFIG_MEMORY_HOTPLUG
> > > extern unsigned long max_pfn;
> > > u64 memory_hotplug_max(void);
> > > +u64 hot_add_drconf_memory_max(void);
> > > #else
> > > #define memory_hotplug_max() memblock_end_of_DRAM()
> > > #endif
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 3c1da08304d0..603a0f652ba6 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > > return nid;
> > > }
> > > -static u64 hot_add_drconf_memory_max(void)
> > > +u64 hot_add_drconf_memory_max(void)
> > > {
> > > struct device_node *memory = NULL;
> > > struct device_node *dn = NULL;
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > > static phys_addr_t ddw_memory_hotplug_max(void)
> > > {
> > > - resource_size_t max_addr = memory_hotplug_max();
> > > - struct device_node *memory;
> > > + resource_size_t max_addr;
> > > - for_each_node_by_type(memory, "memory") {
> > > - struct resource res;
> > > -
> > > - if (of_address_to_resource(memory, 0, &res))
> > > - continue;
> > > -
> > > - max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > - }
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > + max_addr = hot_add_drconf_memory_max();
> > > +#else
> > > + max_addr = memblock_end_of_DRAM();
> > > +#endif
> > > return max_addr;
> > > }
> > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > if (direct_mapping) {
> > > /* DDW maps the whole partition, so enable direct DMA mapping */
> > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > + ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> PAGE_SHIFT,
> > > win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > if (ret) {
> > > dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
> > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > struct memory_notify *arg = data;
> > > int ret = 0;
> > > + /* This notifier can get called when onlining persistent memory as well.
> > > + * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > + * always be above ddw_memory_hotplug_max()
> > > + */
> > > +
> > > switch (action) {
> > > case MEM_GOING_ONLINE:
> > > spin_lock(&dma_win_list_lock);
> > > list_for_each_entry(window, &dma_win_list, list) {
> > > - if (window->direct) {
> > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > + ddw_memory_hotplug_max()) {
> > > ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > > arg->nr_pages, window->prop);
> > > }
> > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
> > > case MEM_OFFLINE:
> > > spin_lock(&dma_win_list_lock);
> > > list_for_each_entry(window, &dma_win_list, list) {
> > > - if (window->direct) {
> > > + if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > + ddw_memory_hotplug_max()) {
> > > ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > > arg->nr_pages, window->prop);
> > > }
> > >
> > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > --
> > > 2.39.3 (Apple Git-146)
> > >
> > >
More information about the Linuxppc-dev
mailing list