[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

Gaurav Batra gbatra at linux.vnet.ibm.com
Mon Jun 26 23:12:29 AEST 2023


Thanks a lot


On 6/25/23 11:54 PM, Michael Ellerman wrote:
> Gaurav Batra <gbatra at linux.vnet.ibm.com> writes:
>> Hello Michael,
>>
>> Did you get a chance to look into this patch? I don't mean to rush you.
>> Just wondering if there is anything I can do to help make the patch to
>> Upstream.
> I skimmed it and decided it wasn't a critical bug fix, and hoped someone
> else would review it - silly me :D
>
> But the patch looks simple enough, and the explanation is very good so I
> think it looks good to merge.
>
> I'll apply it for v6.5.
>
> cheers
>
>> On 6/13/23 12:17 PM, Gaurav Batra wrote:
>>> Hello Michael,
>>>
>>> I found this bug while going though the code. This bug is exposed when
>>> DDW is smaller than the max memory of the LPAR. This will result in
>>> creating DDW which will have Dynamically mapped TCEs (no direct mapping).
>>>
>>> I would like to stress that this  bug is exposed only in Upstream
>>> kernel. Current kernel level in Distros are not exposed to this since
>>> they don't have the  concept of "dynamically mapped" DDW.
>>>
>>> I didn't have access to any of the P10 boxes with large amount of
>>> memory to  re-create the scenario. On P10 we have 2MB TCEs, which
>>> results in DDW large enough to be able to cover  max memory I could
>>> have for the LPAR. As a result,  IO Bus Addresses generated were
>>> always within DDW limits and no H_PARAMETER was returned by HCALL.
>>>
>>> So, I hacked the kernel to force the use of 64K TCEs. This resulted in
>>> DDW smaller than max memory.
>>>
>>> When I tried to DLPAR ADD memory, it failed with error code of -4
>>> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when
>>> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().
>>>
>>> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are
>>> removed by tce_clearrange_multi_pSeriesLP(), since I would need to
>>> hack kernel to force dynamically added TCEs to the high IO Bus
>>> Addresses. But, the concept is  same.
>>>
>>> Thanks,
>>>
>>> Gaurav
>>>
>>> On 6/13/23 12:16 PM, Gaurav Batra wrote:
>>>> When memory is dynamically added/removed, iommu_mem_notifier() is
>>>> invoked. This
>>>> routine traverses through all the DMA windows (DDW only, not default
>>>> windows)
>>>> to add/remove "direct" TCE mappings. The routines for this purpose are
>>>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().
>>>>
>>>> Both these routines are designed for Direct mapped DMA windows only.
>>>>
>>>> The issue is that there could be some DMA windows in the list which
>>>> are not
>>>> "direct" mapped. Calling these routines will either,
>>>>
>>>> 1) remove some dynamically mapped TCEs, Or
>>>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER
>>>>
>>>> Here are the side affects when these routines are incorrectly invoked
>>>> for
>>>> "dynamically" mapped DMA windows.
>>>>
>>>> tce_setrange_multi_pSeriesLP()
>>>>
>>>> This adds direct mapped TCEs. Now, this could invoke HCALL to add
>>>> TCEs with
>>>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER
>>>> and DLAR
>>>> ADD of memory will fail.
>>>>
>>>> tce_clearrange_multi_pSeriesLP()
>>>>
>>>> This will remove range of TCEs. The TCE range that is calculated,
>>>> depending on
>>>> the memory range being added, could infact be mapping some other memory
>>>> address (for dynamic DMA window scenario). This will wipe out those
>>>> TCEs.
>>>>
>>>> The solution is for iommu_mem_notifier() to only invoke these
>>>> routines for
>>>> "direct" mapped DMA windows.
>>>>
>>>> Signed-off-by: Gaurav Batra <gbatra at linux.vnet.ibm.com>
>>>> Reviewed-by: Brian King <brking at linux.vnet.ibm.com>
>>>> ---
>>>>    arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++----
>>>>    1 file changed, 13 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
>>>> b/arch/powerpc/platforms/pseries/iommu.c
>>>> index 918f511837db..24dd61636400 100644
>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
>>>>    struct dma_win {
>>>>        struct device_node *device;
>>>>        const struct dynamic_dma_window_prop *prop;
>>>> +    bool    direct;
>>>>        struct list_head list;
>>>>    };
>>>>
>>>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev,
>>>> struct device_node *pdn)
>>>>            goto out_del_prop;
>>>>
>>>>        if (direct_mapping) {
>>>> +        window->direct = true;
>>>> +
>>>>            /* DDW maps the whole partition, so enable direct DMA
>>>> mapping */
>>>>            ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
>>>> PAGE_SHIFT,
>>>>                            win64->value,
>>>> tce_setrange_multi_pSeriesLP_walk);
>>>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev,
>>>> struct device_node *pdn)
>>>>            int i;
>>>>            unsigned long start = 0, end = 0;
>>>>
>>>> +        window->direct = false;
>>>> +
>>>>            for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
>>>>                const unsigned long mask = IORESOURCE_MEM_64 |
>>>> IORESOURCE_MEM;
>>>>
>>>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct
>>>> notifier_block *nb, unsigned long action,
>>>>        case MEM_GOING_ONLINE:
>>>>            spin_lock(&dma_win_list_lock);
>>>>            list_for_each_entry(window, &dma_win_list, list) {
>>>> -            ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>>> -                    arg->nr_pages, window->prop);
>>>> +            if (window->direct) {
>>>> +                ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>>> +                        arg->nr_pages, window->prop);
>>>> +            }
>>>>                /* XXX log error */
>>>>            }
>>>>            spin_unlock(&dma_win_list_lock);
>>>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct
>>>> notifier_block *nb, unsigned long action,
>>>>        case MEM_OFFLINE:
>>>>            spin_lock(&dma_win_list_lock);
>>>>            list_for_each_entry(window, &dma_win_list, list) {
>>>> -            ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>>> -                    arg->nr_pages, window->prop);
>>>> +            if (window->direct) {
>>>> +                ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>>> +                        arg->nr_pages, window->prop);
>>>> +            }
>>>>                /* XXX log error */
>>>>            }
>>>>            spin_unlock(&dma_win_list_lock);


More information about the Linuxppc-dev mailing list