[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

Michael Ellerman mpe at ellerman.id.au
Mon Jun 26 14:54:23 AEST 2023


Gaurav Batra <gbatra at linux.vnet.ibm.com> writes:
> Hello Michael,
>
> Did you get a chance to look into this patch? I don't mean to rush you. 
> Just wondering if there is anything I can do to help make the patch to 
> Upstream.

I skimmed it and decided it wasn't a critical bug fix, and hoped someone
else would review it - silly me :D

But the patch looks simple enough, and the explanation is very good so I
think it looks good to merge.

I'll apply it for v6.5.

cheers

> On 6/13/23 12:17 PM, Gaurav Batra wrote:
>> Hello Michael,
>>
>> I found this bug while going though the code. This bug is exposed when 
>> DDW is smaller than the max memory of the LPAR. This will result in 
>> creating DDW which will have Dynamically mapped TCEs (no direct mapping).
>>
>> I would like to stress that this  bug is exposed only in Upstream 
>> kernel. Current kernel level in Distros are not exposed to this since 
>> they don't have the  concept of "dynamically mapped" DDW.
>>
>> I didn't have access to any of the P10 boxes with large amount of 
>> memory to  re-create the scenario. On P10 we have 2MB TCEs, which 
>> results in DDW large enough to be able to cover  max memory I could 
>> have for the LPAR. As a result,  IO Bus Addresses generated were 
>> always within DDW limits and no H_PARAMETER was returned by HCALL.
>>
>> So, I hacked the kernel to force the use of 64K TCEs. This resulted in 
>> DDW smaller than max memory.
>>
>> When I tried to DLPAR ADD memory, it failed with error code of -4 
>> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when 
>> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().
>>
>> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are 
>> removed by tce_clearrange_multi_pSeriesLP(), since I would need to 
>> hack kernel to force dynamically added TCEs to the high IO Bus 
>> Addresses. But, the concept is  same.
>>
>> Thanks,
>>
>> Gaurav
>>
>> On 6/13/23 12:16 PM, Gaurav Batra wrote:
>>> When memory is dynamically added/removed, iommu_mem_notifier() is 
>>> invoked. This
>>> routine traverses through all the DMA windows (DDW only, not default 
>>> windows)
>>> to add/remove "direct" TCE mappings. The routines for this purpose are
>>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().
>>>
>>> Both these routines are designed for Direct mapped DMA windows only.
>>>
>>> The issue is that there could be some DMA windows in the list which 
>>> are not
>>> "direct" mapped. Calling these routines will either,
>>>
>>> 1) remove some dynamically mapped TCEs, Or
>>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER
>>>
>>> Here are the side affects when these routines are incorrectly invoked 
>>> for
>>> "dynamically" mapped DMA windows.
>>>
>>> tce_setrange_multi_pSeriesLP()
>>>
>>> This adds direct mapped TCEs. Now, this could invoke HCALL to add 
>>> TCEs with
>>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER 
>>> and DLAR
>>> ADD of memory will fail.
>>>
>>> tce_clearrange_multi_pSeriesLP()
>>>
>>> This will remove range of TCEs. The TCE range that is calculated, 
>>> depending on
>>> the memory range being added, could infact be mapping some other memory
>>> address (for dynamic DMA window scenario). This will wipe out those 
>>> TCEs.
>>>
>>> The solution is for iommu_mem_notifier() to only invoke these 
>>> routines for
>>> "direct" mapped DMA windows.
>>>
>>> Signed-off-by: Gaurav Batra <gbatra at linux.vnet.ibm.com>
>>> Reviewed-by: Brian King <brking at linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++----
>>>   1 file changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 918f511837db..24dd61636400 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
>>>   struct dma_win {
>>>       struct device_node *device;
>>>       const struct dynamic_dma_window_prop *prop;
>>> +    bool    direct;
>>>       struct list_head list;
>>>   };
>>>
>>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, 
>>> struct device_node *pdn)
>>>           goto out_del_prop;
>>>
>>>       if (direct_mapping) {
>>> +        window->direct = true;
>>> +
>>>           /* DDW maps the whole partition, so enable direct DMA 
>>> mapping */
>>>           ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
>>> PAGE_SHIFT,
>>>                           win64->value, 
>>> tce_setrange_multi_pSeriesLP_walk);
>>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, 
>>> struct device_node *pdn)
>>>           int i;
>>>           unsigned long start = 0, end = 0;
>>>
>>> +        window->direct = false;
>>> +
>>>           for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
>>>               const unsigned long mask = IORESOURCE_MEM_64 | 
>>> IORESOURCE_MEM;
>>>
>>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct 
>>> notifier_block *nb, unsigned long action,
>>>       case MEM_GOING_ONLINE:
>>>           spin_lock(&dma_win_list_lock);
>>>           list_for_each_entry(window, &dma_win_list, list) {
>>> -            ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> -                    arg->nr_pages, window->prop);
>>> +            if (window->direct) {
>>> +                ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> +                        arg->nr_pages, window->prop);
>>> +            }
>>>               /* XXX log error */
>>>           }
>>>           spin_unlock(&dma_win_list_lock);
>>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct 
>>> notifier_block *nb, unsigned long action,
>>>       case MEM_OFFLINE:
>>>           spin_lock(&dma_win_list_lock);
>>>           list_for_each_entry(window, &dma_win_list, list) {
>>> -            ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> -                    arg->nr_pages, window->prop);
>>> +            if (window->direct) {
>>> +                ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> +                        arg->nr_pages, window->prop);
>>> +            }
>>>               /* XXX log error */
>>>           }
>>>           spin_unlock(&dma_win_list_lock);


More information about the Linuxppc-dev mailing list