[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

Gaurav Batra gbatra at linux.vnet.ibm.com
Sat Jun 24 00:55:46 AEST 2023


Hello Michael,

Did you get a chance to look into this patch? I don't mean to rush you. 
Just wondering if there is anything I can do to help make the patch to 
Upstream.

Thanks,

Gaurav

On 6/13/23 12:17 PM, Gaurav Batra wrote:
> Hello Michael,
>
> I found this bug while going though the code. This bug is exposed when 
> DDW is smaller than the max memory of the LPAR. This will result in 
> creating DDW which will have Dynamically mapped TCEs (no direct mapping).
>
> I would like to stress that this  bug is exposed only in Upstream 
> kernel. Current kernel level in Distros are not exposed to this since 
> they don't have the  concept of "dynamically mapped" DDW.
>
> I didn't have access to any of the P10 boxes with large amount of 
> memory to  re-create the scenario. On P10 we have 2MB TCEs, which 
> results in DDW large enough to be able to cover  max memory I could 
> have for the LPAR. As a result,  IO Bus Addresses generated were 
> always within DDW limits and no H_PARAMETER was returned by HCALL.
>
> So, I hacked the kernel to force the use of 64K TCEs. This resulted in 
> DDW smaller than max memory.
>
> When I tried to DLPAR ADD memory, it failed with error code of -4 
> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when 
> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().
>
> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are 
> removed by tce_clearrange_multi_pSeriesLP(), since I would need to 
> hack kernel to force dynamically added TCEs to the high IO Bus 
> Addresses. But, the concept is  same.
>
> Thanks,
>
> Gaurav
>
> On 6/13/23 12:16 PM, Gaurav Batra wrote:
>> When memory is dynamically added/removed, iommu_mem_notifier() is 
>> invoked. This
>> routine traverses through all the DMA windows (DDW only, not default 
>> windows)
>> to add/remove "direct" TCE mappings. The routines for this purpose are
>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().
>>
>> Both these routines are designed for Direct mapped DMA windows only.
>>
>> The issue is that there could be some DMA windows in the list which 
>> are not
>> "direct" mapped. Calling these routines will either,
>>
>> 1) remove some dynamically mapped TCEs, Or
>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER
>>
>> Here are the side affects when these routines are incorrectly invoked 
>> for
>> "dynamically" mapped DMA windows.
>>
>> tce_setrange_multi_pSeriesLP()
>>
>> This adds direct mapped TCEs. Now, this could invoke HCALL to add 
>> TCEs with
>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER 
>> and DLAR
>> ADD of memory will fail.
>>
>> tce_clearrange_multi_pSeriesLP()
>>
>> This will remove range of TCEs. The TCE range that is calculated, 
>> depending on
>> the memory range being added, could infact be mapping some other memory
>> address (for dynamic DMA window scenario). This will wipe out those 
>> TCEs.
>>
>> The solution is for iommu_mem_notifier() to only invoke these 
>> routines for
>> "direct" mapped DMA windows.
>>
>> Signed-off-by: Gaurav Batra <gbatra at linux.vnet.ibm.com>
>> Reviewed-by: Brian King <brking at linux.vnet.ibm.com>
>> ---
>>   arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++----
>>   1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 918f511837db..24dd61636400 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
>>   struct dma_win {
>>       struct device_node *device;
>>       const struct dynamic_dma_window_prop *prop;
>> +    bool    direct;
>>       struct list_head list;
>>   };
>>
>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, 
>> struct device_node *pdn)
>>           goto out_del_prop;
>>
>>       if (direct_mapping) {
>> +        window->direct = true;
>> +
>>           /* DDW maps the whole partition, so enable direct DMA 
>> mapping */
>>           ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
>> PAGE_SHIFT,
>>                           win64->value, 
>> tce_setrange_multi_pSeriesLP_walk);
>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, 
>> struct device_node *pdn)
>>           int i;
>>           unsigned long start = 0, end = 0;
>>
>> +        window->direct = false;
>> +
>>           for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
>>               const unsigned long mask = IORESOURCE_MEM_64 | 
>> IORESOURCE_MEM;
>>
>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>       case MEM_GOING_ONLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> -                    arg->nr_pages, window->prop);
>> +            if (window->direct) {
>> +                ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> +                        arg->nr_pages, window->prop);
>> +            }
>>               /* XXX log error */
>>           }
>>           spin_unlock(&dma_win_list_lock);
>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct 
>> notifier_block *nb, unsigned long action,
>>       case MEM_OFFLINE:
>>           spin_lock(&dma_win_list_lock);
>>           list_for_each_entry(window, &dma_win_list, list) {
>> -            ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> -                    arg->nr_pages, window->prop);
>> +            if (window->direct) {
>> +                ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> +                        arg->nr_pages, window->prop);
>> +            }
>>               /* XXX log error */
>>           }
>>           spin_unlock(&dma_win_list_lock);


More information about the Linuxppc-dev mailing list