[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory
Gaurav Batra
gbatra at linux.vnet.ibm.com
Sat Jun 24 00:55:46 AEST 2023
Hello Michael,
Did you get a chance to look into this patch? I don't mean to rush you.
Just wondering if there is anything I can do to help make the patch to
Upstream.
Thanks,
Gaurav
On 6/13/23 12:17 PM, Gaurav Batra wrote:
> Hello Michael,
>
> I found this bug while going though the code. This bug is exposed when
> DDW is smaller than the max memory of the LPAR. This will result in
> creating DDW which will have Dynamically mapped TCEs (no direct mapping).
>
> I would like to stress that this bug is exposed only in Upstream
> kernel. Current kernel level in Distros are not exposed to this since
> they don't have the concept of "dynamically mapped" DDW.
>
> I didn't have access to any of the P10 boxes with large amount of
> memory to re-create the scenario. On P10 we have 2MB TCEs, which
> results in DDW large enough to be able to cover max memory I could
> have for the LPAR. As a result, IO Bus Addresses generated were
> always within DDW limits and no H_PARAMETER was returned by HCALL.
>
> So, I hacked the kernel to force the use of 64K TCEs. This resulted in
> DDW smaller than max memory.
>
> When I tried to DLPAR ADD memory, it failed with error code of -4
> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when
> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().
>
> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are
> removed by tce_clearrange_multi_pSeriesLP(), since I would need to
> hack kernel to force dynamically added TCEs to the high IO Bus
> Addresses. But, the concept is same.
>
> Thanks,
>
> Gaurav
>
> On 6/13/23 12:16 PM, Gaurav Batra wrote:
>> When memory is dynamically added/removed, iommu_mem_notifier() is
>> invoked. This
>> routine traverses through all the DMA windows (DDW only, not default
>> windows)
>> to add/remove "direct" TCE mappings. The routines for this purpose are
>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().
>>
>> Both these routines are designed for Direct mapped DMA windows only.
>>
>> The issue is that there could be some DMA windows in the list which
>> are not
>> "direct" mapped. Calling these routines will either,
>>
>> 1) remove some dynamically mapped TCEs, Or
>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER
>>
>> Here are the side affects when these routines are incorrectly invoked
>> for
>> "dynamically" mapped DMA windows.
>>
>> tce_setrange_multi_pSeriesLP()
>>
>> This adds direct mapped TCEs. Now, this could invoke HCALL to add
>> TCEs with
>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER
>> and DLAR
>> ADD of memory will fail.
>>
>> tce_clearrange_multi_pSeriesLP()
>>
>> This will remove range of TCEs. The TCE range that is calculated,
>> depending on
>> the memory range being added, could infact be mapping some other memory
>> address (for dynamic DMA window scenario). This will wipe out those
>> TCEs.
>>
>> The solution is for iommu_mem_notifier() to only invoke these
>> routines for
>> "direct" mapped DMA windows.
>>
>> Signed-off-by: Gaurav Batra <gbatra at linux.vnet.ibm.com>
>> Reviewed-by: Brian King <brking at linux.vnet.ibm.com>
>> ---
>> arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++----
>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 918f511837db..24dd61636400 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
>> struct dma_win {
>> struct device_node *device;
>> const struct dynamic_dma_window_prop *prop;
>> + bool direct;
>> struct list_head list;
>> };
>>
>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev,
>> struct device_node *pdn)
>> goto out_del_prop;
>>
>> if (direct_mapping) {
>> + window->direct = true;
>> +
>> /* DDW maps the whole partition, so enable direct DMA
>> mapping */
>> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
>> PAGE_SHIFT,
>> win64->value,
>> tce_setrange_multi_pSeriesLP_walk);
>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev,
>> struct device_node *pdn)
>> int i;
>> unsigned long start = 0, end = 0;
>>
>> + window->direct = false;
>> +
>> for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
>> const unsigned long mask = IORESOURCE_MEM_64 |
>> IORESOURCE_MEM;
>>
>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct
>> notifier_block *nb, unsigned long action,
>> case MEM_GOING_ONLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> - arg->nr_pages, window->prop);
>> + if (window->direct) {
>> + ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>> + arg->nr_pages, window->prop);
>> + }
>> /* XXX log error */
>> }
>> spin_unlock(&dma_win_list_lock);
>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct
>> notifier_block *nb, unsigned long action,
>> case MEM_OFFLINE:
>> spin_lock(&dma_win_list_lock);
>> list_for_each_entry(window, &dma_win_list, list) {
>> - ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> - arg->nr_pages, window->prop);
>> + if (window->direct) {
>> + ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>> + arg->nr_pages, window->prop);
>> + }
>> /* XXX log error */
>> }
>> spin_unlock(&dma_win_list_lock);
More information about the Linuxppc-dev
mailing list