[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory
Michael Ellerman
mpe at ellerman.id.au
Mon Jun 26 14:54:23 AEST 2023
Gaurav Batra <gbatra at linux.vnet.ibm.com> writes:
> Hello Michael,
>
> Did you get a chance to look into this patch? I don't mean to rush you.
> Just wondering if there is anything I can do to help make the patch to
> Upstream.
I skimmed it and decided it wasn't a critical bug fix, and hoped someone
else would review it - silly me :D
But the patch looks simple enough, and the explanation is very good so I
think it looks good to merge.
I'll apply it for v6.5.
cheers
> On 6/13/23 12:17 PM, Gaurav Batra wrote:
>> Hello Michael,
>>
>> I found this bug while going though the code. This bug is exposed when
>> DDW is smaller than the max memory of the LPAR. This will result in
>> creating DDW which will have Dynamically mapped TCEs (no direct mapping).
>>
>> I would like to stress that this bug is exposed only in Upstream
>> kernel. Current kernel level in Distros are not exposed to this since
>> they don't have the concept of "dynamically mapped" DDW.
>>
>> I didn't have access to any of the P10 boxes with large amount of
>> memory to re-create the scenario. On P10 we have 2MB TCEs, which
>> results in DDW large enough to be able to cover max memory I could
>> have for the LPAR. As a result, IO Bus Addresses generated were
>> always within DDW limits and no H_PARAMETER was returned by HCALL.
>>
>> So, I hacked the kernel to force the use of 64K TCEs. This resulted in
>> DDW smaller than max memory.
>>
>> When I tried to DLPAR ADD memory, it failed with error code of -4
>> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when
>> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().
>>
>> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are
>> removed by tce_clearrange_multi_pSeriesLP(), since I would need to
>> hack kernel to force dynamically added TCEs to the high IO Bus
>> Addresses. But, the concept is same.
>>
>> Thanks,
>>
>> Gaurav
>>
>> On 6/13/23 12:16 PM, Gaurav Batra wrote:
>>> When memory is dynamically added/removed, iommu_mem_notifier() is
>>> invoked. This
>>> routine traverses through all the DMA windows (DDW only, not default
>>> windows)
>>> to add/remove "direct" TCE mappings. The routines for this purpose are
>>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().
>>>
>>> Both these routines are designed for Direct mapped DMA windows only.
>>>
>>> The issue is that there could be some DMA windows in the list which
>>> are not
>>> "direct" mapped. Calling these routines will either,
>>>
>>> 1) remove some dynamically mapped TCEs, Or
>>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER
>>>
>>> Here are the side affects when these routines are incorrectly invoked
>>> for
>>> "dynamically" mapped DMA windows.
>>>
>>> tce_setrange_multi_pSeriesLP()
>>>
>>> This adds direct mapped TCEs. Now, this could invoke HCALL to add
>>> TCEs with
>>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER
>>> and DLAR
>>> ADD of memory will fail.
>>>
>>> tce_clearrange_multi_pSeriesLP()
>>>
>>> This will remove range of TCEs. The TCE range that is calculated,
>>> depending on
>>> the memory range being added, could infact be mapping some other memory
>>> address (for dynamic DMA window scenario). This will wipe out those
>>> TCEs.
>>>
>>> The solution is for iommu_mem_notifier() to only invoke these
>>> routines for
>>> "direct" mapped DMA windows.
>>>
>>> Signed-off-by: Gaurav Batra <gbatra at linux.vnet.ibm.com>
>>> Reviewed-by: Brian King <brking at linux.vnet.ibm.com>
>>> ---
>>> arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++----
>>> 1 file changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 918f511837db..24dd61636400 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
>>> struct dma_win {
>>> struct device_node *device;
>>> const struct dynamic_dma_window_prop *prop;
>>> + bool direct;
>>> struct list_head list;
>>> };
>>>
>>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev,
>>> struct device_node *pdn)
>>> goto out_del_prop;
>>>
>>> if (direct_mapping) {
>>> + window->direct = true;
>>> +
>>> /* DDW maps the whole partition, so enable direct DMA
>>> mapping */
>>> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
>>> PAGE_SHIFT,
>>> win64->value,
>>> tce_setrange_multi_pSeriesLP_walk);
>>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev,
>>> struct device_node *pdn)
>>> int i;
>>> unsigned long start = 0, end = 0;
>>>
>>> + window->direct = false;
>>> +
>>> for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
>>> const unsigned long mask = IORESOURCE_MEM_64 |
>>> IORESOURCE_MEM;
>>>
>>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct
>>> notifier_block *nb, unsigned long action,
>>> case MEM_GOING_ONLINE:
>>> spin_lock(&dma_win_list_lock);
>>> list_for_each_entry(window, &dma_win_list, list) {
>>> - ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> - arg->nr_pages, window->prop);
>>> + if (window->direct) {
>>> + ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
>>> + arg->nr_pages, window->prop);
>>> + }
>>> /* XXX log error */
>>> }
>>> spin_unlock(&dma_win_list_lock);
>>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct
>>> notifier_block *nb, unsigned long action,
>>> case MEM_OFFLINE:
>>> spin_lock(&dma_win_list_lock);
>>> list_for_each_entry(window, &dma_win_list, list) {
>>> - ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> - arg->nr_pages, window->prop);
>>> + if (window->direct) {
>>> + ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
>>> + arg->nr_pages, window->prop);
>>> + }
>>> /* XXX log error */
>>> }
>>> spin_unlock(&dma_win_list_lock);
More information about the Linuxppc-dev
mailing list