[PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

Thu Dec 12 09:47:30 AEDT 2019

On 12/12/2019 07:31, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
>>
>>
>> On 11/12/2019 02:35, Ram Pai wrote:
>>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 10/12/2019 16:12, Ram Pai wrote:
>>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
>>>>>>
>>>>>>
>>>>>> On 07/12/2019 12:12, Ram Pai wrote:
>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>> its parameters.  On secure VMs, hypervisor cannot access the contents of
>>>>>>> this page since it gets encrypted.  Hence share the page with the
>>>>>>> hypervisor, and unshare when done.
>>>>>>
>>>>>>
>>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>>>>>> pages. There is small problem that when DDW is enabled,
>>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>>>>>> about the performance on slack but this is caused by initial cleanup of
>>>>>> the default TCE window (which we do not use anyway) and to battle this
>>>>>> we can simply reduce its size by adding
>>>>>
>>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
>>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
>>>>> 30GB guest. With larger guest, the time will further detoriate.
>>>>
>>>>
>>>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
>>>> the delay is caused by clearing the small DMA window which is small by
>>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
>>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
>>>> (depends on the system) so the number of TCEs is much smaller.
>>>
>>> I cant get your results.  What changes did you make to get it?
>>
>>
>> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
>> in clearing the default window and the huge window took a fraction of a
>> second to create and map.
> 
> Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> of H_PUT_TCE everywhere?

Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
fixed (I have it in my local branch), this does not make a difference.

> 
> In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> instant),

PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
and H_PUT_TCE_INDIRECT are present or neither.

> and then force H_PUT_TCE for new mappings via something like:
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 6ba081dd61c9..85d092baf17d 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>         unsigned long flags;
>  
>         if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> +       if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) {

Nobody (including myself) seems to like the idea of having
is_secure_guest() all over the place.

And with KVM acceleration enabled, it is pretty fast anyway. Just now we
do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
fix this for secure PCI passhtrough anyway.

>                 return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
>                                            direction, attrs);
>         }
> 
> That seems like it would avoid the extra 13s.

Or move around iommu_table_clear() which imho is just the right thing to do.

> If we take the additional step of only mapping SWIOTLB range in
> enable_ddw() for is_secure_guest() that might further improve things
> (though the bigger motivation with that is the extra isolation it would
> grant us for stuff behind the IOMMU, since it apparently doesn't affect
> boot-time all that much)

Sure, we just need to confirm how many of these swiotlb banks we are
going to have (just one or many and at what location). Thanks,

> 
>>
>>
>>>>>>
>>>>>> -global
>>>>>> spapr-pci-host-bridge.dma_win_size=0x4000000
>>>>>
>>>>> This option, speeds it up tremendously.  But than should this option be
>>>>> enabled in qemu by default?  only for secure VMs? for both VMs?
>>>>
>>>>
>>>> As discussed in slack, by default we do not need to clear the entire TCE
>>>> table and we only have to map swiotlb buffer using the small window. It
>>>> is a guest kernel change only. Thanks,
>>>
>>> Can you tell me what code you are talking about here.  Where is the TCE
>>> table getting cleared? What code needs to be changed to not clear it?
>>
>>
>> pci_dma_bus_setup_pSeriesLP()
>>         iommu_init_table()
>>                 iommu_table_clear()
>>                         for () tbl->it_ops->get()
>>
>> We do not really need to clear it there, we only need it for VFIO with
>> IOMMU SPAPR TCE v1 which reuses these tables but there are
>> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
>> send a patch for this.
> 
> 
>>
>>
>>> Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
>>> aswell?
>>
>>
>> This one does not need to clear TCEs as this creates a window of known
>> size and maps it all.
>>
>> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
>> then TCEs for the gaps will have what hypervisor had there (which is
>> zeroes, qemu/kvm clears it anyway).
>>
>>
>>> But before I close, you have not told me clearly, what is the problem
>>> with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.
>>
>> Between share and unshare you have a (tiny) window of opportunity to
>> attack the guest. No, I do not know how exactly.
>>
>> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
>> 64bit devices - each time this will create a huge window which will
>> share/unshare the same page.  No, I do not know how exactly how this can
>> be exploited either, we cannot rely of what you or myself know today. My
>> point is that we should not be sharing pages at all unless we really
>> really have to, and this does not seem to be the case.
>>
>> But since this seems to an acceptable compromise anyway,
>>
>> Reviewed-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>
>>
>>
>>
>>
>>> Remember this is the same page that is earmarked for doing
>>> H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
>>> existing code. So it not some random buffer that is picked. Second 
>>> this page is temporarily shared and unshared, it does not stay shared
>>> for life.  It does not slow the boot. it does not need any
>>> special command line options on the qemu.
>>>> Shared pages technology was put in place, exactly for the purpose of
>>> sharing data with the hypervisor.  We are using this technology exactly
>>> for that purpose.  And finally I agreed with your concern of having
>>> shared pages staying around.  Hence i addressed that concern, by
>>> unsharing the page.  At this point, I fail to understand your concern.
>>
>>
>>
>>
>> -- 
>> Alexey

-- 
Alexey