[PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list

Alexey Kardashevskiy aik at ozlabs.ru
Wed Jan 11 17:35:21 AEDT 2017


On 21/12/16 19:57, Alexey Kardashevskiy wrote:
> On 21/12/16 15:08, David Gibson wrote:
>> On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
>>> VFIO on sPAPR already implements guest memory pre-registration
>>> when the entire guest RAM gets pinned. This can be used to translate
>>> the physical address of a guest page containing the TCE list
>>> from H_PUT_TCE_INDIRECT.
>>>
>>> This makes use of the pre-registrered memory API to access TCE list
>>> pages in order to avoid unnecessary locking on the KVM memory
>>> reverse map as we know that all of guest memory is pinned and
>>> we have a flat array mapping GPA to HPA which makes it simpler and
>>> quicker to index into that array (even with looking up the
>>> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
>>> lock the rmap entry, look up the user page tables, and unlock the rmap
>>> entry. Note that the rmap pointer is initialized to NULL
>>> where declared (not in this patch).
>>>
>>> If a requested chunk of memory has not been preregistered,
>>> this will fail with H_TOO_HARD so the virtual mode handle can
>>> handle the request.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>> ---
>>> Changes:
>>> v2:
>>> * updated the commit log with David's comment
>>> ---
>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 ++++++++++++++++++++++++++++---------
>>>  1 file changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> index d461c440889a..a3be4bd6188f 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>>>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>>>  
>>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
>>> +{
>>> +	return mm_iommu_preregistered(vcpu->kvm->mm);
>>> +}
>>> +
>>> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
>>> +		struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
>>> +{
>>> +	return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
>>> +}
>>
>> I don't see that there's much point to these inlines.
>>
>>>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>  		unsigned long ioba, unsigned long tce)
>>>  {
>>> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	if (ret != H_SUCCESS)
>>>  		return ret;
>>>  
>>> -	if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> -		return H_TOO_HARD;
>>> +	if (kvmppc_preregistered(vcpu)) {
>>> +		/*
>>> +		 * We get here if guest memory was pre-registered which
>>> +		 * is normally VFIO case and gpa->hpa translation does not
>>> +		 * depend on hpt.
>>> +		 */
>>> +		struct mm_iommu_table_group_mem_t *mem;
>>>  
>>> -	rmap = (void *) vmalloc_to_phys(rmap);
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL))
>>> +			return H_TOO_HARD;
>>>  
>>> -	/*
>>> -	 * Synchronize with the MMU notifier callbacks in
>>> -	 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> -	 * While we have the rmap lock, code running on other CPUs
>>> -	 * cannot finish unmapping the host real page that backs
>>> -	 * this guest real page, so we are OK to access the host
>>> -	 * real page.
>>> -	 */
>>> -	lock_rmap(rmap);
>>> -	if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> -		ret = H_TOO_HARD;
>>> -		goto unlock_exit;
>>> +		mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
>>> +		if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces))
>>> +			return H_TOO_HARD;
>>> +	} else {
>>> +		/*
>>> +		 * This is emulated devices case.
>>> +		 * We do not require memory to be preregistered in this case
>>> +		 * so lock rmap and do __find_linux_pte_or_hugepte().
>>> +		 */
>>
>> Hmm.  So this isn't wrong as such, but the logic and comments are
>> both misleading.  The 'if' here isn't really about VFIO vs. emulated -
>> it's about whether the mm has *any* preregistered chunks, without any
>> regard to which particular device you're talking about.  For example
>> if your guest has two PHBs, one with VFIO devices and the other with
>> emulated devices, then the emulated devices will still go through the
>> "VFIO" case here.
> 
> kvmppc_preregistered() checks for a single pointer, kvmppc_rm_ua_to_hpa()
> goes through __find_linux_pte_or_hugepte() which is unnecessary
> complication here.
> 
> s/emulated devices case/case of a guest with emulated devices only/ ?
> 
> 
>> Really what you have here is a fast case when the tce_list is in
>> preregistered memory, and a fallback case when it isn't.  But that's
>> obscured by the fact that if for whatever reason you have some
>> preregistered memory but it doesn't cover the tce_list, then you don't
>> go to the fallback case here, but instead fall right back to the
>> virtual mode handler.
> 
> This is purely acceleration, I am trying to make obvious cases faster, and
> other cases safer. If some chunk is not preregistered but others are and
> there is H_PUT_TCE_INDIRECT with tce_list from non-preregistered memory,
> then I have no idea what this userspace is and what it is doing, so I just
> do not want to accelerate anything for it in real mode (I have poor
> imagination and since I cannot test it - I better drop it).
> 
> 
>> So, I think you should either:
>>     1) Fallback to the code below whenever you can't access the
>>        tce_list via prereg memory, regardless of whether there's any
>>        _other_ prereg memory
> 
> Using prereg was the entire point here as if something goes wrong (i.e.
> some TCE fails to translate), I may stop in a middle of TCE list and will
> have to do complicated rollback to pass the request in the virtual mode to
> finish processing (note that there is nothing to revert when it is
> emulated-devices-only-guest).
> 
> 
>> or
>>     2) Drop the code below entirely and always return H_TOO_HARD if
>>        you can't get the tce_list from prereg.
> 
> This path cannot fail for emulated device and it is really fast path, why
> to drop it?
> 
> 
> I am _really_ confused now. In few last respins this was not a concern, now
> it is - is the patch this bad and 100% needs to be reworked? I am trying to
> push it last 3 years now :(
> 


Ping?


> 
>>
>>> +		if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap))
>>> +			return H_TOO_HARD;
>>> +
>>> +		rmap = (void *) vmalloc_to_phys(rmap);
>>> +
>>> +		/*
>>> +		 * Synchronize with the MMU notifier callbacks in
>>> +		 * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
>>> +		 * While we have the rmap lock, code running on other CPUs
>>> +		 * cannot finish unmapping the host real page that backs
>>> +		 * this guest real page, so we are OK to access the host
>>> +		 * real page.
>>> +		 */
>>> +		lock_rmap(rmap);
>>> +		if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) {
>>> +			ret = H_TOO_HARD;
>>> +			goto unlock_exit;
>>> +		}
>>>  	}
>>>  
>>>  	for (i = 0; i < npages; ++i) {
>>> @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>  	}
>>>  
>>>  unlock_exit:
>>> -	unlock_rmap(rmap);
>>> +	if (rmap)
>>> +		unlock_rmap(rmap);
>>>  
>>>  	return ret;
>>>  }
>>
> 
> 


-- 
Alexey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 839 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20170111/a8800676/attachment.sig>


More information about the Linuxppc-dev mailing list