[PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

Fri Feb 10 18:58:30 AEDT 2017

On 10/02/17 15:50, David Gibson wrote:
> On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
>> On 10/02/17 14:07, David Gibson wrote:
>>> On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
>>>> On 09/02/17 14:51, David Gibson wrote:
>>>>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
>>>>>> For the emulated devices it does not matter much if we get a broken TCE
>>>>>> half way handling a TCE list but for VFIO it will matter as it has
>>>>>> more chances to fail so we try to do our best and check as much as we
>>>>>> can before proceeding.
>>>>>>
>>>>>> This separates a guest view table update from validation. No change in
>>>>>> behavior is expected.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>>>> ---
>>>>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
>>>>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
>>>>>>  2 files changed, 14 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> index 15df8ae627d9..9a7b7fca5e84 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		if (get_user(tce, tces + i)) {
>>>>>> +			ret = H_TOO_HARD;
>>>>>> +			goto unlock_exit;
>>>>>> +		}
>>>>>> +		tce = be64_to_cpu(tce);
>>>>>
>>>>> This doesn't look safe.  The contents of user memory could change
>>>>> between the two get_user()s, meaning that you're no longer guaranteed
>>>>> a TCE loaded into kernel has been validated at all.
>>>>>
>>>>> I think you need to either:
>>>>>
>>>>>     a) Make sure things safe against a bad TCE being loaded into a TCE
>>>>>     table and move all validation to where the TCE is used, rather
>>>>>     than loaded
>>>>>
>>>>> or
>>>>>     b) Copy the whole set of indirect entries to a temporary in-kernel
>>>>>        buffer, then validate, then load into the actual TCE table.
>>>>
>>>>
>>>> Correct :( The problem is I do not know how far I want to go in reverting
>>>> the state as it was when I started handling H_PUT_TCE_INDIRECT.
>>>>
>>>> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
>>>> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
>>>> address.
>>>>
>>>>
>>>> To do a) I'll need to remember old content of each hardware table entry as
>>>> when I reach TCE#100, I'll need to revert to the initial state which means
>>>> I need to write back old TCEs to all affected hardware tables and update
>>>> reference counters of all affected preregistered areas. Well, the actual
>>>> tables must not have different addresses (BUG_ON? is it worth testing while
>>>> writing to hardware tables that values I am replacing are the same in all
>>>> tables?) so I can have just a single array of old TCEs from hardware tables
>>>> in vcpu.
>>>
>>> I thought you said shared tables were disabled, so the two tables
>>> would have different addresses?
>>
>> That would be 2 physically separated tables but the content would be the
>> same as long as they belong to the same VFIO container.
> 
> Ok.  I thought you were talking about the address of the TCE tables
> being the same above.

No, the example uses 2 separate TCE tables.

> Did you mean the address of an individual page
> mapped in the TCE table?

I meant the tables themselves are separate in the host memory but their
content is the same.

>>> Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
>>> only if the guest/qemu does something wrong, or can it fail for other
>>> reasons? 
>>
>> This should always just work.
> 
> Ok, given that, just replacing HPAs we can't translate with a clear
> entry seems fine to me.

Ok.

>>> What about in real mode vs. virtual mode?
>>
>> Real mode is no different in this matter.
>>
>> Real mode is different from virtual mode in 3 aspects:
>>
>> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
>> inhibited writes to invalidate "TCE kill" cache;
>>
>> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
>> lockdep does not work in real mode properly;
>>
>> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
>> addresses directly. Not expected to fail.
>>
>> This is a full list.
> 
> Ok.
> 
>>> I think the key to this approach will be to think carefully about what
>>> semantics you guarantee for mappings shadowed into the hardware
>>> tables.  For example, it might work to specify that the host mappings
>>> only match the GPA mappings if those GPA mapings are valid in the
>>> first place.  So, H_PUT_TCE etc. would succeed as long as they're able
>>> to update the view of the table in terms of GPA.  But when you shadow
>>> those into the HPA tables, any entries which can't be translated you
>>> just replace with a cleared entry.
>>
>> Literally with zero? Silently? WARN_ON_ONCE?
> 
> Well, with a no-permission TCE, which might as well be zero, yes.
> 
> WARN_ON_ONCE() is probably a good idea.
> 
>>> That should be enough to protect
>>> the host.  Obviously you can expect the device to fail when you
>>> actually attempt to DMA there, but that's the guest's (or qemu's) own
>>> fault for putting bad addresses in the TCE table.
>>>
>>> Obviously that might not be great for debugging, since mappings will
>>> appear to succeed, but then not work later on.
>>>
>>> This does have the nice property that it's reasonably obvious what to
>>> do if you have some GPA mappings for emulated devices, then hotplug a
>>> VFIO device and at that point hit a gpa->hpa translation error.
>>> There's no hcall in this case, so there's no obvious way to return an
>>> error to the guest.
>>
>> Right. So if I do this, you would probably even ack this? :)
> 
> Assuming I don't spot some other showstopper...
> 
> Oh.. one thing to make sure you think about though: what happens if a
> guest makes some mappings, then there's a memory hotplug event which
> changes the set of valid GPAs?  In particular what if you hot unplug
> some memory which is mapped in a guest TCE table?  You might have to
> regenerate the HPA tables from the GPA table on hot unplug (unless you
> have a way of locking out an unplug event while that piece of guest
> ram is TCE mapped).

The guest is expected to clear TCE table. Then QEMU will delete regions
which will invoke unregistration of previously registered memory and if the
guest failed to clear TCE table, these preregistered pages will remain
pinned, this is what is that mm_iommu_mapped_inc/dec is about.

>>>> To do b) I'll need:
>>>>
>>>> 1. to have a copy of TCEs from the guest in vcpu,
>>>
>>> I don't quite understand this.  You need a temporary copy, yes, but I
>>> don't see why it needs to be attached to the vcpu.
>>
>> It does not need, I just need a safe + static + lock-free place for it as I
>> do not want to do malloc() in the TCE handlers and (in theory) multiple
>> CPUs can do concurrent TCE requests and I want to avoid locking especially
>> in realmode.
> 
> Ah, right, it's the inability to malloc() that's the difficulty.  You
> could put it in the vcpu, or you could use a per-(host)-cpu area - you
> can't switch guests while in a realmode handler.

vcpu looks as a safe choice, it is just a bit annoying that each CPU will
use 4K for something which most likely won't be used though.

>>
>>
>>>> I populate it via
>>>> get_user() to make sure they won't change;
>>>> 2. an array of userspace addresses translated from given TCEs; and in order
>>>> to make sure these addresses won't go away, I'll need to reference each
>>>> preregistered memory area via mm_iommu_mapped_inc().
>>>>
>>>> When I reach TCE#100, I'll have to revert the change, i.e. call
>>>> mm_iommu_mapped_dec().
>>>
>>> Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
>>> the updated translations into a temp buffer.  Then you'd to make more
>>> temp buffers to store the UA and HPA translations (although maybe you
>>> could overwrite/reuse the original temp buffer if you're careful).
>>> Then only if all of those succeed do you copy them into the real
>>> hardware tables.
>>>
>>> Which sounds like it might be kinda messy, at least in real mode.
>>
>> So is it worth it?
> 
> Option (a) is certainly looking better to me based on current
> information.
> 
>>>> So I will end up having 2 arrays in a vcpu and simpler reverting code.
>>>>
>>>>
>>>> Or I can do simpler version of b) which would store guest TCEs in
>>>> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
>>>> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
>>>> request, some preregistered regions will stay referenced till the guest is
>>>> killed or rebooted (and this will prevent memory from unregistering) - but
>>>> this means no harm to the host;
>>>
>>> Hrm.. that's not really true.  It's not the worst thing that can
>>> happen, but allowing the guest to permanently lock extra chunks of
>>> memory is a form of harm to the host.
>>
>>
>> These are the same preregistered chunks which are already locked. And the
>> lock is there till QEMU process is dead. What will not be possible is
>> memory hotunplug.
> 
> Ah, ok, I see your point.  That's probably sufficient, but option (a)
> is still looking better.
> 
>>>> and with preregistered RAM, there is no
>>>> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
>>>>
>>>>
>>>>
>>>> Which approach to pick?
>>>>
>>>>
>>>> LoPAPR says:
>>>> ===
>>>> If the TCE parameter represents the logical page address of a page that is
>>>> not valid for the calling partition, return
>>>> H_Parameter.
>>>> ===
>>>>
>>>>
>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> index 918af76ab2b6..f8a54b7c788e 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  {
>>>>>>  	struct kvmppc_spapr_tce_table *stt;
>>>>>>  	long i, ret = H_SUCCESS;
>>>>>> -	unsigned long tces, entry, ua = 0;
>>>>>> +	unsigned long tces, entry, tce, ua = 0;
>>>>>>  	unsigned long *rmap = NULL;
>>>>>>  
>>>>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>>>>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>  	}
>>>>>>  
>>>>>>  	for (i = 0; i < npages; ++i) {
>>>>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>>  
>>>>>>  		ret = kvmppc_tce_validate(stt, tce);
>>>>>>  		if (ret != H_SUCCESS)
>>>>>>  			goto unlock_exit;
>>>>>> +	}
>>>>>> +
>>>>>> +	for (i = 0; i < npages; ++i) {
>>>>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
>>>>>
>>>>> Same problem here.
>>>>>
>>>>>>  
>>>>>>  		kvmppc_tce_put(stt, entry + i, tce);
>>>>>>  	}
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 
> 

-- 
Alexey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 839 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20170210/ab6004de/attachment-0001.sig>