[PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

Fri Feb 10 14:07:23 AEDT 2017

On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> On 09/02/17 14:51, David Gibson wrote:
> > On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >> For the emulated devices it does not matter much if we get a broken TCE
> >> half way handling a TCE list but for VFIO it will matter as it has
> >> more chances to fail so we try to do our best and check as much as we
> >> can before proceeding.
> >>
> >> This separates a guest view table update from validation. No change in
> >> behavior is expected.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 15df8ae627d9..9a7b7fca5e84 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		if (get_user(tce, tces + i)) {
> >> +			ret = H_TOO_HARD;
> >> +			goto unlock_exit;
> >> +		}
> >> +		tce = be64_to_cpu(tce);
> > 
> > This doesn't look safe.  The contents of user memory could change
> > between the two get_user()s, meaning that you're no longer guaranteed
> > a TCE loaded into kernel has been validated at all.
> > 
> > I think you need to either:
> > 
> >     a) Make sure things safe against a bad TCE being loaded into a TCE
> >     table and move all validation to where the TCE is used, rather
> >     than loaded
> > 
> > or
> >     b) Copy the whole set of indirect entries to a temporary in-kernel
> >        buffer, then validate, then load into the actual TCE table.
> 
> 
> Correct :( The problem is I do not know how far I want to go in reverting
> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> 
> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> address.
> 
> 
> To do a) I'll need to remember old content of each hardware table entry as
> when I reach TCE#100, I'll need to revert to the initial state which means
> I need to write back old TCEs to all affected hardware tables and update
> reference counters of all affected preregistered areas. Well, the actual
> tables must not have different addresses (BUG_ON? is it worth testing while
> writing to hardware tables that values I am replacing are the same in all
> tables?) so I can have just a single array of old TCEs from hardware tables
> in vcpu.

I thought you said shared tables were disabled, so the two tables
would have different addresses?

Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
only if the guest/qemu does something wrong, or can it fail for other
reasons?  What about in real mode vs. virtual mode?

I think the key to this approach will be to think carefully about what
semantics you guarantee for mappings shadowed into the hardware
tables.  For example, it might work to specify that the host mappings
only match the GPA mappings if those GPA mapings are valid in the
first place.  So, H_PUT_TCE etc. would succeed as long as they're able
to update the view of the table in terms of GPA.  But when you shadow
those into the HPA tables, any entries which can't be translated you
just replace with a cleared entry. That should be enough to protect
the host.  Obviously you can expect the device to fail when you
actually attempt to DMA there, but that's the guest's (or qemu's) own
fault for putting bad addresses in the TCE table.

Obviously that might not be great for debugging, since mappings will
appear to succeed, but then not work later on.

This does have the nice property that it's reasonably obvious what to
do if you have some GPA mappings for emulated devices, then hotplug a
VFIO device and at that point hit a gpa->hpa translation error.
There's no hcall in this case, so there's no obvious way to return an
error to the guest.

> To do b) I'll need:
> 
> 1. to have a copy of TCEs from the guest in vcpu,

I don't quite understand this.  You need a temporary copy, yes, but I
don't see why it needs to be attached to the vcpu.

> I populate it via
> get_user() to make sure they won't change;
> 2. an array of userspace addresses translated from given TCEs; and in order
> to make sure these addresses won't go away, I'll need to reference each
> preregistered memory area via mm_iommu_mapped_inc().
> 
> When I reach TCE#100, I'll have to revert the change, i.e. call
> mm_iommu_mapped_dec().

Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
the updated translations into a temp buffer.  Then you'd to make more
temp buffers to store the UA and HPA translations (although maybe you
could overwrite/reuse the original temp buffer if you're careful).
Then only if all of those succeed do you copy them into the real
hardware tables.

Which sounds like it might be kinda messy, at least in real mode.

> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> 
> 
> Or I can do simpler version of b) which would store guest TCEs in
> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> request, some preregistered regions will stay referenced till the guest is
> killed or rebooted (and this will prevent memory from unregistering) - but
> this means no harm to the host;

Hrm.. that's not really true.  It's not the worst thing that can
happen, but allowing the guest to permanently lock extra chunks of
memory is a form of harm to the host.

> and with preregistered RAM, there is no
> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> 
> 
> 
> Which approach to pick?
> 
> 
> LoPAPR says:
> ===
> If the TCE parameter represents the logical page address of a page that is
> not valid for the calling partition, return
> H_Parameter.
> ===
> 
> 
> 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> index 918af76ab2b6..f8a54b7c788e 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  {
> >>  	struct kvmppc_spapr_tce_table *stt;
> >>  	long i, ret = H_SUCCESS;
> >> -	unsigned long tces, entry, ua = 0;
> >> +	unsigned long tces, entry, tce, ua = 0;
> >>  	unsigned long *rmap = NULL;
> >>  
> >>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>  	}
> >>  
> >>  	for (i = 0; i < npages; ++i) {
> >> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>  
> >>  		ret = kvmppc_tce_validate(stt, tce);
> >>  		if (ret != H_SUCCESS)
> >>  			goto unlock_exit;
> >> +	}
> >> +
> >> +	for (i = 0; i < npages; ++i) {
> >> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> > 
> > Same problem here.
> > 
> >>  
> >>  		kvmppc_tce_put(stt, entry + i, tce);
> >>  	}
> > 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20170210/a68f5228/attachment.sig>