[PATCH v2] KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path

Wed Apr 7 20:01:00 AEST 2021

Excerpts from Fabiano Rosas's message of April 7, 2021 7:46 am:
> As one of the arguments of the H_ENTER_NESTED hypercall, the nested
> hypervisor (L1) prepares a structure containing the values of various
> hypervisor-privileged registers with which it wants the nested guest
> (L2) to run. Since the nested HV runs in supervisor mode it needs the
> host to write to these registers.
> 
> To stop a nested HV manipulating this mechanism and using a nested
> guest as a proxy to access a facility that has been made unavailable
> to it, we have a routine that sanitises the values of the HV registers
> before copying them into the nested guest's vcpu struct.
> 
> However, when coming out of the guest the values are copied as they
> were back into L1 memory, which means that any sanitisation we did
> during guest entry will be exposed to L1 after H_ENTER_NESTED returns.
> 
> This patch alters this sanitisation to have effect on the vcpu->arch
> registers directly before entering and after exiting the guest,
> leaving the structure that is copied back into L1 unchanged (except
> when we really want L1 to access the value, e.g the Cause bits of
> HFSCR).
> 
> Signed-off-by: Fabiano Rosas <farosas at linux.ibm.com>

I like this direction better. Now "sanitise" may be the wrong word 
because you aren't cleaning the data in place any more but copying a 
cleaned version across.

Which is fine but I wouldn't call it sanitise. Actually I would
prefer if it is just done as part of the copy rather than
copy first and then apply this (explained below in the code).

> ---
> I'm taking another shot at fixing this locally without resorting to
> more complex things such as error handling and feature
> advertisement/negotiation.

That's okay, I think those are orthogonal problems. This won't help if
a nested HV tries to use some HFSCR feature it believes should be
available to a (say) POWER10 processor but got secretly masked away.
But also such negotiation doesn't help with trying to minimise L0 HV
data accessible to guests.

(As before a guest can easily find out many of these things if it is
determined to, but that does not mean I'm strongly against what you
are doing here)

> 
> Changes since v1:
> 
> - made the change more generic, not only applies to hfscr anymore;
> - sanitisation is now done directly on the vcpu struct, l2_hv is left unchanged;
> 
> v1:
> 
> https://lkml.kernel.org/r/20210305231055.2913892-1-farosas@linux.ibm.com
> ---
>  arch/powerpc/kvm/book3s_hv_nested.c | 33 +++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 0cd0e7aad588..a60fccb2c4f2 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -132,21 +132,37 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
>  	}
>  }
>  
> -static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
> +static void sanitise_vcpu_entry_state(struct kvm_vcpu *vcpu,
> +				      const struct hv_guest_state *l2_hv,
> +				      const struct hv_guest_state *l1_hv)
>  {
>  	/*
>  	 * Don't let L1 enable features for L2 which we've disabled for L1,
>  	 * but preserve the interrupt cause field.
>  	 */
> -	hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
> +	vcpu->arch.hfscr = l2_hv->hfscr & (HFSCR_INTR_CAUSE | l1_hv->hfscr);
>  
>  	/* Don't let data address watchpoint match in hypervisor state */
> -	hr->dawrx0 &= ~DAWRX_HYP;
> -	hr->dawrx1 &= ~DAWRX_HYP;
> +	vcpu->arch.dawrx0 = l2_hv->dawrx0 & ~DAWRX_HYP;
> +	vcpu->arch.dawrx1 = l2_hv->dawrx1 & ~DAWRX_HYP;
>  
>  	/* Don't let completed instruction address breakpt match in HV state */
> -	if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> -		hr->ciabr &= ~CIABR_PRIV;
> +	if ((l2_hv->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
> +		vcpu->arch.ciabr = l2_hv->ciabr & ~CIABR_PRIV;
> +}
> +
> +
> +/*
> + * During sanitise_vcpu_entry_state() we might have used bits from L1
> + * state to restrict what the L2 state is allowed to be. Since L1 is
> + * not allowed to read the HV registers, do not include these
> + * modifications in the return state.
> + */
> +static void sanitise_vcpu_return_state(struct kvm_vcpu *vcpu,
> +				       const struct hv_guest_state *l2_hv)
> +{
> +	vcpu->arch.hfscr = ((~HFSCR_INTR_CAUSE & l2_hv->hfscr) |
> +			(HFSCR_INTR_CAUSE & vcpu->arch.hfscr));
>  }
>  
>  static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
> @@ -324,9 +340,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>  	mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
>  		LPCR_LPES | LPCR_MER;
>  	lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask);
> -	sanitise_hv_regs(vcpu, &l2_hv);
>  	restore_hv_regs(vcpu, &l2_hv);
>  
> +	sanitise_vcpu_entry_state(vcpu, &l2_hv, &saved_l1_hv);

So instead of doing this, can we just have one function that does
load_hv_regs_for_l2()?

> +
>  	vcpu->arch.ret = RESUME_GUEST;
>  	vcpu->arch.trap = 0;
>  	do {
> @@ -338,6 +355,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>  		r = kvmhv_run_single_vcpu(vcpu, hdec_exp, lpcr);
>  	} while (is_kvmppc_resume_guest(r));
>  
> +	sanitise_vcpu_return_state(vcpu, &l2_hv);

And this could be done in save_hv_return_state().

I think?

Question about HFSCR. Is it possible for some interrupt cause bit
reaching the nested hypervisor for a bit that we thought we had
enabled but was secretly masked off? I.e., do we have to filter
HFSCR causes according to the facilities we secretly disabled?

Thanks,
Nick