[PATCH v3 1/2] KVM: PPC: Book3S HV: Sanitise vcpu registers in nested path

Tue May 4 18:35:49 AEST 2021

Excerpts from Paul Mackerras's message of May 4, 2021 5:36 pm:
> On Tue, May 04, 2021 at 03:26:24PM +1000, Nicholas Piggin wrote:
>> Excerpts from Paul Mackerras's message of May 4, 2021 2:28 pm:
>> > On Sat, May 01, 2021 at 11:58:36AM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Fabiano Rosas's message of April 16, 2021 9:09 am:
>> >> > As one of the arguments of the H_ENTER_NESTED hypercall, the nested
>> >> > hypervisor (L1) prepares a structure containing the values of various
>> >> > hypervisor-privileged registers with which it wants the nested guest
>> >> > (L2) to run. Since the nested HV runs in supervisor mode it needs the
>> >> > host to write to these registers.
>> >> > 
>> >> > To stop a nested HV manipulating this mechanism and using a nested
>> >> > guest as a proxy to access a facility that has been made unavailable
>> >> > to it, we have a routine that sanitises the values of the HV registers
>> >> > before copying them into the nested guest's vcpu struct.
>> >> > 
>> >> > However, when coming out of the guest the values are copied as they
>> >> > were back into L1 memory, which means that any sanitisation we did
>> >> > during guest entry will be exposed to L1 after H_ENTER_NESTED returns.
>> >> > 
>> >> > This patch alters this sanitisation to have effect on the vcpu->arch
>> >> > registers directly before entering and after exiting the guest,
>> >> > leaving the structure that is copied back into L1 unchanged (except
>> >> > when we really want L1 to access the value, e.g the Cause bits of
>> >> > HFSCR).
>> >> > 
>> >> > Signed-off-by: Fabiano Rosas <farosas at linux.ibm.com>
>> >> > ---
>> >> >  arch/powerpc/kvm/book3s_hv_nested.c | 55 ++++++++++++++++++-----------
>> >> >  1 file changed, 34 insertions(+), 21 deletions(-)
>> >> > 
>> >> > diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
>> >> > index 0cd0e7aad588..270552dd42c5 100644
>> >> > --- a/arch/powerpc/kvm/book3s_hv_nested.c
>> >> > +++ b/arch/powerpc/kvm/book3s_hv_nested.c
>> >> > @@ -102,8 +102,17 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
>> >> >  {
>> >> >  	struct kvmppc_vcore *vc = vcpu->arch.vcore;
>> >> >  
>> >> > +	/*
>> >> > +	 * When loading the hypervisor-privileged registers to run L2,
>> >> > +	 * we might have used bits from L1 state to restrict what the
>> >> > +	 * L2 state is allowed to be. Since L1 is not allowed to read
>> >> > +	 * the HV registers, do not include these modifications in the
>> >> > +	 * return state.
>> >> > +	 */
>> >> > +	hr->hfscr = ((~HFSCR_INTR_CAUSE & hr->hfscr) |
>> >> > +		     (HFSCR_INTR_CAUSE & vcpu->arch.hfscr));
>> >> > +
>> >> >  	hr->dpdes = vc->dpdes;
>> >> > -	hr->hfscr = vcpu->arch.hfscr;
>> >> >  	hr->purr = vcpu->arch.purr;
>> >> >  	hr->spurr = vcpu->arch.spurr;
>> >> >  	hr->ic = vcpu->arch.ic;
>> >> 
>> >> Do we still have the problem here that hfac interrupts due to bits cleared
>> >> by the hfscr sanitisation would have the cause bits returned to the L1,
>> >> so in theory it could probe hfscr directly that way? I don't see a good
>> >> solution to this except either have the L0 intercept these faults and do
>> >> "something" transparent, or return error from H_ENTER_NESTED (which would
>> >> also allow trivial probing of the facilities).
>> > 
>> > It seems to me that there are various specific reasons why L0 would
>> > clear HFSCR bits, and if we think about the specific reasons, what we
>> > should do becomes clear.  (I say "L0" but in fact the same reasoning
>> > applies to any hypervisor that lets its guest do hypervisor-ish
>> > things.)
>> > 
>> > 1. Emulating a version of the architecture which doesn't have the
>> > feature in question - in that case the bit should appear to L1 as a
>> > reserved bit in HFSCR (i.e. always read 0), the associated facility
>> > code should never appear in the top 8 bits of any HFSCR value that L1
>> > sees, and any HFU interrupt received by L0 for the facility should be
>> > changed into an illegal instruction interrupt (or HEAI) forwarded to
>> > L1.  In this case the real HFSCR should always have the enable bit for
>> > the facility set to 0.
>> > 
>> > 2. Lazy save/restore of the state associated with a facility - in this
>> > case, while the system is in the "lazy" state (i.e. the state is not
>> > that of the currently running guest), the real HFSCR bit for the
>> > facility should be 0.  On an HFU interrupt for the facility, L0 looks
>> > at L1's HFSCR value: if it's 0, forward the HFU interrupt to L1; if
>> > it's 1, load up the facility state, set the facility's bit in HFSCR,
>> > and resume the guest.
>> > 
>> > 3. Emulating a facility in software - in this case, the real HFSCR
>> > bit for the facility would always be 0.  On an HFU interrupt, L0 reads
>> > the instruction and emulates it, then resumes the guest.
>> > 
>> > One thing this all makes clear is that the IC field of the "virtual"
>> > HFSCR value seen by L1 should only ever be changed when L0 forwards a
>> > HFU interrupt to L1.
>> > 
>> > In fact we currently never do (1) or (2), and we only do (3) for
>> > msgsndp etc., so this discussion is mostly theoretical.
>> 
>> Yeah it's somewhat theoretical, and I guess I mostly agree with you.
>> 
>> Missing is the case where the L0 does not implement a feature at all.
>> Let's say TM is broken so it disables it, or nobody uses TAR so it 
>> doesn't bother to switch it.
> 
> I think that's the same as my case (1), where L0 is presenting an
> architecture to L1 that doesn't implement the feature.  (Unless you
> mean that L0 is running on a machine that has features that didn't
> exist at the time that L0's code was written; to handle that, L0
> should clear all HFSCR bits that it doesn't know about, and map any
> unknown HFSCR[IC] code into illegal interrupt/HEAI.)

I don't think it's quite either of those. There is no ISA v3.0 where the
target address register facility does not exist, and the L0 is not going
to emulate it, for argument's sake.

>> In those cases what do you tell the L1 if it enables a bit that you
>> don't support at all, and it takes a fault?
> 
> Illegal instruction interrupt, or HEAI.

Okay, so that's just a roundabout way to tell the nested HV "this 
feature doesn't exist", asynchronously at some point when its guest
tries to use the facility. It's contrary to the architecture it
expected (maybe wrongly because it didn't query the right capability)
but it set HFSCR for the facility so it thought it should work.

>> I guess the right thing to do is advertise that to the guest by some
>> other means, and expect it does the right thing. And you could have
>> the proviso in the nested HV specification that the returned IC field
>> might trip for a feature you enabled in the L1 HFSCR.
> 
> I think that is confusing; better to return illegal instruction/HEAI.

I think both are more confusing than returning error from the hcall.

> 
>> > 
>> >> Returning an hfac interrupt to a hypervisor that thought it enabled the 
>> >> bit would be strange. But so does appearing to modify the register 
>> >> underneath it and then returning a fault.
>> > 
>> > I don't think we should ever do either of those things.  The closest
>> > would be (1) above, but in that case the fault has to be either an
>> > illegal instruction type program interrupt, or a HEAI.
>> > 
>> >> I think the sanest thing would actually be to return failure from the 
>> >> hcall.
>> > 
>> > I don't think we should do that either.
>> 
>> I still think it's preferable for case 4. No point waiting for the
>> guest to boot and some user program eventually hits a bad instruction,
>> even if it was due to some host vs guest configuration problem.
> 
> If you let it boot to userspace, the administrator has a better chance
> of recovering the situation.

Maybe. What about if it runs userspace until something executes
pthread_mutex_lock

An error message when you try to start the nested guest telling you
pass -machine cap-htm=off would be better... I guess that should
really all work with caps etc today though so TM's a bad example.
But assume we don't have a cap for the bit we disable? Maybe we
should have caps for all HFSCR bits, or I'm just worried about
something not very important.

Thanks,
Nick