[RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

Wed Jan 11 10:06:39 EST 2012

On 10.01.2012, at 23:03, Scott Wood wrote:

> On 01/09/2012 09:11 PM, Alexander Graf wrote:
>> On 10.01.2012, at 01:51, Scott Wood wrote:
>>> On 01/09/2012 11:46 AM, Alexander Graf wrote:
>>>> On 21.12.2011, at 02:34, Scott Wood wrote:
>>>>> +		/* For debugging, encode the failing instruction and
>>>>> +		 * report it to userspace. */
>>>>> +		run->hw.hardware_exit_reason = ~0ULL << 32;
>>>>> +		run->hw.hardware_exit_reason |= vcpu->arch.last_inst;
>>>> 
>>>> 
>>>> I'm fairly sure you want to fix this :)
>>> 
>>> Likewise, that's what booke.c already does.  What should it do instead?
>> 
>> This is what book3s does:
>> 
>>                case EMULATE_FAIL:
>>                        printk(KERN_CRIT "%s: emulation at %lx failed (%08x)\n",
>>                               __func__, kvmppc_get_pc(vcpu), kvmppc_get_last_inst(vcpu));
>>                        kvmppc_core_queue_program(vcpu, flags);
>>                        r = RESUME_GUEST;
>> 
>> which also doesn't throttle the printk, but I think injecting a
>> program fault into the guest is the most sensible thing to do if we
>> don't know what the instruction is supposed to do. Best case we get
>> an oops inside the guest telling us what broke :).
> 
> Ah, yes, it should send a program check.
> 
>>>> Ah, so that's what you want to use regs for. So is having a pt_regs
>>>> struct that only contains useful register values in half its fields
>>>> any useful here? Or could we keep control of the registers ourselves,
>>>> enabling us to maybe one day optimize things more.
>>> 
>>> I think it contains enough to be useful for debugging code such as sysrq
>>> and tracers, and as noted in the comment we could copy the rest if we
>>> care enough.  MSR might be worth copying.
>>> 
>>> It will eventually be used for machine checks as well, which I'd like to
>>> hand reasonable register state to, at least for GPRs, LR, and PC.
>>> 
>>> If there's a good enough performance reason, we could just copy
>>> everything over for machine checks and pass NULL to do_IRQ (I think it
>>> can take this -- a dummy regs struct if not), but it seems premature at
>>> the moment unless the switch already causes measured performance loss
>>> (cache utilization?).
>> 
>> I'm definitely not concerned about performance, but complexity and uniqueness.
>> 
>> With the pt_regs struct, we have a bunch of fields in the vcpu that are there, but unused. I find that situation pretty confusing.
> 
> I removed the registers from the vcpu, that are to be used in regs instead.
> 
> There are a few fields in regs that are not valid, though it is
> explicitly pointed out via a comment.

Yes, and if there was real technical reason to do it this way I'd agree. But there isn't.

> 
>> So yes, I would definitely prefer to copy registers during MC and keep the registers where they are today - unless there are SPRs for them of course.
>> 
>> Imagine we'd one day want to share GPRs with user space through the
>> kvm_run structure (see the s390 patches on the ML for this). I really
>> wouldn't want to make pt_regs part of our userspace ABI.
> 
> Neither would I.  If that's something that's reasonably likely to
> happen, I guess that's a good enough reason to avoid this.  We could
> always add later a debug option to copy regs even on normal interrupts,
> if needed.

Yup. I don't want to walk in the wrong direction basically. The overhead of copying
a couple fields to the stack on machine checks doesn't sound too bad compared
to the flexibility we maintain by keeping fields under our control.

Another imaginary case. I experimented with putting the GPRs into the PACA
back in the day. I don't remember why anymore, but it was for some speedup
of something.

That wouldn't be possible if we mandate everyone to use pt_regs.

> 
>>> We probably should defer the check until after we've disabled
>>> interrupts, similar to signals -- even if we didn't exit for an
>>> interrupt, we could have received one after enabling them.
>> 
>> Yup. I just don't think you can call resched() with interrupts disabled, so a bit cleverness is probably required here.
> 
> I think it is actually allowed, but interrupts will be enabled on
> return.  We'll need to repeat prepare_to_enter if we do schedule.  Since
> we already need special handling for that, we might as well add a
> local_irq_enable() once we know we are going to schedule, just in case.

Yup :). And then check again.

> 
>>>>> diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
>>>>> index 05d1d99..d53bcf2 100644
>>>>> --- a/arch/powerpc/kvm/booke.h
>>>>> +++ b/arch/powerpc/kvm/booke.h
>>>>> @@ -48,7 +48,20 @@
>>>>> #define BOOKE_IRQPRIO_PERFORMANCE_MONITOR 19
>>>>> /* Internal pseudo-irqprio for level triggered externals */
>>>>> #define BOOKE_IRQPRIO_EXTERNAL_LEVEL 20
>>>>> -#define BOOKE_IRQPRIO_MAX 20
>>>>> +#define BOOKE_IRQPRIO_DBELL 21
>>>>> +#define BOOKE_IRQPRIO_DBELL_CRIT 22
>>>>> +#define BOOKE_IRQPRIO_MAX 23
>>>> 
>>>> So was MAX wrong before or is it too big now?
>>> 
>>> MAX is just a marker for how many IRQPRIOs we have, not any sort of
>>> external limit.  This patch adds new IRQPRIOs, so MAX goes up.
>>> 
>>> The actual limit is the number of bits in a long.
>> 
> 
>> Yes, and before the highest value was 20 with MAX being 20, now the
>> highest value is 22 with MAX being 23. Either MAX == highest number
>> or MAX == highest number + 1, but you're changing the semantics of
>> MAX here. Maybe it was wrong before, I don't know, hence I'm asking
>> :).
> 
> Oh, didn't notice that.
> 
> Actually, it looks like the two places that reference BOOKE_IRQPRIO_MAX
> don't agree on what they're expecting.  book3s uses "one greater than
> the highest irqprio", so I guess we should resolve it that way (even
> though I'd normally expect that to be phrased "num" rather than "max")
> -- as a separate patch, of course.

Yup. As long as it's consistent it's fine. I just really stumbled over this since the semantics of the define changed.

Alex