[PATCH v3 19/41] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

Nicholas Piggin npiggin at gmail.com
Tue Mar 23 05:22:31 AEDT 2021


Excerpts from Cédric Le Goater's message of March 23, 2021 2:01 am:
> On 3/22/21 2:15 PM, Nicholas Piggin wrote:
>> Excerpts from Alexey Kardashevskiy's message of March 22, 2021 5:30 pm:
>>>
>>>
>>> On 06/03/2021 02:06, Nicholas Piggin wrote:
>>>> In the interest of minimising the amount of code that is run in>>> "real-mode", don't handle hcalls in real mode in the P9 path.
>>>>
>>>> POWER8 and earlier are much more expensive to exit from HV real mode
>>>> and switch to host mode, because on those processors HV interrupts get
>>>> to the hypervisor with the MMU off, and the other threads in the core
>>>> need to be pulled out of the guest, and SLBs all need to be saved,
>>>> ERATs invalidated, and host SLB reloaded before the MMU is re-enabled
>>>> in host mode. Hash guests also require a lot of hcalls to run. The
>>>> XICS interrupt controller requires hcalls to run.
>>>>
>>>> By contrast, POWER9 has independent thread switching, and in radix mode
>>>> the hypervisor is already in a host virtual memory mode when the HV
>>>> interrupt is taken. Radix + xive guests don't need hcalls to handle
>>>> interrupts or manage translations.
> 
> Do we need to handle the host-is-a-P9-without-xive case ?

I'm not sure really. Is there an intention for OPAL to be able to 
provide a fallback layer in the worst case? Maybe microwatt grows
HV capability before XIVE?

> 
>>>> So it's much less important to handle hcalls in real mode in P9.
>>>
>>> So acde25726bc6034b (which added if(kvm_is_radix(vcpu->kvm))return 
>>> H_TOO_HARD) can be reverted, pretty much?
>> 
>> Yes. Although that calls attention to the fact I missed doing
>> a P9 h_random handler in this patch. I'll fix that, then I think
>> acde2572 could be reverted entirely.
>> 
>> [...]
>> 
>>>>   	} else {
>>>>   		kvmppc_xive_push_vcpu(vcpu);
>>>>   		trap = kvmhv_load_hv_regs_and_go(vcpu, time_limit, lpcr);
>>>> -		kvmppc_xive_pull_vcpu(vcpu);
>>>> +		/* H_CEDE has to be handled now, not later */
>>>> +		/* XICS hcalls must be handled before xive is pulled */
>>>> +		if (trap == BOOK3S_INTERRUPT_SYSCALL &&
>>>> +		    !(vcpu->arch.shregs.msr & MSR_PR)) {
>>>> +			unsigned long req = kvmppc_get_gpr(vcpu, 3);
>>>>   
>>>> +			if (req == H_CEDE) {
>>>> +				kvmppc_cede(vcpu);
>>>> +				kvmppc_xive_cede_vcpu(vcpu); /* may un-cede */
>>>> +				kvmppc_set_gpr(vcpu, 3, 0);
>>>> +				trap = 0;
>>>> +			}
>>>> +			if (req == H_EOI || req == H_CPPR ||
>>>
>>> else if (req == H_EOI ... ?
>> 
>> Hummm, sure.
> 
> you could integrate the H_CEDE in the switch statement below.

Below is in a different file just for the emulation calls.

>> 
>> [...]
>> 
>>>> +void kvmppc_xive_cede_vcpu(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	void __iomem *esc_vaddr = (void __iomem *)vcpu->arch.xive_esc_vaddr;
>>>> +
>>>> +	if (!esc_vaddr)
>>>> +		return;
>>>> +
>>>> +	/* we are using XIVE with single escalation */
>>>> +
>>>> +	if (vcpu->arch.xive_esc_on) {
>>>> +		/*
>>>> +		 * If we still have a pending escalation, abort the cede,
>>>> +		 * and we must set PQ to 10 rather than 00 so that we don't
>>>> +		 * potentially end up with two entries for the escalation
>>>> +		 * interrupt in the XIVE interrupt queue.  In that case
>>>> +		 * we also don't want to set xive_esc_on to 1 here in
>>>> +		 * case we race with xive_esc_irq().
>>>> +		 */
>>>> +		vcpu->arch.ceded = 0;
>>>> +		/*
>>>> +		 * The escalation interrupts are special as we don't EOI them.
>>>> +		 * There is no need to use the load-after-store ordering offset
>>>> +		 * to set PQ to 10 as we won't use StoreEOI.
>>>> +		 */
>>>> +		__raw_readq(esc_vaddr + XIVE_ESB_SET_PQ_10);
>>>> +	} else {
>>>> +		vcpu->arch.xive_esc_on = true;
>>>> +		mb();
>>>> +		__raw_readq(esc_vaddr + XIVE_ESB_SET_PQ_00);
>>>> +	}
>>>> +	mb();
>>>
>>>
>>> Uff. Thanks for cut-n-pasting the comments, helped a lot to match this c 
>>> to that asm!
>> 
>> Glad it helped.
>>>> +}
> 
> I had to do the PowerNV models in QEMU to start understanding that stuff ... 
> 
>>>> +EXPORT_SYMBOL_GPL(kvmppc_xive_cede_vcpu);
>>>> +
>>>>   /*
>>>>    * This is a simple trigger for a generic XIVE IRQ. This must
>>>>    * only be called for interrupts that support a trigger page
>>>> @@ -2106,6 +2140,32 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +int kvmppc_xive_xics_hcall(struct kvm_vcpu *vcpu, u32 req)
>>>> +{
>>>> +	struct kvmppc_vcore *vc = vcpu->arch.vcore;
>>>
>>>
>>> Can a XIVE enabled guest issue these hcalls? Don't we want if 
>>> (!kvmppc_xics_enabled(vcpu)) and
>>>   if (xics_on_xive()) here, as kvmppc_rm_h_xirr() have? Some of these 
>>> hcalls do write to XIVE registers but some seem to change 
>>> kvmppc_xive_vcpu. Thanks,
>> 
>> Yes I think you're right, good catch. I'm not completely sure about all 
>> the xive and xics modes but a guest certainly can make any kind of hcall 
>> it likes and we have to sanity check it.
> 
> Yes. 
> 
>> We want to take the hcall here (in replacement of the real mode hcalls)
>> with the same condition. So it would be:
>> 
>>         if (!kvmppc_xics_enabled(vcpu))
>>                 return H_TOO_HARD;
> 
> Yes.
> 
> This test covers the case in which a vCPU does XICS hcalls without QEMU 
> having connected the vCPU to a XICS ICP. The ICP is the KVM XICS device 
> on P8 or XICS-on-XIVE on P9. It catches QEMU errors when the interrupt 
> mode is negotiated, we don't want the OS to do XICS hcalls after having 
> negotiated the XIVE interrupt mode. 

Okay.

> It's different for the XIVE hcalls (when running under XICS) because they 
> are all handled in QEMU. 

XIVE guest hcalls running on XICS host?

>>         if (!xics_on_xive())
>> 		return H_TOO_HARD;
> 
> I understand that this code is only called on P9 and with translation on.

Yes.

> On P9, we could have xics_on_xive() == 0 if XIVE is disabled at compile 
> time or with "xive=off" at boot time. But guests should be supported. 
> I don't see a reason to restrict the support even if these scenarios 
> are rather unusual if not very rare.
> 
> on P10, it's the same but since we don't have the XICS emulation layer 
> in OPAL, the host will be pretty useless. We don't care.
> 
> Since we are trying to handle hcalls, this is L0 and it can not be called 
> for nested guests, which would be another case of xics_on_xive() == 0. 
> We don't care either.

Okay so no xics_on_xive() test. I'll change that.

Thanks,
Nick


More information about the Linuxppc-dev mailing list