[RFC PATCH v1] powerpc/prom_init: disable XIVE in Secure VM.

Mon Mar 2 18:34:51 AEDT 2020

On 2/29/20 11:51 PM, Ram Pai wrote:
> On Sat, Feb 29, 2020 at 09:27:54AM +0100, Cédric Le Goater wrote:
>> On 2/29/20 8:54 AM, Ram Pai wrote:
>>> XIVE is not correctly enabled for Secure VM in the KVM Hypervisor yet.
>>>
>>> Hence Secure VM, must always default to XICS interrupt controller.
>>
>> have you tried XIVE emulation 'kernel-irqchip=off' ? 
> 
> yes and it hangs. I think that option, continues to enable some variant
> of XIVE in the VM. 

HW is not involved, KVM is not involved anymore and all is emulated at 
the QEMU level in user space. What is the issue ? 

> There are some known deficiencies between KVM
> and the ultravisor negotiation, resulting in a hang in the SVM.

That is something else to investigate. feature/capability negotiation
is the core of the hypervisor stack : 

    OPAL <-> PowerNV <-> KVM <-> QEMU <-> guest OS

>>> If XIVE is requested through kernel command line option "xive=on",
>>> override and turn it off.
>>
>> This is incorrect. It is negotiated through CAS depending on the FW
>> capabilities and the KVM capabilities.
> 
> Yes I understand, qemu/KVM have predetermined a set of capabilties that
> it can offer to the VM.  The kernel within the VM has a list of
> capabilties it needs to operate correctly.  So both negotiate and
> determine something mutually ammicable.
> 
> Here I am talking about the list of capabilities that the kernel is
> trying to determine, it needs to operate correctly.  "xive=on" is one of
> those capabilities the kernel is told by the VM-adminstrator, to enable.

XIVE is not a kernel capability. It's platform support and the default
for P9 is the native exploitation mode which makes full use of the P9
interrupt controller. For non XIVE aware kernels, the hypervisor emulates
the legacy interface on top of XIVE. 

"xive=off" was introduced for distro testing. It skips the negotiation 
process of the XIVE native exploitation mode on the guest. But it's not
a negotiation setting. It's a chicken switch.

> Unfortunately if the VM-administrtor blindly requests to enable it, the
> kernel must override it, if it knows that will be switching the VM into
> a SVM soon. No point negotiating a capability with Qemu; through CAS,
> if it knows it cannot handle that capability.

I don't understand. Are you talking about SVM or XIVE ? 

>>> If XIVE is the only supported platform interrupt controller; specified
>>> through qemu option "ic-mode=xive", simply abort. Otherwise default to
>>> XICS.
>>
>>
>> I don't think it is a good approach to downgrade the guest kernel 
>> capabilities this way. 
>>
>> PAPR has specified the CAS negotiation process for this purpose. It 
>> comes in two parts under KVM. First the KVM hypervisor advertises or 
>> not a capability to QEMU. The second is the CAS negotiation process 
>> between QEMU and the guest OS.
> 
> Unfortunately, this is not viable.  At the time the hypervisor
> advertises its capabilities to qemu, the hypervisor has no idea whether
> that VM will switch into a SVM or not. 

OK, but the hypervisor knows if it can handle 'SVM' guests or not and,
if not, there is no point in advertising a 'SVM' capability to the guest. 

> The decision to switch into a> SVM is taken by the kernel running in the VM. This happens much later,
> after the hypervisor has already conveyed its capabilties to the qemu, and
> qemu has than instantiated the VM.

So you don't have negotiation with the hypervisor ? How does the guest
knows the hypervisor platform can handle SVMs ? try and see if it fails ?
If so, it seems quite broken to me.

> As a result, CAS in prom_init is the only place where this negotiation
> can take place.

Euh. I don't follow. This is indeed where CAS is performed and so it's 
*the* place to check that the hypervisor has 'SVM' support ? 

>> The SVM specifications might not be complete yet and if some features 
>> are incompatible, I think we should modify the capabilities advertised 
>> by the hypervisor : no XIVE in case of SVM. QEMU will automatically 
>> use the fallback path and emulate the XIVE device, same as setting 
>> 'kernel-irqchip=off'. 
> 
> As mentioned above, this would be an excellent approach, if the
> Hypervisor was aware of the VM's intent to switch into a SVM. Neither
> the hypervisor knows, nor the qemu.  Only the kernel running within the
> VM knows about it.

The hypervisor (KVM/QEMU) never knows what are the guest OS capabilities
or its intents. That is why there is a negotiation process. 

I would do :

 * OPAL FW advertises 'SVM' support to the Linux PowerNV (through DT) 
 * KVM advertises 'SVM' support to QEMU (extend KVM ioctls)
 * QEMU advertises 'SVM' support to guest OS (through CAS or DT) 
 * Guest OS should not try to use SVM it is not supported. 

If the passthrough of HW pages is not supported by Ultravisor, KVM 
should not advertised XIVE to QEMU which would then use fallback mode.

If emulated XIVE or XICS is not supported by SVM guests, then we have
a problem and we need to understand why ! :) 

And if XIVE is still a problem, then the guest could change the CAS 
request and remove XIVE when SVM is being set. I suppose that we have 
all this information before CAS. Do we ? 

It should be a runtime choice taking into account the full software 
stack rather than a compile choice at the bottom which would impact
all other options. This is not acceptable IMHO.

Cheers,

C.