[PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
Daniel Henrique Barboza
danielhb413 at gmail.com
Sat Apr 17 07:02:14 AEST 2021
At this moment, PAPR [1] does not have a way to report errors during a device
removal operation. This puts a strain in the hypervisor, which needs extra
mechanisms to try to fallback and recover from an error that might have
happened during the removal. The QEMU community has dealt with it during these
years by either trying to preempt the error before sending the HP event or, in
case of a guest side failure, reboot the guest to complete the removal process.
This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC
reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a
memory removal error was introduced. In this case, when QEMU detects that the
kernel is reconfiguring LMBs DRCs that were marked as pending removal, the
entire process is reverted from the QEMU side as well. Around the same time,
other discussions in the QEMU mailing discussed an alternative for other device
as well.
In [2] the idea of using RTAS set-indicator for this role was first introduced.
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already
UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU
and phyp. This gives us an opportunity to use this behavior to signal the
hypervisor layer when a device removal happens, allowing it to do a proper
error handling knowing for sure that the removal failed in the kernel. Using
set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4.
of table 13.7 of [1]:
"For all DR options: If this is a DR operation that involves the user insert-
ing a DR entity, then if the firmware can determine that the inserted entity
would cause a system disturbance, then the set-indicator RTAS call must not
unisolate the entity and must return an error status which is unique to the
particular error."
PAPR does not make any restrictions or considerations about setting an already
Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it
for this purpose - signal an OS side error when attempting to remove a DR
entity. To validate the design, this is being implemented only for CPUs.
QEMU will use this mechanism to rollback the device removal (hotunplug) state,
allowing for a better error handling mechanism. A implementation of how QEMU
can do it is in [3]. When using a kernel with this series applied, together
with this QEMU build, this is what happens in a common CPU removal/hotunplug
error scenario (trying to remove the last online CPU):
( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )
[root at localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
(qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
(qemu)
[root at localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
[ 77.548442][ T13] IRQ 19: no longer affine to CPU0
[ 77.548452][ T13] IRQ 20: no longer affine to CPU0
[ 77.548458][ T13] IRQ 256: no longer affine to CPU0
[ 77.548465][ T13] IRQ 258: no longer affine to CPU0
[ 77.548472][ T13] IRQ 259: no longer affine to CPU0
[ 77.548479][ T13] IRQ 260: no longer affine to CPU0
[ 77.548485][ T13] IRQ 261: no longer affine to CPU0
[ 77.548590][ T0] cpu 0 (hwid 0) Ready to die...
[root at localhost ~]# (qemu)
(qemu) device_del core1
(qemu) [ 83.214073][ T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
qemu-system-ppc64: Device hotunplug rejected by the guest for device core1
(qemu)
As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of
it and is able to do error recovery.
If this solution is well received, I'll push for an architecture change
request internally at IBM to make this mechanism PAPR official.
[1] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
[2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
[3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1
Daniel Henrique Barboza (2):
dlpar.c: introduce dlpar_unisolate_drc()
hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
arch/powerpc/platforms/pseries/dlpar.c | 14 ++++++++++++++
arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 ++++++++-
arch/powerpc/platforms/pseries/pseries.h | 1 +
3 files changed, 23 insertions(+), 1 deletion(-)
--
2.30.2
More information about the Linuxppc-dev
mailing list