[PATCH kernel v2] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
Alexey Kardashevskiy
aik at ozlabs.ru
Fri Oct 19 12:20:52 AEDT 2018
On 18/10/2018 12:05, Alistair Popple wrote:
> Hi Alexey,
>
>>> wouldn't you also need to do that somewhere? Unless the driver
>>> does it at startup?
>>
>> VFIO performs GPU reset so I'd expect the GPUs to flush its caches
>> without any software interactions. Am I hoping for too much here?
>
> Sadly you are. It's not the GPU caches that need flushing, it's the CPU caches.
> This needs to happen as part of the reset sequence, so I guess you would need
> to add it to the VFIO driver.
Well, ok. Caches need flushing, will look into this but this fencing is
still needed, is not it?
>
> - Alistair
>
>>
>>> - Alistair
>>>
>>>>> - Alistair
>>>>>
>>>>>>> - Alistair
>>>>>>>
>>>>>>>>> - Alistair
>>>>>>>>>
>>>>>>>>> On Monday, 15 October 2018 6:17:51 PM AEDT Alexey Kardashevskiy
> wrote:
>>>>>>>>>> Ping?
>>>>>>>>>>
>>>>>>>>>> On 02/10/2018 13:20, Alexey Kardashevskiy wrote:
>>>>>>>>>>> The skiboot firmware has a hot reset handler which fences the
>>>>>>>>>>> NVIDIA V100
>>>>>>>>>>> GPU RAM on Witherspoons and makes accesses no-op instead of
>>>>>>>>>>> throwing HMIs:
>>>>>>>>>>> https://github.com/open-power/skiboot/commit/fca2b2b839a67
>>>>>>>>>>>
>>>>>>>>>>> Now we are going to pass V100 via VFIO which most certainly
>>>>>>>>>>> involves
>>>>>>>>>>> KVM guests which are often terminated without getting a chance to
>>>>>>>>>>> offline
>>>>>>>>>>> GPU RAM so we end up with a running machine with misconfigured
>>>>>>>>>>> memory.
>>>>>>>>>>> Accessing this memory produces hardware management interrupts
>>>>>>>>>>> (HMI)
>>>>>>>>>>> which bring the host down.
>>>>>>>>>>>
>>>>>>>>>>> To suppress HMIs, this wires up this hot reset hook to
>>>>>>>>>>> vfio_pci_disable()
>>>>>>>>>>> via pci_disable_device() which switches NPU2 to a safe mode and
>>>>>>>>>>> prevents
>>>>>>>>>>> HMIs.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>> Changes:
>>>>>>>>>>> v2:
>>>>>>>>>>> * updated the commit log
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++
>>>>>>>>>>> 1 file changed, 10 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>>>>>> b/arch/powerpc/platforms/powernv/pci-ioda.c index
>>>>>>>>>>> cde7102..e37b9cc 100644
>>>>>>>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>>>>>> @@ -3688,6 +3688,15 @@ static void pnv_pci_release_device(struct
>>>>>>>>>>> pci_dev *pdev)>>>>>>>>>
>>>>>>>>>>> pnv_ioda_release_pe(pe);
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +static void pnv_npu_disable_device(struct pci_dev *pdev)
>>>>>>>>>>> +{
>>>>>>>>>>> + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
>>>>>>>>>>> + struct eeh_pe *eehpe = edev ? edev->pe : NULL;
>>>>>>>>>>> +
>>>>>>>>>>> + if (eehpe && eeh_ops && eeh_ops->reset)
>>>>>>>>>>> + eeh_ops->reset(eehpe, EEH_RESET_HOT);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>
>>>>>>>>>>> static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
>>>>>>>>>>> {
>>>>>>>>>>>
>>>>>>>>>>> struct pnv_phb *phb = hose->private_data;
>>>>>>>>>>>
>>>>>>>>>>> @@ -3732,6 +3741,7 @@ static const struct pci_controller_ops
>>>>>>>>>>> pnv_npu_ioda_controller_ops = {>>>>>>>>>
>>>>>>>>>>> .reset_secondary_bus = pnv_pci_reset_secondary_bus,
>>>>>>>>>>> .dma_set_mask = pnv_npu_dma_set_mask,
>>>>>>>>>>> .shutdown = pnv_pci_ioda_shutdown,
>>>>>>>>>>>
>>>>>>>>>>> + .disable_device = pnv_npu_disable_device,
>>>>>>>>>>>
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> static const struct pci_controller_ops
>>>>>>>>>>> pnv_npu_ocapi_ioda_controller_ops = {
>
>
--
Alexey
More information about the Linuxppc-dev
mailing list