[Skiboot] [PATCH skiboot v3] npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
aik at ozlabs.ru
Fri May 3 12:36:44 AEST 2019
On 02/05/2019 18:16, Stewart Smith wrote:
> Alexey Kardashevskiy <aik at ozlabs.ru> writes:
>> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
>> memory was accessed by the CPU and they by GPU using so called block
>> linear mapping) and issue double probes to NPU which can cope with this
>> problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
>> snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
>> If the bit is set (which is the case today), NPU issues the machine
>> check stop.
>> The snarfing feature is designed to detect 2 probes in flight and combine
>> them into one.
>> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
>> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
>> stop from happening.
>> This disables snarfing by default as otherwise a broken GPU driver can
>> crash the entire box even when a GPU is passed through to a guest.
>> This provides a dial to allow regression tests (might be useful for
>> a bare metal). To enable snarfing, the user needs to run:
>> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
>> and reboot the host system.
>> While at this, define macros for register names as well to avoid touching
>> same lines over and over again.
>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> Merged to master as of 0f492a92590850af6360bdcc93e2047b285d41c7.
> I'm gathering this also needs to go to stable so that it makes its way
> through to various releases?
More information about the Skiboot