[Skiboot] [PATCH skiboot v3] npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default

Stewart Smith stewart at linux.ibm.com
Thu May 2 18:16:41 AEST 2019

Alexey Kardashevskiy <aik at ozlabs.ru> writes:
> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
> memory was accessed by the CPU and they by GPU using so called block
> linear mapping) and issue double probes to NPU which can cope with this
> problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
> snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
> If the bit is set (which is the case today), NPU issues the machine
> check stop.
> The snarfing feature is designed to detect 2 probes in flight and combine
> them into one.
> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
> stop from happening.
> This disables snarfing by default as otherwise a broken GPU driver can
> crash the entire box even when a GPU is passed through to a guest.
> This provides a dial to allow regression tests (might be useful for
> a bare metal). To enable snarfing, the user needs to run:
> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
> and reboot the host system.
> While at this, define macros for register names as well to avoid touching
> same lines over and over again.
> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>

Merged to master as of 0f492a92590850af6360bdcc93e2047b285d41c7.

I'm gathering this also needs to go to stable so that it makes its way
through to various releases?

Stewart Smith
OPAL Architect, IBM.

More information about the Skiboot mailing list