[Skiboot] [PATCH skiboot v3] npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default

Stewart Smith stewart at linux.ibm.com
Thu May 2 18:16:41 AEST 2019


Alexey Kardashevskiy <aik at ozlabs.ru> writes:
> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
> memory was accessed by the CPU and they by GPU using so called block
> linear mapping) and issue double probes to NPU which can cope with this
> problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
> snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
> If the bit is set (which is the case today), NPU issues the machine
> check stop.
>
> The snarfing feature is designed to detect 2 probes in flight and combine
> them into one.
>
> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
> stop from happening.
>
> This disables snarfing by default as otherwise a broken GPU driver can
> crash the entire box even when a GPU is passed through to a guest.
> This provides a dial to allow regression tests (might be useful for
> a bare metal). To enable snarfing, the user needs to run:
>
> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
>
> and reboot the host system.
>
> While at this, define macros for register names as well to avoid touching
> same lines over and over again.
>
> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>

Merged to master as of 0f492a92590850af6360bdcc93e2047b285d41c7.

I'm gathering this also needs to go to stable so that it makes its way
through to various releases?

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list