[Skiboot] [PATCH skiboot v3] npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
aik at ozlabs.ru
Mon May 6 14:48:20 AEST 2019
On 03/05/2019 22:38, Brian J King wrote:
> My thinking here is that we keep this in OP930 and later releases only.
> I'd really like to avoid this getting pulled into a 910 or 920 service
> pack. Those releases don't support KVM and there is a driver fix for the
> bare metal case. So I think we can avoid pushing this to all the stable
Not having this fix means that a bad driver on a bare metal can still
trigger MCE (immediate reboot) instead of HMI/EEH (gives a chance for
the kernel to react) and let's say corrupt filesystems. I'd pull.
> Brian King
> STSM, Linux on Power I/O Chief Architect
> Linux Technology Center
> ----- Original message -----
> From: Alexey Kardashevskiy <aik at ozlabs.ru>
> To: Stewart Smith <stewart at linux.ibm.com>, skiboot at lists.ozlabs.org
> Cc: "Leonardo Augusto Guimarães Garcia" <lagarcia at br.ibm.com>,
> Michael Neuling <mikey at neuling.org>, Reza Arbab
> <arbab at linux.ibm.com>, Brian J King <bjking1 at us.ibm.com>
> Subject: Re: [Skiboot] [PATCH skiboot v3] npu2: Disable
> Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
> Date: Thu, May 2, 2019 9:36 PM
> On 02/05/2019 18:16, Stewart Smith wrote:
> > Alexey Kardashevskiy <aik at ozlabs.ru> writes:
> >> V100 GPUs are known to violate NVLink2 protocol in some cases
> (one is when
> >> memory was accessed by the CPU and they by GPU using so called block
> >> linear mapping) and issue double probes to NPU which can cope
> with this
> >> problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
> >> snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
> >> If the bit is set (which is the case today), NPU issues the machine
> >> check stop.
> >> The snarfing feature is designed to detect 2 probes in flight and
> >> them into one.
> >> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
> >> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
> >> stop from happening.
> >> This disables snarfing by default as otherwise a broken GPU
> driver can
> >> crash the entire box even when a GPU is passed through to a guest.
> >> This provides a dial to allow regression tests (might be useful for
> >> a bare metal). To enable snarfing, the user needs to run:
> >> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
> >> and reboot the host system.
> >> While at this, define macros for register names as well to avoid
> >> same lines over and over again.
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> > Merged to master as of 0f492a92590850af6360bdcc93e2047b285d41c7.
> > I'm gathering this also needs to go to stable so that it makes its way
> > through to various releases?
> Yes, thanks!
More information about the Skiboot