[Skiboot] [PATCH skiboot v3] npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by default

Alexey Kardashevskiy aik at ozlabs.ru
Mon May 6 14:48:20 AEST 2019



On 03/05/2019 22:38, Brian J King wrote:
> My thinking here is that we keep this in OP930 and later releases only.
> I'd really like to avoid this getting pulled into a 910 or 920 service
> pack. Those releases don't support KVM and there is a driver fix for the
> bare metal case. So I think we can avoid pushing this to all the stable
> releases.


Not having this fix means that a bad driver on a bare metal can still
trigger MCE (immediate reboot) instead of HMI/EEH (gives a chance for
the kernel to react) and let's say corrupt filesystems. I'd pull.



>  
> Thanks,
>  
> Brian
> 
> Brian King
> STSM, Linux on Power I/O Chief Architect
> Linux Technology Center
>  
>  
> 
>     ----- Original message -----
>     From: Alexey Kardashevskiy <aik at ozlabs.ru>
>     To: Stewart Smith <stewart at linux.ibm.com>, skiboot at lists.ozlabs.org
>     Cc: "Leonardo Augusto Guimarães Garcia" <lagarcia at br.ibm.com>,
>     Michael Neuling <mikey at neuling.org>, Reza Arbab
>     <arbab at linux.ibm.com>, Brian J King <bjking1 at us.ibm.com>
>     Subject: Re: [Skiboot] [PATCH skiboot v3] npu2: Disable
>     Probe-to-Invalid-Return-Modified-or-Owned snarfing by default
>     Date: Thu, May 2, 2019 9:36 PM
>      
> 
>     On 02/05/2019 18:16, Stewart Smith wrote:
>     > Alexey Kardashevskiy <aik at ozlabs.ru> writes:
>     >> V100 GPUs are known to violate NVLink2 protocol in some cases
>     (one is when
>     >> memory was accessed by the CPU and they by GPU using so called block
>     >> linear mapping) and issue double probes to NPU which can cope
>     with this
>     >> problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO
>     >> snarfing a cp_m") is not set in the CQ_SM Misc Config register #0.
>     >> If the bit is set (which is the case today), NPU issues the machine
>     >> check stop.
>     >>
>     >> The snarfing feature is designed to detect 2 probes in flight and
>     combine
>     >> them into one.
>     >>
>     >> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
>     >> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
>     >> stop from happening.
>     >>
>     >> This disables snarfing by default as otherwise a broken GPU
>     driver can
>     >> crash the entire box even when a GPU is passed through to a guest.
>     >> This provides a dial to allow regression tests (might be useful for
>     >> a bare metal). To enable snarfing, the user needs to run:
>     >>
>     >> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable
>     >>
>     >> and reboot the host system.
>     >>
>     >> While at this, define macros for register names as well to avoid
>     touching
>     >> same lines over and over again.
>     >>
>     >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>     >
>     > Merged to master as of 0f492a92590850af6360bdcc93e2047b285d41c7.
>     >
>     > I'm gathering this also needs to go to stable so that it makes its way
>     > through to various releases?
> 
> 
>     Yes, thanks!
> 
> 
>     --
>     Alexey
>      
> 
>  
> 

-- 
Alexey


More information about the Skiboot mailing list