[Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing

Alexey Kardashevskiy aik at ozlabs.ru
Fri Apr 26 18:18:25 AEST 2019



On 26/04/2019 17:31, Michael Neuling wrote:
>> Re: [Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing
> 
> "I.MO" ???
> 
> 
> On Wed, 2019-04-24 at 17:21 +1000, Alexey Kardashevskiy wrote:
>> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
>> memory was accessed by the CPU and they by GPU using so called block
>> linear mapping) and issue double probes to NPU which can still handle this
>> but only if CONFIG_ENABLE_SNARF_CPM is not set in the CQ_SM Misc Config
>> register #0. If the bit is set (which is the case today), NPU issues
>> the machine check stop.
>>
>> The snarfing feature is designed to detect 2 probes in flight and combine
>> them into one.
>>
>> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
>> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
>> stop from happening. By default snarfing is not disabled. 
> 
> change "not disabled" to "enabled".
> 
>> In order to
>> disable it, the user has to run:
>>
>> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=disable
> 
> I'm against adding this as an nvram option.

I'd be happy to know what the other option is besides disabling it
unconditionally.

> We need to make a decision as to which way to go and stick with that. Users
> shouldn't be controlling this sort of thing unless in a very very very special
> case. In that case, they can recompile skiboot.

Very often reflashing is the problem.

> If it's a platform decision, then make it per platform for the user.
> 
> If it's causing a checkstop, we need to disable it. 


It is causing a checkstop only with a broken GPU driver so it does not
have to be disabled for the baremetal and might actually have some
performance benefits.


> If we start adding nvram options for all these types of things we are going to
> end up with a billion different options that users will never know what to set
> to. 
>> and reboot the host system.
>>
>> While at this, define macros for register names as well to avoid touching
>> same lines over and over again.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>> ---
>>  include/npu2-regs.h | 14 ++++++++++++++
>>  hw/npu2.c           | 45 ++++++++++++++++++++++++++++++++-------------
>>  2 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/npu2-regs.h b/include/npu2-regs.h
>> index ba10b8eaf88d..61e8ea8615f8 100644
>> --- a/include/npu2-regs.h
>> +++ b/include/npu2-regs.h
>> @@ -791,4 +791,18 @@ void npu2_scom_write(uint64_t gcid, uint64_t scom_base,
>>  #define L3_PRD_PURGE_TTYPE_MASK 		PPC_BIT(1) | PPC_BIT(2) |
>> PPC_BIT(3) | PPC_BIT(4)
>>  #define L3_FULL_PURGE				0x0
>>  
>> +/* Config registers for NPU2 */
>> +#define NPU_STCK0_CS_SM0_MISC_CONFIG0		0x5011000
>> +#define NPU_STCK0_CS_SM1_MISC_CONFIG0		0x5011030
>> +#define NPU_STCK0_CS_SM2_MISC_CONFIG0		0x5011060
>> +#define NPU_STCK0_CS_SM3_MISC_CONFIG0		0x5011090
>> +#define NPU_STCK1_CS_SM0_MISC_CONFIG0		0x5011200
>> +#define NPU_STCK1_CS_SM1_MISC_CONFIG0		0x5011230
>> +#define NPU_STCK1_CS_SM2_MISC_CONFIG0		0x5011260
>> +#define NPU_STCK1_CS_SM3_MISC_CONFIG0		0x5011290
>> +#define NPU_STCK2_CS_SM0_MISC_CONFIG0		0x5011400
>> +#define NPU_STCK2_CS_SM1_MISC_CONFIG0		0x5011430
>> +#define NPU_STCK2_CS_SM2_MISC_CONFIG0		0x5011460
>> +#define NPU_STCK2_CS_SM3_MISC_CONFIG0		0x5011490
>> +
>>  #endif /* __NPU2_REGS_H */
>> diff --git a/hw/npu2.c b/hw/npu2.c
>> index d532c4da3532..c7b7b071f3e0 100644
>> --- a/hw/npu2.c
>> +++ b/hw/npu2.c
>> @@ -1452,7 +1452,7 @@ static void assign_mmio_bars(uint64_t gcid, uint32_t
>> scom, uint64_t reg[2], uint
>>  int npu2_nvlink_init_npu(struct npu2 *npu)
>>  {
>>  	struct dt_node *np;
>> -	uint64_t reg[2], mm_win[2], val;
>> +	uint64_t reg[2], mm_win[2], val, mask;
>>  
>>  	/* TODO: Clean this up with register names, etc. when we get
>>  	 * time. This just turns NVLink mode on in each brick and should
>> @@ -1461,18 +1461,37 @@ int npu2_nvlink_init_npu(struct npu2 *npu)
>>  	 *
>>  	 * Obviously if the year is now 2020 that didn't happen and you
>>  	 * should fix this :-) */
>> -	xscom_write_mask(npu->chip_id, 0x5011000, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011030, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011060, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011090, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011200, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011230, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011260, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011290, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011400, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011430, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011460, PPC_BIT(58), PPC_BIT(58));
>> -	xscom_write_mask(npu->chip_id, 0x5011490, PPC_BIT(58), PPC_BIT(58));
>> +
>> +	val = PPC_BIT(58);
>> +	mask = PPC_BIT(58); /* CONFIG_NVLINK_MODE */
>> +
>> +	if (nvram_query_eq("opal-npu2-snarf-cpm", "disable"))
>> +		mask |= PPC_BIT(40); /* CONFIG_ENABLE_SNARF_CPM */
> 
> Need a big print here so we can debug this in the field.
> 
> 
> 
>> +
>> +	xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM0_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM1_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM2_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM3_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM0_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM1_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM2_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM3_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM0_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM1_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM2_MISC_CONFIG0,
>> +			 val, mask);
>> +	xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM3_MISC_CONFIG0,
>> +			 val, mask);
>>  
>>  	xscom_write_mask(npu->chip_id, 0x50110c0, PPC_BIT(53), PPC_BIT(53));
>>  	xscom_write_mask(npu->chip_id, 0x50112c0, PPC_BIT(53), PPC_BIT(53));
> 

-- 
Alexey


More information about the Skiboot mailing list