[Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing
Alexey Kardashevskiy
aik at ozlabs.ru
Fri Apr 26 18:18:25 AEST 2019
On 26/04/2019 17:31, Michael Neuling wrote:
>> Re: [Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing
>
> "I.MO" ???
>
>
> On Wed, 2019-04-24 at 17:21 +1000, Alexey Kardashevskiy wrote:
>> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
>> memory was accessed by the CPU and they by GPU using so called block
>> linear mapping) and issue double probes to NPU which can still handle this
>> but only if CONFIG_ENABLE_SNARF_CPM is not set in the CQ_SM Misc Config
>> register #0. If the bit is set (which is the case today), NPU issues
>> the machine check stop.
>>
>> The snarfing feature is designed to detect 2 probes in flight and combine
>> them into one.
>>
>> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
>> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
>> stop from happening. By default snarfing is not disabled.
>
> change "not disabled" to "enabled".
>
>> In order to
>> disable it, the user has to run:
>>
>> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=disable
>
> I'm against adding this as an nvram option.
I'd be happy to know what the other option is besides disabling it
unconditionally.
> We need to make a decision as to which way to go and stick with that. Users
> shouldn't be controlling this sort of thing unless in a very very very special
> case. In that case, they can recompile skiboot.
Very often reflashing is the problem.
> If it's a platform decision, then make it per platform for the user.
>
> If it's causing a checkstop, we need to disable it.
It is causing a checkstop only with a broken GPU driver so it does not
have to be disabled for the baremetal and might actually have some
performance benefits.
> If we start adding nvram options for all these types of things we are going to
> end up with a billion different options that users will never know what to set
> to.
>> and reboot the host system.
>>
>> While at this, define macros for register names as well to avoid touching
>> same lines over and over again.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>> ---
>> include/npu2-regs.h | 14 ++++++++++++++
>> hw/npu2.c | 45 ++++++++++++++++++++++++++++++++-------------
>> 2 files changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/npu2-regs.h b/include/npu2-regs.h
>> index ba10b8eaf88d..61e8ea8615f8 100644
>> --- a/include/npu2-regs.h
>> +++ b/include/npu2-regs.h
>> @@ -791,4 +791,18 @@ void npu2_scom_write(uint64_t gcid, uint64_t scom_base,
>> #define L3_PRD_PURGE_TTYPE_MASK PPC_BIT(1) | PPC_BIT(2) |
>> PPC_BIT(3) | PPC_BIT(4)
>> #define L3_FULL_PURGE 0x0
>>
>> +/* Config registers for NPU2 */
>> +#define NPU_STCK0_CS_SM0_MISC_CONFIG0 0x5011000
>> +#define NPU_STCK0_CS_SM1_MISC_CONFIG0 0x5011030
>> +#define NPU_STCK0_CS_SM2_MISC_CONFIG0 0x5011060
>> +#define NPU_STCK0_CS_SM3_MISC_CONFIG0 0x5011090
>> +#define NPU_STCK1_CS_SM0_MISC_CONFIG0 0x5011200
>> +#define NPU_STCK1_CS_SM1_MISC_CONFIG0 0x5011230
>> +#define NPU_STCK1_CS_SM2_MISC_CONFIG0 0x5011260
>> +#define NPU_STCK1_CS_SM3_MISC_CONFIG0 0x5011290
>> +#define NPU_STCK2_CS_SM0_MISC_CONFIG0 0x5011400
>> +#define NPU_STCK2_CS_SM1_MISC_CONFIG0 0x5011430
>> +#define NPU_STCK2_CS_SM2_MISC_CONFIG0 0x5011460
>> +#define NPU_STCK2_CS_SM3_MISC_CONFIG0 0x5011490
>> +
>> #endif /* __NPU2_REGS_H */
>> diff --git a/hw/npu2.c b/hw/npu2.c
>> index d532c4da3532..c7b7b071f3e0 100644
>> --- a/hw/npu2.c
>> +++ b/hw/npu2.c
>> @@ -1452,7 +1452,7 @@ static void assign_mmio_bars(uint64_t gcid, uint32_t
>> scom, uint64_t reg[2], uint
>> int npu2_nvlink_init_npu(struct npu2 *npu)
>> {
>> struct dt_node *np;
>> - uint64_t reg[2], mm_win[2], val;
>> + uint64_t reg[2], mm_win[2], val, mask;
>>
>> /* TODO: Clean this up with register names, etc. when we get
>> * time. This just turns NVLink mode on in each brick and should
>> @@ -1461,18 +1461,37 @@ int npu2_nvlink_init_npu(struct npu2 *npu)
>> *
>> * Obviously if the year is now 2020 that didn't happen and you
>> * should fix this :-) */
>> - xscom_write_mask(npu->chip_id, 0x5011000, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011030, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011060, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011090, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011200, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011230, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011260, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011290, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011400, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011430, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011460, PPC_BIT(58), PPC_BIT(58));
>> - xscom_write_mask(npu->chip_id, 0x5011490, PPC_BIT(58), PPC_BIT(58));
>> +
>> + val = PPC_BIT(58);
>> + mask = PPC_BIT(58); /* CONFIG_NVLINK_MODE */
>> +
>> + if (nvram_query_eq("opal-npu2-snarf-cpm", "disable"))
>> + mask |= PPC_BIT(40); /* CONFIG_ENABLE_SNARF_CPM */
>
> Need a big print here so we can debug this in the field.
>
>
>
>> +
>> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM0_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM1_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM2_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM3_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM0_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM1_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM2_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM3_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM0_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM1_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM2_MISC_CONFIG0,
>> + val, mask);
>> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM3_MISC_CONFIG0,
>> + val, mask);
>>
>> xscom_write_mask(npu->chip_id, 0x50110c0, PPC_BIT(53), PPC_BIT(53));
>> xscom_write_mask(npu->chip_id, 0x50112c0, PPC_BIT(53), PPC_BIT(53));
>
--
Alexey
More information about the Skiboot
mailing list