[Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing
Michael Neuling
mikey at neuling.org
Fri Apr 26 17:31:32 AEST 2019
> Re: [Skiboot] [PATCH skiboot] npu2: Allow disabling Probe.I.MO snarfing
"I.MO" ???
On Wed, 2019-04-24 at 17:21 +1000, Alexey Kardashevskiy wrote:
> V100 GPUs are known to violate NVLink2 protocol in some cases (one is when
> memory was accessed by the CPU and they by GPU using so called block
> linear mapping) and issue double probes to NPU which can still handle this
> but only if CONFIG_ENABLE_SNARF_CPM is not set in the CQ_SM Misc Config
> register #0. If the bit is set (which is the case today), NPU issues
> the machine check stop.
>
> The snarfing feature is designed to detect 2 probes in flight and combine
> them into one.
>
> This adds a new "opal-npu2-snarf-cpm" nvram variable which controls
> CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check
> stop from happening. By default snarfing is not disabled.
change "not disabled" to "enabled".
> In order to
> disable it, the user has to run:
>
> sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=disable
I'm against adding this as an nvram option.
We need to make a decision as to which way to go and stick with that. Users
shouldn't be controlling this sort of thing unless in a very very very special
case. In that case, they can recompile skiboot.
If it's a platform decision, then make it per platform for the user.
If it's causing a checkstop, we need to disable it.
If we start adding nvram options for all these types of things we are going to
end up with a billion different options that users will never know what to set
to.
> and reboot the host system.
>
> While at this, define macros for register names as well to avoid touching
> same lines over and over again.
>
> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> ---
> include/npu2-regs.h | 14 ++++++++++++++
> hw/npu2.c | 45 ++++++++++++++++++++++++++++++++-------------
> 2 files changed, 46 insertions(+), 13 deletions(-)
>
> diff --git a/include/npu2-regs.h b/include/npu2-regs.h
> index ba10b8eaf88d..61e8ea8615f8 100644
> --- a/include/npu2-regs.h
> +++ b/include/npu2-regs.h
> @@ -791,4 +791,18 @@ void npu2_scom_write(uint64_t gcid, uint64_t scom_base,
> #define L3_PRD_PURGE_TTYPE_MASK PPC_BIT(1) | PPC_BIT(2) |
> PPC_BIT(3) | PPC_BIT(4)
> #define L3_FULL_PURGE 0x0
>
> +/* Config registers for NPU2 */
> +#define NPU_STCK0_CS_SM0_MISC_CONFIG0 0x5011000
> +#define NPU_STCK0_CS_SM1_MISC_CONFIG0 0x5011030
> +#define NPU_STCK0_CS_SM2_MISC_CONFIG0 0x5011060
> +#define NPU_STCK0_CS_SM3_MISC_CONFIG0 0x5011090
> +#define NPU_STCK1_CS_SM0_MISC_CONFIG0 0x5011200
> +#define NPU_STCK1_CS_SM1_MISC_CONFIG0 0x5011230
> +#define NPU_STCK1_CS_SM2_MISC_CONFIG0 0x5011260
> +#define NPU_STCK1_CS_SM3_MISC_CONFIG0 0x5011290
> +#define NPU_STCK2_CS_SM0_MISC_CONFIG0 0x5011400
> +#define NPU_STCK2_CS_SM1_MISC_CONFIG0 0x5011430
> +#define NPU_STCK2_CS_SM2_MISC_CONFIG0 0x5011460
> +#define NPU_STCK2_CS_SM3_MISC_CONFIG0 0x5011490
> +
> #endif /* __NPU2_REGS_H */
> diff --git a/hw/npu2.c b/hw/npu2.c
> index d532c4da3532..c7b7b071f3e0 100644
> --- a/hw/npu2.c
> +++ b/hw/npu2.c
> @@ -1452,7 +1452,7 @@ static void assign_mmio_bars(uint64_t gcid, uint32_t
> scom, uint64_t reg[2], uint
> int npu2_nvlink_init_npu(struct npu2 *npu)
> {
> struct dt_node *np;
> - uint64_t reg[2], mm_win[2], val;
> + uint64_t reg[2], mm_win[2], val, mask;
>
> /* TODO: Clean this up with register names, etc. when we get
> * time. This just turns NVLink mode on in each brick and should
> @@ -1461,18 +1461,37 @@ int npu2_nvlink_init_npu(struct npu2 *npu)
> *
> * Obviously if the year is now 2020 that didn't happen and you
> * should fix this :-) */
> - xscom_write_mask(npu->chip_id, 0x5011000, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011030, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011060, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011090, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011200, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011230, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011260, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011290, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011400, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011430, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011460, PPC_BIT(58), PPC_BIT(58));
> - xscom_write_mask(npu->chip_id, 0x5011490, PPC_BIT(58), PPC_BIT(58));
> +
> + val = PPC_BIT(58);
> + mask = PPC_BIT(58); /* CONFIG_NVLINK_MODE */
> +
> + if (nvram_query_eq("opal-npu2-snarf-cpm", "disable"))
> + mask |= PPC_BIT(40); /* CONFIG_ENABLE_SNARF_CPM */
Need a big print here so we can debug this in the field.
> +
> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM0_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM1_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM2_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK0_CS_SM3_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM0_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM1_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM2_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK1_CS_SM3_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM0_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM1_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM2_MISC_CONFIG0,
> + val, mask);
> + xscom_write_mask(npu->chip_id, NPU_STCK2_CS_SM3_MISC_CONFIG0,
> + val, mask);
>
> xscom_write_mask(npu->chip_id, 0x50110c0, PPC_BIT(53), PPC_BIT(53));
> xscom_write_mask(npu->chip_id, 0x50112c0, PPC_BIT(53), PPC_BIT(53));
More information about the Skiboot
mailing list