[Skiboot] [PATCH 1/7] hw/npu2: Fix OpenCAPI PE assignment

Wed Mar 27 02:58:19 AEDT 2019


Le 26/03/2019 à 00:20, Oliver a écrit :
> On Thu, Mar 21, 2019 at 5:36 AM Frederic Barrat <fbarrat at linux.ibm.com> wrote:
>>
>> From: Andrew Donnellan <andrew.donnellan at au1.ibm.com>
>>
>> When we support mixing NVLink and OpenCAPI devices on the same NPU, we're
>> going to have to share the same range of 16 PE numbers between NVLink and
>> OpenCAPI PHBs.
>>
>> For OpenCAPI devices, PE assignment is only significant for determining
>> which System Interrupt Log register is used for a particular brick - unlike
>> NVLink, it doesn't play any role in determining how links are fenced.
>>
>> Split the PE range into a lower half which is used for NVLink, and an upper
>> half that is used for OpenCAPI, with a fixed PE number assigned per brick.
>>
>> As the PE assignment for OpenCAPI devices is fixed, set the PE once
>> during device init and then ignore calls to the set_pe() operation.
> 
> I don't understand why this is necessary or a good idea. The kernel
> determining PE assignments is one of those things that is pretty
> fundamentally baked into OPAL so why go against the grain here? If the
> PE assignment only changes where errors are logged then I don't see
> why deleting 9 lines of code justifies adding weird behaviour. Is
> there some other goal here?
> 
> It's also worth pointing out that patch 5 in this series re-introduces
> tracking of the kernel's PE assignments which fixes my main gripe, but
> still, why?


For linux, one NPU can generate multiple PHBs, one for all the nv links 
+ 1 per opencapi link. Realistically, we can only mix nvlink and 
opencapi on witherpoon and we only support 1 opencapi card on 
witherspoon, so it's max 2 PHBs per NPU. Each PHB defines a 0-based 
range of PE numbers, so since the 2 PHBs are completely independent for 
linux, at the minimum, we would need to partition the space of NPU PEs 
(there are 16) so that linux doesn't use the same numbers for 2 
different devices.

On top of that, for opencapi, the full concept of PE from PCI doesn't 
really mean anything. That's one of the area where you can see that 
opencapi forced its way in the existing NPU design. For opencapi, the PE 
is just an index in an array of registers, where some link error 
information can be saved. So by design, we must have one PE per link 
(and if one day we support dual link opencapi cards, we will have the 
same PE number for 2 links). The NPU register to configure is called 
"BDF 2 PE map", but the bdf part is ignored for opencapi. We're really 
mapping a PE number to a brick, it's not something the OS can choose.
So I think we have 2 choices:
- let linux drive the PE number through the set_pe() callback, and have 
skiboot enforce the above. If set_pe() is called multiple times for an 
opencapi PHB, they would need to all pass the same pe_num. And we'd 
still need to worry about the NPU PE space partitioning to avoid 
conflict with nvlink.
- or, since the NPU concept of PE for opencapi is not the same as linux, 
we can stop pretending and let skiboot handling the NPU PE space, and 
keep linux on a different space, that skiboot can almost ignore (except, 
as you noticed for that later patch, where we still need to remember the 
pe number linux used so that we can give it back). This patch tries to 
do this. I think it keeps things simpler.

   Fred


>> Suggested-by: Frederic Barrat <fbarrat at linux.ibm.com>
>> Signed-off-by: Andrew Donnellan <andrew.donnellan at au1.ibm.com>
>> ---
>>   hw/npu2-opencapi.c | 74 +++++++++++++++++++++-------------------------
>>   include/npu2.h     | 21 +++++++++++--
>>   2 files changed, 52 insertions(+), 43 deletions(-)
>>
>> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
>> index 090223e2..905314cc 100644
>> --- a/hw/npu2-opencapi.c
>> +++ b/hw/npu2-opencapi.c
>> @@ -584,6 +584,20 @@ static void brick_config(uint32_t gcid, uint32_t scom_base, int index)
>>          enable_pb_snooping(gcid, scom_base, index);
>>   }
>>
>> +/* Procedure 13.1.3.4 - Brick to PE Mapping */
>> +static void pe_config(struct npu2_dev *dev)
>> +{
>> +       /* We currently use a fixed PE assignment per brick */
>> +       uint64_t val, reg;
>> +       val = NPU2_MISC_BRICK_BDF2PE_MAP_ENABLE;
>> +       val = SETFIELD(NPU2_MISC_BRICK_BDF2PE_MAP_PE, val, NPU2_OCAPI_PE(dev));
>> +       val = SETFIELD(NPU2_MISC_BRICK_BDF2PE_MAP_BDF, val, 0);
>> +       reg = NPU2_REG_OFFSET(NPU2_STACK_MISC, NPU2_BLOCK_MISC,
>> +                             NPU2_MISC_BRICK0_BDF2PE_MAP0 +
>> +                             (dev->brick_index * 0x18));
>> +       npu2_write(dev->npu, reg, val);
>> +}
>> +
>>   /* Procedure 13.1.3.5 - TL Configuration */
>>   static void tl_config(uint32_t gcid, uint32_t scom_base, uint64_t index)
>>   {
>> @@ -1422,47 +1436,18 @@ static int64_t npu2_opencapi_ioda_reset(struct phb __unused *phb,
>>          return OPAL_SUCCESS;
>>   }
>>
>> -static int64_t npu2_opencapi_set_pe(struct phb *phb,
>> -                                   uint64_t pe_num,
>> -                                   uint64_t bdfn,
>> -                                   uint8_t bcompare,
>> -                                   uint8_t dcompare,
>> -                                   uint8_t fcompare,
>> -                                   uint8_t action)
>> +static int64_t npu2_opencapi_set_pe(struct phb __unused *phb,
>> +                                   uint64_t __unused pe_num,
>> +                                   uint64_t __unused bdfn,
>> +                                   uint8_t __unused bcompare,
>> +                                   uint8_t __unused dcompare,
>> +                                   uint8_t __unused fcompare,
>> +                                   uint8_t __unused action)
>>   {
>> -       struct npu2 *p;
>> -       struct npu2_dev *dev;
>> -       uint64_t reg, val, pe_bdfn;
>> -
>> -       /* Sanity check */
>> -       if (action != OPAL_MAP_PE && action != OPAL_UNMAP_PE)
>> -               return OPAL_PARAMETER;
>> -       if (pe_num >= NPU2_MAX_PE_NUM)
>> -               return OPAL_PARAMETER;
>> -       if (bdfn >> 8)
>> -               return OPAL_PARAMETER;
>> -       if (bcompare != OpalPciBusAll ||
>> -           dcompare != OPAL_COMPARE_RID_DEVICE_NUMBER ||
>> -           fcompare != OPAL_COMPARE_RID_FUNCTION_NUMBER)
>> -               return OPAL_UNSUPPORTED;
>> -
>> -       /* Get the NPU2 device */
>> -       dev = phb_to_npu2_dev_ocapi(phb);
>> -       if (!dev)
>> -               return OPAL_PARAMETER;
>> -
>> -       p = dev->npu;
>> -
>> -       pe_bdfn = dev->bdfn;
>> -
>> -       val = NPU2_MISC_BRICK_BDF2PE_MAP_ENABLE;
>> -       val = SETFIELD(NPU2_MISC_BRICK_BDF2PE_MAP_PE, val, pe_num);
>> -       val = SETFIELD(NPU2_MISC_BRICK_BDF2PE_MAP_BDF, val, pe_bdfn);
>> -       reg = NPU2_REG_OFFSET(NPU2_STACK_MISC, NPU2_BLOCK_MISC,
>> -                             NPU2_MISC_BRICK0_BDF2PE_MAP0 +
>> -                             (dev->brick_index * 0x18));
>> -       npu2_write(p, reg, val);
>> -
>> +       /*
>> +        * Ignored on OpenCAPI - we use fixed PE assignments. May need
>> +        * addressing when we support dual-link devices.
>> +        */
>>          return OPAL_SUCCESS;
>>   }
>>
>> @@ -1653,7 +1638,13 @@ static void setup_device(struct npu2_dev *dev)
>>          dt_add_property_cells(dn_phb, "ibm,links", 1);
>>          dt_add_property(dn_phb, "ibm,mmio-window", mm_win, sizeof(mm_win));
>>          dt_add_property_cells(dn_phb, "ibm,phb-diag-data-size", 0);
>> +
>> +       /*
>> +        * We ignore whatever PE numbers Linux tries to set, so we just
>> +        * advertise enough that Linux won't complain
>> +        */
>>          dt_add_property_cells(dn_phb, "ibm,opal-num-pes", NPU2_MAX_PE_NUM);
>> +       dt_add_property_cells(dn_phb, "ibm,opal-reserved-pe", NPU2_RESERVED_PE_NUM);
>>
>>          dt_add_property_cells(dn_phb, "ranges", 0x02000000,
>>                                hi32(mm_win[0]), lo32(mm_win[0]),
>> @@ -1740,6 +1731,9 @@ int npu2_opencapi_init_npu(struct npu2 *npu)
>>                  /* Procedure 13.1.3.1 - Select OCAPI vs NVLink */
>>                  brick_config(npu->chip_id, npu->xscom_base, dev->brick_index);
>>
>> +               /* Procedure 13.1.3.4 - Brick to PE Mapping */
>> +               pe_config(dev);
>> +
>>                  /* Procedure 13.1.3.5 - Transaction Layer Configuration */
>>                  tl_config(npu->chip_id, npu->xscom_base, dev->brick_index);
>>
>> diff --git a/include/npu2.h b/include/npu2.h
>> index a2bcc450..86a449eb 100644
>> --- a/include/npu2.h
>> +++ b/include/npu2.h
>> @@ -46,9 +46,24 @@
>>                                            dev->npu->chip_id, dev->brick_index, ## a)
>>
>>
>> -/* Number of PEs supported */
>> -#define NPU2_MAX_PE_NUM                16
>> -#define NPU2_RESERVED_PE_NUM   15
>> +/*
>> + * Number of PEs supported
>> + *
>> + * The NPU supports PE numbers from 0-15. At present, we only assign a maximum
>> + * of 1 PE per brick.
>> + *
>> + * NVLink devices are currently exposed to Linux underneath a single virtual
>> + * PHB. Therefore, we give NVLink half the available PEs, which is enough for
>> + * 6 bricks plus 1 reserved PE.
>> + *
>> + * For OpenCAPI, the BDF-to-PE registers are used exclusively for mapping
>> + * bricks to System Interrupt Log registers (the BDF component of those
>> + * registers is ignored). Currently, we allocate a fixed PE based on the brick
>> + * index in the upper half of the PE namespace.
>> + */
>> +#define NPU2_MAX_PE_NUM                8
>> +#define NPU2_RESERVED_PE_NUM   7
>> +#define NPU2_OCAPI_PE(ndev) ((ndev)->brick_index + NPU2_MAX_PE_NUM)
>>
>>   #define NPU2_LINKS_PER_CHIP 6
>>
>> --
>> 2.19.1
>>
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot at lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot
>