[Skiboot] [PATCH v2 09/15] platforms/astbmc/witherspoon: Rework NPU presence detection
Andrew Donnellan
andrew.donnellan at au1.ibm.com
Fri Feb 8 18:42:58 AEDT 2019
On 4/2/19 3:29 pm, Alexey Kardashevskiy wrote:
>
>
> On 01/02/2019 17:55, Andrew Donnellan wrote:
>> On 21/1/19 4:30 pm, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 11/01/2019 12:09, Andrew Donnellan wrote:
>>>> Rework NPU presence detection in preparation for supporting both
>>>> NVLink and
>>>> OpenCAPI devices operating simultaneously on the same NPU.
>>>>
>>>> If an OpenCAPI card is connected to GPU#0, and an NVLink GPU is
>>>> connected
>>>> to GPU#1,
>>>
>>> What is "NVLink GPU"? A peer NVLink of GPU#0 connected to GPU#1?
>>
>> It's a GPU that is connected to the system via NVLink. The simplest
>> possible meaning. Not a "peer" or anything.
>
>
> Then what are GPU#0 and GPU#1? Are these connectors which can receive
> opencapi or "nvlink gpu" cards? We do not call them "slots" whyyyy? :)
yeah, they're the SXM2 connectors on the motherboard, which can be
connected either to an NVLink GPU, or to an OpenCAPI device (via a
SXM2<->SlimSAS connector known as Acorn).
>
>
>
>>> What is the supported hardware configuration? In other words, is there a
>>> spec describing it such as a nice drawing from
>>> Witherspoon_Design_Workbook_v1.7_19June2018.pdf, page 40 with the wiring
>>> diagram?
>>
>> Wait, you're asking for sensible documentation? That's not how any of
>> this works :) The diagrams in the Witherspoon workbook are a good start.
>>
>> On a Redbud system, any combination of GPUs and OpenCAPI cards in the 4
>> connectors provided across 2 CPUs should work, however as documented
>> below there are some combinations where the GPU will only receive 2
>> links rather than 3 links. To understand this there are some diagrams
>> you have to decipher in the P9 Fabric workbook as well.
>>
>>>
>>> The patch could be much simpler:
>>>
>>> diff --git a/platforms/astbmc/witherspoon.c
>>> b/platforms/astbmc/witherspoon.c
>>> index fe13899..b592a67 100644
>>> --- a/platforms/astbmc/witherspoon.c
>>> +++ b/platforms/astbmc/witherspoon.c
>>> @@ -233,6 +233,7 @@ static void witherspoon_npu2_device_detect(struct
>>> npu2 *npu)
>>> int rc;
>>>
>>> bool gpu0_present, gpu1_present;
>>> + enum npu2_dev_type gpu0_type = NPU2_DEV_TYPE_UNKNOWN;
>>>
>>> if (witherspoon_type != WITHERSPOON_TYPE_REDBUD) {
>>> prlog(PR_DEBUG, "PLAT: Setting all NPU links to NVLink,
>>> OpenCAPI only supported on Redbud\n");
>>> @@ -305,12 +306,14 @@ static void witherspoon_npu2_device_detect(struct
>>> npu2 *npu)
>>> set_link_details(npu, 1, 3,
>>> NPU2_DEV_TYPE_OPENCAPI);
>>> /* We current don't support using the second
>>> link */
>>> set_link_details(npu, 0, 2,
>>> NPU2_DEV_TYPE_UNKNOWN);
>>> + gpu0_type = npu2_dev_type_opencapi;
>>> } else {
>>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 is
>>> NVLink\n",
>>> chip->id);
>>> set_link_details(npu, 0, 0,
>>> NPU2_DEV_TYPE_NVLINK);
>>> set_link_details(npu, 1, 1,
>>> NPU2_DEV_TYPE_NVLINK);
>>> set_link_details(npu, 2, 2,
>>> NPU2_DEV_TYPE_NVLINK);
>>> + gpu0_type = NPU2_DEV_TYPE_NVLINK;
>>> }
>>> }
>>>
>>> @@ -324,7 +327,10 @@ static void witherspoon_npu2_device_detect(struct
>>> npu2 *npu)
>>> } else {
>>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 is
>>> NVLink\n",
>>> chip->id);
>>> - set_link_details(npu, 3, 3,
>>> NPU2_DEV_TYPE_NVLINK);
>>> + if (gpu0_type == NPU2_DEV_TYPE_OPENCAPI)
>>> + prlog(PR_WARNING, "PLAT: Chip %d GPU#1
>>> will operate at reduced performance due to presence of OpenCAPI device.
>>> For optimal performance, swap device locations\n", chip->id);
>>> + else
>>> + set_link_details(npu, 3, 3,
>>> NPU2_DEV_TYPE_NVLINK);
>>> set_link_details(npu, 4, 4,
>>> NPU2_DEV_TYPE_NVLINK);
>>> set_link_details(npu, 5, 5,
>>> NPU2_DEV_TYPE_NVLINK);
>>> }
>>>
>>>
>>> Otherwise it is harder to spot the exact change the patch is doing which
>>> is skipping one call to set_link_details().
>>>
>>
>> I see your point, though the change is very clearly mentioned in the
>> commit and has a giant comment block around it.
>
>
> And? Do you really expect people to read the comment, the commit log and
> believe that there is no other change whatsoever so we can skip on
> reading the code movements? Hm.
>
> And I fail to see how new code layout is any better than old one but ok,
> may be it is my poor taste.
>
hmm, will think about it...
>>> btw after this rework you could add this, right?
>>>
>>> diff --git a/platforms/astbmc/witherspoon.c
>>> b/platforms/astbmc/witherspoon.c
>>> index fe13899..7ce525c 100644
>>> --- a/platforms/astbmc/witherspoon.c
>>> +++ b/platforms/astbmc/witherspoon.c
>>> @@ -219,6 +219,7 @@ static void set_link_details(struct npu2 *npu,
>>> uint32_t link_index,
>>> link_index);
>>> return;
>>> }
>>> + assert(dev->brick_index != NPU2_DEV_TYPE_UNKNOWN);
>>
>> I don't think you meant this?
>>
>> Assuming you meant assert(dev->type != NPU2_DEV_TYPE_UNKNOWN) instead -
>
> Correct, my bad.
>
>
>> as you can see above I do explicitly set some links to type unknown when
>> we're not using them, which strictly speaking I don't have to do, but it
>> keeps it very explicit.
>
> I think what I really meant was:
> assert(dev->type == NPU2_DEV_TYPE_UNKNOWN)
> :)
>
> I must have been thinking of "bug_on" hence "!=" instead of "==" and I
> cut-n-pasted wrong "dev->brick_index". Sorry about that. The idea was to
> alert about rewriting the type to ensure it is set in one place only.
That makes sense, I might add that.
>
>
>
>>
>>> dev->brick_index = brick_index;
>>> dev->type = type;
>>> }
>>>
>>>
>>>
>>>
>>>
>>>> }
>>>> return;
>>>>
>>>
>>
>
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com IBM Australia Limited
More information about the Skiboot
mailing list