[Skiboot] [PATCH 1/4] platforms/astbmc/witherspoon: Rework NPU presence detection

Andrew Donnellan andrew.donnellan at au1.ibm.com
Fri Oct 19 15:49:29 AEDT 2018


On 19/10/18 3:02 am, Frederic Barrat wrote:
>> -    if (gpu0_present) {
>> +    if (occ_get_gpu_presence(chip, 0)) {
>> +        prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 slot present\n", chip->id);
>>           if (state & (1 << 0)) {
>>               prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 is OpenCAPI\n",
>>                     chip->id);
>> -            /*
>> -             * On witherspoon, bricks 2 and 3 are connected to
>> -             * the lanes matching links 1 and 0 in OpenCAPI mode.
>> -             */
>> -            set_link_details(npu, 0, 3, NPU2_DEV_TYPE_OPENCAPI);
>> -            /* We current don't support using the second link */
>> -            set_link_details(npu, 1, 2, NPU2_DEV_TYPE_UNKNOWN);
>> +            gpu0_type = NPU2_DEV_TYPE_OPENCAPI;
>>           } else {
>>               prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 is NVLink\n",
>>                     chip->id);
>> -            set_link_details(npu, 0, 0, NPU2_DEV_TYPE_NVLINK);
>> -            set_link_details(npu, 1, 1, NPU2_DEV_TYPE_NVLINK);
>> -            set_link_details(npu, 2, 2, NPU2_DEV_TYPE_NVLINK);
>> +            gpu0_type = NPU2_DEV_TYPE_NVLINK;
>>           }
>>       }
> 
> I would add a comment to explain that the value of the "state" variable 
> is floating if there's no device. We suffered enough to get it working ;-)
> 

Will do.

> 
>> -    if (gpu1_present) {
>> +    if (occ_get_gpu_presence(chip, 1)) {
>> +        prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 slot present\n", chip->id);
>>           if (state & (1 << 1)) {
>>               prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 is OpenCAPI\n",
>>                     chip->id);
>> -            set_link_details(npu, 4, 4, NPU2_DEV_TYPE_OPENCAPI);
>> -            /* We current don't support using the second link */
>> -            set_link_details(npu, 5, 5, NPU2_DEV_TYPE_UNKNOWN);
>> +            gpu1_type = NPU2_DEV_TYPE_OPENCAPI;
>>           } else {
>>               prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 is NVLink\n",
>>                     chip->id);
>> +            gpu1_type = NPU2_DEV_TYPE_NVLINK;
>> +        }
>> +    }
>> +
>> +    if (gpu0_type == NPU2_DEV_TYPE_OPENCAPI) {
>> +        /*
>> +         * On witherspoon, bricks 2 and 3 are connected to
>> +         * the lanes matching links 1 and 0 in OpenCAPI mode.
>> +         */
>> +        set_link_details(npu, 0, 3, NPU2_DEV_TYPE_OPENCAPI);
>> +        /* We currently don't support using the second link */
>> +        set_link_details(npu, 1, 2, NPU2_DEV_TYPE_UNKNOWN);
>> +    }
> 
> I think we have a problem here. Not because of this patch, but the 
> previous one which introduced the link to brick mapping.
> We're targeting the "middle" group of lanes (4-6 11-12 17-19), that's 
> the lanes connected to the connector 5 of the GPU slot. So that would be 
> link 1 and not link 0 (for brick 3).
> I think what we want is:
>      set_link_details(npu, 0, 2, NPU2_DEV_TYPE_UNKNOWN);
>      set_link_details(npu, 1, 3, NPU2_DEV_TYPE_OPENCAPI);
> 
> The surprising thing is that it trains in both cases! It should only 
> affect the PHY reset sequence and calibration, and not the rest of the 
> training sequence, so we may have been lucky that the default inits were 
> close enough.
> 
> Link 4 below is good.

I'm not entirely sure how I got to that, I remember having to put a fair 
bit of thought into it, but I can't see how I got there any more. I 
think you're right, and I'll change this in v2.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com  IBM Australia Limited



More information about the Skiboot mailing list