[Skiboot] [PATCH 1/4] platforms/astbmc/witherspoon: Rework NPU presence detection
Andrew Donnellan
andrew.donnellan at au1.ibm.com
Fri Oct 19 15:49:29 AEDT 2018
On 19/10/18 3:02 am, Frederic Barrat wrote:
>> - if (gpu0_present) {
>> + if (occ_get_gpu_presence(chip, 0)) {
>> + prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 slot present\n", chip->id);
>> if (state & (1 << 0)) {
>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 is OpenCAPI\n",
>> chip->id);
>> - /*
>> - * On witherspoon, bricks 2 and 3 are connected to
>> - * the lanes matching links 1 and 0 in OpenCAPI mode.
>> - */
>> - set_link_details(npu, 0, 3, NPU2_DEV_TYPE_OPENCAPI);
>> - /* We current don't support using the second link */
>> - set_link_details(npu, 1, 2, NPU2_DEV_TYPE_UNKNOWN);
>> + gpu0_type = NPU2_DEV_TYPE_OPENCAPI;
>> } else {
>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#0 is NVLink\n",
>> chip->id);
>> - set_link_details(npu, 0, 0, NPU2_DEV_TYPE_NVLINK);
>> - set_link_details(npu, 1, 1, NPU2_DEV_TYPE_NVLINK);
>> - set_link_details(npu, 2, 2, NPU2_DEV_TYPE_NVLINK);
>> + gpu0_type = NPU2_DEV_TYPE_NVLINK;
>> }
>> }
>
> I would add a comment to explain that the value of the "state" variable
> is floating if there's no device. We suffered enough to get it working ;-)
>
Will do.
>
>> - if (gpu1_present) {
>> + if (occ_get_gpu_presence(chip, 1)) {
>> + prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 slot present\n", chip->id);
>> if (state & (1 << 1)) {
>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 is OpenCAPI\n",
>> chip->id);
>> - set_link_details(npu, 4, 4, NPU2_DEV_TYPE_OPENCAPI);
>> - /* We current don't support using the second link */
>> - set_link_details(npu, 5, 5, NPU2_DEV_TYPE_UNKNOWN);
>> + gpu1_type = NPU2_DEV_TYPE_OPENCAPI;
>> } else {
>> prlog(PR_DEBUG, "PLAT: Chip %d GPU#1 is NVLink\n",
>> chip->id);
>> + gpu1_type = NPU2_DEV_TYPE_NVLINK;
>> + }
>> + }
>> +
>> + if (gpu0_type == NPU2_DEV_TYPE_OPENCAPI) {
>> + /*
>> + * On witherspoon, bricks 2 and 3 are connected to
>> + * the lanes matching links 1 and 0 in OpenCAPI mode.
>> + */
>> + set_link_details(npu, 0, 3, NPU2_DEV_TYPE_OPENCAPI);
>> + /* We currently don't support using the second link */
>> + set_link_details(npu, 1, 2, NPU2_DEV_TYPE_UNKNOWN);
>> + }
>
> I think we have a problem here. Not because of this patch, but the
> previous one which introduced the link to brick mapping.
> We're targeting the "middle" group of lanes (4-6 11-12 17-19), that's
> the lanes connected to the connector 5 of the GPU slot. So that would be
> link 1 and not link 0 (for brick 3).
> I think what we want is:
> set_link_details(npu, 0, 2, NPU2_DEV_TYPE_UNKNOWN);
> set_link_details(npu, 1, 3, NPU2_DEV_TYPE_OPENCAPI);
>
> The surprising thing is that it trains in both cases! It should only
> affect the PHY reset sequence and calibration, and not the rest of the
> training sequence, so we may have been lucky that the default inits were
> close enough.
>
> Link 4 below is good.
I'm not entirely sure how I got to that, I remember having to put a fair
bit of thought into it, but I can't see how I got there any more. I
think you're right, and I'll change this in v2.
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com IBM Australia Limited
More information about the Skiboot
mailing list