[Skiboot] [RFC 04/12] npu2-opencapi: Rework link training timeout

Andrew Donnellan ajd at linux.ibm.com
Thu Jun 20 16:04:55 AEST 2019


On 19/6/19 10:45 pm, Frederic Barrat wrote:
> Opencapi link state should be polled for up to 3 seconds. Current code
> assumes a tight retry loop during fundamental reset at boot, which is
> not going to be true on link retraining. So update the timeout
> detection code to use a timebase instead of a simple retry count which
> could be way too long.
> 
> Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>

This looks good

Reviewed-by: Andrew Donnellan <ajd at linux.ibm.com>

> ---
>   hw/npu2-opencapi.c | 9 +++++----
>   include/npu2.h     | 2 ++
>   2 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/npu2-opencapi.c b/hw/npu2-opencapi.c
> index ada41ddb..5a94c949 100644
> --- a/hw/npu2-opencapi.c
> +++ b/hw/npu2-opencapi.c
> @@ -1140,13 +1140,13 @@ static int64_t npu2_opencapi_poll_link(struct pci_slot *slot)
>   		reg = get_odl_status(chip_id, dev->brick_index);
>   		if (GETFIELD(OB_ODL_STATUS_TRAINING_STATE_MACHINE, reg) ==
>   			OCAPI_LINK_STATE_TRAINED) {
> -			OCAPIINF(dev, "link trained in %lld ms\n",
> -				OCAPI_LINK_TRAINING_TIMEOUT - slot->retries);
> +			OCAPIINF(dev, "link trained in %ld ms\n",
> +				tb_to_msecs(mftb() - dev->train_start));
>   			check_trained_link(dev, reg);
>   			pci_slot_set_state(slot, OCAPI_SLOT_LINK_TRAINED);
>   			return pci_slot_set_sm_timeout(slot, msecs_to_tb(1));
>   		}
> -		if (slot->retries-- == 0)
> +		if (tb_compare(mftb(), dev->train_timeout) == TB_AAFTERB)
>   			return npu2_opencapi_retry_state(slot, reg);
>   
>   		return pci_slot_set_sm_timeout(slot, msecs_to_tb(1));
> @@ -1252,7 +1252,8 @@ static int64_t npu2_opencapi_freset(struct pci_slot *slot)
>   		/* Bump lanes - this improves training reliability */
>   		npu2_opencapi_bump_ui_lane(dev);
>   		start_training(chip_id, dev);
> -		slot->retries = OCAPI_LINK_TRAINING_TIMEOUT;
> +		dev->train_start = mftb();
> +		dev->train_timeout = dev->train_start + msecs_to_tb(OCAPI_LINK_TRAINING_TIMEOUT);
>   		pci_slot_set_state(slot, OCAPI_SLOT_LINK_START);
>   		return slot->ops.poll_link(slot);
>   
> diff --git a/include/npu2.h b/include/npu2.h
> index 5b2a436b..57a9cc96 100644
> --- a/include/npu2.h
> +++ b/include/npu2.h
> @@ -160,6 +160,8 @@ struct npu2_dev {
>   	uint64_t		linux_pe;
>   	bool			train_need_fence;
>   	bool			train_fenced;
> +	unsigned long		train_start;
> +	unsigned long		train_timeout;
>   };
>   
>   struct npu2 {
> 

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd at linux.ibm.com             IBM Australia Limited



More information about the Skiboot mailing list