[Skiboot] [PATCH] npu2/hw-procedures: fence bricks on GPU reset

Alistair Popple alistair at popple.id.au
Mon Apr 23 12:15:15 AEST 2018


Thanks Balbir, this looks pretty good to me.

Reviewed-By: Alistair Popple <alistair at popple.id.au>

On Monday, 23 April 2018 11:43:12 AM AEST Balbir Singh wrote:
> The NPU workbook defines a way of fencing a brick and
> getting the brick out of fence state. We do have an implementation
> of bringing the brick out of fenced/quiesced state. We do
> the latter in our procedures, but to support run time reset
> we need to do the former.
> 
> The fencing ensures that access to memory behind the links
> will not lead to HMI's, but instead SUE's will be populated
> in cache (in the case of speculation). The expectation is then
> that prior to and after reset, the operating system components
> will flush the cache for the region of memory behind the GPU.
> 
> This patch does the following:
> 
> 1. Implements a npu2_dev_fence_brick() function to set/clear
> fence state
> 2. Clear FIR bits prior to clearing the fence status
> 3. Clear's the fence status
> 4. We take the powerbus out of CQ fence much later now,
> in credits_check() which is the last hardware procedure
> called after link training.
> 
> Signed-off-by: Balbir Singh <bsingharora at gmail.com>
> ---
> 
> Notes for reviewer
>  - Clearing FIR bits, will clear full NPU FIR, but I don't think it's a
>  problem, any major link or powerbus issues will retrigger back. We
>  don't do a whole lot of mitigation in our HMI handling, just reporting,
>  so we can't we papering over a problem from what I can see.
>  - I've tested this on a 4 GPU box with several reset cycles over a couple
>  of days
> 
>  hw/npu2-hw-procedures.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
> index 9e4a4316..e25b85c5 100644
> --- a/hw/npu2-hw-procedures.c
> +++ b/hw/npu2-hw-procedures.c
> @@ -232,6 +232,26 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
>  	return false;
>  }
>  
> +static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set)
> +{
> +	/*
> +	 * Add support for queisce/fence the brick at
> +	 * procedure reset time.
> +	 */
> +	uint32_t brick;
> +	uint64_t val;
> +
> +	brick = ndev->index;
> +	if (set)
> +		brick += 6;
> +
> +	val = PPC_BIT(brick);
> +	NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear",
> +			ndev->index, val);
> +	npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
> +	return 0;
> +}
> +
>  /* Procedure 1.2.1 - Reset NPU/NDL */
>  uint32_t reset_ntl(struct npu2_dev *ndev)
>  {
> @@ -288,19 +308,28 @@ static uint32_t reset_ndl(struct npu2_dev *ndev)
>  static uint32_t reset_ntl_release(struct npu2_dev *ndev)
>  {
>  	uint64_t val;
> +	uint64_t npu2_fir;
> +	uint64_t npu2_fir_addr;
> +	int i;
>  
> -	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> -	val &= 0xFFBFFFFFFFFFFFFF;
> -	npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> +	/* Clear FIR bits */
> +	npu2_fir_addr = NPU2_FIR_REGISTER_0;
> +	npu2_fir = 0;
>  
> -	if (!poll_fence_status(ndev, 0x8000000000000000))
> -		return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
> +	for (i = 0; i < NPU2_TOTAL_FIR_REGISTERS; i++) {
> +		npu2_write(ndev->npu, npu2_fir_addr, npu2_fir);
> +		npu2_fir_addr += NPU2_FIR_OFFSET;
> +
> +	}
> +
> +	/* Release the fence */
> +	npu2_dev_fence_brick(ndev, false);
>  
>  	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> -	val &= 0xFF3FFFFFFFFFFFFF;
> +	val &= 0xFFBFFFFFFFFFFFFF;
>  	npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
>  
> -	if (!poll_fence_status(ndev, 0x0))
> +	if (!poll_fence_status(ndev, 0x8000000000000000))
>  		return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
>  
>  	return PROCEDURE_NEXT;
> @@ -718,6 +747,7 @@ static uint32_t check_credit(struct npu2_dev *ndev, uint64_t reg,
>  static uint32_t check_credits(struct npu2_dev *ndev)
>  {
>  	int fail = 0;
> +	uint64_t val;
>  
>  	fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
>  	fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
> @@ -728,6 +758,13 @@ static uint32_t check_credits(struct npu2_dev *ndev)
>  
>  	assert(!fail);
>  
> +	val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> +	val &= 0xFF3FFFFFFFFFFFFF;
> +	npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> +
> +	if (!poll_fence_status(ndev, 0x0))
> +		return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
> +
>  	return PROCEDURE_COMPLETE;
>  }
>  DEFINE_PROCEDURE(check_credits);
> @@ -885,5 +922,6 @@ int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf,
>  
>  void npu2_dev_procedure_reset(struct npu2_dev *dev)
>  {
> +	npu2_dev_fence_brick(dev, true);
>  	npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET);
>  }
> 




More information about the Skiboot mailing list