[Skiboot] [PATCH] npu2/hw-procedures: fence bricks on GPU reset
Alistair Popple
alistair at popple.id.au
Mon Apr 23 12:15:15 AEST 2018
Thanks Balbir, this looks pretty good to me.
Reviewed-By: Alistair Popple <alistair at popple.id.au>
On Monday, 23 April 2018 11:43:12 AM AEST Balbir Singh wrote:
> The NPU workbook defines a way of fencing a brick and
> getting the brick out of fence state. We do have an implementation
> of bringing the brick out of fenced/quiesced state. We do
> the latter in our procedures, but to support run time reset
> we need to do the former.
>
> The fencing ensures that access to memory behind the links
> will not lead to HMI's, but instead SUE's will be populated
> in cache (in the case of speculation). The expectation is then
> that prior to and after reset, the operating system components
> will flush the cache for the region of memory behind the GPU.
>
> This patch does the following:
>
> 1. Implements a npu2_dev_fence_brick() function to set/clear
> fence state
> 2. Clear FIR bits prior to clearing the fence status
> 3. Clear's the fence status
> 4. We take the powerbus out of CQ fence much later now,
> in credits_check() which is the last hardware procedure
> called after link training.
>
> Signed-off-by: Balbir Singh <bsingharora at gmail.com>
> ---
>
> Notes for reviewer
> - Clearing FIR bits, will clear full NPU FIR, but I don't think it's a
> problem, any major link or powerbus issues will retrigger back. We
> don't do a whole lot of mitigation in our HMI handling, just reporting,
> so we can't we papering over a problem from what I can see.
> - I've tested this on a 4 GPU box with several reset cycles over a couple
> of days
>
> hw/npu2-hw-procedures.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
> index 9e4a4316..e25b85c5 100644
> --- a/hw/npu2-hw-procedures.c
> +++ b/hw/npu2-hw-procedures.c
> @@ -232,6 +232,26 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
> return false;
> }
>
> +static int64_t npu2_dev_fence_brick(struct npu2_dev *ndev, bool set)
> +{
> + /*
> + * Add support for queisce/fence the brick at
> + * procedure reset time.
> + */
> + uint32_t brick;
> + uint64_t val;
> +
> + brick = ndev->index;
> + if (set)
> + brick += 6;
> +
> + val = PPC_BIT(brick);
> + NPU2DEVINF(ndev, "%s fence brick %d, val %llx\n", set ? "set" : "clear",
> + ndev->index, val);
> + npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
> + return 0;
> +}
> +
> /* Procedure 1.2.1 - Reset NPU/NDL */
> uint32_t reset_ntl(struct npu2_dev *ndev)
> {
> @@ -288,19 +308,28 @@ static uint32_t reset_ndl(struct npu2_dev *ndev)
> static uint32_t reset_ntl_release(struct npu2_dev *ndev)
> {
> uint64_t val;
> + uint64_t npu2_fir;
> + uint64_t npu2_fir_addr;
> + int i;
>
> - val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> - val &= 0xFFBFFFFFFFFFFFFF;
> - npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> + /* Clear FIR bits */
> + npu2_fir_addr = NPU2_FIR_REGISTER_0;
> + npu2_fir = 0;
>
> - if (!poll_fence_status(ndev, 0x8000000000000000))
> - return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
> + for (i = 0; i < NPU2_TOTAL_FIR_REGISTERS; i++) {
> + npu2_write(ndev->npu, npu2_fir_addr, npu2_fir);
> + npu2_fir_addr += NPU2_FIR_OFFSET;
> +
> + }
> +
> + /* Release the fence */
> + npu2_dev_fence_brick(ndev, false);
>
> val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> - val &= 0xFF3FFFFFFFFFFFFF;
> + val &= 0xFFBFFFFFFFFFFFFF;
> npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
>
> - if (!poll_fence_status(ndev, 0x0))
> + if (!poll_fence_status(ndev, 0x8000000000000000))
> return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
>
> return PROCEDURE_NEXT;
> @@ -718,6 +747,7 @@ static uint32_t check_credit(struct npu2_dev *ndev, uint64_t reg,
> static uint32_t check_credits(struct npu2_dev *ndev)
> {
> int fail = 0;
> + uint64_t val;
>
> fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
> fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
> @@ -728,6 +758,13 @@ static uint32_t check_credits(struct npu2_dev *ndev)
>
> assert(!fail);
>
> + val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
> + val &= 0xFF3FFFFFFFFFFFFF;
> + npu2_write(ndev->npu, NPU2_NTL_MISC_CFG1(ndev), val);
> +
> + if (!poll_fence_status(ndev, 0x0))
> + return PROCEDURE_COMPLETE | PROCEDURE_FAILED;
> +
> return PROCEDURE_COMPLETE;
> }
> DEFINE_PROCEDURE(check_credits);
> @@ -885,5 +922,6 @@ int64_t npu2_dev_procedure(void *dev, struct pci_cfg_reg_filter *pcrf,
>
> void npu2_dev_procedure_reset(struct npu2_dev *dev)
> {
> + npu2_dev_fence_brick(dev, true);
> npu2_clear_link_flag(dev, NPU2_DEV_DL_RESET);
> }
>
More information about the Skiboot
mailing list