[Skiboot] [PATCH skiboot] npu2: Clear fence on all bricks

Alexey Kardashevskiy aik at ozlabs.ru
Fri Nov 22 11:04:22 AEDT 2019


A bug in the NVidia driver can cause an UR HMI which fences bricks
(links). At the moment we clear fence status only for bricks of a specific
devices, however this does not appear to be enough and we need to clear
fences for all bricks. This is ok as we do not allow using GPUs
individually anyway.

Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
---

Reza/Ryan, could you please add more details about what exactly causes
these UR HMIs? Thanks!
---
 hw/npu2-hw-procedures.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/npu2-hw-procedures.c b/hw/npu2-hw-procedures.c
index 8379cbbeedcf..8d7082d27dc6 100644
--- a/hw/npu2-hw-procedures.c
+++ b/hw/npu2-hw-procedures.c
@@ -264,8 +264,8 @@ static bool poll_fence_status(struct npu2_dev *ndev, uint64_t val)
 /* Procedure 1.2.1 - Reset NPU/NDL */
 uint32_t reset_ntl(struct npu2_dev *ndev)
 {
-	uint64_t val;
-	int lane;
+	uint64_t val, check;
+	int lane, i;
 
 	set_iovalid(ndev, true);
 
@@ -283,10 +283,17 @@ uint32_t reset_ntl(struct npu2_dev *ndev)
 
 	/* Clear fence state for the brick */
 	val = npu2_read(ndev->npu, NPU2_MISC_FENCE_STATE);
-	if (val & PPC_BIT(ndev->brick_index)) {
-		NPU2DEVINF(ndev, "Clearing brick fence\n");
-		val = PPC_BIT(ndev->brick_index);
+	if (val) {
+		NPU2DEVINF(ndev, "Clearing all bricks fence\n");
 		npu2_write(ndev->npu, NPU2_MISC_FENCE_STATE, val);
+		for (i = 0, check = 0; i < 4096; i++) {
+			check = npu2_read(ndev->npu, NPU2_NTL_CQ_FENCE_STATUS(ndev));
+			if (!check)
+				break;
+		}
+		if (check)
+			NPU2DEVERR(ndev, "Clearing NPU2_MISC_FENCE_STATE=0x%llx timeout, current=0x%llx\n",
+					val, check);
 	}
 
 	/* Write PRI */
-- 
2.17.1



More information about the Skiboot mailing list