<div class="socmaildefaultfont" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt" dir="ltr" ><div dir="ltr" >Hi Reza,</div>
<div dir="ltr" > </div>
<div dir="ltr" ><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >Each GPU has a 640kb protected region which will result in a</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >"unsupported request" (UR) response. The root bug is that the driver</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >maps and accidentally accesses that area.</span></div>
<div dir="ltr" > </div>
<div dir="ltr" >-> That more or less matches my understanding. There's a small region near the uppermost offset of each GPU that architecturally can never be read or written from the CPU (a requirement that was not understood or well communicated when the P9 hardware was designed). Doing so causes a GPU MMU fault which causes the UR-response. Based on a few other failures, there may also be another nearby region that is read-only, where the CPU can gain ownership without a UR-response, but if it tries to push modified data back the GPU's MMU will permission fault the same way and give a UR-response back to the CPU.</div>
<div dir="ltr" > </div>
<div dir="ltr" >The best understanding I have is the driver didn't ever intend to map or read/write these regions, but occasionally they receive interrupts pointing to go read these offsets (due to a bug with some PMU software on the GPU, or possibly a misunderstanding of how to adapt something to do with that software to the Power ISA) and do so without noticing that they are in these forbidden regions.</div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >This firmware patch helps for recovery. From our perspective it may seem</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >redundant to clear the fence on all bricks instead of just the one we're</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >resetting, but at a hardware level the above UR sends a fence signal to</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >all the hardware units so they all need to be cleared.</span></div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Arial,Helvetica,sans-serif; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >-> This I'm not as sure about, but it probably doesn't hurt to clear fence on bricks that have not errored at least. This error would be contained to the brick it occurred on and mirrored to other bricks in the same error brick grouping (via the npu registers bearing that name). Most errors just fence a particular brick + the bricks they're grouped with, but if say the XTS macro has an error all bricks will fence together based on the constraint that all links on a chip share the translation macro.</span></div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Arial,Helvetica,sans-serif; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >-Ryan</span></div>
<blockquote style="border-left:solid #aaaaaa 2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" dir="ltr" data-history-content-modified="1" data-history-expanded="1" >----- Original message -----<br>From: "Reza Arbab" <arbab@linux.ibm.com><br>To: "Alexey Kardashevskiy" <aik@ozlabs.ru><br>Cc: skiboot@lists.ozlabs.org, "Alistair Popple" <alistair@popple.id.au>, Brian J King/Rochester/IBM@IBMUS, Ryan Black/Rochester/IBM@IBMUS<br>Subject: Re: [Skiboot] [PATCH skiboot] npu2: Clear fence on all bricks<br>Date: Fri, Nov 29, 2019 10:47 AM<br>
<div><font face="Default Monospace,Courier New,Courier,monospace" size="2" >On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:<br>>Reza/Ryan, could you please add more details about what exactly causes<br>>these UR HMIs? Thanks!<br><br>Hopefully I've pieced together the bug history correctly. As I<br>understand it...<br><br>Each GPU has a 640kb protected region which will result in a<br>"unsupported request" (UR) response. The root bug is that the driver<br>maps and accidentally accesses that area.<br><br>This firmware patch helps for recovery. From our perspective it may seem<br>redundant to clear the fence on all bricks instead of just the one we're<br>resettting, but at a hardware level the above UR sends a fence signal to<br>all the hardware units so they all need to be cleared.<br><br>Acked-by: Reza Arbab <arbab@linux.ibm.com><br><br>--<br>Reza Arbab</font></div></blockquote>
<div dir="ltr" > </div></div><BR>