<div class="socmaildefaultfont" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt" dir="ltr" ><div dir="ltr" >Hi Alexey,</div>
<div dir="ltr" ><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >Oh. Is this address range described anywhere? We could disable mapping</span><br style="color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >these as a precaution measure.</span></div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >-> The Most significant address bits are defined by the npu GPU bar registers. I don't know what they're set to on production systems these days, but they are readable via scom (refer to npu workbook then I'd recommend reading these on a system with recent FW)</span></div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0); font-family: Default Monospace,Courier New,Courier,monospace; font-size: 13.33px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >The least significant bits are offsets defined by the GPU driver. I assume their BARs are programmable like ours and could change with driver releases, but the last offsets I had for the keep-out region from nvidia were:</span></div>
<div dir="ltr" > </div>
<div dir="ltr" ><span style="display: inline !important; float: none; background-color: rgba(248, 248, 248, 1); color: rgba(var(--sk_primary_foreground,29,28,29),1); font-family: Slack-Lato,appleLogo,sans-serif; font-size: 15px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.4666; orphans: 2; overflow-wrap: break-word; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >ADDR_LO = 0x7FDE80000</span><br style="box-sizing: inherit; color: rgba(29, 28, 29, 1); font-family: Slack-Lato,appleLogo,sans-serif; font-size: 15px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" ><span style="display: inline !important; float: none; background-color: rgba(248, 248, 248, 1); color: rgba(var(--sk_primary_foreground,29,28,29),1); font-family: Slack-Lato,appleLogo,sans-serif; font-size: 15px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; line-height: 1.4666; orphans: 2; overflow-wrap: break-word; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;" >ADDR_HI = 0x7FDF20000</span></div>
<div dir="ltr" > </div>
<div dir="ltr" >I think there is at least one more region though that allows reads but not writes, I'm not sure of the offsets for that so we'd have to ask nvidia. Or we could use getmemprocs/putmemprocs over an entire gpu region to deduce what regions allow what, but this would probably be tedious and time consuming to do.</div>
<div dir="ltr" > </div>
<div dir="ltr" >Can we really stop the driver outright from accessing these regions on its own device? That would be great if we could, but part of the problem by my understanding was it basically had kernel level privileges to its own device and it might be doing these accesses outside of the normal methods an application would do by going through populated page table entries. As far as I know, the uppermost 256MiB of each GPU actually isn't ever onlined in Linux, but accesses to that region happen regularly by the driver anyway (normally in a safe manner but in this case obviously not). </div>
<div dir="ltr" > </div>
<div dir="ltr" >-Ryan</div>
<blockquote style="border-left:solid #aaaaaa 2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" dir="ltr" data-history-content-modified="1" data-history-expanded="1" >----- Original message -----<br>From: Alexey Kardashevskiy <aik@ozlabs.ru><br>To: Reza Arbab <arbab@linux.ibm.com><br>Cc: skiboot@lists.ozlabs.org, Alistair Popple <alistair@popple.id.au>, Brian J King <bjking1@us.ibm.com>, Ryan Black <rblack@us.ibm.com><br>Subject: [EXTERNAL] Re: [Skiboot] [PATCH skiboot] npu2: Clear fence on all bricks<br>Date: Sun, Dec 1, 2019 8:06 PM<br>
<div><br><font face="Default Monospace,Courier New,Courier,monospace" size="2" >On 30/11/2019 03:47, Reza Arbab wrote:<br>> On Fri, Nov 22, 2019 at 11:04:22AM +1100, Alexey Kardashevskiy wrote:<br>>> Reza/Ryan, could you please add more details about what exactly causes<br>>> these UR HMIs? Thanks!<br>><br>> Hopefully I've pieced together the bug history correctly. As I<br>> understand it...<br>><br>> Each GPU has a 640kb protected region which will result in a<br>> "unsupported request" (UR) response. The root bug is that the driver<br>> maps and accidentally accesses that area.<br><br>Oh. Is this address range described anywhere? We could disable mapping<br>these as a precaution measure.<br><br><br>> This firmware patch helps for recovery. From our perspective it may seem<br>> redundant to clear the fence on all bricks instead of just the one we're<br>> resettting, but at a hardware level the above UR sends a fence signal to<br>> all the hardware units so they all need to be cleared.<br>><br>> Acked-by: Reza Arbab <arbab@linux.ibm.com><br><br><br>Thanks!<br><br><br>--<br>Alexey</font><br> </div></blockquote>
<div dir="ltr" > </div></div><BR>