[Skiboot] [PATCH] npu2/hw-procedures: fence bricks on GPU reset
stewart at linux.vnet.ibm.com
Tue Apr 24 13:33:46 AEST 2018
Balbir Singh <bsingharora at gmail.com> writes:
> The NPU workbook defines a way of fencing a brick and
> getting the brick out of fence state. We do have an implementation
> of bringing the brick out of fenced/quiesced state. We do
> the latter in our procedures, but to support run time reset
> we need to do the former.
> The fencing ensures that access to memory behind the links
> will not lead to HMI's, but instead SUE's will be populated
> in cache (in the case of speculation). The expectation is then
> that prior to and after reset, the operating system components
> will flush the cache for the region of memory behind the GPU.
> This patch does the following:
> 1. Implements a npu2_dev_fence_brick() function to set/clear
> fence state
> 2. Clear FIR bits prior to clearing the fence status
> 3. Clear's the fence status
> 4. We take the powerbus out of CQ fence much later now,
> in credits_check() which is the last hardware procedure
> called after link training.
> Signed-off-by: Balbir Singh <bsingharora at gmail.com>
> Notes for reviewer
> - Clearing FIR bits, will clear full NPU FIR, but I don't think it's a
> problem, any major link or powerbus issues will retrigger back. We
> don't do a whole lot of mitigation in our HMI handling, just reporting,
> so we can't we papering over a problem from what I can see.
> - I've tested this on a 4 GPU box with several reset cycles over a couple
> of days
> hw/npu2-hw-procedures.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 45 insertions(+), 7 deletions(-)
cheers. Merged to master as of bdd925aabbbbf0d35a44d85c9d51809c668be1ba
and to 5.10.x as of 4adf8e31b1108a374da7eab3fcdbf1ec7e760ef2.
I guess it's time to do a 5.10.5
OPAL Architect, IBM.
More information about the Skiboot