[Skiboot] [PATCH] npu2/hw-procedures: fence bricks on GPU reset

Stewart Smith stewart at linux.vnet.ibm.com
Tue Apr 24 13:33:46 AEST 2018


Balbir Singh <bsingharora at gmail.com> writes:
> The NPU workbook defines a way of fencing a brick and
> getting the brick out of fence state. We do have an implementation
> of bringing the brick out of fenced/quiesced state. We do
> the latter in our procedures, but to support run time reset
> we need to do the former.
>
> The fencing ensures that access to memory behind the links
> will not lead to HMI's, but instead SUE's will be populated
> in cache (in the case of speculation). The expectation is then
> that prior to and after reset, the operating system components
> will flush the cache for the region of memory behind the GPU.
>
> This patch does the following:
>
> 1. Implements a npu2_dev_fence_brick() function to set/clear
> fence state
> 2. Clear FIR bits prior to clearing the fence status
> 3. Clear's the fence status
> 4. We take the powerbus out of CQ fence much later now,
> in credits_check() which is the last hardware procedure
> called after link training.
>
> Signed-off-by: Balbir Singh <bsingharora at gmail.com>
> ---
>
> Notes for reviewer
>  - Clearing FIR bits, will clear full NPU FIR, but I don't think it's a
>  problem, any major link or powerbus issues will retrigger back. We
>  don't do a whole lot of mitigation in our HMI handling, just reporting,
>  so we can't we papering over a problem from what I can see.
>  - I've tested this on a 4 GPU box with several reset cycles over a couple
>  of days
>
>  hw/npu2-hw-procedures.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 7 deletions(-)

cheers. Merged to master as of bdd925aabbbbf0d35a44d85c9d51809c668be1ba
and to 5.10.x as of 4adf8e31b1108a374da7eab3fcdbf1ec7e760ef2.

I guess it's time to do a 5.10.5

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list