[Skiboot] [PATCH skiboot] npu2: Increase timeout for L2/L3 cache purging

Thu Jun 27 18:31:08 AEST 2019

Tested-by: Srikanth Aithal <sraithal at linux.vnet.ibm.com>
Reported-by: Srikanth Aithal <sraithal at linux.vnet.ibm.com>

On 6/27/19 12:39 PM, Alistair Popple wrote:
> On Thursday, 27 June 2019 3:47:54 PM AEST Alexey Kardashevskiy wrote:
>> On 27/06/2019 15:15, Stewart Smith wrote:
>>> Alexey Kardashevskiy <aik at ozlabs.ru> writes:
>>>> On NVLink2 bridge reset, we purge all L2/L3 caches in the system.
>>>> This is an asynchronous operation, we have a 2ms timeout here. There are
>>>> reports that this is not enough and "PURGE L3 on core xxx timed out"
>>>> messages appear (for the reference: on the test setup this takes
>>>> 280us..780us).
>>>>
>>>> This defines the timeout as a macro and changes this from 2ms to 20ms.
>>>>
>>>> This adds a tracepoint to tell how long it took to purge all the caches.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>> ---
>>>>
>>>> It would be interesting to know how long it can possibly take and if it
>>>> depends on the actual GPU load and usage pattern.
>>>>
>>>> To enable or disable traces, "nvram" needs to run and then the host needs
>>>> reboot:
>>>>
>>>> - enable traces:
>>>> sudo nvram  -p ibm,skiboot --update-config log-level-memory=trace
>>>> sudo nvram  -p ibm,skiboot --update-config log-level-driver=trace
>>>>
>>>> - disable traces:
>>>> sudo nvram  -p ibm,skiboot --update-config log-level-memory=
>>>> sudo nvram  -p ibm,skiboot --update-config log-level-driver=
>>>> ---
>>>>
>>>>   include/npu2-regs.h |  2 ++
>>>>   hw/npu2.c           | 20 +++++++++++++-------
>>>>   2 files changed, 15 insertions(+), 7 deletions(-)
>>> Merged to master as of d2005818bea35e74b8991a615ac5bee389263126.
>>>
>>> Should this also go to stable?
>> Only the very first flush takes more than 2ms (2.6ms is the biggest
>> observed).
>>
>> We call the flushing sequence on all cores and it always completes
>> within the timeout (200-300us). So by the time we finished with the last
>> core (and we have many * 200us), the first one should have finished long
>> ago.
>>
>> So except the nasty message, this does not seem to have any other effect
>> => no much need for stable?
> On the other hand is there any reason not to put it in stable? In theory you
> could get unlucky and have the link torn down during a write back to the GPU
> which would result in a HMI/checkstop, so my vote would be for putting the
> timeout increase in stable.
>
> - Alistair
>
> _______________________________________________
> Skiboot mailing list
> Skiboot at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot