[Skiboot] [PATCH skiboot] npu2: Increase timeout for L2/L3 cache purging

Thu Jun 27 15:47:54 AEST 2019

On 27/06/2019 15:15, Stewart Smith wrote:
> Alexey Kardashevskiy <aik at ozlabs.ru> writes:
>> On NVLink2 bridge reset, we purge all L2/L3 caches in the system.
>> This is an asynchronous operation, we have a 2ms timeout here. There are
>> reports that this is not enough and "PURGE L3 on core xxx timed out"
>> messages appear (for the reference: on the test setup this takes
>> 280us..780us).
>>
>> This defines the timeout as a macro and changes this from 2ms to 20ms.
>>
>> This adds a tracepoint to tell how long it took to purge all the caches.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>> ---
>>
>> It would be interesting to know how long it can possibly take and if it
>> depends on the actual GPU load and usage pattern.
>>
>> To enable or disable traces, "nvram" needs to run and then the host needs
>> reboot:
>>
>> - enable traces:
>> sudo nvram  -p ibm,skiboot --update-config log-level-memory=trace
>> sudo nvram  -p ibm,skiboot --update-config log-level-driver=trace
>>
>> - disable traces:
>> sudo nvram  -p ibm,skiboot --update-config log-level-memory=
>> sudo nvram  -p ibm,skiboot --update-config log-level-driver=
>> ---
>>  include/npu2-regs.h |  2 ++
>>  hw/npu2.c           | 20 +++++++++++++-------
>>  2 files changed, 15 insertions(+), 7 deletions(-)
> 
> Merged to master as of d2005818bea35e74b8991a615ac5bee389263126.
> 
> Should this also go to stable?

Only the very first flush takes more than 2ms (2.6ms is the biggest
observed).

We call the flushing sequence on all cores and it always completes
within the timeout (200-300us). So by the time we finished with the last
core (and we have many * 200us), the first one should have finished long
ago.

So except the nasty message, this does not seem to have any other effect
=> no much need for stable?

-- 
Alexey