[Skiboot] [PATCH skiboot] npu2: Increase timeout for L2/L3 cache purging

Alistair Popple alistair at popple.id.au
Thu Jun 27 17:09:08 AEST 2019


On Thursday, 27 June 2019 3:47:54 PM AEST Alexey Kardashevskiy wrote:
> On 27/06/2019 15:15, Stewart Smith wrote:
> > Alexey Kardashevskiy <aik at ozlabs.ru> writes:
> >> On NVLink2 bridge reset, we purge all L2/L3 caches in the system.
> >> This is an asynchronous operation, we have a 2ms timeout here. There are
> >> reports that this is not enough and "PURGE L3 on core xxx timed out"
> >> messages appear (for the reference: on the test setup this takes
> >> 280us..780us).
> >> 
> >> This defines the timeout as a macro and changes this from 2ms to 20ms.
> >> 
> >> This adds a tracepoint to tell how long it took to purge all the caches.
> >> 
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >> ---
> >> 
> >> It would be interesting to know how long it can possibly take and if it
> >> depends on the actual GPU load and usage pattern.
> >> 
> >> To enable or disable traces, "nvram" needs to run and then the host needs
> >> reboot:
> >> 
> >> - enable traces:
> >> sudo nvram  -p ibm,skiboot --update-config log-level-memory=trace
> >> sudo nvram  -p ibm,skiboot --update-config log-level-driver=trace
> >> 
> >> - disable traces:
> >> sudo nvram  -p ibm,skiboot --update-config log-level-memory=
> >> sudo nvram  -p ibm,skiboot --update-config log-level-driver=
> >> ---
> >> 
> >>  include/npu2-regs.h |  2 ++
> >>  hw/npu2.c           | 20 +++++++++++++-------
> >>  2 files changed, 15 insertions(+), 7 deletions(-)
> > 
> > Merged to master as of d2005818bea35e74b8991a615ac5bee389263126.
> > 
> > Should this also go to stable?
> 
> Only the very first flush takes more than 2ms (2.6ms is the biggest
> observed).
> 
> We call the flushing sequence on all cores and it always completes
> within the timeout (200-300us). So by the time we finished with the last
> core (and we have many * 200us), the first one should have finished long
> ago.
> 
> So except the nasty message, this does not seem to have any other effect
> => no much need for stable?

On the other hand is there any reason not to put it in stable? In theory you 
could get unlucky and have the link torn down during a write back to the GPU 
which would result in a HMI/checkstop, so my vote would be for putting the 
timeout increase in stable.

- Alistair



More information about the Skiboot mailing list