[Skiboot] [PATCH] npu2/hw-procedures: Remove assertion from check_credits()
oohall at gmail.com
Wed Nov 20 11:25:49 AEDT 2019
On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab at linux.ibm.com> wrote:
> The RX clock mux in the NVLink PHY can glitch, which will manifest in
> hard to diagnose behavior--at best, a checkstop during the first link
> traffic. The only reliable way we found to detect this was by checking
> for a discrepancy in the credits we expect to receive during link
> Since the time the check was added, we've found that
> * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()")
> does work around the original glitch.
> * Asserting is too harsh. Before root cause was established, it was
> thought this could have been a manufacturing defect and we wanted to
> loudly fail hardware acceptance boot cycle tests.
> * It seems there is a valid situation in which credits are off from
> the expected value. During GPU hot reset, a CPU prefetch across the link
> can affect the credit count before we check.
> Given all of the above, remove the assert().
> Cc: stable # 6.0.x
> Signed-off-by: Reza Arbab <arbab at linux.ibm.com>
Merged as 24664b48642845d620e225111bf6184f3c102f60
Vasant, can you pick this up? I don't think you're monitoring
stable at arbab-laptop.localdomain, but maybe you are ;)
More information about the Skiboot