[Skiboot] [PATCH] npu2/hw-procedures: Remove assertion from check_credits()

Oliver O'Halloran oohall at gmail.com
Wed Nov 20 11:25:49 AEDT 2019


On Fri, Nov 15, 2019 at 3:13 AM Reza Arbab <arbab at linux.ibm.com> wrote:
>
> The RX clock mux in the NVLink PHY can glitch, which will manifest in
> hard to diagnose behavior--at best, a checkstop during the first link
> traffic. The only reliable way we found to detect this was by checking
> for a discrepancy in the credits we expect to receive during link
> training.
>
> Since the time the check was added, we've found that
>
> * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()")
> does work around the original glitch.
>
> * Asserting is too harsh. Before root cause was established, it was
> thought this could have been a manufacturing defect and we wanted to
> loudly fail hardware acceptance boot cycle tests.
>
> * It seems there is a valid situation in which credits are off from
> the expected value. During GPU hot reset, a CPU prefetch across the link
> can affect the credit count before we check.
>
> Given all of the above, remove the assert().
>
> Cc: stable # 6.0.x
> Signed-off-by: Reza Arbab <arbab at linux.ibm.com>

Merged as 24664b48642845d620e225111bf6184f3c102f60

Vasant, can you pick this up? I don't think you're monitoring
stable at arbab-laptop.localdomain, but maybe you are ;)


More information about the Skiboot mailing list