<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <pre>Hi Bjorn,</pre>

    <pre>Thanks for the acknowledgement.

</pre>

    <div class="moz-cite-prefix">On 1/4/2023 12:44 AM, Bjorn Helgaas

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:20230103191418.GA1011392@bhelgaas">

      <pre class="moz-quote-pre" wrap="">[+cc Paul, Sasha, Leon, Frederick]

(Please cc folks who have commented on previous versions of your

patch.)

On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">There are many instances where correctable errors tend to inundate

the message buffer. We observe such instances during thunderbolt PCIe

tunneling.

It's true that they are mitigated by the hardware and are non-fatal

but we shouldn't be spamming the logs with such correctable errors as it

confuses other kernel developers less familiar with PCI errors, support

staff, and users who happen to look at the logs, hence rate limit them.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

I want a better understanding of why we have so many errors before

rate-limiting everybody.</pre>

    </blockquote>

    <pre>--> So, we are debugging this inside Intel along with the thunderbolt/PCIe team. Apparently, it will

take some time to reach to a conclusion. Since I witness these errors in other thunderbolt devices

also, I am currently segregating all the TBT devices so that we have proper data to debug.

</pre>

    <blockquote type="cite" cite="mid:20230103191418.GA1011392@bhelgaas">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">A typical example log inside an HP TBT4 dock:

[54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0

[54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)

[54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000

[54912.661211] igc 0000:2b:00.0:    [ 8] Rollover

[54912.661219] igc 0000:2b:00.0:    [12] Timeout

[54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0

[54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)

[54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000

[54982.838817] igc 0000:2b:00.0:    [12] Timeout

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Please remove the timestamps; they don't contribute to understanding

the problem.</pre>

    </blockquote>

    <pre>--> Sure. 

</pre>

    <blockquote type="cite" cite="mid:20230103191418.GA1011392@bhelgaas">

      <pre class="moz-quote-pre" wrap="">

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">This gets repeated continuously, thus inundating the buffer.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Did you verify that we actually clear the Correctable Error Status

register?</pre>

    </blockquote>

    <pre>--> This patch targets only rate limiting the correctable errors since they are

non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt

connections. It doesn't have an impact anywhere else.

As per your suggestion in the igc patch, I found rate limiting as a doable option

currently. Have eradicated any kind of masking the bits.

</pre>

    <blockquote type="cite" cite="mid:20230103191418.GA1011392@bhelgaas">

      <pre class="moz-quote-pre" wrap="">

<a class="moz-txt-link-freetext" href="https://bugzilla.kernel.org/show_bug.cgi?id=216863">https://bugzilla.kernel.org/show_bug.cgi?id=216863</a> looks like a

similar issue.  The issue Frederick is seeing happens when resuming

from sleep.  Is there some event that triggers the correctable errors

you see?</pre>

    </blockquote>

    <pre>--> The signatures look similar but there is no such event which triggers these errors.

I witness them in many situations (hot plug, cold boot, warm boot, s0ix, etc.). 

Further, I think the replay correctable errors arise in thunderbolt PCIe devices because

the timeout values are not adjusted properly concerning thunderbolt daisy chains.

Not sure, but since these PCIe devices work directly on the motherboard, and only give issues

when they are inside thunderbolt devices, I think the addition of PCIe bridges in the daisy chain

is not synced with proper timeout values.

</pre>

    <blockquote type="cite" cite="mid:20230103191418.GA1011392@bhelgaas">

      <pre class="moz-quote-pre" wrap="">

Bjorn

</pre>

    </blockquote>

  </body>

</html>