[PATCH v6 00/16] Rate limit AER logs
Krzysztof Wilczyński
kw at linux.com
Tue May 20 19:05:11 AEST 2025
Hello,
> This work is mostly due to Jon Pan-Doh and Karolina Stolarek. I rebased
> this to v6.15-rc1, factored out some of the trace and statistics updates,
> and added some minor cleanups.
>
> Proposal
> ========
>
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits for
> more robust error logging. Allow userspace to configure ratelimits via
> sysfs knobs.
>
> Motivation
> ==========
>
> Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
> of devices), affects repairabilitiy flows for fleet operators.
>
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services will allow for more
> predictable repair flows and decrease machine downtime.
>
> Background
> ==========
>
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
>
> There have been previous attempts to add ratelimits to AER logs ([4], [5]).
> The most recent attempt[5] has many similarities with the proposed
> approach.
I have been testing this series locally with and without faults triggered
using the AER error injection facility. No issues thus far.
And, as such...
Tested-by: Krzysztof Wilczyński <kwilczynski at kernel.org>
Thank you!
Krzysztof
More information about the Linuxppc-dev
mailing list