[Skiboot] [RFC PATCH 2/6] opal/hmi: Introduce thresholding of HMI errors.
stewart at linux.ibm.com
Tue Jun 4 14:14:05 AEST 2019
Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> Define two threshold levels similar to what pHyp uses.
> Level 1) 100 errors in 100 msec.
> If we get same hmi error 100 times in 100 msec then we definitely
> have a BAD chip/core.
> Level 2) 32 errors on 24 hour time window.
> If we get same hmi error 32 times in 24 hour time window then also
> we can consider that we have a BAD chip/core.
> In either of above cases when threshold is reached log an eSEL pointing out
> a BAD chip. Possibly also send an event to Linux host to hotplug out
> respective core and mark a gard record for same.
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
A thought I had while reading through the patch, I feel we should expose
these counters / thresholds either through sensors or the IMC interface
as even if the threshold isn't being it, they could be useful numbers to
gather over a large cluster and might be useful for some kind of
OPAL Architect, IBM.
More information about the Skiboot