[Skiboot] [RFC PATCH 2/6] opal/hmi: Introduce thresholding of HMI errors.

Tue Jun 4 14:14:05 AEST 2019

Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> Define two threshold levels similar to what pHyp uses.
> Level 1) 100 errors in 100 msec.
>      If we get same hmi error 100 times in 100 msec then we definitely
>      have a BAD chip/core.
> Level 2) 32 errors on 24 hour time window.
>      If we get same hmi error 32 times in 24 hour time window then also
>      we can consider that we have a BAD chip/core.
>
> In either of above cases when threshold is reached log an eSEL pointing out
> a BAD chip. Possibly also send an event to Linux host to hotplug out
> respective core and mark a gard record for same.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>

A thought I had while reading through the patch, I feel we should expose
these counters / thresholds either through sensors or the IMC interface
as even if the threshold isn't being it, they could be useful numbers to
gather over a large cluster and might be useful for some kind of
predictive maintenace.

-- 
Stewart Smith
OPAL Architect, IBM.