[Skiboot] [RFC PATCH 0/6] opal/hmi: Threshold-ing for timebase facility errors
mahesh at linux.vnet.ibm.com
Wed Apr 24 04:12:38 AEST 2019
These are inital set of patches that adds threshold-ing infrastructure for
timebase facility errors reported through HMI.
OPAL is capable of recovering from timebase facility errors. But as of
today OPAL does not do any kind of threshold-ing for these recovered errors.
There are chances where we may get certain parity errors reported repeatedly.
If they are more in numbers during certain time limit then this is certainly
an indication of a faulty chip/core. Hence in such cases it is better to
raise an alarm about the faulty chip/core so that it can be replaced with
This patch series introduces threshold-ing limits and generation of errorlog
eSEL to notify about the faulty chip/core.
- Add counters for chip level errors.
- Send gard-able event to HBRT to create a gard record for faulty core/chip.
- Need to figure out a way to inform Linux host to offline faulty core.
Mahesh Salgaonkar (6):
opal/hmi: Introduce core and thread level error counters
opal/hmi: Introduce thresholding of HMI errors.
opal/errorlog: Allow generation of Serviceable attention events
opal/errorlog: Add support to include callout section in error log.
opal: Get chip part-number and serial-number.
opal/hmi: Send an error callout on threshold.
core/chip.c | 13 ++++
core/errorlog.c | 37 ++++++++++++
core/hmi.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
core/pel.c | 80 ++++++++++++++++++++++++++
hw/chiptod.c | 28 +++++++--
include/chip.h | 28 +++++++++
include/cpu.h | 2 +
include/errorlog.h | 12 ++++
include/hmi.h | 83 +++++++++++++++++++++++++++
include/pel.h | 69 ++++++++++++++++++++++
10 files changed, 506 insertions(+), 6 deletions(-)
create mode 100644 include/hmi.h
More information about the Skiboot