anyone interested in chip register error diagnostics?

zshelle zshelle at linux.vnet.ibm.com
Tue Mar 5 09:44:36 AEDT 2019


On 2019-03-04 14:56, Supreeth Venkatesh wrote:
> Thanks Brad.
> 
> Hi Zane/Brad,
> 
> On Arm Platforms, We use Common Platform Error Record (CPER) to report
> these kinds of hardware errors.
> The format of the errors are defined in Appendix N in UEFI 
> specification
> http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf
> 
> I have not read the proposal in its entirety, but this seems similar
> to Reliability, Availability, Serviceability (RAS) feature using
> System Management Mode/Management Mode, but on the BMC side.
> 
> I will take a look at the reviews posted and provide more feedback.
> 
> If this is something similar to RAS feature, I have in fact proposed
> in DMTF PMCI WG to include CPER formats to be added to one of
> the PLDM specifications.
> 
> Arm would be interested in the design of this component, if it can
> accommodate the above error formats and component can be designed in
> an architecture agnostic way.
> 
> Thanks,
> Supreeth
> 
> -----Original Message-----
> From: Brad Bishop <bradleyb at fuzziesquirrel.com>
> Sent: Monday, March 4, 2019 2:38 PM
> To: zshelle <zshelle at linux.vnet.ibm.com>; Supreeth Venkatesh
> <Supreeth.Venkatesh at arm.com>; ed.tanous at intel.com
> Subject: Re: anyone interested in chip register error diagnostics?
> 
> On Mon, Mar 04, 2019 at 02:22:45PM -0600, zshelle wrote:
>> On POWER, I work on a component that listens for hardware errors
>> reported by registers in the system chips (processors, memory buffers,
>> I/O chips, etc.) and performs service actions based on those errors. I
>> have been working on porting some of this code to the BMC for system
>> fatal error analysis (see my work-in-progress proposals:
>> https://gerrit.openbmc-project.xyz/#/c/openbmc/docs/+/18591/ and
>> https://gerrit.openbmc-project.xyz/#/c/openbmc/docs/+/18831/). As part
>> of the new design, we are building a generic, data-driven register
>> error isolator, which will be used by several applications within
>> POWER. However, it has the potential to be useful on other
>> architectures as well. I am curious if anyone in the community is
>> interested in this.
> 
> Thanks Zane - I'll tag Ed(x86) and Supreeth(arm) on this one.  Ed,
> Supreeth - do you understand the function being proposed here?  How
> does this work on x86 and arm servers?
> 
> thx - brad
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.

Looks like CPER is similar to IBM's Platform Error Log (PEL). At this 
time, I am not really focused on the log format at the moment, but it is 
on my list of things to investigate. I heard a rumor that there may be a 
standard "OpenBMC" logging mechanism in the works. With those, you could 
convert them into the CPER or PEL format if needed.

My proposals are more focused on reading the registers from hardware and 
determining what caused the error. On POWER, we have hundreds of what we 
call fault isolation registers (FIRs), where each bit within those 
registers can signify a different hardware error event. It is also 
possible that there may be several active bits on at the same time. So 
my component will sort through all of those registers, find the active 
bits, and determine what is the root cause of the failure versus 
side-effect errors. Once root cause is determined, we then perform any 
services actions defined by our RAS team (and commit a log). Is there 
anything like this on ARM?



More information about the openbmc mailing list