Callout definitions in phosphor-logging : RFC
Deepak Kodihalli
dkodihal at linux.vnet.ibm.com
Thu Jan 12 20:43:28 AEDT 2017
Hello,
This sprint I've been working on defining how an openbmc error having a
"callout" should be defined. Traditionally IBM servers have this concept
of a callout, which is basically an indication of a faulty hardware
component in the server. Such a component is typically a "field
replaceable unit" or a FRU. Hence the callout is usually a FRU callout.
The callout has additional information, which helps determine how the
FRU could be serviced or replaced. The callout is included in an error log.
While what necessitates a callout is a failed hardware access operation,
for example a failed IIC write, what gets called out depends a lot on
the system topology and various policies. The callout design needs to be
generic, keeping system specific logic out of the application code.
Keeping this in mind, we need to have a set of basic design principles
on how a callout should be implemented in openbmc. I've outlined these
below. Please provide your feedback. If you don't like the term
"callout", please suggest alternatives.
1) An application running into an error and wanting to do a callout will
express the callout in terminologies most familiar to the application,
rather than system terminology. As an example, say an application
programs the system reference clock via an IIC interface; in case of a
failed IIC write, the application will express the callout as an IIC
callout, adding information such as the IIC bus and device address. It
will not bother about details such as, the FRU in question is the
system's reference clock, or maybe that the clock is not a FRU on this
system, and the entire system planar may need to be called out for
replacement. Such decisions will have to be made by the error logging
component, via use of system specific policy files.
2) Based on the above, we've defined some "callout interfaces" which
applications may want to use. These interfaces are based on the
phosphor-logging's openbmc error definition, so they are expressed in
YAML. A callout interface would have metadata specific to that callout.
For eg an IIC callout, such as xyz.openbmc_project.Error.Callout.IIC
(may the term "callout" isn't needed here, since it's just another
error?), would need the device address and the bus information. We have
callouts for IIC, GPIO, IPMI sensor to start with. This is up for review
in Gerrit [1]. The reasoning behind having a callout defined as an
openbmc error is it lets you validate the metadata that an application
needs to add to the error log for that callout. This is in-line with the
existing phosphor-logging design.
3) To actually add a callout to an error, that error needs to be defined
such that it inherits the callout event. There is an example for this in
[1]. With this, a callout will essentially be like any other openbmc
error, but there will be metadata, specific to that callout, required to
log that error. When an application logs such an error, the metadata
pertaining to the callout should be supplied by the application, and
this gets to the systemd journal just like a regular error's metadata.
The callout metadata will enable the phosphor error log server (work in
progress) to be able to identify the callout, and then apply system
policies to convert that to a FRU callout.
4) The error log server, upon identifying a callout, will with the help
of the callout metadata, the system's MRW and various system policies,
identify one or more FRUs that need to be called out. It will fetch
appropriate FRU inventory objects from the phosphor-inventory-manager,
and create associations between the error object and these inventory
objects, in order to represent the callouts.
[1] https://gerrit.openbmc-project.xyz/#/c/1752
Thanks,
Deepak
More information about the openbmc
mailing list