Callout definitions in phosphor-logging : RFC

Thu Jan 12 20:43:28 AEDT 2017

Hello,

This sprint I've been working on defining how an openbmc error having a 
"callout" should be defined. Traditionally IBM servers have this concept 
of a callout, which is basically an indication of a faulty hardware 
component in the server. Such a component is typically a "field 
replaceable unit" or a FRU. Hence the callout is usually a FRU callout. 
The callout has additional information, which helps determine how the 
FRU could be serviced or replaced. The callout is included in an error log.

While what necessitates a callout is a failed hardware access operation, 
for example a failed IIC write, what gets called out depends a lot on 
the system topology and various policies. The callout design needs to be 
generic, keeping system specific logic out of the application code. 
Keeping this in mind, we need to have a set of basic design principles 
on how a callout should be implemented in openbmc. I've outlined these 
below. Please provide your feedback. If you don't like the term 
"callout", please suggest alternatives.

1) An application running into an error and wanting to do a callout will 
express the callout in terminologies most familiar to the application, 
rather than system terminology. As an example, say an application 
programs the system reference clock via an IIC interface; in case of a 
failed IIC write, the application will express the callout as an IIC 
callout, adding information such as the IIC bus and device address. It 
will not bother about details such as, the FRU in question is the 
system's reference clock, or maybe that the clock is not a FRU on this 
system, and the entire system planar may need to be called out for 
replacement. Such decisions will have to be made by the error logging 
component, via use of system specific policy files.

2) Based on the above, we've defined some "callout interfaces" which 
applications may want to use. These interfaces are based on the 
phosphor-logging's openbmc error definition, so they are expressed in 
YAML. A callout interface would have metadata specific to that callout. 
For eg an IIC callout, such as xyz.openbmc_project.Error.Callout.IIC 
(may the term "callout" isn't needed here, since it's just another 
error?), would need the device address and the bus information. We have 
callouts for IIC, GPIO, IPMI sensor to start with. This is up for review 
in Gerrit [1]. The reasoning behind having a callout defined as an 
openbmc error is it lets you validate the metadata that an application 
needs to add to the error log for that callout. This is in-line with the 
existing phosphor-logging design.

3) To actually add a callout to an error, that error needs to be defined 
such that it inherits the callout event. There is an example for this in 
[1]. With this, a callout will essentially be like any other openbmc 
error, but there will be metadata, specific to that callout, required to 
log that error. When an application logs such an error, the metadata 
pertaining to the callout should be supplied by the application, and 
this gets to the systemd journal just like a regular error's metadata. 
The callout metadata will enable the phosphor error log server (work in 
progress) to be able to identify the callout, and then apply system 
policies to convert that to a FRU callout.

4) The error log server, upon identifying a callout, will with the help 
of the callout metadata, the system's MRW and various system policies, 
identify one or more FRUs that need to be called out. It will fetch 
appropriate FRU inventory objects from the phosphor-inventory-manager, 
and create associations between the error object and these inventory 
objects, in order to represent the callouts.

[1] https://gerrit.openbmc-project.xyz/#/c/1752

Thanks,
Deepak