Logging utilities.

Patrick Williams patrick at stwcx.xyz
Sat Oct 29 00:28:59 AEDT 2016


On Mon, Oct 24, 2016 at 05:43:22PM -0700, Brendan Higgins wrote:
> Trying to revive this thread: we never concluded discussion on this. For the
> convenience of the reader I have quoted the previous email below:
> 
> > Thank you for the background.  That explains a lot about the path Adriana
> > has gone down.  I completely understand wanting to separate policy from
> > reporting of device and process errors.  Where we differ is that I don't
> > see that as the purpose of logging.  You've described an event-based
> > telemetry system meant to be primarily consumed by other software to drive
> > alerts and automated responses.  I view logging as being primarily for the
> > benefit of a programmer or operator trying to understand what section of
> > _code_ went wrong.  That distinction is part of why glog isn't designed to
> > be super high-performance.  Google uses entirely separate systems for
> > telemetry.  In my and Google's experience, scraping logs to extract
> > information (even when the messages are well structured) winds up being
> > painful to maintain and unreliable over time even when using journald.
> 
> We have a patch set out that adds glog support at:
> https://gerrit.openbmc-project.xyz/#/c/788/
> 
> In addition, we have a patch for glog that adds support for logging to journald,
> but I would like to get some feedback before I do anything else.

The IBM team and Google team had a in-person discussion on this email
thread yesterday (10/27).  I'm going to try to recap some of the points
there.

There ends up being three different aspects of "logging" that we were
discussing and in some cases using the same words for the different
aspects.  Those aspects are:

   1.  Program level information to assist in debugging.  This is
   sometimes called "logging", "tracing" or "flight recording".

   2.  Creating a record of something wrong with the system that may
   require a service action.  IBM has traditionally used the word "error
   logging" to refer to the action and "serviceable event" to refer to
   the thing generated by the action.

   3.  Collection of metrics associated with the system for external
   consumption.  Examples of metrics might be: raw sensor values, BMC
   memory and filesystem consumption, instances of recovered /
   predictive failure errors that have not reached a serviceable
   threshold.

Program level "logging" is intended to go into the journal.  The commit
from Adriana referenced in the discussion implements this.  This logging
interface allows unstructured entries, but we prefer to keep metadata tags
on data items as described earlier in the chain.
  (ex. {MSG="Failed to open file.", ERRNO=%d, FILE_PATH=%s} rather than
       {"Failed to open file(%d): %s", errno, file_path})
The interface also allows syslog-style log levels and we could decide on
a compile flag to disable (compile out) certain log levels.  I believe
the IBM team discussed earlier in the week adding a compile flag to
disable DEBUG level in full image builds, but allowing developers to do
something like "-DPHOSPHOR_LOGGING_DEBUG" to enable full logging on a
particular application.

We understand that there are some nice features of glog, with respect to
backtraces for instance, but strongly feel that the API of it will cause
a fairly significant flash size increase due to iostream-style shift
operators.  There is also concern, though unmeasured, about the
performance impact from the increase in the code path to the journald
interfaces.  The phosphor-logging interfaces compile down to exactly a
single journald function call per log call, which alleviates both of
these concerns.

The creation of service action records proposal was what I described
previously in this email chain.  The commits for that are available in
Gerrit.  We have found that often the information a developer would want
to log for application debug is often related to the information that is
needed to create a serviceable event.  Rather than require the program
(and developer) to log the same information through two different APIs,
we have attempted to combine it by leveraging the meta-data fields from
the program level logging API.  The serviceable event structure is
defined via a YAML file with required meta-data fields and then the API
has compile-time assurance that the required meta-data fields have been
provided.

We recognize this means we are effectively using the journal as an IPC
mechanism between the program identifying a serviceable event and the
program recording and reacting to / executing policy for the serviceable
event.  If this becomes cumbersome or problematic due to journald's
behavior we believe we can change this implementation choice with
little or no API modification.

Metrics are typically going to be generated and distributed through the
system via dbus signals.  For instance, a dbus object is created per
sensor and the org.freedesktop.Object.PropertiesChanged signal is
generated when the value of the sensor changes.  We expect to enhance
the REST server to allow users to subscribe and stream out those sorts
of signals.  If there isn't an interest in the REST API, an implementer
is free to create another daemon that subscribes to those signals and
processes them in another way.

-- 
Patrick Williams
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20161028/dfc679ad/attachment-0001.sig>


More information about the openbmc mailing list