Logging utilities.

Mon Sep 19 13:18:17 AEST 2016

Thank you for the background.  That explains a lot about the path Adriana
has gone down.  I completely understand wanting to separate policy from
reporting of device and process errors.  Where we differ is that I don't
see that as the purpose of logging.  You've described an event-based
telemetry system meant to be primarily consumed by other software to drive
alerts and automated responses.  I view logging as being primarily for the
benefit of a programmer or operator trying to understand what section of
_code_ went wrong.  That distinction is part of why glog isn't designed to
be super high-performance.  Google uses entirely separate systems for
telemetry.  In my and Google's experience, scraping logs to extract
information (even when the messages are well structured) winds up being
painful to maintain and unreliable over time even when using journald.

On Sat, Sep 17, 2016 at 5:49 AM, Patrick Williams <patrick at stwcx.xyz> wrote:

> There is a commit on Gerrit where some discussion was requested on the
> libraries we plan to use for logging:
>     https://gerrit.openbmc-project.xyz/#/c/53/
> The suggestion was made that we consider google/glog as a logging
> library.
>
> Our intention with the program logging is to heavily leverage systemd's
> journal methods both for original program logging but also for later
> programmatic extraction on errors.  Adriana is working on the second
> part and will have some proof of concepts to share soon.
>
> Left to their own devices, two different programmers might create two
> logging events such as:
>     ERROR "Failed to open device %s, %s", device, strerror(errno)
>     ERROR "fopen on %s failed: %d", device, errno
>
> Notice in this case, not only is the English different but the content
> itself is different and impossible to parse out if we want to do
> something like automatically transpose "device" into a potentially
> failing FRU.  The journald approach is that you do not put content in
> English but you use meta-data tags to represent it.
>
>     LEVEL=ERROR, MESSAGE="Failed to open", FILE="%s", device,
>     ERRNO="%d", errno
>
> Now we have a message that humans can read ("Failed to open
> FILE=device") and one that computers can still interpret.  To take this
> a step further, consider adding another meta-data tag:
>     ERROR_EVENT="xyz.openbmc-project.Error.Device-Not-Found"
>
> We can now codify behavior, in a data format customizable per-system if
> needed, such as:
>     on xyz.openbmc-project.Error.Device-Not-Found:
>         [Device, AccessType] = parse-device-string(metadata.DEVICE)
>         Parts = devices-on-bus(BMC, Device, AccessType)
>         add-callout(Parts, High)
>         add-callout("Error.Device-Configuration", Low)
>
> Effectively, we separate the error identification from the error policy.
> There are a few major advantages of this:
>     1. Historically we have found that well over 50% of the code in this
>        environment ends up being dealing with errors, which makes the
>        code very difficult to understand and maintain.  By separating
>        the policy, we end up with something that is much more readable.
>     2. Different systems end up with policy quirks that tend to litter
>        the code.  "On this system i2c bus 3 goes through this extra
>        riser card and the connector ends up oxidizing faster than we
>        expected, so put that riser as the highest priority for
>        replacement."  We can now put this policy into a system-specific
>        enhancement file rather than --enable-firestone and #if
>        CONFIG_FIRESTONE littering all the packages.
>
> Now, back to the logging API, what Adriana delivered aligns very well
> with these metadata and error concepts because it was designed with
> these thoughts in mind (which, I know, we could have done a better job of
> communicating).  We enforce, at compile-time, that "messages" are static
> English statements and dynamic data is all tagged through the "entry"
> mechanism.  Down the line we will be able to further enforce, again at
> compile time, that log(error(xyz.openbmc-project.Device-Not-Found)...)
> must contain {entry(DEVICE), entry(ERRNO)} so that the program fails to
> compile if the required data, per the error definition, were not given.
>
> I don't see an easy way to leverage google/glog to include these
> concepts, other than simply having the logging utilities Adriana
> delivered use google/glog instead of systemd_journal.  At that point we
> are losing the mind-share(?) associated with using glog in the first
> place.
>
> Glog seems to be primarily oriented towards a logging infrastructure
> for a stand-alone application.  The default behavior of it is to create
> a log file in /tmp per process.  Due to the nature of our project, being
> a collection of interacting products, debug is often going to require
> aggregation of logging across the system and this is where journald
> would be needed as the back-end.  As Rick mentioned in the above review,
> glog doesn't currently support systemd as a back-end but that is likely
> trivial to add (or maybe available as an unreleased patch already).  It
> does seem like a significant code path to introduce simply as a way to
> get from a log statement to journald, when what Adriana delivered
> compiles directly to a single journald call.
>
> Glog also uses C++ iostream-style << operators.  I made a simple
> hello-world application using it and there ends up being a local object
> constructed to contain the buffer, a << operator call to insert a string
> into the buffer, and a destructor call to emit the log entry.  There is
> 4 function calls involved at a minimum for each log entry, plus an
> additional function call per-element.  So, in addition to the
> performance implications of whatever glog is doing internally, I have
> concerns about the code size impacts to using Glog due to the way the
> API is structured and implemented.
>
> In summary, I do not see an advantage to using glog over direct
> journald calls, and would prefer we incorporate something derived from
> what Adriana has proposed, for the following reasons:
>     1. The proposed API has syntactic handling for concepts we want in
>        our error-event design that glog does not.
>         a. glog could be inserted under phosphor::logging at any point
>            in the future without any code change to applications that
>            use phosphor::logging.
>     2. Glog has performance and size implications, over direct journald
>        calls, without much in the way of added value.
>
> There are a number of macros that Glog provides to simplify log
> interaction:
>         LOG_IF(condition) << entry
>     rather than
>         if (condition) { LOG << entry; }
>     (and many more)
>
> It does seem like a concept we could enhance the phosphor::logging
> interfaces with if we find that this makes the code more concise.  I
> would be interested to hear some of the more widely used macros and see
> those enhancements proposed in Gerrit.
>
> --
> Patrick Williams
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20160918/0815bde6/attachment.html>