Metrics vs Logging, Continued

Sat Dec 16 06:06:48 AEDT 2017

Things like Prometheus and graphana already exist and have fairly standardized metrics gathering interfaces. Why re-invent the wheel here? I would advocate for Prometheus for metrics gathering. 

Quick primer on Prometheus: each process exposes an HTTP endpoint that reports metrics in a standardized JSON format. The process can expose counters, gauges, histograms or summaries. The server side is responsible for polling the various clients at whatever interval is desired and can format results graphically and over a bunch of clients. Each process can instrument itself, there are c++ client libraries to do this already. Then we can aggregate them out with a route from the main HTTP server.

All of the things in your metrics list fall neatly into types of data that Prometheus format handles well.
--
Michael

-----Original Message-----
From: openbmc [mailto:openbmc-bounces+michael.e.brown=dell.com at lists.ozlabs.org] On Behalf Of Patrick Venture
Sent: Tuesday, December 5, 2017 10:51 AM
To: OpenBMC Maillist <openbmc at lists.ozlabs.org>; Brad Bishop <bradleyb at fuzziesquirrel.com>
Subject: Metrics vs Logging, Continued

Logging being something separate from metrics -- I've been toying around with different approaches to allowing userspace metrics collection and distribution.  There are likely better ways, and I think I saw a message on chat about a metrics library that could be used. -- but I've mostly been following email.

I was thinking this morning of a couple methods, some y'all might like (one where the daemon owns it, one where the metric owner owns it):

1) Each daemon can be responsible for exporting onto dbus some objects with a well-defined path that are of a metric type that has a value and the daemon that owns it is therefore responsible for maintaining it.  to collect the metrics, one must grab the subtree for the starting point and trace out all the different metrics and get the values from their owners. and reports that up somehow. -- the somehow could be several IPMI packets.  or several IPMI packets containing a protobuf (similarly to the flash access approach proposed by Brendan).
The upside to the free-form text and paths is you could parse it out to figure out what was each thing.

2) Each daemon that wants to track a metric creates a metric object in another daemon (via dbus calls) and then periodically updates that value.  then the information can be reported in the way described above similarly, except the owner of the dbus objects would be the one daemon and one bus, etc.  This implementation requires a lot more dbus traffic to maintain the values.  However, in situations where one doesn't want to manage their own dbus object for this, they can just make one dbus call to update their value based on whatever mechanism they use for timing this and they can store the metrics internally in their daemon however they please.  Another upside to this is that it'd be straightforward to add to the current set of daemons without needing to restructure anything.  Also, depending on the metric itself, it may not be something updated all that frequently.  For many, I foresee updating on non-critical failures, or interesting failures -- for instance, how often the ipmi daemon's reply is rejected by the btbridge daemon.

Approach #2 could be rolled into a couple library calls as well, very easily such that they don't even know the internals of the tracking...
I like and don't like the free-form text naming of the metrics, because obviously they can be human-readable.  Another approach might be to assign them human readable names and IDs, similarly to sensors so that you can read back the name for a metric once, and then in the future cache it, making subsequent requests smaller.

Obvious downside to both implementations (although #2 has an easy mitigation), if the daemon with the internal state crashes the metrics are lost, when it comes back up all the metrics are 0.  If the metrics are owned by another daemon, then the library calls to set up the metrics tracking could check if the metric already exists, and use that value to start with -- then you only have to care about that one daemon crashing.  It could periodically write the values down and then read them on start-up to persist these values.  However, you might want the values to not persist... I imagine I wouldn't, however, something like boot count would...

There are specific things that the host wants to know, that really fall into metrics over logging:
1) BMCs boot count
2) i2c ioctl failure count (which bus/device/reg: count)
3) Specific sensor requests (reading, writing)
4) Fan control failsafe mode count, how often it's falling into failsafe mode
5) How often the ipmi daemon's reply to the btbridge daemon fails.

Given some feedback on this, I'll write up a design and the use-cases it's trying to address.

Thanks,
Patrick