Metrics vs Logging, Continued

Wed Dec 6 03:50:31 AEDT 2017

Logging being something separate from metrics -- I've been toying
around with different approaches to allowing userspace metrics
collection and distribution.  There are likely better ways, and I
think I saw a message on chat about a metrics library that could be
used. -- but I've mostly been following email.

I was thinking this morning of a couple methods, some y'all might like
(one where the daemon owns it, one where the metric owner owns it):

1) Each daemon can be responsible for exporting onto dbus some objects
with a well-defined path that are of a metric type that has a value
and the daemon that owns it is therefore responsible for maintaining
it.  to collect the metrics, one must grab the subtree for the
starting point and trace out all the different metrics and get the
values from their owners. and reports that up somehow. -- the somehow
could be several IPMI packets.  or several IPMI packets containing a
protobuf (similarly to the flash access approach proposed by Brendan).
The upside to the free-form text and paths is you could parse it out
to figure out what was each thing.

2) Each daemon that wants to track a metric creates a metric object in
another daemon (via dbus calls) and then periodically updates that
value.  then the information can be reported in the way described
above similarly, except the owner of the dbus objects would be the one
daemon and one bus, etc.  This implementation requires a lot more dbus
traffic to maintain the values.  However, in situations where one
doesn't want to manage their own dbus object for this, they can just
make one dbus call to update their value based on whatever mechanism
they use for timing this and they can store the metrics internally in
their daemon however they please.  Another upside to this is that it'd
be straightforward to add to the current set of daemons without
needing to restructure anything.  Also, depending on the metric
itself, it may not be something updated all that frequently.  For
many, I foresee updating on non-critical failures, or interesting
failures -- for instance, how often the ipmi daemon's reply is
rejected by the btbridge daemon.

Approach #2 could be rolled into a couple library calls as well, very
easily such that they don't even know the internals of the tracking...
I like and don't like the free-form text naming of the metrics,
because obviously they can be human-readable.  Another approach might
be to assign them human readable names and IDs, similarly to sensors
so that you can read back the name for a metric once, and then in the
future cache it, making subsequent requests smaller.

Obvious downside to both implementations (although #2 has an easy
mitigation), if the daemon with the internal state crashes the metrics
are lost, when it comes back up all the metrics are 0.  If the metrics
are owned by another daemon, then the library calls to set up the
metrics tracking could check if the metric already exists, and use
that value to start with -- then you only have to care about that one
daemon crashing.  It could periodically write the values down and then
read them on start-up to persist these values.  However, you might
want the values to not persist... I imagine I wouldn't, however,
something like boot count would...

There are specific things that the host wants to know, that really
fall into metrics over logging:
1) BMCs boot count
2) i2c ioctl failure count (which bus/device/reg: count)
3) Specific sensor requests (reading, writing)
4) Fan control failsafe mode count, how often it's falling into failsafe mode
5) How often the ipmi daemon's reply to the btbridge daemon fails.

Given some feedback on this, I'll write up a design and the use-cases
it's trying to address.

Thanks,
Patrick