Metrics vs Logging, Continued

Fri Jan 5 02:39:55 AEDT 2018

On Wed, Jan 03, 2018 at 02:22:39AM +0000, Christopher Covington wrote:
> Hi Michael, Patrick,
> 
> I probably should have hopped on this list months ago. Thanks for your patience as I come up to
> speed on your code, configure my mail client to suit this list, and so on.
> 
> > Prometheus metrics is fundamentally a pull model, not a push model. If you have a pull model,
> > it greatly simplifies the dependencies:
> 
> >	- Pull metrics internally or externally (daemons listen on 127.0.0.1, optionally reverse proxy
> >	  that through your web service).
> 
> An option for on-demand metrics (as opposed to periodic, always-on monitoring) is nice. I would
> use it to more highly scrutinize upgrades in progress for example.

This is a nice point. You can easily turn on/off pull of metrics from your
systems by simply turning the server on/off.  But is harder for push metrics,
as you have to *configure* each endpoint for where to push to (if that changes
from time to time, which may or may not be the case), and you have to turn push
on/off.

> 
> >	- Optionally run the metrics server or not depending on configuration.
> 
> I agree it should fail gracefully when there is no server present, and think this generalizes to
> other network services, even NTP and DHCP.
> 
> >	- Pull model naturally self-limits in performance-limited cases... you don’t have a thundering
> >	  herd of daemons trying to push metrics. In case metrics server gets loaded it will naturally
> >	  slow down polls to backend daemons.
> 
> At large scale you'll either need multiple pollers or load-balancing for the receiving server. I'm
> not sure what the best solution is. Is load-balancing perhaps more commonplace?

Load balancing is "more commonplace" for things like generic web servers.
Setting up load balancing is distinctly non-trivial, as specifics of how you
are using are very important.

Overall this is a matter that reasonable people can disagree on. I favor the
pull approach. It degrades much more predictably (server pulls more slowly but
still hits all the systems). You can easily scale up. Load balancing for this
type of thing seems way more difficult to set up and get working well. (How you
persistently balance clients between servers, for instance.) However, I think
that a push model is also pretty easy to argue successfully for (though I
personally wont).

Also, I think the conversation here started more as a discussion on the best
way to collect metrics for individual daemons, rather than specifically for how
to get metrics for openbmc overall. I think that something like a protocol spec
for how individual metrics collection would be done would be pretty useful,
then it could be implemented differently across different daemons, but have the
same protocol. And then we can talk about how to extend that off the box.

Another part of the conversation we probably need to talk about is "api
stability" and are we going to have a requirement for specific metrics to be
stable over time? Or can we deal with ad-hoc metrics that may be added to over
time or may drop.

> 
> > But what I think would be pretty nice is if you could point graphana/Prometheus towards every
> > BMC on your network to get nice graphs of temp, fan speeds, etc.
> 
> For metrics/counters, I've been centrally pulling/polling from a fleet running the following RESTful
> API:
> 
> https://github.com/facebook/openbmc/tree/helium/common/recipes-rest/rest-api/files
> 
> But polling the whole fleet doesn't seem ideal, so I'm wondering about a push model.
> 
> Prometheus looks interesting, thanks for the pointer. It does seem to support a push model
> https://prometheus.io/docs/instrumenting/pushing/

Prometheus is just a specification for an http endpoint, so it would (in
theory) be relatively easy to write a pull-to-push gateway using any language.
The push model mentioned here is just a local agent that periodically polls the
local pull endpoint and pushes it somewhere.

> 
> Do Go language applications run reasonably well on ASpeed 2400 SoCs?

I've done several prototypes of golang servers on the Nuvoton ARM chip
(slightly faster than aspeed) and the results for me were more-than-acceptable.
I was very pleased with the ease of development, memory usage, and most other
metrics. It has the development speed of python combined with the runtime speed
of Java (and sometimes approaching C).

> I've heard that OpenWRT uses collectd: https://wiki.openwrt.org/doc/howto/statistic.collectd

Quick look at this project, it does two things: a) it has a plugin format for
how to collect various stats, and some pre-written plugins for "popular"
things, and b) it formats output in rrd format for consumption by other tools.
This is conceptually very similar to prometheus (same concepts: gauges,
histograms, counts, etc). However, it does appear that collectd specifies a
file format for output, but not an access format. The prometheus part that I'm
focusing on is in how we can standardize access and file formats.

--
Michael Brown