RFC for Telemetry data collection

Wed Mar 14 01:50:33 AEDT 2018

On 13/03/18 7:53 pm, Kurt Taylor wrote:

>     Hi,
> 
>     I'd like to bump this topic and add some more details. I'd like to
>     discuss design proposals/directions for a couple things :
> 
>     1) A short/mid term proposal for telemetry requirements specific to
>     IBM labs (which need to be implemented in a relatively short span of
>     time, so there may not be the bandwidth to write an entirely new
>     application not based on D-Bus or the OpenBMC REST API).
>     2) Industry standard methods for storing and retrieving telemetry
>     data - thoughts on how to get here.
> 
> 
>     1) Telemetry requirements specific to IBM labs
>     Here are the requirements and a design proposal.
> 
>     a) Instantaneous readings, such as temperatures, currents, errors,
>     events etc. Let's call this Layer 0.
> 
>     Proposal:
>     - The D-Bus model is the source for instantaneous readings. This
>     means there would be D-Bus objects representing this data, and hence
>     an OpenBMC REST API around it.
>     - These D-Bus objects would not necessarily implement the same D-Bus
>     interfaces.
>     - Interested clients can read these D-Bus objects via the OpenBMC
>     REST API.
>     - If clients are interested in being notified about "changes" to the
>     readings, that's possible via the existing event notification over
>     WebSockets mechanism.
> 
> 
> This would also map well into an OBMC MIB extension for example.
> 
> 
> 
>     b) Instantaneous aggregations - this would mostly apply to, but may
>     not be limited to, readings such as temperatures and currents. Let's
>     call this Layer 1. This basically is to solve, for eg, "what is the
>     min/max/average over the last X seconds?". We have a requirement to
>     do such aggregations on the BMC.
> 
> 
> I would be interested in why aggregations (and historical - level 3) are 
> a requirement and not just handled by the monitoring/event management 
> app as done in network management.If this work is to be done in the 
> BMC, it needs to be user definable and able to be turned off for 
> resource-critical situations.

Right, it should be possible to turn off the layer 2 and 3 aggregation 
apps, and not have them in the BMC image at all.

Why the aggregations are required to be done on the BMC - I think that's 
the expectation of some of the IBM monitoring tools. I'm sure Todd 
Rosedahl would have a better answer here.

> 
>     Proposal:
>     - Aggregations are represented as D-Bus objects, created by a
>     telemetry app. For eg if we need to know the min/max/avg ambient
>     temp for the last 5 minutes, and say the the ambient temp is usually
>     at temps/ambient, the aggregation could be at
>     temps/aggregations/ambient.
>     - Implement D-Bus interfaces to denote aggregations, for eg the
>     temps/aggregation/ambient object could implement a D-Bus interface
>     describing min/max/avg properties.
>     - Aggregation objects will have the values as described in the D-Bus
>     interface (such as min/max/avg), and a timestamp, as properties.
>     - Enable a config (eg JSON) to let the telemetry app know things
>     like : What (supported) aggregations should be performed
>     (min/max/avg)? What D-Bus objects should be aggregated? How
>     frequently should they be aggregated? What should be the paths of
>     the aggregations? Potentially add a REST API to allow changing the
>     (JSON) config at run-time.
>     - It will be possible to read all aggregation objects, or
>     aggregation objects of a specific type via one REST call.
> 
> 
>     c) Historical aggregations or snapshot. Let's call this Layer 2.
>     This is to solve, for eg, "Need a reading corresponding to every X
>     minutes in a period of Y hours". This can be a snapshot of Layer 1
>     or Layer 0 D-Bus objects. We have a requirement to store this
>     snapshot on the BMC.
> 
>     Proposal:
>     - The snapshot will be represented as a set of D-Bus objects. For eg
>     if one needs an hourly reading for a period of 24 hours, the objects
>     could be at temps/aggregations/ambient/per-hour/{1..24}.
>     - Enable a config to let a telemetry app to know things like : What
>     D-Bus objects should I keep a history of? What is the duration of
>     the snapshot? At what frequency should entries be added into the
>     snapshot? Once the snapshot is full, should the entries roll, or
>     should we restart? Potentially add a REST API to allow changing the
>     (JSON) config at run-time.
>     - The historical aggregations can be read via one REST call. It
>     should be one D-Bus call as well most likely for the REST server, if
>     there's an object manager at temps/aggregations/ambient/per-hour for eg.
>     - These objects in the snapshot will implement the same interfaces
>     as Layer 1 objects, so they will have the same properties (eg
>     min/max/avg, timestamp).
> 
> 
>     d) Some notes
>     - With the proposal above, the API to retrieve the telemetry data is
>     via the current OpenBMC REST API, so it may not readily work with
>     telemetry tools relying on industry-standard API (see point 2
>     below), but it seems to be the feasible option to rely on to
>     implement IBM's requirements in the expected timelines.
>     - Layer 1 and Layer 2 telemetry apps would be different processes,
>     and can function independent of each other.
> 
> 
> 
>     2) Industry standard methods for storing and retrieving telemetry data
> 
>     - With the proposal above, the instantaneous readings are D-Bus
>     objects, the instantaneous and historical aggregations are D-Bus
>     objects as well. The API is the OpenBMC REST API.
>     - Typically, aggregations may not have to happen on the BMC, in
>     which case one can turn off layers 1 and 2.
>     - This is regarding how the telemetry data is presented, and how
>     we'd eventually not use the current OpenBMC REST API in production.
>     I've heard (mostly from people on the To: list) of the following
>     industry-standard ways to represent/retrieve telemetry data. This
>     would mean transforming layer 0 D-Bus objects into these :
>              - Via Redfish (events) API
>              - Via IPMI events/PEF
> 
> 
> Meh. I'd stick with Redfish/OBMC REST API over this one.
> 
>              - Via SNMP traps
> 
> 
> If there is interest here, I have experience designing MIB extensions 
> and sub-agents to support them.
> 
>              - Via an sqlite db, and have something like Logstash parse it
> 
> 
> Seems very heavy for BMC.

I tend to agree.

> Kurt Taylor (krtaylor)
> 

Regards,
Deepak