RFC for Telemetry data collection

Fri Sep 8 06:04:37 AEST 2017

We do need to provide live reads of a subset of the data, which I think is 
what Rick is describing below.  For instance fan speeds, 30 second power 
averages, component temperatures, etc.  Much like IPMI/DCMI 
implementations out there today.  And these need to alert based on trip 
levels that are set.   Higher layers of software can then act on this data 
or log it away as they see fit.  I like the idea of rich meta-data around 
these values, but I would think we would use Redfish as the method of 
exporting this data.

We also need deep traces where the data is gathered, processed, and logged 
locally by the BMC.  Then the BMC should alert every X hours and the log 
should be collected by the higher layer entity.  This would be for things 
like VRM currents on every output (hourly min, max, average).  It is not 
required that any other company use these deep telemetry logs, but they 
are required on our systems.

As far as #3 below, this should not be a new requirement.  Just export the 
OCC telemetry log in the same way that you export all OCC/HOST logs.

Todd Rosedahl
IBM Power and Thermal Management
(507) 250-3275
rosedahl at us.ibm.com

From:   Rick Altherr <raltherr at google.com>
To:     tomjose <tomjose at linux.vnet.ibm.com>
Cc:     OpenBMC Maillist <openbmc at lists.ozlabs.org>, thalerj at us.ibm.com, 
jkeusema at us.ibm.com, rosedahl at us.ibm.com
Date:   09/07/2017 01:41 PM
Subject:        Re: RFC for Telemetry data collection

I have many opinions on telemetry data formats and APIs.  What I'm seeing 
in your proposal looks pretty good with some subtlety in the details.  For 
example, I expect to collect most data at least once-per-second, not log 
anything locally, and not alert.  I'll do all aggregation and thresholding 
at a higher level in the software stack.  I also, ideally, want very 
descriptive information about where in the system the sensor is.  I've 
attached a screenshot of what our existing host-based reporting software 
makes available to higher-level software.  This is a view via the 
human-readable web interface, the data is normally served via protobufs.

Rick

On Thu, Sep 7, 2017 at 8:20 AM, tomjose <tomjose at linux.vnet.ibm.com> 
wrote:
Hello,

I am working on the issue (https://github.com/openbmc/openbmc/issues/1957) 
to design a telemetry application for the OpenBMC. I would be explaining a 
rough idea of how we plan to go about. Please share your thoughts and 
feedback on this proposal. This issue would depend on the design evolving 
out of following issues, since this app would utilize the capabilities 
provided. (https://github.com/openbmc/openbmc/issues/1856, 
https://github.com/openbmc/openbmc/issues/2102).

Summary of the requirements that we came across relevant to this 
discussion.

1) BMC telemetry data (example VRM rail voltages) where the data is 
collected at different rates depending on the data and aggregated by the 
BMC app  (minimum, maximum
    and average). Based on the collection timing request(frequency) the 
metrics are logged, so that the user can fetch it for analytics.

2)  Users should be able to set thresholds for the temperature limits, and 
receive alerts. This would allow user to plan the cooling needs.

3)  BMC would act as route for the OCC metrics to be send to the user. The 
OCC would send down telemetric data to the BMC and BMC should figure out a 
way to
     alert the user to consume this data.

We would keep the focus of the discussion on the requirement no 1.
This proposal presupposes that all the resources( example VRM rail 
voltages, ambient temperature) that the telemetry app is interested in, 
should be populated as dbus objects, which can
be queried to read the instantaneous values. phosphor-hwmon application 
exposes many of the interested resources.

The idea is to have a yaml based approach, where the policy of the 
telemetry app will be expressed. The application would be able to consume 
the yaml and initiate the telemetry
data collection. The yaml would express the following:

a) Dbus Info (object, interface, property) associated with the resource.
b) Units associated with the value (celsius) and the associated scaling 
factor).
c) Granularity - the time between two measures.
d) Aggregation methods - min,max,avg..etc.
e) Logging policy - frequency for creating an event and alerting the user.

The application would operate based on the policy and log the telemetry 
data. The details of logging would evolve as we progress on the related 
issue.

Regards,
Tom

[attachment "zaius-fan-telemetry.png" deleted by Todd 
Rosedahl/Rochester/IBM] 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20170907/3b44b109/attachment.html>