<div dir="ltr"><br><div class="gmail_extra"><div class="gmail_quote">On Fri, Mar 9, 2018 at 7:43 AM, Deepak Kodihalli <span dir="ltr"><<a href="mailto:dkodihal@linux.vnet.ibm.com" target="_blank">dkodihal@linux.vnet.ibm.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 07/09/17 8:50 pm, tomjose wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello,<br>
<br>
I am working on the issue (<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openbmc_openbmc_issues_1957&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=LzkOghL3x_3V_EkHiUQJUJ0xrq_s3_wwfssFT35AQXw&m=Gk0KnGVKy2iC82jVgpSqjzR2K_EYRlsFBqs34EjyKys&s=nQguk1LTD_q7dPNEn6dV2p4vsKaMu1vYZs5_Vh0BiNc&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__<wbr>github.com_openbmc_openbmc_<wbr>issues_1957&d=DwICaQ&c=jf_<wbr>iaSHvJObTbx-siA1ZOg&r=LzkOghL3<wbr>x_3V_EkHiUQJUJ0xrq_s3_wwfssFT3<wbr>5AQXw&m=Gk0KnGVKy2iC82jVgpSqjz<wbr>R2K_EYRlsFBqs34EjyKys&s=<wbr>nQguk1LTD_q7dPNEn6dV2p4vsKaMu1<wbr>vYZs5_Vh0BiNc&e=</a> ) to design a telemetry application for the OpenBMC. I would be explaining a rough idea of how we plan to go about. Please share your thoughts and feedback on this proposal. This issue would depend on the design evolving out of following issues, since this app would utilize the capabilities provided. (<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openbmc_openbmc_issues_1856&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=LzkOghL3x_3V_EkHiUQJUJ0xrq_s3_wwfssFT35AQXw&m=Gk0KnGVKy2iC82jVgpSqjzR2K_EYRlsFBqs34EjyKys&s=2B_nLYU03S0QgMnCrMr8YawangOxRXmXGBqPF9593DY&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint<wbr>.com/v2/url?u=https-3A__<wbr>github.com_openbmc_openbmc_<wbr>issues_1856&d=DwICaQ&c=jf_<wbr>iaSHvJObTbx-siA1ZOg&r=LzkOghL3<wbr>x_3V_EkHiUQJUJ0xrq_s3_wwfssFT3<wbr>5AQXw&m=Gk0KnGVKy2iC82jVgpSqjz<wbr>R2K_EYRlsFBqs34EjyKys&s=2B_nLY<wbr>U03S0QgMnCrMr8YawangOxRXmXGBqP<wbr>F9593DY&e=</a> , <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openbmc_openbmc_issues_2102&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=LzkOghL3x_3V_EkHiUQJUJ0xrq_s3_wwfssFT35AQXw&m=Gk0KnGVKy2iC82jVgpSqjzR2K_EYRlsFBqs34EjyKys&s=U6M9vpoDDmNbTJJH5I6M6lPBGFzS1nuqYEEGwXjAviY&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.<wbr>com/v2/url?u=https-3A__github.<wbr>com_openbmc_openbmc_issues_210<wbr>2&d=DwICaQ&c=jf_iaSHvJObTbx-<wbr>siA1ZOg&r=LzkOghL3x_3V_<wbr>EkHiUQJUJ0xrq_s3_wwfssFT35AQXw<wbr>&m=Gk0KnGVKy2iC82jVgpSqjzR2K_E<wbr>YRlsFBqs34EjyKys&s=U6M9vpoDDmN<wbr>bTJJH5I6M6lPBGFzS1nuqYEEGwXjAv<wbr>iY&e=</a> ).<br>
<br>
Summary of the requirements that we came across relevant to this discussion.<br>
<br>
1) BMC telemetry data (example VRM rail voltages) where the data is collected at different rates depending on the data and aggregated by the BMC app (minimum, maximum<br>
and average). Based on the collection timing request(frequency) the metrics are logged, so that the user can fetch it for analytics.<br>
<br>
2) Users should be able to set thresholds for the temperature limits, and receive alerts. This would allow user to plan the cooling needs.<br>
<br>
3) BMC would act as route for the OCC metrics to be send to the user. The OCC would send down telemetric data to the BMC and BMC should figure out a way to<br>
alert the user to consume this data.<br>
<br>
<br>
We would keep the focus of the discussion on the requirement no 1.<br>
This proposal presupposes that all the resources( example VRM rail voltages, ambient temperature) that the telemetry app is interested in, should be populated as dbus objects, which can<br>
be queried to read the instantaneous values. phosphor-hwmon application exposes many of the interested resources.<br>
<br>
The idea is to have a yaml based approach, where the policy of the telemetry app will be expressed. The application would be able to consume the yaml and initiate the telemetry<br>
data collection. The yaml would express the following:<br>
<br>
a) Dbus Info (object, interface, property) associated with the resource.<br>
b) Units associated with the value (celsius) and the associated scaling factor).<br>
c) Granularity - the time between two measures.<br>
d) Aggregation methods - min,max,avg..etc.<br>
e) Logging policy - frequency for creating an event and alerting the user.<br>
<br>
The application would operate based on the policy and log the telemetry data. The details of logging would evolve as we progress on the related issue.<br>
<br>
Regards,<br>
Tom<br>
</blockquote>
<br>
Hi,<br>
<br>
I'd like to bump this topic and add some more details. I'd like to discuss design proposals/directions for a couple things :<br>
<br>
1) A short/mid term proposal for telemetry requirements specific to IBM labs (which need to be implemented in a relatively short span of time, so there may not be the bandwidth to write an entirely new application not based on D-Bus or the OpenBMC REST API).<br>
2) Industry standard methods for storing and retrieving telemetry data - thoughts on how to get here.<br>
<br>
<br>
1) Telemetry requirements specific to IBM labs<br>
Here are the requirements and a design proposal.<br>
<br>
a) Instantaneous readings, such as temperatures, currents, errors, events etc. Let's call this Layer 0.<br>
<br>
Proposal:<br>
- The D-Bus model is the source for instantaneous readings. This means there would be D-Bus objects representing this data, and hence an OpenBMC REST API around it.<br>
- These D-Bus objects would not necessarily implement the same D-Bus interfaces.<br>
- Interested clients can read these D-Bus objects via the OpenBMC REST API.<br>
- If clients are interested in being notified about "changes" to the readings, that's possible via the existing event notification over WebSockets mechanism.<br></blockquote><div><br></div><div>This would also map well into an OBMC MIB extension for example.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
b) Instantaneous aggregations - this would mostly apply to, but may not be limited to, readings such as temperatures and currents. Let's call this Layer 1. This basically is to solve, for eg, "what is the min/max/average over the last X seconds?". We have a requirement to do such aggregations on the BMC.<br></blockquote><div><br></div><div>I would be interested in why aggregations (and historical - level 3) are a requirement and not just handled by the monitoring/event management app as done in network management. If this work is to be done in the BMC, it needs to be user definable and able to be turned off for resource-critical situations.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Proposal:<br>
- Aggregations are represented as D-Bus objects, created by a telemetry app. For eg if we need to know the min/max/avg ambient temp for the last 5 minutes, and say the the ambient temp is usually at temps/ambient, the aggregation could be at temps/aggregations/ambient.<br>
- Implement D-Bus interfaces to denote aggregations, for eg the temps/aggregation/ambient object could implement a D-Bus interface describing min/max/avg properties.<br>
- Aggregation objects will have the values as described in the D-Bus interface (such as min/max/avg), and a timestamp, as properties.<br>
- Enable a config (eg JSON) to let the telemetry app know things like : What (supported) aggregations should be performed (min/max/avg)? What D-Bus objects should be aggregated? How frequently should they be aggregated? What should be the paths of the aggregations? Potentially add a REST API to allow changing the (JSON) config at run-time.<br>
- It will be possible to read all aggregation objects, or aggregation objects of a specific type via one REST call.<br>
<br>
<br>
c) Historical aggregations or snapshot. Let's call this Layer 2. This is to solve, for eg, "Need a reading corresponding to every X minutes in a period of Y hours". This can be a snapshot of Layer 1 or Layer 0 D-Bus objects. We have a requirement to store this snapshot on the BMC.<br>
<br>
Proposal:<br>
- The snapshot will be represented as a set of D-Bus objects. For eg if one needs an hourly reading for a period of 24 hours, the objects could be at temps/aggregations/ambient/per<wbr>-hour/{1..24}.<br>
- Enable a config to let a telemetry app to know things like : What D-Bus objects should I keep a history of? What is the duration of the snapshot? At what frequency should entries be added into the snapshot? Once the snapshot is full, should the entries roll, or should we restart? Potentially add a REST API to allow changing the (JSON) config at run-time.<br>
- The historical aggregations can be read via one REST call. It should be one D-Bus call as well most likely for the REST server, if there's an object manager at temps/aggregations/ambient/per<wbr>-hour for eg.<br>
- These objects in the snapshot will implement the same interfaces as Layer 1 objects, so they will have the same properties (eg min/max/avg, timestamp).<br>
<br>
<br>
d) Some notes<br>
- With the proposal above, the API to retrieve the telemetry data is via the current OpenBMC REST API, so it may not readily work with telemetry tools relying on industry-standard API (see point 2 below), but it seems to be the feasible option to rely on to implement IBM's requirements in the expected timelines.<br>
- Layer 1 and Layer 2 telemetry apps would be different processes, and can function independent of each other.<br>
<br>
<br>
<br>
2) Industry standard methods for storing and retrieving telemetry data<br>
<br>
- With the proposal above, the instantaneous readings are D-Bus objects, the instantaneous and historical aggregations are D-Bus objects as well. The API is the OpenBMC REST API.<br>
- Typically, aggregations may not have to happen on the BMC, in which case one can turn off layers 1 and 2.<br>
- This is regarding how the telemetry data is presented, and how we'd eventually not use the current OpenBMC REST API in production. I've heard (mostly from people on the To: list) of the following industry-standard ways to represent/retrieve telemetry data. This would mean transforming layer 0 D-Bus objects into these :<br>
- Via Redfish (events) API<br>
- Via IPMI events/PEF<br></blockquote><div><br></div><div>Meh. I'd stick with Redfish/OBMC REST API over this one.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- Via SNMP traps <br></blockquote><div><br></div><div>If there is interest here, I have experience designing MIB extensions and sub-agents to support them.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- Via an sqlite db, and have something like Logstash parse it<br></blockquote><div><br></div><div>Seems very heavy for BMC.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- Others?<br>
Discussions are already happening regarding Redfish, so telemetry could be one aspect to consider as well.<br>
- Aggregations could be done on the BMC with collectd. I need to look at this in detail. Aggregations could be stored in an RRD format. Need to look at this in detail as well. These are as opposed to a D-Bus model of aggregations. Thoughts on this? For eg, would this be much less work both for the BMC and the telemetry data users than the proposed D-Bus model, but at the same time can address the requirements I've mentioned? Do we know what are the commonly used client tools for processing telemetry data, and how they expect the data to be presented?<br>
<br>
<br>
<br>
Thanks,<br>
Deepak<br>
<br>
</blockquote></div><br></div><div class="gmail_extra">Kurt Taylor (krtaylor)<br><br></div></div>