Implementing BMC Health Monitoring

Thu May 21 20:47:56 AEST 2020

Hi, 

The proposal seems interesting. From what I've read from your e-mail, you are looking the best way to implement BMC health metrics. My proposal would be to expose these metrics as D-Bus sensors with an option to store data to the filesystem. Such solution will ease the integration with Redfish and support these metrics by the Monitoring Service (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . This way, you have support for collecting metrics into metric report, you have support of simple operations, like min/max/average/sum. Also, using metric reports, you can store historical readings and stream the metric reports as events. 
Piotr Matuszczak
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o. 
ul. Slowackiego 173, 80-298 Gdansk
KRS 101882
NIP 957-07-52-316

From: openbmc <openbmc-bounces+piotr.matuszczak=intel.com at lists.ozlabs.org> On Behalf Of Sui Chen
Sent: Thursday, May 21, 2020 3:37 AM
To: openbmc at lists.ozlabs.org
Subject: Implementing BMC Health Monitoring

Hello OpenBMC Mailing List,

It is great to see the proposal on BMC health monitoring! We have similar efforts in health monitoring in progress, started doing some implementation, and would like to share some thoughts with the Mailing List to help get BMC health monitoring started:

(1) What metrics have we considered now?

We have considered the following metrics on the BMC:
  - Memory usage
  - Number of open file descriptors
  - Free storage space in the read-write file system
  - List of processes
  - CPU time for a few top processes

  Some of these are inspired by various profilers, and some others are expected to be relevant to the typical workloads running on the BMC.

(2) Overall, it appears the design space for health monitoring has the following dimensions:

a) A method to do the collection, which might be:
  - Running a program like "df" to get free disk space
  - Traversing some folder to compute some statistics
  - Monitor some bus for some time and generate some result
  - or something else

  The collection process might vary from metric to metric, and can take some time to complete on the BMC, and therefore, the results need to be staged somewhere and made accessible when it's completed, so the requestor won't have to busy-wait.

b) A way to stage monitoring data on the BMC, which might be:
  - Files or databases in DRAM or some persistent store.
  - DBus objects, as described in Vijay's document; this is similar to how sensors work.
  - IPMI Blobs (this is what we have implemented right now)
  - or something else

c) A way to transfer monitoring data out of the BMC, which might be:
  - scp
  - RedFish
  - IPMI (this is what we're using right now)
  - or something else

d) Format of staged data:
  - Raw bytes
  - Protocol buffers
  - JSON objects
  - or something else
  - The data may be compressed to save transfer time

e) A way to consume the health monitoring data:
  - The BMC might do some pre-processing, like windowed average.
  - The BMC may perform certain corrective measures when metrics appear abnormal.
  - The host may perform certain corrective measures when metrics appear abnormal.
  - BMC health data might be plugged into some already existing monitoring framework overseeing a large number of machines, collecting historical data, and projecting future trends, etc.

f) A way to configure the health monitoring system:
  - Configuration for which metrics are collected
  - Configuration for the choice of staging in b), way of transfer in c), and frequency of collection in e)
  - Some configurations may be build-time and some may be run-time
     - I guess we can draw some inspirations from phosphor-ipmi-blobs

(3) The requirements and performance/storage impacts on the BMC:

a) The collection should not be too taxing on the processing/storage resources on the BMC

b) The data transfer process should not be too taxing on the link between the host and BMC
  - For the metrics we have and the IPMI connection we're using so far, it took around 10 ~ 100ms for the host to collect a metric. The time is dominated by IPMI transfer time. The time is considered acceptable if a metric is collected at a reasonably long interval, say, every 30 minutes.

We hope the above contents help complement the existing design proposal, and would like to help actually start implementing (and deploying) health monitoring for the BMC.
The question is: we're working on our implementation and we're wondering what would be a good time for us to send it for review? Do we need to support both what we have now and what is being proposed?

Thanks!
Sui