Implementing BMC Health Monitoring

Thu May 21 11:36:47 AEST 2020

Hello OpenBMC Mailing List,

It is great to see the proposal on BMC health monitoring! We have similar
efforts in health monitoring in progress, started doing some
implementation, and would like to share some thoughts with the Mailing List
to help get BMC health monitoring started:

(1) What metrics have we considered now?

We have considered the following metrics on the BMC:
  - Memory usage
  - Number of open file descriptors
  - Free storage space in the read-write file system
  - List of processes
  - CPU time for a few top processes

  Some of these are inspired by various profilers, and some others are
expected to be relevant to the typical workloads running on the BMC.

(2) Overall, it appears the design space for health monitoring has the
following dimensions:

a) A method to do the collection, which might be:
  - Running a program like "df" to get free disk space
  - Traversing some folder to compute some statistics
  - Monitor some bus for some time and generate some result
  - or something else

  The collection process might vary from metric to metric, and can take
some time to complete on the BMC, and therefore, the results need to be
staged somewhere and made accessible when it's completed, so the requestor
won't have to busy-wait.

b) A way to stage monitoring data on the BMC, which might be:
  - Files or databases in DRAM or some persistent store.
  - DBus objects, as described in Vijay's document; this is similar to how
sensors work.
  - IPMI Blobs (this is what we have implemented right now)
  - or something else

c) A way to transfer monitoring data out of the BMC, which might be:
  - scp
  - RedFish
  - IPMI (this is what we're using right now)
  - or something else

d) Format of staged data:
  - Raw bytes
  - Protocol buffers
  - JSON objects
  - or something else
  - The data may be compressed to save transfer time

e) A way to consume the health monitoring data:
  - The BMC might do some pre-processing, like windowed average.
  - The BMC may perform certain corrective measures when metrics appear
abnormal.
  - The host may perform certain corrective measures when metrics appear
abnormal.
  - BMC health data might be plugged into some already existing monitoring
framework overseeing a large number of machines, collecting historical
data, and projecting future trends, etc.

f) A way to configure the health monitoring system:
  - Configuration for which metrics are collected
  - Configuration for the choice of staging in b), way of transfer in c),
and frequency of collection in e)
  - Some configurations may be build-time and some may be run-time
     - I guess we can draw some inspirations from phosphor-ipmi-blobs

(3) The requirements and performance/storage impacts on the BMC:

a) The collection should not be too taxing on the processing/storage
resources on the BMC

b) The data transfer process should not be too taxing on the link between
the host and BMC
  - For the metrics we have and the IPMI connection we're using so far, it
took around 10 ~ 100ms for the host to collect a metric. The time is
dominated by IPMI transfer time. The time is considered acceptable if a
metric is collected at a reasonably long interval, say, every 30 minutes.

We hope the above contents help complement the existing design proposal,
and would like to help actually start implementing (and deploying) health
monitoring for the BMC.
The question is: we're working on our implementation and we're wondering
what would be a good time for us to send it for review? Do we need to
support both what we have now and what is being proposed?

Thanks!
Sui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20200520/0034729d/attachment.htm>