<div dir="ltr">Hello OpenBMC Mailing List,<br><br>It is great to see the proposal on BMC health monitoring! We have similar efforts in health monitoring in progress, started doing some implementation, and would like to share some thoughts with the Mailing List to help get BMC health monitoring started:<br><br>(1) What metrics have we considered now?<br><br>We have considered the following metrics on the BMC:<br>  - Memory usage<br>  - Number of open file descriptors<br>  - Free storage space in the read-write file system<br>  - List of processes<br>  - CPU time for a few top processes<br>  <br>  Some of these are inspired by various profilers, and some others are expected to be relevant to the typical workloads running on the BMC.<br><br>(2) Overall, it appears the design space for health monitoring has the following dimensions:<br><br>a) A method to do the collection, which might be:<br>  - Running a program like "df" to get free disk space<br>  - Traversing some folder to compute some statistics<br>  - Monitor some bus for some time and generate some result<div>  - or something else<br><div>  <br>  The collection process might vary from metric to metric, and can take some time to complete on the BMC, and therefore, the results need to be staged somewhere and made accessible when it's completed, so the requestor won't have to busy-wait.<br><br>b) A way to stage monitoring data on the BMC, which might be:<br>  - Files or databases in DRAM or some persistent store.<br>  - DBus objects, as described in Vijay's document; this is similar to how sensors work.<br>  - IPMI Blobs (this is what we have implemented right now)<div>  - or something else<br>  <br>c) A way to transfer monitoring data out of the BMC, which might be:<br>  - scp<br>  - RedFish<br>  - IPMI (this is what we're using right now)<div>  - or something else<br>  <br>d) Format of staged data:<br>  - Raw bytes<br>  - Protocol buffers<br>  - JSON objects<br>  - or something else<br>  - The data may be compressed to save transfer time<br> <br>e) A way to consume the health monitoring data:<br>  - The BMC might do some pre-processing, like windowed average.<br>  - The BMC may perform certain corrective measures when metrics appear abnormal.<br>  - The host may perform certain corrective measures when metrics appear abnormal.<br>  - BMC health data might be plugged into some already existing monitoring framework overseeing a large number of machines, collecting historical data, and projecting future trends, etc.<br><br>f) A way to configure the health monitoring system:<br>  - Configuration for which metrics are collected<br>  - Configuration for the choice of staging in b), way of transfer in c), and frequency of collection in e)<br>  - Some configurations may be build-time and some may be run-time<br>     - I guess we can draw some inspirations from phosphor-ipmi-blobs<br><br>(3) The requirements and performance/storage impacts on the BMC:<br><br>a) The collection should not be too taxing on the processing/storage resources on the BMC<br><br>b) The data transfer process should not be too taxing on the link between the host and BMC<br>  - For the metrics we have and the IPMI connection we're using so far, it took around 10 ~ 100ms for the host to collect a metric. The time is dominated by IPMI transfer time. The time is considered acceptable if a metric is collected at a reasonably long interval, say, every 30 minutes.<br>  <br><br>We hope the above contents help complement the existing design proposal, and would like to help actually start implementing (and deploying) health monitoring for the BMC.<br>The question is: we're working on our implementation and we're wondering what would be a good time for us to send it for review? Do we need to support both what we have now and what is being proposed?<br><br>Thanks!<br>Sui<br></div></div></div></div></div>