Implementing BMC Health Monitoring
Adrian Ambrożewicz
adrian.ambrozewicz at linux.intel.com
Mon May 25 22:32:04 AEST 2020
@Brad, @Vijay
It seems Sui is proposing something highly related to already discussed
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/31957 . As a matter
of fact - requirement for such system metrics availability is also
highly desirable on our (Intel) side. It seems we need to merge all
requirements to satisfy the common needs..
Regards,
Adrian
W dniu 5/22/2020 o 10:43, Adrian Ambrożewicz pisze:
> I suppose I could back up Piotr here.
>
> I believe that in general EntityManager could be leveraged for
> configuration (enabling/disabling metrics and configuring them).
> dbus-sensors infrastructure would be beneficial in terms of:
> - familiarity (already used for monitoring physical sensors, new
> synthetized sensors to come)
> - flexibility (EntityManager could provide runtime configuration of the
> metrics in the system)
> - availability - both configuration and metrics would be exposed using
> D-Bus interfaces as easy to consume and 'standarized' way.
>
> If dbus-sensors would be used then feature mentioned by Piotr
> (TelemetryService) could almost 'out of the box' support storing and
> exposing metrics snapshots, send them to external databases
> (EventService) etc.
>
> Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only
> one of the interfaces for the data, so it's not limiting any other use
> cases you've mentioned.
>
> Regards,
> Adrian
>
> W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
>> Hi,
>>
>> The proposal seems interesting. From what I've read from your e-mail,
>> you are looking the best way to implement BMC health metrics. My
>> proposal would be to expose these metrics as D-Bus sensors with an
>> option to store data to the filesystem. Such solution will ease the
>> integration with Redfish and support these metrics by the Monitoring
>> Service
>> (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) .
>> This way, you have support for collecting metrics into metric report,
>> you have support of simple operations, like min/max/average/sum. Also,
>> using metric reports, you can store historical readings and stream the
>> metric reports as events.
>> Piotr Matuszczak
>> ---------------------------------------------------------------------
>> Intel Technology Poland sp. z o.o.
>> ul. Slowackiego 173, 80-298 Gdansk
>> KRS 101882
>> NIP 957-07-52-316
>>
>> From: openbmc
>> <openbmc-bounces+piotr.matuszczak=intel.com at lists.ozlabs.org> On
>> Behalf Of Sui Chen
>> Sent: Thursday, May 21, 2020 3:37 AM
>> To: openbmc at lists.ozlabs.org
>> Subject: Implementing BMC Health Monitoring
>>
>> Hello OpenBMC Mailing List,
>>
>> It is great to see the proposal on BMC health monitoring! We have
>> similar efforts in health monitoring in progress, started doing some
>> implementation, and would like to share some thoughts with the Mailing
>> List to help get BMC health monitoring started:
>>
>> (1) What metrics have we considered now?
>>
>> We have considered the following metrics on the BMC:
>> - Memory usage
>> - Number of open file descriptors
>> - Free storage space in the read-write file system
>> - List of processes
>> - CPU time for a few top processes
>> Some of these are inspired by various profilers, and some others
>> are expected to be relevant to the typical workloads running on the BMC.
>>
>> (2) Overall, it appears the design space for health monitoring has the
>> following dimensions:
>>
>> a) A method to do the collection, which might be:
>> - Running a program like "df" to get free disk space
>> - Traversing some folder to compute some statistics
>> - Monitor some bus for some time and generate some result
>> - or something else
>> The collection process might vary from metric to metric, and can
>> take some time to complete on the BMC, and therefore, the results
>> need to be staged somewhere and made accessible when it's completed,
>> so the requestor won't have to busy-wait.
>>
>> b) A way to stage monitoring data on the BMC, which might be:
>> - Files or databases in DRAM or some persistent store.
>> - DBus objects, as described in Vijay's document; this is similar
>> to how sensors work.
>> - IPMI Blobs (this is what we have implemented right now)
>> - or something else
>> c) A way to transfer monitoring data out of the BMC, which might be:
>> - scp
>> - RedFish
>> - IPMI (this is what we're using right now)
>> - or something else
>> d) Format of staged data:
>> - Raw bytes
>> - Protocol buffers
>> - JSON objects
>> - or something else
>> - The data may be compressed to save transfer time
>> e) A way to consume the health monitoring data:
>> - The BMC might do some pre-processing, like windowed average.
>> - The BMC may perform certain corrective measures when metrics
>> appear abnormal.
>> - The host may perform certain corrective measures when metrics
>> appear abnormal.
>> - BMC health data might be plugged into some already existing
>> monitoring framework overseeing a large number of machines, collecting
>> historical data, and projecting future trends, etc.
>>
>> f) A way to configure the health monitoring system:
>> - Configuration for which metrics are collected
>> - Configuration for the choice of staging in b), way of transfer in
>> c), and frequency of collection in e)
>> - Some configurations may be build-time and some may be run-time
>> - I guess we can draw some inspirations from phosphor-ipmi-blobs
>>
>> (3) The requirements and performance/storage impacts on the BMC:
>>
>> a) The collection should not be too taxing on the processing/storage
>> resources on the BMC
>>
>> b) The data transfer process should not be too taxing on the link
>> between the host and BMC
>> - For the metrics we have and the IPMI connection we're using so
>> far, it took around 10 ~ 100ms for the host to collect a metric. The
>> time is dominated by IPMI transfer time. The time is considered
>> acceptable if a metric is collected at a reasonably long interval,
>> say, every 30 minutes.
>>
>> We hope the above contents help complement the existing design
>> proposal, and would like to help actually start implementing (and
>> deploying) health monitoring for the BMC.
>> The question is: we're working on our implementation and we're
>> wondering what would be a good time for us to send it for review? Do
>> we need to support both what we have now and what is being proposed?
>>
>> Thanks!
>> Sui
>>
More information about the openbmc
mailing list