Implementing BMC Health Monitoring

Vijay Khemka vijaykhemka at fb.com
Wed May 27 04:50:23 AEST 2020


Adrian,
I agree and will discuss with Sui to merge this to one.

Regards
-Vijay

On 5/25/20, 5:32 AM, "Adrian Ambrożewicz" <adrian.ambrozewicz at linux.intel.com> wrote:

    @Brad, @Vijay
    
    It seems Sui is proposing something highly related to already discussed 
    https://urldefense.proofpoint.com/v2/url?u=https-3A__gerrit.openbmc-2Dproject.xyz_c_openbmc_docs_-2B_31957&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=v9MU0Ki9pWnTXCWwjHPVgpnCR80vXkkcrIaqU7USl5g&m=r8plK5Drko3s3OBal2B9JntaXpSk5Kb-tY2NYWL-gQQ&s=2wBUvS8-DtkxTF8kSLoPvmd0cLQTcDBBl3xhKjy1VWw&e=  . As a matter 
    of fact - requirement for such system metrics availability is also 
    highly desirable on our (Intel) side. It seems we need to merge all 
    requirements to satisfy the common needs..
    
    Regards,
    Adrian
    
    W dniu 5/22/2020 o 10:43, Adrian Ambrożewicz pisze:
    > I suppose I could back up Piotr here.
    > 
    > I believe that in general EntityManager could be leveraged for 
    > configuration (enabling/disabling metrics and configuring them). 
    > dbus-sensors infrastructure would be beneficial in terms of:
    > - familiarity (already used for monitoring physical sensors, new 
    > synthetized sensors to come)
    > - flexibility (EntityManager could provide runtime configuration of the 
    > metrics in the system)
    > - availability - both configuration and metrics would be exposed using 
    > D-Bus interfaces as easy to consume and 'standarized' way.
    > 
    > If dbus-sensors would be used then feature mentioned by Piotr 
    > (TelemetryService) could almost 'out of the box' support storing and 
    > exposing metrics snapshots, send them to external databases 
    > (EventService) etc.
    > 
    > Of course dbus-sensors (xyz.openbmcc_project.Sensor.Value) could be only 
    > one of the interfaces for the data, so it's not limiting any other use 
    > cases you've mentioned.
    > 
    > Regards,
    > Adrian
    > 
    > W dniu 5/21/2020 o 12:47, Matuszczak, Piotr pisze:
    >> Hi,
    >>
    >> The proposal seems interesting. From what I've read from your e-mail, 
    >> you are looking the best way to implement BMC health metrics. My 
    >> proposal would be to expose these metrics as D-Bus sensors with an 
    >> option to store data to the filesystem. Such solution will ease the 
    >> integration with Redfish and support these metrics by the Monitoring 
    >> Service 
    >> (https://github.com/openbmc/docs/blob/master/designs/telemetry.md) . 
    >> This way, you have support for collecting metrics into metric report, 
    >> you have support of simple operations, like min/max/average/sum. Also, 
    >> using metric reports, you can store historical readings and stream the 
    >> metric reports as events.
    >> Piotr Matuszczak
    >> ---------------------------------------------------------------------
    >> Intel Technology Poland sp. z o.o.
    >> ul. Slowackiego 173, 80-298 Gdansk
    >> KRS 101882
    >> NIP 957-07-52-316
    >>
    >> From: openbmc 
    >> <openbmc-bounces+piotr.matuszczak=intel.com at lists.ozlabs.org> On 
    >> Behalf Of Sui Chen
    >> Sent: Thursday, May 21, 2020 3:37 AM
    >> To: openbmc at lists.ozlabs.org
    >> Subject: Implementing BMC Health Monitoring
    >>
    >> Hello OpenBMC Mailing List,
    >>
    >> It is great to see the proposal on BMC health monitoring! We have 
    >> similar efforts in health monitoring in progress, started doing some 
    >> implementation, and would like to share some thoughts with the Mailing 
    >> List to help get BMC health monitoring started:
    >>
    >> (1) What metrics have we considered now?
    >>
    >> We have considered the following metrics on the BMC:
    >>    - Memory usage
    >>    - Number of open file descriptors
    >>    - Free storage space in the read-write file system
    >>    - List of processes
    >>    - CPU time for a few top processes
    >>    Some of these are inspired by various profilers, and some others 
    >> are expected to be relevant to the typical workloads running on the BMC.
    >>
    >> (2) Overall, it appears the design space for health monitoring has the 
    >> following dimensions:
    >>
    >> a) A method to do the collection, which might be:
    >>    - Running a program like "df" to get free disk space
    >>    - Traversing some folder to compute some statistics
    >>    - Monitor some bus for some time and generate some result
    >>    - or something else
    >>    The collection process might vary from metric to metric, and can 
    >> take some time to complete on the BMC, and therefore, the results 
    >> need to be staged somewhere and made accessible when it's completed, 
    >> so the requestor won't have to busy-wait.
    >>
    >> b) A way to stage monitoring data on the BMC, which might be:
    >>    - Files or databases in DRAM or some persistent store.
    >>    - DBus objects, as described in Vijay's document; this is similar 
    >> to how sensors work.
    >>    - IPMI Blobs (this is what we have implemented right now)
    >>    - or something else
    >> c) A way to transfer monitoring data out of the BMC, which might be:
    >>    - scp
    >>    - RedFish
    >>    - IPMI (this is what we're using right now)
    >>    - or something else
    >> d) Format of staged data:
    >>    - Raw bytes
    >>    - Protocol buffers
    >>    - JSON objects
    >>    - or something else
    >>    - The data may be compressed to save transfer time
    >> e) A way to consume the health monitoring data:
    >>    - The BMC might do some pre-processing, like windowed average.
    >>    - The BMC may perform certain corrective measures when metrics 
    >> appear abnormal.
    >>    - The host may perform certain corrective measures when metrics 
    >> appear abnormal.
    >>    - BMC health data might be plugged into some already existing 
    >> monitoring framework overseeing a large number of machines, collecting 
    >> historical data, and projecting future trends, etc.
    >>
    >> f) A way to configure the health monitoring system:
    >>    - Configuration for which metrics are collected
    >>    - Configuration for the choice of staging in b), way of transfer in 
    >> c), and frequency of collection in e)
    >>    - Some configurations may be build-time and some may be run-time
    >>       - I guess we can draw some inspirations from phosphor-ipmi-blobs
    >>
    >> (3) The requirements and performance/storage impacts on the BMC:
    >>
    >> a) The collection should not be too taxing on the processing/storage 
    >> resources on the BMC
    >>
    >> b) The data transfer process should not be too taxing on the link 
    >> between the host and BMC
    >>    - For the metrics we have and the IPMI connection we're using so 
    >> far, it took around 10 ~ 100ms for the host to collect a metric. The 
    >> time is dominated by IPMI transfer time. The time is considered 
    >> acceptable if a metric is collected at a reasonably long interval, 
    >> say, every 30 minutes.
    >>
    >> We hope the above contents help complement the existing design 
    >> proposal, and would like to help actually start implementing (and 
    >> deploying) health monitoring for the BMC.
    >> The question is: we're working on our implementation and we're 
    >> wondering what would be a good time for us to send it for review? Do 
    >> we need to support both what we have now and what is being proposed?
    >>
    >> Thanks!
    >> Sui
    >>
    



More information about the openbmc mailing list