BMC health metrics (again!)

Fri Apr 12 23:02:34 AEST 2019

On Tue, Apr 9, 2019 at 11:26 AM Kun Yi <kunyi at google.com> wrote:
>
> Hello there,
>
> This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.

I like it, start simple and we can build from there.

>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.

Are you thinking this is mostly for out of band IPMI? Or in-band as
well? I can't say I've looked into this much but are there known
issues in this area? The only time I've run into IPMI issues are
usually when communicating with the host firmware. We've hit a variety
of race conditions and timeouts in that path.

>
> Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
>
> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.

Yes, I would def be interested in being able to look at these things.
I assume the sampling rate would be configurable? We could build this
into our CI images and collect the information for each run. I'm not
sure if we'd use this in production due to the additional resources it
will consume but I could see it being very useful in lab/debug/CI
areas.

> A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.

Sounds flexible and reasonable to me.

> All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
>
> If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.

Perfect

> --
> Regards,
> Kun