BMC health metrics (again!)

Sat Apr 20 11:08:15 AEST 2019

On Fri, Apr 12, 2019 at 6:02 AM Andrew Geissler <geissonator at gmail.com> wrote:
>
> On Tue, Apr 9, 2019 at 11:26 AM Kun Yi <kunyi at google.com> wrote:
> >
> > Hello there,
> >
> > This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.
>
> I like it, start simple and we can build from there.
>
> >
> > 1. Monitoring host IPMI link reliability (host side)
> >
> > The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.
>
> Are you thinking this is mostly for out of band IPMI? Or in-band as
> well? I can't say I've looked into this much but are there known
> issues in this area? The only time I've run into IPMI issues are
> usually when communicating with the host firmware. We've hit a variety
> of race conditions and timeouts in that path.

Good question. Mostly for in-band IPMI, because we didn't have much
experience with OOB IPMI. :)

Compared to the other proposal this one needs to be fleshed out more
for a design proposal. I'm currently low on resources but will pick it
up soon after the next 1-2 weeks.

>
> >
> > Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
> >
> > Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
> >
> > 2. Read and expose core BMC performance metrics from procfs
> >
> > This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.
>
> Yes, I would def be interested in being able to look at these things.
> I assume the sampling rate would be configurable? We could build this
> into our CI images and collect the information for each run. I'm not
> sure if we'd use this in production due to the additional resources it
> will consume but I could see it being very useful in lab/debug/CI
> areas.
>
> > A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.
>
> Sounds flexible and reasonable to me.
>
> > All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
> >
> > If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.
>
> Perfect
>
> > --
> > Regards,
> > Kun

-- 
Regards,
Kun