BMC health metrics (again!)

Sat Apr 20 11:04:24 AEST 2019

Thanks Sivas. Response inline.

On Thu, Apr 11, 2019 at 5:57 AM Sivas Srr <sivas.srr at in.ibm.com> wrote:
>
> Thank you Kun Yi for your proposal.
> My input starts with word "Response:".
>
>
>
> With regards,
> Sivas
>
>
> ----- Original message -----
> From: Kun Yi <kunyi at google.com>
> Sent by: "openbmc" <openbmc-bounces+sivas.srr=in.ibm.com at lists.ozlabs.org>
> To: OpenBMC Maillist <openbmc at lists.ozlabs.org>
> Cc:
> Subject: BMC health metrics (again!)
> Date: Tue, Apr 9, 2019 9:57 PM
>
> Hello there,
>
> This topic has been brought up several times on the mailing list and offline, but in general seems we as a community didn't reach a consensus on what things would be the most valuable to monitor, and how to monitor them. While it seems a general purposed monitoring infrastructure for OpenBMC is a hard problem, I have some simple ideas that I hope can provide immediate and direct benefits.
>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands succeeded" counts over time. More metrics like response time would be helpful as well. The issue to address here: when some IPMI sensor readings are flaky, it would be really helpful to tell from IPMI command stats to determine whether it is a hardware issue, or IPMI issue. Moreover, it would be a very useful regression test metric for rolling out new BMC software.
>
> Looking at the host IPMI side, there is some metrics exposed through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't dug into whether it contains information mapping to the interrupts. Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the interfaces in ipmitool, though I feel that approach is harder due to fragmentation of IPMI libraries.
>
> Response: Can we have it as a part of debug / tarball image to get response time and this can be used only at that time.
> And more over IPMI interface is not fading away? Will let others to provide input.

Debug tarball tool is an interesting idea, though it seems from my
preliminary probing that getting command response from kernel stat
alone is not feasible without modifying the driver.

> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) read,parse, and process procfs and put values on D-Bus. Core metrics I'm interested in getting through this way: load average, memory, disk used/available, net stats... The values can then simply be exported as IPMI sensors or Redfish resource properties.
>
> A nice byproduct of this effort would be a procfs parsing library. Since different platforms would probably have different monitoring requirements and procfs output format has no standard, I'm thinking the user would just provide a configuration file containing list of (procfs path, property regex, D-Bus property name), and the compile-time generated code to provide an object for each property.
>
> All of this is merely thoughts and nothing concrete. With that said, it would be really great if you could provide some feedback such as "I want this, but I really need that feature", or let me know it's all implemented already :)
>
> If this seems valuable, after gathering more feedback of feature requirements, I'm going to turn them into design docs and upload for review.
>
> Response: As BMC is small embedded system, Do we really need to put this and may need to decide based on memory / flash foot print.

Yes, obviously it depends on whether the daemon itself is lightweight.
I don't envision it to be larger than any standard phosphor daemon.
Again, it could be configured and included on a platform-by-platform
basis.

>
> Feature to get even when BMC usage goes > 90%:
>
> From end user perspective,  If BMC performance / usages reaches consistently > 90% of BMC CPU utilization / BMC Memory / BMC file system then we should have way to get an event accordingly. This will help end user. I feel this is higher priority.
>
> May be based on the event, involved application should try to correct itself.

Agree with generating event logs for degraded BMC performances. There
is a standard software watchdog that can reset/recover the system
based on configuration, and we are using it on our platforms, we
should look into whether it can be hooked up to generate an event.
[1] https://linux.die.net/man/8/watchdog

>
> If After this, BMC have good foot print then nothing wrong in having small daemon like procfs and use d-bus to get  performance metrics.

As I have mentioned, I think there are still values from a QA
perspective to profile the performance even if BMC itself is running
fine.

>
> With regards,
> Sivas
> --
>
> Regards,
> Kun
>
>
>

--
Regards,
Kun