BMC health metrics (again!)

vishwa vishwa at linux.vnet.ibm.com
Wed May 8 18:11:14 AEST 2019


Hello Kun,

Thanks for initiating it. I liked the /proc parsing. On the IPMI thing, 
is it only targeted to IPMI -or- a generic BMC-Host communication kink ?

Some of the things in my wish-list are:

1/. Flash wear and tear detection and the threshold to be a config option
2/. Any SoC specific health checks ( If that is exposed )
3/. Mechanism to detect spurious interrupts on any HW link
4/. Some kind of check to see if there will be any I2C lock to a given 
end device
5/. Ability to detect errors on HW links

On the watchdog(8) area, I was just thinking these:

How about having some kind of BMC_health D-Bus properties -or- a compile 
time feed, whose values can be fed into a configuration file than 
watchdog using the default /etc/watchdog.conf always. If the properties 
are coming from a D-Bus, then we could either append to 
/etc/watchdog.conf -or- treat those values only as the config file that 
can be given to watchdog.
The systemd service files to be setup accordingly.


We have seen instances where we get an error that is indicating no 
resources available. Those could be file descriptors / socket 
descriptors etc. A way to plug this into watchdog as part of test binary 
that checks for this ? We could hook a repair-binary to take the action.


Another thing that I was looking at hooking into watchdog is the test to 
see the file system usage as defined by the policy.
Policy could mention the file system mounts and also the threshold.

For example, /tmp , /root etc.. We could again hook a repair binary to 
do some cleanup if needed

If we see the list is growing with these custom requirements, then 
probably does not make sense to pollute the watchdog(2) but
have these consumed into the app instead ?

!! Vishwa !!

On 4/9/19 9:55 PM, Kun Yi wrote:
> Hello there,
>
> This topic has been brought up several times on the mailing list and 
> offline, but in general seems we as a community didn't reach a 
> consensus on what things would be the most valuable to monitor, and 
> how to monitor them. While it seems a general purposed monitoring 
> infrastructure for OpenBMC is a hard problem, I have some simple ideas 
> that I hope can provide immediate and direct benefits.
>
> 1. Monitoring host IPMI link reliability (host side)
>
> The essentials I want are "IPMI commands sent" and "IPMI commands 
> succeeded" counts over time. More metrics like response time would 
> be helpful as well. The issue to address here: when some IPMI sensor 
> readings are flaky, it would be really helpful to tell from IPMI 
> command stats to determine whether it is a hardware issue, or IPMI 
> issue. Moreover, it would be a very useful regression test metric for 
> rolling out new BMC software.
>
> Looking at the host IPMI side, there is some metrics exposed 
> through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I haven't 
> dug into whether it contains information mapping to the interrupts. 
> Time to read the source code I guess.
>
> Another idea would be to instrument caller libraries like the 
> interfaces in ipmitool, though I feel that approach is harder due to 
> fragmentation of IPMI libraries.
>
> 2. Read and expose core BMC performance metrics from procfs
>
> This is straightforward: have a smallish daemon (or bmc-state-manager) 
> read,parse, and process procfs and put values on D-Bus. Core metrics 
> I'm interested in getting through this way: load average, memory, disk 
> used/available, net stats... The values can then simply be exported as 
> IPMI sensors or Redfish resource properties.
>
> A nice byproduct of this effort would be a procfs parsing library. 
> Since different platforms would probably have different monitoring 
> requirements and procfs output format has no standard, I'm thinking 
> the user would just provide a configuration file containing list of 
> (procfs path, property regex, D-Bus property name), and the 
> compile-time generated code to provide an object for each property.
>
> All of this is merely thoughts and nothing concrete. With that said, 
> it would be really great if you could provide some feedback such as "I 
> want this, but I really need that feature", or let me know it's all 
> implemented already :)
>
> If this seems valuable, after gathering more feedback of feature 
> requirements, I'm going to turn them into design docs and upload for 
> review.
>
> -- 
> Regards,
> Kun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190508/b21d6dba/attachment-0001.htm>


More information about the openbmc mailing list