BMC health metrics (again!)

Sat May 18 04:25:36 AEST 2019

This is great !!

Neeraj / Kun, Were you guys planning on putting an initial proposal ?

!! Vishwa !!

On 5/17/19 9:20 PM, Kun Yi wrote:
> I'd also like to be in the metric workgroup. Neeraj, I can see the 
> first and second point you listed aligns with my goals in the original 
> proposal very well.
>
> On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa at linux.vnet.ibm.com 
> <mailto:vishwa at linux.vnet.ibm.com>> wrote:
>
>     IMO, we could start fresh here. The initial thought was an year+ ago.
>
>     !! Vishwa !!
>
>     On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
>>     Sure thing. Is there an design document that exist for this
>>     feature ?
>>
>>     I can volunteer to drive this work group if we have quorum.
>>
>>     Neeraj
>>
>>     Get Outlook for Android <https://aka.ms/ghei36>
>>
>>     ------------------------------------------------------------------------
>>     *From:* vishwa <vishwa at linux.vnet.ibm.com>
>>     <mailto:vishwa at linux.vnet.ibm.com>
>>     *Sent:* Friday, May 17, 2019 12:17:51 AM
>>     *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
>>     *Subject:* Re: BMC health metrics (again!)
>>
>>     Neeraj,
>>
>>     Thanks for the inputs. It's nice to see us having a similar thought.
>>
>>     AFAIK, we don't have any work-group that is driving “Platform
>>     telemetry and health monitoring”. Also, do we want to see this as
>>     2 different entities ?. In the past, there were thoughts about
>>     using websockets to channel some of the thermal parameters as
>>     telemetry data. But then it was not implemented.
>>
>>     We can discuss here I think.
>>
>>     !! Vishwa !!
>>
>>     On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>>>
>>>     At cloud scale, telemetry and health monitoring is very
>>>     critical. We should define a framework that allows platform
>>>     owners to add their own telemetry hooks. Telemetry service
>>>     should be designed to make this data accessible and store in
>>>     resilient way (like blackbox during plane crash).
>>>
>>>     Is there any workgroup that drives this feature “Platform
>>>     telemetry and health monitoring” ?
>>>
>>>     Wishlist
>>>
>>>     BMC telemetry :
>>>
>>>      1. Linux subsystem
>>>          1. Uptime
>>>          2. CPU Load average
>>>          3. Memory info
>>>          4. Storage usage ( RW )
>>>          5. Dmesg
>>>          6. Syslog
>>>          7. FDs of critical processes
>>>          8. Alignment traps
>>>          9. WDT excursions
>>>      2. IPMI subsystem
>>>          1. Request and Response logging par interface with
>>>             timestamps ( KCS, LAN, USB)
>>>          2. Request and Response of IPMB
>>>
>>>     i.Request , Response, No of Retries
>>>
>>>      3. Misc
>>>
>>>      1. Critical Temperature Excursions
>>>
>>>     i.Minimum Reading of Sensor
>>>
>>>     ii.Max Reading of a sensor
>>>
>>>     iii.Count of state transition
>>>
>>>     iv.Retry Count
>>>
>>>      2. Count of assertions/deassertions of GPIO and ability to
>>>         capture the state
>>>      3. timestamp of last assertion/deassertion of GPIO
>>>
>>>     Thanks
>>>
>>>     ~Neeraj
>>>
>>>     *From:*openbmc
>>>     <openbmc-bounces+neladk=microsoft.com at lists.ozlabs.org>
>>>     <mailto:openbmc-bounces+neladk=microsoft.com at lists.ozlabs.org>
>>>     *On Behalf Of *vishwa
>>>     *Sent:* Wednesday, May 8, 2019 1:11 AM
>>>     *To:* Kun Yi <kunyi at google.com> <mailto:kunyi at google.com>;
>>>     OpenBMC Maillist <openbmc at lists.ozlabs.org>
>>>     <mailto:openbmc at lists.ozlabs.org>
>>>     *Subject:* Re: BMC health metrics (again!)
>>>
>>>     Hello Kun,
>>>
>>>     Thanks for initiating it. I liked the /proc parsing. On the IPMI
>>>     thing, is it only targeted to IPMI -or- a generic BMC-Host
>>>     communication kink ?
>>>
>>>     Some of the things in my wish-list are:
>>>
>>>     1/. Flash wear and tear detection and the threshold to be a
>>>     config option
>>>     2/. Any SoC specific health checks ( If that is exposed )
>>>     3/. Mechanism to detect spurious interrupts on any HW link
>>>     4/. Some kind of check to see if there will be any I2C lock to a
>>>     given end device
>>>     5/. Ability to detect errors on HW links
>>>
>>>     On the watchdog(8) area, I was just thinking these:
>>>
>>>     How about having some kind of BMC_health D-Bus properties -or- a
>>>     compile time feed, whose values can be fed into a configuration
>>>     file than watchdog using the default /etc/watchdog.conf always.
>>>     If the properties are coming from a D-Bus, then we could either
>>>     append to /etc/watchdog.conf -or- treat those values only as the
>>>     config file that can be given to watchdog.
>>>     The systemd service files to be setup accordingly.
>>>
>>>
>>>     We have seen instances where we get an error that is indicating
>>>     no resources available. Those could be file descriptors / socket
>>>     descriptors etc. A way to plug this into watchdog as part of
>>>     test binary that checks for this ? We could hook a repair-binary
>>>     to take the action.
>>>
>>>
>>>     Another thing that I was looking at hooking into watchdog is the
>>>     test to see the file system usage as defined by the policy.
>>>     Policy could mention the file system mounts and also the threshold.
>>>
>>>     For example, /tmp , /root etc.. We could again hook a repair
>>>     binary to do some cleanup if needed
>>>
>>>     If we see the list is growing with these custom requirements,
>>>     then probably does not make sense to pollute the watchdog(2) but
>>>     have these consumed into the app instead ?
>>>
>>>     !! Vishwa !!
>>>
>>>     On 4/9/19 9:55 PM, Kun Yi wrote:
>>>
>>>         Hello there,
>>>
>>>         This topic has been brought up several times on the mailing
>>>         list and offline, but in general seems we as a community
>>>         didn't reach a consensus on what things would be the most
>>>         valuable to monitor, and how to monitor them. While it seems
>>>         a general purposed monitoring infrastructure for OpenBMC is
>>>         a hard problem, I have some simple ideas that I hope can
>>>         provide immediate and direct benefits.
>>>
>>>         1. Monitoring host IPMI link reliability (host side)
>>>
>>>         The essentials I want are "IPMI commands sent" and "IPMI
>>>         commands succeeded" counts over time. More metrics like
>>>         response time would be helpful as well. The issue to address
>>>         here: when some IPMI sensor readings are flaky, it would be
>>>         really helpful to tell from IPMI command stats to determine
>>>         whether it is a hardware issue, or IPMI issue. Moreover, it
>>>         would be a very useful regression test metric for rolling
>>>         out new BMC software.
>>>
>>>         Looking at the host IPMI side, there is some metrics exposed
>>>         through /proc/ipmi/0/si_stats if ipmi_si driver is used, but
>>>         I haven't dug into whether it contains information mapping
>>>         to the interrupts. Time to read the source code I guess.
>>>
>>>         Another idea would be to instrument caller libraries like
>>>         the interfaces in ipmitool, though I feel that approach is
>>>         harder due to fragmentation of IPMI libraries.
>>>
>>>         2. Read and expose core BMC performance metrics from procfs
>>>
>>>         This is straightforward: have a smallish daemon (or
>>>         bmc-state-manager) read,parse, and process procfs and put
>>>         values on D-Bus. Core metrics I'm interested in getting
>>>         through this way: load average, memory, disk used/available,
>>>         net stats... The values can then simply be exported as IPMI
>>>         sensors or Redfish resource properties.
>>>
>>>         A nice byproduct of this effort would be a procfs parsing
>>>         library. Since different platforms would probably have
>>>         different monitoring requirements and procfs output format
>>>         has no standard, I'm thinking the user would just provide a
>>>         configuration file containing list of (procfs path, property
>>>         regex, D-Bus property name), and the compile-time generated
>>>         code to provide an object for each property.
>>>
>>>         All of this is merely thoughts and nothing concrete. With
>>>         that said, it would be really great if you could provide
>>>         some feedback such as "I want this, but I really need that
>>>         feature", or let me know it's all implemented already :)
>>>
>>>         If this seems valuable, after gathering more feedback of
>>>         feature requirements, I'm going to turn them into design
>>>         docs and upload for review.
>>>
>>>         -- 
>>>
>>>         Regards,
>>>
>>>         Kun
>>>
>
>
> -- 
> Regards,
> Kun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190517/f1755b59/attachment-0001.htm>