BMC health metrics (again!)

vishwa vishwa at linux.vnet.ibm.com
Fri May 17 17:27:28 AEST 2019


IMO, we could start fresh here. The initial thought was an year+ ago.

!! Vishwa !!

On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
> Sure thing. Is there an design document that exist for this feature ?
>
> I can volunteer to drive this work group if we have quorum.
>
> Neeraj
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------------------------------------------------
> *From:* vishwa <vishwa at linux.vnet.ibm.com>
> *Sent:* Friday, May 17, 2019 12:17:51 AM
> *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
> *Subject:* Re: BMC health metrics (again!)
>
> Neeraj,
>
> Thanks for the inputs. It's nice to see us having a similar thought.
>
> AFAIK, we don't have any work-group that is driving “Platform 
> telemetry and health monitoring”. Also, do we want to see this as 2 
> different entities ?. In the past, there were thoughts about using 
> websockets to channel some of the thermal parameters as telemetry 
> data. But then it was not implemented.
>
> We can discuss here I think.
>
> !! Vishwa !!
>
> On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>>
>> At cloud scale, telemetry and health monitoring is very critical. We 
>> should define a framework that allows platform owners to add their 
>> own telemetry hooks. Telemetry service should be designed to make 
>> this data accessible and store in resilient way (like blackbox during 
>> plane crash).
>>
>> Is there any workgroup that drives this feature “Platform telemetry 
>> and health monitoring” ?
>>
>> Wishlist
>>
>> BMC telemetry :
>>
>>  1. Linux subsystem
>>      1. Uptime
>>      2. CPU Load average
>>      3. Memory info
>>      4. Storage usage ( RW )
>>      5. Dmesg
>>      6. Syslog
>>      7. FDs of critical processes
>>      8. Alignment traps
>>      9. WDT excursions
>>  2. IPMI subsystem
>>      1. Request and Response logging par interface with timestamps (
>>         KCS, LAN, USB)
>>      2. Request and Response of IPMB
>>
>> i.Request , Response, No of Retries
>>
>>  3. Misc
>>
>>  1. Critical Temperature Excursions
>>
>> i.Minimum Reading of Sensor
>>
>> ii.Max Reading of a sensor
>>
>> iii.Count of state transition
>>
>> iv.Retry Count
>>
>>  2. Count of assertions/deassertions of GPIO and ability to capture
>>     the state
>>  3. timestamp of last assertion/deassertion of GPIO
>>
>> Thanks
>>
>> ~Neeraj
>>
>> *From:*openbmc 
>> <openbmc-bounces+neladk=microsoft.com at lists.ozlabs.org> *On Behalf Of 
>> *vishwa
>> *Sent:* Wednesday, May 8, 2019 1:11 AM
>> *To:* Kun Yi <kunyi at google.com>; OpenBMC Maillist 
>> <openbmc at lists.ozlabs.org>
>> *Subject:* Re: BMC health metrics (again!)
>>
>> Hello Kun,
>>
>> Thanks for initiating it. I liked the /proc parsing. On the IPMI 
>> thing, is it only targeted to IPMI -or- a generic BMC-Host 
>> communication kink ?
>>
>> Some of the things in my wish-list are:
>>
>> 1/. Flash wear and tear detection and the threshold to be a config option
>> 2/. Any SoC specific health checks ( If that is exposed )
>> 3/. Mechanism to detect spurious interrupts on any HW link
>> 4/. Some kind of check to see if there will be any I2C lock to a 
>> given end device
>> 5/. Ability to detect errors on HW links
>>
>> On the watchdog(8) area, I was just thinking these:
>>
>> How about having some kind of BMC_health D-Bus properties -or- a 
>> compile time feed, whose values can be fed into a configuration file 
>> than watchdog using the default /etc/watchdog.conf always. If the 
>> properties are coming from a D-Bus, then we could either append to 
>> /etc/watchdog.conf -or- treat those values only as the config file 
>> that can be given to watchdog.
>> The systemd service files to be setup accordingly.
>>
>>
>> We have seen instances where we get an error that is indicating no 
>> resources available. Those could be file descriptors / socket 
>> descriptors etc. A way to plug this into watchdog as part of test 
>> binary that checks for this ? We could hook a repair-binary to take 
>> the action.
>>
>>
>> Another thing that I was looking at hooking into watchdog is the test 
>> to see the file system usage as defined by the policy.
>> Policy could mention the file system mounts and also the threshold.
>>
>> For example, /tmp , /root etc.. We could again hook a repair binary 
>> to do some cleanup if needed
>>
>> If we see the list is growing with these custom requirements, then 
>> probably does not make sense to pollute the watchdog(2) but
>> have these consumed into the app instead ?
>>
>> !! Vishwa !!
>>
>> On 4/9/19 9:55 PM, Kun Yi wrote:
>>
>>     Hello there,
>>
>>     This topic has been brought up several times on the mailing list
>>     and offline, but in general seems we as a community didn't reach
>>     a consensus on what things would be the most valuable to monitor,
>>     and how to monitor them. While it seems a general purposed
>>     monitoring infrastructure for OpenBMC is a hard problem, I have
>>     some simple ideas that I hope can provide immediate and direct
>>     benefits.
>>
>>     1. Monitoring host IPMI link reliability (host side)
>>
>>     The essentials I want are "IPMI commands sent" and "IPMI commands
>>     succeeded" counts over time. More metrics like response time
>>     would be helpful as well. The issue to address here: when some
>>     IPMI sensor readings are flaky, it would be really helpful to
>>     tell from IPMI command stats to determine whether it is a
>>     hardware issue, or IPMI issue. Moreover, it would be a very
>>     useful regression test metric for rolling out new BMC software.
>>
>>     Looking at the host IPMI side, there is some metrics exposed
>>     through /proc/ipmi/0/si_stats if ipmi_si driver is used, but I
>>     haven't dug into whether it contains information mapping to the
>>     interrupts. Time to read the source code I guess.
>>
>>     Another idea would be to instrument caller libraries like the
>>     interfaces in ipmitool, though I feel that approach is harder due
>>     to fragmentation of IPMI libraries.
>>
>>     2. Read and expose core BMC performance metrics from procfs
>>
>>     This is straightforward: have a smallish daemon (or
>>     bmc-state-manager) read,parse, and process procfs and put values
>>     on D-Bus. Core metrics I'm interested in getting through this
>>     way: load average, memory, disk used/available, net stats... The
>>     values can then simply be exported as IPMI sensors or Redfish
>>     resource properties.
>>
>>     A nice byproduct of this effort would be a procfs parsing
>>     library. Since different platforms would probably have different
>>     monitoring requirements and procfs output format has no standard,
>>     I'm thinking the user would just provide a configuration file
>>     containing list of (procfs path, property regex, D-Bus property
>>     name), and the compile-time generated code to provide an object
>>     for each property.
>>
>>     All of this is merely thoughts and nothing concrete. With that
>>     said, it would be really great if you could provide some feedback
>>     such as "I want this, but I really need that feature", or let me
>>     know it's all implemented already :)
>>
>>     If this seems valuable, after gathering more feedback of feature
>>     requirements, I'm going to turn them into design docs and upload
>>     for review.
>>
>>     -- 
>>
>>     Regards,
>>
>>     Kun
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190517/6d91656b/attachment-0001.htm>


More information about the openbmc mailing list