BMC health metrics (again!)
vishwa
vishwa at linux.vnet.ibm.com
Sat May 18 04:25:36 AEST 2019
This is great !!
Neeraj / Kun, Were you guys planning on putting an initial proposal ?
!! Vishwa !!
On 5/17/19 9:20 PM, Kun Yi wrote:
> I'd also like to be in the metric workgroup. Neeraj, I can see the
> first and second point you listed aligns with my goals in the original
> proposal very well.
>
> On Fri, May 17, 2019 at 12:28 AM vishwa <vishwa at linux.vnet.ibm.com
> <mailto:vishwa at linux.vnet.ibm.com>> wrote:
>
> IMO, we could start fresh here. The initial thought was an year+ ago.
>
> !! Vishwa !!
>
> On 5/17/19 12:53 PM, Neeraj Ladkani wrote:
>> Sure thing. Is there an design document that exist for this
>> feature ?
>>
>> I can volunteer to drive this work group if we have quorum.
>>
>> Neeraj
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>> ------------------------------------------------------------------------
>> *From:* vishwa <vishwa at linux.vnet.ibm.com>
>> <mailto:vishwa at linux.vnet.ibm.com>
>> *Sent:* Friday, May 17, 2019 12:17:51 AM
>> *To:* Neeraj Ladkani; Kun Yi; OpenBMC Maillist
>> *Subject:* Re: BMC health metrics (again!)
>>
>> Neeraj,
>>
>> Thanks for the inputs. It's nice to see us having a similar thought.
>>
>> AFAIK, we don't have any work-group that is driving “Platform
>> telemetry and health monitoring”. Also, do we want to see this as
>> 2 different entities ?. In the past, there were thoughts about
>> using websockets to channel some of the thermal parameters as
>> telemetry data. But then it was not implemented.
>>
>> We can discuss here I think.
>>
>> !! Vishwa !!
>>
>> On 5/17/19 12:00 PM, Neeraj Ladkani wrote:
>>>
>>> At cloud scale, telemetry and health monitoring is very
>>> critical. We should define a framework that allows platform
>>> owners to add their own telemetry hooks. Telemetry service
>>> should be designed to make this data accessible and store in
>>> resilient way (like blackbox during plane crash).
>>>
>>> Is there any workgroup that drives this feature “Platform
>>> telemetry and health monitoring” ?
>>>
>>> Wishlist
>>>
>>> BMC telemetry :
>>>
>>> 1. Linux subsystem
>>> 1. Uptime
>>> 2. CPU Load average
>>> 3. Memory info
>>> 4. Storage usage ( RW )
>>> 5. Dmesg
>>> 6. Syslog
>>> 7. FDs of critical processes
>>> 8. Alignment traps
>>> 9. WDT excursions
>>> 2. IPMI subsystem
>>> 1. Request and Response logging par interface with
>>> timestamps ( KCS, LAN, USB)
>>> 2. Request and Response of IPMB
>>>
>>> i.Request , Response, No of Retries
>>>
>>> 3. Misc
>>>
>>> 1. Critical Temperature Excursions
>>>
>>> i.Minimum Reading of Sensor
>>>
>>> ii.Max Reading of a sensor
>>>
>>> iii.Count of state transition
>>>
>>> iv.Retry Count
>>>
>>> 2. Count of assertions/deassertions of GPIO and ability to
>>> capture the state
>>> 3. timestamp of last assertion/deassertion of GPIO
>>>
>>> Thanks
>>>
>>> ~Neeraj
>>>
>>> *From:*openbmc
>>> <openbmc-bounces+neladk=microsoft.com at lists.ozlabs.org>
>>> <mailto:openbmc-bounces+neladk=microsoft.com at lists.ozlabs.org>
>>> *On Behalf Of *vishwa
>>> *Sent:* Wednesday, May 8, 2019 1:11 AM
>>> *To:* Kun Yi <kunyi at google.com> <mailto:kunyi at google.com>;
>>> OpenBMC Maillist <openbmc at lists.ozlabs.org>
>>> <mailto:openbmc at lists.ozlabs.org>
>>> *Subject:* Re: BMC health metrics (again!)
>>>
>>> Hello Kun,
>>>
>>> Thanks for initiating it. I liked the /proc parsing. On the IPMI
>>> thing, is it only targeted to IPMI -or- a generic BMC-Host
>>> communication kink ?
>>>
>>> Some of the things in my wish-list are:
>>>
>>> 1/. Flash wear and tear detection and the threshold to be a
>>> config option
>>> 2/. Any SoC specific health checks ( If that is exposed )
>>> 3/. Mechanism to detect spurious interrupts on any HW link
>>> 4/. Some kind of check to see if there will be any I2C lock to a
>>> given end device
>>> 5/. Ability to detect errors on HW links
>>>
>>> On the watchdog(8) area, I was just thinking these:
>>>
>>> How about having some kind of BMC_health D-Bus properties -or- a
>>> compile time feed, whose values can be fed into a configuration
>>> file than watchdog using the default /etc/watchdog.conf always.
>>> If the properties are coming from a D-Bus, then we could either
>>> append to /etc/watchdog.conf -or- treat those values only as the
>>> config file that can be given to watchdog.
>>> The systemd service files to be setup accordingly.
>>>
>>>
>>> We have seen instances where we get an error that is indicating
>>> no resources available. Those could be file descriptors / socket
>>> descriptors etc. A way to plug this into watchdog as part of
>>> test binary that checks for this ? We could hook a repair-binary
>>> to take the action.
>>>
>>>
>>> Another thing that I was looking at hooking into watchdog is the
>>> test to see the file system usage as defined by the policy.
>>> Policy could mention the file system mounts and also the threshold.
>>>
>>> For example, /tmp , /root etc.. We could again hook a repair
>>> binary to do some cleanup if needed
>>>
>>> If we see the list is growing with these custom requirements,
>>> then probably does not make sense to pollute the watchdog(2) but
>>> have these consumed into the app instead ?
>>>
>>> !! Vishwa !!
>>>
>>> On 4/9/19 9:55 PM, Kun Yi wrote:
>>>
>>> Hello there,
>>>
>>> This topic has been brought up several times on the mailing
>>> list and offline, but in general seems we as a community
>>> didn't reach a consensus on what things would be the most
>>> valuable to monitor, and how to monitor them. While it seems
>>> a general purposed monitoring infrastructure for OpenBMC is
>>> a hard problem, I have some simple ideas that I hope can
>>> provide immediate and direct benefits.
>>>
>>> 1. Monitoring host IPMI link reliability (host side)
>>>
>>> The essentials I want are "IPMI commands sent" and "IPMI
>>> commands succeeded" counts over time. More metrics like
>>> response time would be helpful as well. The issue to address
>>> here: when some IPMI sensor readings are flaky, it would be
>>> really helpful to tell from IPMI command stats to determine
>>> whether it is a hardware issue, or IPMI issue. Moreover, it
>>> would be a very useful regression test metric for rolling
>>> out new BMC software.
>>>
>>> Looking at the host IPMI side, there is some metrics exposed
>>> through /proc/ipmi/0/si_stats if ipmi_si driver is used, but
>>> I haven't dug into whether it contains information mapping
>>> to the interrupts. Time to read the source code I guess.
>>>
>>> Another idea would be to instrument caller libraries like
>>> the interfaces in ipmitool, though I feel that approach is
>>> harder due to fragmentation of IPMI libraries.
>>>
>>> 2. Read and expose core BMC performance metrics from procfs
>>>
>>> This is straightforward: have a smallish daemon (or
>>> bmc-state-manager) read,parse, and process procfs and put
>>> values on D-Bus. Core metrics I'm interested in getting
>>> through this way: load average, memory, disk used/available,
>>> net stats... The values can then simply be exported as IPMI
>>> sensors or Redfish resource properties.
>>>
>>> A nice byproduct of this effort would be a procfs parsing
>>> library. Since different platforms would probably have
>>> different monitoring requirements and procfs output format
>>> has no standard, I'm thinking the user would just provide a
>>> configuration file containing list of (procfs path, property
>>> regex, D-Bus property name), and the compile-time generated
>>> code to provide an object for each property.
>>>
>>> All of this is merely thoughts and nothing concrete. With
>>> that said, it would be really great if you could provide
>>> some feedback such as "I want this, but I really need that
>>> feature", or let me know it's all implemented already :)
>>>
>>> If this seems valuable, after gathering more feedback of
>>> feature requirements, I'm going to turn them into design
>>> docs and upload for review.
>>>
>>> --
>>>
>>> Regards,
>>>
>>> Kun
>>>
>
>
> --
> Regards,
> Kun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190517/f1755b59/attachment-0001.htm>
More information about the openbmc
mailing list