<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hello Kun,</p>
<p>Thanks for initiating it. I liked the /proc parsing. On the IPMI
thing, is it only targeted to IPMI -or- a generic BMC-Host
communication kink ?<br>
</p>
<p>Some of the things in my wish-list are:<br>
</p>
<p>1/. Flash wear and tear detection and the threshold to be a
config option<br>
2/. Any SoC specific health checks ( If that is exposed )<br>
3/. Mechanism to detect spurious interrupts on any HW link<br>
4/. Some kind of check to see if there will be any I2C lock to a
given end device<br>
5/. Ability to detect errors on HW links</p>
<p>On the watchdog(8) area, I was just thinking these:<br>
</p>
<p>How about having some kind of BMC_health D-Bus properties -or- a
compile time feed, whose values can be fed into a configuration
file than watchdog using the default /etc/watchdog.conf always. If
the properties are coming from a D-Bus, then we could either
append to /etc/watchdog.conf -or- treat those values only as the
config file that can be given to watchdog.<br>
The systemd service files to be setup accordingly.<br>
</p>
<p><br>
We have seen instances where we get an error that is indicating no
resources available. Those could be file descriptors / socket
descriptors etc. A way to plug this into watchdog as part of test
binary that checks for this ? We could hook a repair-binary to
take the action.<br>
</p>
<p><br>
Another thing that I was looking at hooking into watchdog is the
test to see the file system usage as defined by the policy.<br>
Policy could mention the file system mounts and also the
threshold.<br>
<br>
For example, /tmp , /root etc.. We could again hook a repair
binary to do some cleanup if needed<br>
<br>
If we see the list is growing with these custom requirements, then
probably does not make sense to pollute the watchdog(2) but<br>
have these consumed into the app instead ?</p>
<p>!! Vishwa !!<br>
</p>
<div class="moz-cite-prefix">On 4/9/19 9:55 PM, Kun Yi wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGMNF6VHifnF8qC61HN2bboY8duArOuQ1FvK3mP1gA6Xbazcow@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">
<div>
<div>Hello there,</div>
<div><br>
</div>
<div>This topic has been brought up several times on the
mailing list and offline, but in general seems we as a
community didn't reach a consensus on what things would be
the most valuable to monitor, and how to monitor them.
While it seems a general purposed monitoring
infrastructure for OpenBMC is a hard problem, I have some
simple ideas that I hope can provide immediate and direct
benefits.</div>
<div><br>
</div>
<div>1. Monitoring host IPMI link reliability (host side)</div>
<div><br>
</div>
<div>The essentials I want are "IPMI commands sent" and
"IPMI commands succeeded" counts over time. More metrics
like response time would be helpful as well. The issue to
address here: when some IPMI sensor readings are flaky, it
would be really helpful to tell from IPMI command stats to
determine whether it is a hardware issue, or IPMI issue.
Moreover, it would be a very useful regression test metric
for rolling out new BMC software.</div>
<div><br>
</div>
<div>Looking at the host IPMI side, there is some metrics
exposed through /proc/ipmi/0/si_stats if ipmi_si driver is
used, but I haven't dug into whether it contains
information mapping to the interrupts. Time to read the
source code I guess.</div>
<div><br>
</div>
<div>Another idea would be to instrument caller libraries
like the interfaces in ipmitool, though I feel that
approach is harder due to fragmentation of IPMI libraries.</div>
<div><br>
</div>
<div>2. Read and expose core BMC performance metrics from
procfs</div>
<div><br>
</div>
<div>This is straightforward: have a smallish daemon (or
bmc-state-manager) read,parse, and process procfs and put
values on D-Bus. Core metrics I'm interested in getting
through this way: load average, memory, disk
used/available, net stats... The values can then simply be
exported as IPMI sensors or Redfish resource properties.</div>
<div><br>
</div>
<div>A nice byproduct of this effort would be a procfs
parsing library. Since different platforms would probably
have different monitoring requirements and procfs output
format has no standard, I'm thinking the user would just
provide a configuration file containing list of (procfs
path, property regex, D-Bus property name), and the
compile-time generated code to provide an object for each
property. </div>
<div><br>
</div>
<div>All of this is merely thoughts and nothing concrete.
With that said, it would be really great if you could
provide some feedback such as "I want this, but I really
need that feature", or let me know it's all implemented
already :)</div>
<div><br>
</div>
<div>If this seems valuable, after gathering more feedback
of feature requirements, I'm going to turn them into
design docs and upload for review.</div>
<div><br>
</div>
-- <br>
<div dir="ltr"
class="m_894391115062551385m_1187092449188926744gmail_signature">
<div dir="ltr">Regards,
<div>Kun</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</body>
</html>