<div dir="ltr">I'd also like to be in the metric workgroup. Neeraj, I can see the first and second point you listed aligns with my goals in the original proposal very well.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 17, 2019 at 12:28 AM vishwa <<a href="mailto:vishwa@linux.vnet.ibm.com">vishwa@linux.vnet.ibm.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>IMO, we could start fresh here. The initial thought was an year+
ago.</p>
<p>!! Vishwa !!<br>
</p>
<div class="gmail-m_1275227641964777196moz-cite-prefix">On 5/17/19 12:53 PM, Neeraj Ladkani
wrote:<br>
</div>
<blockquote type="cite">
<div dir="auto" style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Sure thing. Is there an design document that exist for this
feature ? <br>
<br>
</div>
<div dir="auto" style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
I can volunteer to drive this work group if we have quorum.<br>
<br>
</div>
<div dir="auto" style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Neeraj <br>
<br>
</div>
<div dir="auto" style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
<span id="gmail-m_1275227641964777196OutlookSignature">
<div dir="auto" style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Get <a href="https://aka.ms/ghei36" target="_blank">Outlook
for Android</a></div>
</span><br>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_1275227641964777196divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> vishwa
<a class="gmail-m_1275227641964777196moz-txt-link-rfc2396E" href="mailto:vishwa@linux.vnet.ibm.com" target="_blank"><vishwa@linux.vnet.ibm.com></a><br>
<b>Sent:</b> Friday, May 17, 2019 12:17:51 AM<br>
<b>To:</b> Neeraj Ladkani; Kun Yi; OpenBMC Maillist<br>
<b>Subject:</b> Re: BMC health metrics (again!)</font>
<div> </div>
</div>
<div>
<p>Neeraj,</p>
<p>Thanks for the inputs. It's nice to see us having a similar
thought.</p>
<p>AFAIK, we don't have any work-group that is driving <span style="color:windowtext">
“Platform telemetry and health monitoring”. Also, do we want
to see this as 2 different entities ?. In the past, there
were thoughts about using websockets to channel some of the
thermal parameters as telemetry data. But then it was not
implemented.</span></p>
<p><span style="color:windowtext">We can discuss here I think.</span></p>
<p><span style="color:windowtext">!! Vishwa !!<br>
</span></p>
<div class="gmail-m_1275227641964777196moz-cite-prefix">On 5/17/19 12:00 PM, Neeraj Ladkani
wrote:<br>
</div>
<blockquote type="cite">
<div class="gmail-m_1275227641964777196WordSection1">
<p class="MsoNormal"><span style="color:windowtext">At cloud
scale, telemetry and health monitoring is very critical.
We should define a framework that allows platform owners
to add their own telemetry hooks. Telemetry service
should be designed to make this data accessible and
store in resilient way (like blackbox during plane
crash). <u></u><u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext">Is there
any workgroup that drives this feature “Platform
telemetry and health monitoring” ?
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext">Wishlist<u></u><u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext">BMC
telemetry : <u></u><u></u></span></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Linux subsystem<u></u><u></u></li>
<ol style="margin-top:0in" start="1" type="a">
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Uptime<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
CPU Load average<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Memory info<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Storage usage ( RW ) <u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Dmesg<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Syslog <u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
FDs of critical processes <u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Alignment traps <u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
WDT excursions <u></u><u></u></li>
</ol>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
IPMI subsystem<u></u><u></u></li>
<ol style="margin-top:0in" start="1" type="a">
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Request and Response logging par interface with
timestamps ( KCS, LAN, USB)<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Request and Response of IPMB<u></u><u></u></li>
</ol>
</ol>
<p class="gmail-m_1275227641964777196MsoListParagraph" style="margin-left:1.5in">
<span style="color:windowtext"><span><span style="font:7pt "Times New Roman"">
</span>i.<span> </span>
</span></span><span style="color:windowtext">Request , Response, No of
Retries<u></u><u></u></span></p>
<ol style="margin-top:0in" start="3" type="1">
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext;margin-left:0in">
Misc<u></u><u></u></li>
</ol>
<ol style="margin-top:0in" start="1" type="a">
<li class="gmail-m_1275227641964777196MsoListParagraph" style="color:windowtext">Critical
Temperature Excursions
<u></u><u></u></li>
</ol>
<p class="gmail-m_1275227641964777196MsoListParagraph" style="margin-left:1.5in">
<span style="color:windowtext"><span><span style="font:7pt "Times New Roman"">
</span>i.<span> </span>
</span></span><span style="color:windowtext">Minimum Reading of Sensor<u></u><u></u></span></p>
<p class="gmail-m_1275227641964777196MsoListParagraph" style="margin-left:1.5in">
<span style="color:windowtext"><span><span style="font:7pt "Times New Roman"">
</span>ii.<span> </span>
</span></span><span style="color:windowtext">Max Reading of a sensor<u></u><u></u></span></p>
<p class="gmail-m_1275227641964777196MsoListParagraph" style="margin-left:1.5in">
<span style="color:windowtext"><span><span style="font:7pt "Times New Roman"">
</span>iii.<span> </span>
</span></span><span style="color:windowtext">Count of state transition<u></u><u></u></span></p>
<p class="gmail-m_1275227641964777196MsoListParagraph" style="margin-left:1.5in">
<span style="color:windowtext"><span><span style="font:7pt "Times New Roman"">
</span>iv.<span> </span>
</span></span><span style="color:windowtext">Retry Count<u></u><u></u></span></p>
<ol style="margin-top:0in" start="2" type="a">
<li class="gmail-m_1275227641964777196MsoListParagraph">Count of assertions/deassertions of GPIO and
ability to capture the state<u></u><u></u></li>
<li class="gmail-m_1275227641964777196MsoListParagraph">timestamp of last assertion/deassertion of GPIO<u></u><u></u></li>
</ol>
<p class="MsoNormal"><span style="color:windowtext"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext">Thanks<u></u><u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext">~Neeraj<u></u><u></u></span></p>
<p class="MsoNormal"><span style="color:windowtext"><u></u> <u></u></span></p>
<div>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in">
<p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> openbmc
<a class="gmail-m_1275227641964777196moz-txt-link-rfc2396E" href="mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org" target="_blank">
<openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org></a> <b>On
Behalf Of </b>vishwa<br>
<b>Sent:</b> Wednesday, May 8, 2019 1:11 AM<br>
<b>To:</b> Kun Yi <a class="gmail-m_1275227641964777196moz-txt-link-rfc2396E" href="mailto:kunyi@google.com" target="_blank">
<kunyi@google.com></a>; OpenBMC Maillist <a class="gmail-m_1275227641964777196moz-txt-link-rfc2396E" href="mailto:openbmc@lists.ozlabs.org" target="_blank">
<openbmc@lists.ozlabs.org></a><br>
<b>Subject:</b> Re: BMC health metrics (again!)<u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<p>Hello Kun,<u></u><u></u></p>
<p>Thanks for initiating it. I liked the /proc parsing. On
the IPMI thing, is it only targeted to IPMI -or- a generic
BMC-Host communication kink ?<u></u><u></u></p>
<p>Some of the things in my wish-list are:<u></u><u></u></p>
<p>1/. Flash wear and tear detection and the threshold to be
a config option<br>
2/. Any SoC specific health checks ( If that is exposed )<br>
3/. Mechanism to detect spurious interrupts on any HW link<br>
4/. Some kind of check to see if there will be any I2C
lock to a given end device<br>
5/. Ability to detect errors on HW links<u></u><u></u></p>
<p>On the watchdog(8) area, I was just thinking these:<u></u><u></u></p>
<p>How about having some kind of BMC_health D-Bus properties
-or- a compile time feed, whose values can be fed into a
configuration file than watchdog using the default
/etc/watchdog.conf always. If the properties are coming
from a D-Bus, then we could either append to
/etc/watchdog.conf -or- treat those values only as the
config file that can be given to watchdog.<br>
The systemd service files to be setup accordingly.<u></u><u></u></p>
<p><br>
We have seen instances where we get an error that is
indicating no resources available. Those could be file
descriptors / socket descriptors etc. A way to plug this
into watchdog as part of test binary that checks for this
? We could hook a repair-binary to take the action.<u></u><u></u></p>
<p><br>
Another thing that I was looking at hooking into watchdog
is the test to see the file system usage as defined by the
policy.<br>
Policy could mention the file system mounts and also the
threshold.<br>
<br>
For example, /tmp , /root etc.. We could again hook a
repair binary to do some cleanup if needed<br>
<br>
If we see the list is growing with these custom
requirements, then probably does not make sense to pollute
the watchdog(2) but<br>
have these consumed into the app instead ?<u></u><u></u></p>
<p>!! Vishwa !!<u></u><u></u></p>
<div>
<p class="MsoNormal">On 4/9/19 9:55 PM, Kun Yi wrote:<u></u><u></u></p>
</div>
<blockquote style="margin-top:5pt;margin-bottom:5pt">
<div>
<div>
<div>
<div>
<p class="MsoNormal">Hello there,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">This topic has been brought
up several times on the mailing list and
offline, but in general seems we as a community
didn't reach a consensus on what things would be
the most valuable to monitor, and how to monitor
them. While it seems a general purposed
monitoring infrastructure for OpenBMC is a hard
problem, I have some simple ideas that I hope
can provide immediate and direct benefits.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">1. Monitoring host IPMI link
reliability (host side)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The essentials I want are
"IPMI commands sent" and "IPMI commands
succeeded" counts over time. More metrics like
response time would be helpful as well. The
issue to address here: when some IPMI sensor
readings are flaky, it would be really helpful
to tell from IPMI command stats to determine
whether it is a hardware issue, or IPMI issue.
Moreover, it would be a very useful regression
test metric for rolling out new BMC software.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Looking at the host IPMI
side, there is some metrics exposed
through /proc/ipmi/0/si_stats if ipmi_si driver
is used, but I haven't dug into whether it
contains information mapping to the interrupts.
Time to read the source code I guess.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Another idea would be to
instrument caller libraries like the interfaces
in ipmitool, though I feel that approach is
harder due to fragmentation of IPMI libraries.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">2. Read and expose core BMC
performance metrics from procfs<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">This is straightforward: have
a smallish daemon (or bmc-state-manager)
read,parse, and process procfs and put values on
D-Bus. Core metrics I'm interested in getting
through this way: load average, memory, disk
used/available, net stats... The values can then
simply be exported as IPMI sensors or Redfish
resource properties.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">A nice byproduct of this
effort would be a procfs parsing library. Since
different platforms would probably have
different monitoring requirements and procfs
output format has no standard, I'm thinking the
user would just provide a configuration file
containing list of (procfs path, property regex,
D-Bus property name), and the
compile-time generated code to provide an object
for each property. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">All of this is merely
thoughts and nothing concrete. With that said,
it would be really great if you could provide
some feedback such as "I want this, but I really
need that feature", or let me know it's all
implemented already :)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">If this seems valuable, after
gathering more feedback of feature requirements,
I'm going to turn them into design docs and
upload for review.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">-- <u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">Regards, <u></u><u></u></p>
<div>
<p class="MsoNormal">Kun<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Regards,<div>Kun</div></div></div>