<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>This is great !!</p>
<p>Neeraj / Kun, Were you guys planning on putting an initial
proposal ?</p>
<p>!! Vishwa !!<br>
</p>
<div class="moz-cite-prefix">On 5/17/19 9:20 PM, Kun Yi wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAGMNF6XyH-VGRh18acGUbJniJ_YLW-3dz6sFJTvKbO7ZraJcZA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">I'd also like to be in the metric workgroup.
Neeraj, I can see the first and second point you listed aligns
with my goals in the original proposal very well.</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, May 17, 2019 at 12:28
AM vishwa <<a href="mailto:vishwa@linux.vnet.ibm.com"
moz-do-not-send="true">vishwa@linux.vnet.ibm.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>IMO, we could start fresh here. The initial thought was
an year+ ago.</p>
<p>!! Vishwa !!<br>
</p>
<div class="gmail-m_1275227641964777196moz-cite-prefix">On
5/17/19 12:53 PM, Neeraj Ladkani wrote:<br>
</div>
<blockquote type="cite">
<div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Sure thing. Is there an design document that exist for
this feature ? <br>
<br>
</div>
<div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
I can volunteer to drive this work group if we have
quorum.<br>
<br>
</div>
<div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Neeraj <br>
<br>
</div>
<div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
<span id="gmail-m_1275227641964777196OutlookSignature">
<div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
Get <a href="https://aka.ms/ghei36" target="_blank"
moz-do-not-send="true">Outlook for Android</a></div>
</span><br>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_1275227641964777196divRplyFwdMsg"
dir="ltr"><font style="font-size:11pt" face="Calibri,
sans-serif" color="#000000"><b>From:</b> vishwa <a
class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
href="mailto:vishwa@linux.vnet.ibm.com"
target="_blank" moz-do-not-send="true"><vishwa@linux.vnet.ibm.com></a><br>
<b>Sent:</b> Friday, May 17, 2019 12:17:51 AM<br>
<b>To:</b> Neeraj Ladkani; Kun Yi; OpenBMC Maillist<br>
<b>Subject:</b> Re: BMC health metrics (again!)</font>
<div> </div>
</div>
<div>
<p>Neeraj,</p>
<p>Thanks for the inputs. It's nice to see us having a
similar thought.</p>
<p>AFAIK, we don't have any work-group that is driving <span
style="color:windowtext"> “Platform telemetry and
health monitoring”. Also, do we want to see this as
2 different entities ?. In the past, there were
thoughts about using websockets to channel some of
the thermal parameters as telemetry data. But then
it was not implemented.</span></p>
<p><span style="color:windowtext">We can discuss here I
think.</span></p>
<p><span style="color:windowtext">!! Vishwa !!<br>
</span></p>
<div class="gmail-m_1275227641964777196moz-cite-prefix">On
5/17/19 12:00 PM, Neeraj Ladkani wrote:<br>
</div>
<blockquote type="cite">
<div class="gmail-m_1275227641964777196WordSection1">
<p class="MsoNormal"><span style="color:windowtext">At
cloud scale, telemetry and health monitoring is
very critical. We should define a framework that
allows platform owners to add their own
telemetry hooks. Telemetry service should be
designed to make this data accessible and store
in resilient way (like blackbox during plane
crash). </span></p>
<p class="MsoNormal"><span style="color:windowtext"> </span></p>
<p class="MsoNormal"><span style="color:windowtext">Is
there any workgroup that drives this feature
“Platform telemetry and health monitoring” ? </span></p>
<p class="MsoNormal"><span style="color:windowtext"> </span></p>
<p class="MsoNormal"><span style="color:windowtext">Wishlist</span></p>
<p class="MsoNormal"><span style="color:windowtext"> </span></p>
<p class="MsoNormal"><span style="color:windowtext">BMC
telemetry : </span></p>
<ol style="margin-top:0in" start="1" type="1">
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> Linux
subsystem</li>
<ol style="margin-top:0in" start="1" type="a">
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Uptime</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> CPU
Load average</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Memory info</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Storage usage ( RW ) </li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Dmesg</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Syslog </li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> FDs
of critical processes </li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Alignment traps </li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> WDT
excursions </li>
</ol>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> IPMI
subsystem</li>
<ol style="margin-top:0in" start="1" type="a">
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Request and Response logging par interface
with timestamps ( KCS, LAN, USB)</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in">
Request and Response of IPMB</li>
</ol>
</ol>
<p
class="gmail-m_1275227641964777196MsoListParagraph"
style="margin-left:1.5in"> <span
style="color:windowtext"><span><span
style="font:7pt "Times New Roman"">
</span>i.<span> </span> </span></span><span
style="color:windowtext">Request , Response, No
of Retries</span></p>
<ol style="margin-top:0in" start="3" type="1">
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext;margin-left:0in"> Misc</li>
</ol>
<ol style="margin-top:0in" start="1" type="a">
<li
class="gmail-m_1275227641964777196MsoListParagraph"
style="color:windowtext">Critical Temperature
Excursions </li>
</ol>
<p
class="gmail-m_1275227641964777196MsoListParagraph"
style="margin-left:1.5in"> <span
style="color:windowtext"><span><span
style="font:7pt "Times New Roman"">
</span>i.<span> </span> </span></span><span
style="color:windowtext">Minimum Reading of
Sensor</span></p>
<p
class="gmail-m_1275227641964777196MsoListParagraph"
style="margin-left:1.5in"> <span
style="color:windowtext"><span><span
style="font:7pt "Times New Roman"">
</span>ii.<span> </span> </span></span><span
style="color:windowtext">Max Reading of a sensor</span></p>
<p
class="gmail-m_1275227641964777196MsoListParagraph"
style="margin-left:1.5in"> <span
style="color:windowtext"><span><span
style="font:7pt "Times New Roman"">
</span>iii.<span> </span> </span></span><span
style="color:windowtext">Count of state
transition</span></p>
<p
class="gmail-m_1275227641964777196MsoListParagraph"
style="margin-left:1.5in"> <span
style="color:windowtext"><span><span
style="font:7pt "Times New Roman"">
</span>iv.<span> </span> </span></span><span
style="color:windowtext">Retry Count</span></p>
<ol style="margin-top:0in" start="2" type="a">
<li
class="gmail-m_1275227641964777196MsoListParagraph">Count
of assertions/deassertions of GPIO and ability
to capture the state</li>
<li
class="gmail-m_1275227641964777196MsoListParagraph">timestamp
of last assertion/deassertion of GPIO</li>
</ol>
<p class="MsoNormal"><span style="color:windowtext"> </span></p>
<p class="MsoNormal"><span style="color:windowtext">Thanks</span></p>
<p class="MsoNormal"><span style="color:windowtext">~Neeraj</span></p>
<p class="MsoNormal"><span style="color:windowtext"> </span></p>
<div>
<div
style="border-right:none;border-bottom:none;border-left:none;border-top:1pt
solid rgb(225,225,225);padding:3pt 0in 0in">
<p class="MsoNormal"><b><span
style="color:windowtext">From:</span></b><span
style="color:windowtext"> openbmc <a
class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
href="mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org"
target="_blank" moz-do-not-send="true">
<openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org></a> <b>On
Behalf Of </b>vishwa<br>
<b>Sent:</b> Wednesday, May 8, 2019 1:11 AM<br>
<b>To:</b> Kun Yi <a
class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
href="mailto:kunyi@google.com"
target="_blank" moz-do-not-send="true">
<kunyi@google.com></a>; OpenBMC
Maillist <a
class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
href="mailto:openbmc@lists.ozlabs.org"
target="_blank" moz-do-not-send="true">
<openbmc@lists.ozlabs.org></a><br>
<b>Subject:</b> Re: BMC health metrics
(again!)</span></p>
</div>
</div>
<p class="MsoNormal"> </p>
<p>Hello Kun,</p>
<p>Thanks for initiating it. I liked the /proc
parsing. On the IPMI thing, is it only targeted to
IPMI -or- a generic BMC-Host communication kink ?</p>
<p>Some of the things in my wish-list are:</p>
<p>1/. Flash wear and tear detection and the
threshold to be a config option<br>
2/. Any SoC specific health checks ( If that is
exposed )<br>
3/. Mechanism to detect spurious interrupts on any
HW link<br>
4/. Some kind of check to see if there will be any
I2C lock to a given end device<br>
5/. Ability to detect errors on HW links</p>
<p>On the watchdog(8) area, I was just thinking
these:</p>
<p>How about having some kind of BMC_health D-Bus
properties -or- a compile time feed, whose values
can be fed into a configuration file than watchdog
using the default /etc/watchdog.conf always. If
the properties are coming from a D-Bus, then we
could either append to /etc/watchdog.conf -or-
treat those values only as the config file that
can be given to watchdog.<br>
The systemd service files to be setup accordingly.</p>
<p><br>
We have seen instances where we get an error that
is indicating no resources available. Those could
be file descriptors / socket descriptors etc. A
way to plug this into watchdog as part of test
binary that checks for this ? We could hook a
repair-binary to take the action.</p>
<p><br>
Another thing that I was looking at hooking into
watchdog is the test to see the file system usage
as defined by the policy.<br>
Policy could mention the file system mounts and
also the threshold.<br>
<br>
For example, /tmp , /root etc.. We could again
hook a repair binary to do some cleanup if needed<br>
<br>
If we see the list is growing with these custom
requirements, then probably does not make sense to
pollute the watchdog(2) but<br>
have these consumed into the app instead ?</p>
<p>!! Vishwa !!</p>
<div>
<p class="MsoNormal">On 4/9/19 9:55 PM, Kun Yi
wrote:</p>
</div>
<blockquote style="margin-top:5pt;margin-bottom:5pt">
<div>
<div>
<div>
<div>
<p class="MsoNormal">Hello there,</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">This topic has been
brought up several times on the mailing
list and offline, but in general seems
we as a community didn't reach a
consensus on what things would be the
most valuable to monitor, and how to
monitor them. While it seems a general
purposed monitoring infrastructure for
OpenBMC is a hard problem, I have some
simple ideas that I hope can provide
immediate and direct benefits.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">1. Monitoring host
IPMI link reliability (host side)</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">The essentials I want
are "IPMI commands sent" and "IPMI
commands succeeded" counts over time.
More metrics like response time would
be helpful as well. The issue to address
here: when some IPMI sensor readings are
flaky, it would be really helpful to
tell from IPMI command stats to
determine whether it is a hardware
issue, or IPMI issue. Moreover, it would
be a very useful regression test metric
for rolling out new BMC software.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Looking at the host
IPMI side, there is some metrics exposed
through /proc/ipmi/0/si_stats if ipmi_si
driver is used, but I haven't dug into
whether it contains information mapping
to the interrupts. Time to read the
source code I guess.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Another idea would be
to instrument caller libraries like the
interfaces in ipmitool, though I feel
that approach is harder due to
fragmentation of IPMI libraries.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">2. Read and expose
core BMC performance metrics from procfs</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">This is
straightforward: have a smallish daemon
(or bmc-state-manager) read,parse, and
process procfs and put values on D-Bus.
Core metrics I'm interested in getting
through this way: load average, memory,
disk used/available, net stats... The
values can then simply be exported as
IPMI sensors or Redfish resource
properties.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">A nice byproduct of
this effort would be a procfs parsing
library. Since different platforms would
probably have different monitoring
requirements and procfs output format
has no standard, I'm thinking the user
would just provide a configuration file
containing list of (procfs path,
property regex, D-Bus property name),
and the compile-time generated code to
provide an object for each property. </p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">All of this is merely
thoughts and nothing concrete. With that
said, it would be really great if you
could provide some feedback such as "I
want this, but I really need that
feature", or let me know it's all
implemented already :)</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">If this seems
valuable, after gathering more feedback
of feature requirements, I'm going to
turn them into design docs and upload
for review.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<p class="MsoNormal">-- </p>
<div>
<div>
<p class="MsoNormal">Regards, </p>
<div>
<p class="MsoNormal">Kun</p>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail_signature">
<div dir="ltr">Regards,
<div>Kun</div>
</div>
</div>
</blockquote>
</body>
</html>