<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>This is great !!</p>
    <p>Neeraj / Kun, Were you guys planning on putting an initial
      proposal ?</p>
    <p>!! Vishwa !!<br>
    </p>
    <div class="moz-cite-prefix">On 5/17/19 9:20 PM, Kun Yi wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAGMNF6XyH-VGRh18acGUbJniJ_YLW-3dz6sFJTvKbO7ZraJcZA@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">I'd also like to be in the metric workgroup.
        Neeraj, I can see the first and second point you listed aligns
        with my goals in the original proposal very well.</div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Fri, May 17, 2019 at 12:28
          AM vishwa <<a href="mailto:vishwa@linux.vnet.ibm.com"
            moz-do-not-send="true">vishwa@linux.vnet.ibm.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div bgcolor="#FFFFFF">
            <p>IMO, we could start fresh here. The initial thought was
              an year+ ago.</p>
            <p>!! Vishwa !!<br>
            </p>
            <div class="gmail-m_1275227641964777196moz-cite-prefix">On
              5/17/19 12:53 PM, Neeraj Ladkani wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
                Sure thing. Is there an design document that exist for
                this feature ? <br>
                <br>
              </div>
              <div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
                I can volunteer to drive this work group if we have
                quorum.<br>
                <br>
              </div>
              <div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
                Neeraj <br>
                <br>
              </div>
              <div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
                <span id="gmail-m_1275227641964777196OutlookSignature">
                  <div dir="auto"
style="direction:ltr;margin:0px;padding:0px;font-family:sans-serif;font-size:11pt;color:black">
                    Get <a href="https://aka.ms/ghei36" target="_blank"
                      moz-do-not-send="true">Outlook for Android</a></div>
                </span><br>
              </div>
              <hr style="display:inline-block;width:98%">
              <div id="gmail-m_1275227641964777196divRplyFwdMsg"
                dir="ltr"><font style="font-size:11pt" face="Calibri,
                  sans-serif" color="#000000"><b>From:</b> vishwa <a
                    class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
                    href="mailto:vishwa@linux.vnet.ibm.com"
                    target="_blank" moz-do-not-send="true"><vishwa@linux.vnet.ibm.com></a><br>
                  <b>Sent:</b> Friday, May 17, 2019 12:17:51 AM<br>
                  <b>To:</b> Neeraj Ladkani; Kun Yi; OpenBMC Maillist<br>
                  <b>Subject:</b> Re: BMC health metrics (again!)</font>
                <div> </div>
              </div>
              <div>
                <p>Neeraj,</p>
                <p>Thanks for the inputs. It's nice to see us having a
                  similar thought.</p>
                <p>AFAIK, we don't have any work-group that is driving <span
                    style="color:windowtext"> “Platform telemetry and
                    health monitoring”. Also, do we want to see this as
                    2 different entities ?. In the past, there were
                    thoughts about using websockets to channel some of
                    the thermal parameters as telemetry data. But then
                    it was not implemented.</span></p>
                <p><span style="color:windowtext">We can discuss here I
                    think.</span></p>
                <p><span style="color:windowtext">!! Vishwa !!<br>
                  </span></p>
                <div class="gmail-m_1275227641964777196moz-cite-prefix">On
                  5/17/19 12:00 PM, Neeraj Ladkani wrote:<br>
                </div>
                <blockquote type="cite">
                  <div class="gmail-m_1275227641964777196WordSection1">
                    <p class="MsoNormal"><span style="color:windowtext">At
                        cloud scale, telemetry and health monitoring is
                        very critical. We should define a framework that
                        allows platform owners to add their own
                        telemetry hooks. Telemetry service should be
                        designed to make this data accessible and store
                        in resilient way (like blackbox during plane
                        crash).  </span></p>
                    <p class="MsoNormal"><span style="color:windowtext"> </span></p>
                    <p class="MsoNormal"><span style="color:windowtext">Is
                        there any workgroup that drives this feature
                        “Platform telemetry and health monitoring” ? </span></p>
                    <p class="MsoNormal"><span style="color:windowtext"> </span></p>
                    <p class="MsoNormal"><span style="color:windowtext">Wishlist</span></p>
                    <p class="MsoNormal"><span style="color:windowtext"> </span></p>
                    <p class="MsoNormal"><span style="color:windowtext">BMC
                        telemetry : </span></p>
                    <ol style="margin-top:0in" start="1" type="1">
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph"
                        style="color:windowtext;margin-left:0in"> Linux
                        subsystem</li>
                      <ol style="margin-top:0in" start="1" type="a">
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Uptime</li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in"> CPU
                          Load average</li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Memory info</li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Storage usage ( RW )  </li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Dmesg</li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Syslog </li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in"> FDs
                          of critical processes </li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Alignment traps </li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in"> WDT
                          excursions </li>
                      </ol>
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph"
                        style="color:windowtext;margin-left:0in"> IPMI
                        subsystem</li>
                      <ol style="margin-top:0in" start="1" type="a">
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Request and Response logging par interface
                          with timestamps ( KCS, LAN, USB)</li>
                        <li
                          class="gmail-m_1275227641964777196MsoListParagraph"
                          style="color:windowtext;margin-left:0in">
                          Request and Response of IPMB</li>
                      </ol>
                    </ol>
                    <p
                      class="gmail-m_1275227641964777196MsoListParagraph"
                      style="margin-left:1.5in"> <span
                        style="color:windowtext"><span><span
                            style="font:7pt "Times New Roman"">                                                              
                          </span>i.<span>      </span> </span></span><span
                        style="color:windowtext">Request , Response, No
                        of Retries</span></p>
                    <ol style="margin-top:0in" start="3" type="1">
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph"
                        style="color:windowtext;margin-left:0in"> Misc</li>
                    </ol>
                    <ol style="margin-top:0in" start="1" type="a">
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph"
                        style="color:windowtext">Critical Temperature
                        Excursions </li>
                    </ol>
                    <p
                      class="gmail-m_1275227641964777196MsoListParagraph"
                      style="margin-left:1.5in"> <span
                        style="color:windowtext"><span><span
                            style="font:7pt "Times New Roman"">                                                              
                          </span>i.<span>      </span> </span></span><span
                        style="color:windowtext">Minimum Reading of
                        Sensor</span></p>
                    <p
                      class="gmail-m_1275227641964777196MsoListParagraph"
                      style="margin-left:1.5in"> <span
                        style="color:windowtext"><span><span
                            style="font:7pt "Times New Roman"">                                                            
                          </span>ii.<span>      </span> </span></span><span
                        style="color:windowtext">Max Reading of a sensor</span></p>
                    <p
                      class="gmail-m_1275227641964777196MsoListParagraph"
                      style="margin-left:1.5in"> <span
                        style="color:windowtext"><span><span
                            style="font:7pt "Times New Roman"">                                                          
                          </span>iii.<span>      </span> </span></span><span
                        style="color:windowtext">Count of state
                        transition</span></p>
                    <p
                      class="gmail-m_1275227641964777196MsoListParagraph"
                      style="margin-left:1.5in"> <span
                        style="color:windowtext"><span><span
                            style="font:7pt "Times New Roman"">                                                          
                          </span>iv.<span>      </span> </span></span><span
                        style="color:windowtext">Retry Count</span></p>
                    <ol style="margin-top:0in" start="2" type="a">
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph">Count
                        of assertions/deassertions of GPIO and ability
                        to capture the state</li>
                      <li
                        class="gmail-m_1275227641964777196MsoListParagraph">timestamp
                        of last assertion/deassertion of GPIO</li>
                    </ol>
                    <p class="MsoNormal"><span style="color:windowtext"> </span></p>
                    <p class="MsoNormal"><span style="color:windowtext">Thanks</span></p>
                    <p class="MsoNormal"><span style="color:windowtext">~Neeraj</span></p>
                    <p class="MsoNormal"><span style="color:windowtext"> </span></p>
                    <div>
                      <div
style="border-right:none;border-bottom:none;border-left:none;border-top:1pt
                        solid rgb(225,225,225);padding:3pt 0in 0in">
                        <p class="MsoNormal"><b><span
                              style="color:windowtext">From:</span></b><span
                            style="color:windowtext"> openbmc <a
                              class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
href="mailto:openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org"
                              target="_blank" moz-do-not-send="true">
<openbmc-bounces+neladk=microsoft.com@lists.ozlabs.org></a> <b>On
                              Behalf Of </b>vishwa<br>
                            <b>Sent:</b> Wednesday, May 8, 2019 1:11 AM<br>
                            <b>To:</b> Kun Yi <a
                              class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
                              href="mailto:kunyi@google.com"
                              target="_blank" moz-do-not-send="true">
                              <kunyi@google.com></a>; OpenBMC
                            Maillist <a
                              class="gmail-m_1275227641964777196moz-txt-link-rfc2396E"
                              href="mailto:openbmc@lists.ozlabs.org"
                              target="_blank" moz-do-not-send="true">
                              <openbmc@lists.ozlabs.org></a><br>
                            <b>Subject:</b> Re: BMC health metrics
                            (again!)</span></p>
                      </div>
                    </div>
                    <p class="MsoNormal"> </p>
                    <p>Hello Kun,</p>
                    <p>Thanks for initiating it. I liked the /proc
                      parsing. On the IPMI thing, is it only targeted to
                      IPMI -or- a generic BMC-Host communication kink ?</p>
                    <p>Some of the things in my wish-list are:</p>
                    <p>1/. Flash wear and tear detection and the
                      threshold to be a config option<br>
                      2/. Any SoC specific health checks ( If that is
                      exposed )<br>
                      3/. Mechanism to detect spurious interrupts on any
                      HW link<br>
                      4/. Some kind of check to see if there will be any
                      I2C lock to a given end device<br>
                      5/. Ability to detect errors on HW links</p>
                    <p>On the watchdog(8) area, I was just thinking
                      these:</p>
                    <p>How about having some kind of BMC_health D-Bus
                      properties -or- a compile time feed, whose values
                      can be fed into a configuration file than watchdog
                      using the default /etc/watchdog.conf always. If
                      the properties are coming from a D-Bus, then we
                      could either append to /etc/watchdog.conf -or-
                      treat those values only as the config file that
                      can be given to watchdog.<br>
                      The systemd service files to be setup accordingly.</p>
                    <p><br>
                      We have seen instances where we get an error that
                      is indicating no resources available. Those could
                      be file descriptors / socket descriptors etc. A
                      way to plug this into watchdog as part of test
                      binary that checks for this ? We could hook a
                      repair-binary to take the action.</p>
                    <p><br>
                      Another thing that I was looking at hooking into
                      watchdog is the test to see the file system usage
                      as defined by the policy.<br>
                      Policy could mention the file system mounts and
                      also the threshold.<br>
                      <br>
                      For example, /tmp , /root etc.. We could again
                      hook a repair binary to do some cleanup if needed<br>
                      <br>
                      If we see the list is growing with these custom
                      requirements, then probably does not make sense to
                      pollute the watchdog(2) but<br>
                      have these consumed into the app instead ?</p>
                    <p>!! Vishwa !!</p>
                    <div>
                      <p class="MsoNormal">On 4/9/19 9:55 PM, Kun Yi
                        wrote:</p>
                    </div>
                    <blockquote style="margin-top:5pt;margin-bottom:5pt">
                      <div>
                        <div>
                          <div>
                            <div>
                              <p class="MsoNormal">Hello there,</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">This topic has been
                                brought up several times on the mailing
                                list and offline, but in general seems
                                we as a community didn't reach a
                                consensus on what things would be the
                                most valuable to monitor, and how to
                                monitor them. While it seems a general
                                purposed monitoring infrastructure for
                                OpenBMC is a hard problem, I have some
                                simple ideas that I hope can provide
                                immediate and direct benefits.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">1. Monitoring host
                                IPMI link reliability (host side)</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">The essentials I want
                                are "IPMI commands sent" and "IPMI
                                commands succeeded" counts over time.
                                More metrics like response time would
                                be helpful as well. The issue to address
                                here: when some IPMI sensor readings are
                                flaky, it would be really helpful to
                                tell from IPMI command stats to
                                determine whether it is a hardware
                                issue, or IPMI issue. Moreover, it would
                                be a very useful regression test metric
                                for rolling out new BMC software.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">Looking at the host
                                IPMI side, there is some metrics exposed
                                through /proc/ipmi/0/si_stats if ipmi_si
                                driver is used, but I haven't dug into
                                whether it contains information mapping
                                to the interrupts. Time to read the
                                source code I guess.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">Another idea would be
                                to instrument caller libraries like the
                                interfaces in ipmitool, though I feel
                                that approach is harder due to
                                fragmentation of IPMI libraries.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">2. Read and expose
                                core BMC performance metrics from procfs</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">This is
                                straightforward: have a smallish daemon
                                (or bmc-state-manager) read,parse, and
                                process procfs and put values on D-Bus.
                                Core metrics I'm interested in getting
                                through this way: load average, memory,
                                disk used/available, net stats... The
                                values can then simply be exported as
                                IPMI sensors or Redfish resource
                                properties.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">A nice byproduct of
                                this effort would be a procfs parsing
                                library. Since different platforms would
                                probably have different monitoring
                                requirements and procfs output format
                                has no standard, I'm thinking the user
                                would just provide a configuration file
                                containing list of (procfs path,
                                property regex, D-Bus property name),
                                and the compile-time generated code to
                                provide an object for each property. </p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">All of this is merely
                                thoughts and nothing concrete. With that
                                said, it would be really great if you
                                could provide some feedback such as "I
                                want this, but I really need that
                                feature", or let me know it's all
                                implemented already :)</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal">If this seems
                                valuable, after gathering more feedback
                                of feature requirements, I'm going to
                                turn them into design docs and upload
                                for review.</p>
                            </div>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <p class="MsoNormal">-- </p>
                            <div>
                              <div>
                                <p class="MsoNormal">Regards, </p>
                                <div>
                                  <p class="MsoNormal">Kun</p>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
      <br clear="all">
      <div><br>
      </div>
      -- <br>
      <div dir="ltr" class="gmail_signature">
        <div dir="ltr">Regards,
          <div>Kun</div>
        </div>
      </div>
    </blockquote>
  </body>
</html>