<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hello Kun,</p>

    <p>Thanks for initiating it. I liked the /proc parsing. On the IPMI

      thing, is it only targeted to IPMI -or- a generic BMC-Host

      communication kink ?<br>

    </p>

    <p>Some of the things in my wish-list are:<br>

    </p>

    <p>1/. Flash wear and tear detection and the threshold to be a

      config option<br>

      2/. Any SoC specific health checks ( If that is exposed )<br>

      3/. Mechanism to detect spurious interrupts on any HW link<br>

      4/. Some kind of check to see if there will be any I2C lock to a

      given end device<br>

      5/. Ability to detect errors on HW links</p>

    <p>On the watchdog(8) area, I was just thinking these:<br>

    </p>

    <p>How about having some kind of BMC_health D-Bus properties -or- a

      compile time feed, whose values can be fed into a configuration

      file than watchdog using the default /etc/watchdog.conf always. If

      the properties are coming from a D-Bus, then we could either

      append to /etc/watchdog.conf -or- treat those values only as the

      config file that can be given to watchdog.<br>

      The systemd service files to be setup accordingly.<br>

    </p>

    <p><br>

      We have seen instances where we get an error that is indicating no

      resources available. Those could be file descriptors / socket

      descriptors etc. A way to plug this into watchdog as part of test

      binary that checks for this ? We could hook a repair-binary to

      take the action.<br>

    </p>

    <p><br>

      Another thing that I was looking at hooking into watchdog is the

      test to see the file system usage as defined by the policy.<br>

      Policy could mention the file system mounts and also the

      threshold.<br>

      <br>

      For example, /tmp , /root etc.. We could again hook a repair

      binary to do some cleanup if needed<br>

      <br>

      If we see the list is growing with these custom requirements, then

      probably does not make sense to pollute the watchdog(2) but<br>

      have these consumed into the app instead ?</p>

    <p>!! Vishwa !!<br>

    </p>

    <div class="moz-cite-prefix">On 4/9/19 9:55 PM, Kun Yi wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGMNF6VHifnF8qC61HN2bboY8duArOuQ1FvK3mP1gA6Xbazcow@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">

          <div>

            <div>Hello there,</div>

            <div><br>

            </div>

            <div>This topic has been brought up several times on the

              mailing list and offline, but in general seems we as a

              community didn't reach a consensus on what things would be

              the most valuable to monitor, and how to monitor them.

              While it seems a general purposed monitoring

              infrastructure for OpenBMC is a hard problem, I have some

              simple ideas that I hope can provide immediate and direct

              benefits.</div>

            <div><br>

            </div>

            <div>1. Monitoring host IPMI link reliability (host side)</div>

            <div><br>

            </div>

            <div>The essentials I want are "IPMI commands sent" and

              "IPMI commands succeeded" counts over time. More metrics

              like response time would be helpful as well. The issue to

              address here: when some IPMI sensor readings are flaky, it

              would be really helpful to tell from IPMI command stats to

              determine whether it is a hardware issue, or IPMI issue.

              Moreover, it would be a very useful regression test metric

              for rolling out new BMC software.</div>

            <div><br>

            </div>

            <div>Looking at the host IPMI side, there is some metrics

              exposed through /proc/ipmi/0/si_stats if ipmi_si driver is

              used, but I haven't dug into whether it contains

              information mapping to the interrupts. Time to read the

              source code I guess.</div>

            <div><br>

            </div>

            <div>Another idea would be to instrument caller libraries

              like the interfaces in ipmitool, though I feel that

              approach is harder due to fragmentation of IPMI libraries.</div>

            <div><br>

            </div>

            <div>2. Read and expose core BMC performance metrics from

              procfs</div>

            <div><br>

            </div>

            <div>This is straightforward: have a smallish daemon (or

              bmc-state-manager) read,parse, and process procfs and put

              values on D-Bus. Core metrics I'm interested in getting

              through this way: load average, memory, disk

              used/available, net stats... The values can then simply be

              exported as IPMI sensors or Redfish resource properties.</div>

            <div><br>

            </div>

            <div>A nice byproduct of this effort would be a procfs

              parsing library. Since different platforms would probably

              have different monitoring requirements and procfs output

              format has no standard, I'm thinking the user would just

              provide a configuration file containing list of (procfs

              path, property regex, D-Bus property name), and the

              compile-time generated code to provide an object for each

              property. </div>

            <div><br>

            </div>

            <div>All of this is merely thoughts and nothing concrete.

              With that said, it would be really great if you could

              provide some feedback such as "I want this, but I really

              need that feature", or let me know it's all implemented

              already :)</div>

            <div><br>

            </div>

            <div>If this seems valuable, after gathering more feedback

              of feature requirements, I'm going to turn them into

              design docs and upload for review.</div>

            <div><br>

            </div>

            -- <br>

            <div dir="ltr"

              class="m_894391115062551385m_1187092449188926744gmail_signature">

              <div dir="ltr">Regards,

                <div>Kun</div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

  </body>

</html>