support NVMe drive health monitoring

Ed Tanous ed at tanous.net
Fri Nov 13 11:06:19 AEDT 2020


On Thu, Nov 12, 2020 at 2:59 PM Andrew Jeffery <andrew at aj.id.au> wrote:
>
> Hey Ed,
>
> On Fri, 13 Nov 2020, at 02:55, Ed Tanous wrote:
> > On Wed, Nov 11, 2020 at 9:56 PM Jet Li <Jet.Li at ibm.com> wrote:
> > >
> > > Hi Ed, Rashmi,
> > >
> > > IBM are working on supporting NVMe drive health monitoring for some new system designs and have been posting some patches to dbus-sensors (e.g. [1]).
> > >
> > > [1] https://gerrit.openbmc-project.xyz/c/openbmc/dbus-sensors/+/38058/
> > >
> > > As some background our high level requirements are that we need to:
> > >
> > > 1. Monitor NVMe drive health to supply data for fan management and error logging
> > > 2. Support arbitrary drive configurations in the platform(s), which are plugged into removable IO cards
> > >
> > > As you're no doubt aware, NVMe drive state is exposed via two interfaces:
> > >
> > > 3. A Basic Management Command over SMBus
> > > 4. NVMe-MI over MCTP, via the SMBus MCTP binding
> > >
> > > And in OpenBMC there are two corresponding NVMe monitoring implementations:
> > >
> > > 5. phosphor-nvme: Basic Management Command, static configuration
> > > 6. dbus-sensors: NVMe-MI over MCTP, dynamic configuration
> > >
> > > IBM are migrating our BMC designs towards entity-manager and dbus-sensors, and our newer system designs would benefit from dbus-sensor's dynamic configuration via entity-manager (we have reservations about the drive presence GPIO configuration in the phosphor-nvme with respect to our system designs).
> > > Zooming out briefly we're looking to, where possible, use upstream code and avoid the use of forks. However, as dbus-sensors uses NVMe-MI over MCTP via libmctp, we have some concerns:
> > >
> > > 7. Upstream libmctp (openbmc/libmctp) currently implements just two bindings
> > >     a. Serial (DSP0253)
> > >     b. ASTLPC (Vendor-specific binding)
> > > 8. Intel have forked libmctp (intel-bmc/libmctp) and implemented a further two bindings
> > >     a. SMBus (DSP0237)
> > >     b. PCIe VDM (DSP0238)
> > > 9. Both the SMBus and PCIe VDM binding implementations in intel-bmc/libmctp require kernel patches that only exist in Intel's OpenBMC tree
> >
> > Nit;  the SMBus side requires one patch which was written by aspeed,
> > and was part of the Aspeed original BSP.  That doesn't exist "only" in
> > the Intel tree.
>
> The broader point was that it doesn't exist in trees that we care about:
> openbmc/linux or Torvalds' tree. It probably could have been worded a bit
> better.
>
> >
> > It should also be noted, this patch was in fact submitted to the
> > kernel, and hopefully looks like it just needs some minor cleanups
> > that never got done.
> > https://lkml.org/lkml/2018/4/23/835
>
> There's also the issue of the Intel fork of libmctp, so there's a bit of a
> dependency chain of out-of-tree patches going on here. No-one has tried to
> upstream the SMBus binding implementation, probably because no-one has
> completed the job of upstreaming the required kernel interface.

Part of me wonders if the best path forward here would be to remove
the NVMeSensor for the time being, then add it back once the patches
have been mainlined.  That might be a little aggressive, but would
give us a clear slate where we can talk about adding nvme-mi basic.
I'll have to ask James why they were merged in the first place and
what the intent was.

>
> There was a short route to getting the information we want while avoiding
> forks, which was to go via the basic management command (which I acknowledge
> also has its issues).
>
> >
> > PCIe VDM, I agree with you, is only in the intel tree.
> >
> > >
> > > Finally, for the moment, the data provided by the NVMe Basic Management Command meets IBM's current requirements, and we're using drives that support the Basic Management Command.
> > >
> > > So the direction we chose is to use entity-manager and dbus-sensors for NVMe drive monitoring, and
> > > implement support for the Basic Management Command over SMBus in the NVMeSensor application. To get there, as far as I have determined, we should do the following:
> > >
> > > 10. Make optional the dependency of NVMeSensor on the forked libmctp
> > > 11. Add a compile-time flag to {en,dis}able the MCTP NVMe-MI backend
> > > 12. Add a compile-time flag to {en,dis}able the Basic Management Command backend
> > > 13. Patch intel-bmc/openbmc to configure NVMeSensor with MCTP NVMe-MI enabled
> > > 14. Change the default build configuration of NVMeSensor use the Basic Management Command
> > > 15. Enable out-of-tree builds of NVMeSensor by default
> > > 16. Add NVMeSensor unit tests
> > > 17. Enable CI for dbus-sensors where we can / is necessary
> > >
> > >
> >
> > You've listed quite a few things here, but I'm not following "why" we
> > need two NVMe implementations.  MCTP NVMe-MI is more feature rich, and
> > is much better supported than its MI basic brethren, and generally
> > provides more information.  Considering we have patches to do it, why
> > don't we just work on getting the required patch upstreamed?
>
> As above, it's not just the one required patch to the kernel, we also have to
> get the libmctp SMBus binding implementation upstream. Putting on my libmctp
> maintainer hat, I'd like to see some effort from Intel on both of those fronts.

Fair point.  Considering I wrote the original libmctp smbus patch, I'm
far from blameless in this situation, despite my change in email
address.

>
> >  On top
> > of that, the committee that writes the NVMe-MI spec keeps threatening
> > to deprecate it, so we might end up with code that's used for one
> > generation, then never again.  I really don't like the idea of having
> > a second NVMe subsystem just because one is _slightly_ easier to
> > write.
>
> I disagree with "_slightly_" if you account for the MCTP stack that goes with
> it, and the fact that the kernel and libmctp code is not yet upstream.

Let me rephrase, if we were to write this as a hwmon driver in
torvalds/linux, would this be more or less effort than trying to get
the mctp driver stack upstreamed along with the mslave device driver?
I had assumed more, but maybe not?

In re-reading my last email, it made me sound like I'm anti- MI-Basic.
To be clear, I really have no problem with adding MI-Basic support in
dbus-sensors.  I'd like to see it made an explicit and separate option
in the per-device configuration, as some devices don't support it, and
to clear the way for if we ever get the existing (or maybe a new)
nvme-sensor buildable by more than just my former teams fork.

>
> > If there's other good technical reasons why a user would
> > prefer MI-basic, and we can get those reasons written down, I'm happy
> > to hear them, but if the overall reason is "we don't want to upstream
> > code to the kernel" that doesn't seem like a good enough justification
> > to build out support for both;
>
> Implementing the SMBus MCTP binding around the i2c-slave-mqueue kernel
> interface is just a stop-gap solution in place of the (continually deferred)
> in-kernel socket-based MCTP implementation (no-one besides Jeremy has really
> put significant effort into collaborating with me on that concept). So if we're
> aiming for stop-gaps, why not implement support for the basic management
> command given the rest of the code is out-of-tree?

It should be noted, it's our intention to start working on an
upstreamable mctp socket kernel driver in the near (1-2 quarters)
future.  That doesn't really change this discussion, but figured it
was worth pointing out.  If we land MI-basic, and have a little bit of
luck, maybe it doesn't have to live for long.

>
> The trade-offs here seem to be the crux of the discussion.
>
> >  Especially considering the MI basic
> > functionality probably would best be done as a kernel driver.
>
> Right, that's an interesting idea.

This was just an idea;  A userspace-only nvme-mi setup would probably
be fine IMO as well, depending on what aligns with your schedule.

>
> >
> > It should also be noted if we end up doing this, I'd probably advocate
> > for it to be its own separate sensor, distinct from the MCTP one,
> > because from a BMC perspective, it has very little in common with the
> > existing NVMe sensor (different protocols, different IO loops, ect),
>
> The difference in IO loop does have an impact here - I hadn't completely
> appreciated how we'd make the synchronous call fit into the current code. The
> hack approach was to just not worry about it for the moment. However,
> implementing the basic command backend isn't actually that much of a change;
> it's a bit of a reorganisation of the NVMeContext class and a small encoder /
> decoder for the SMBus commands.

If we can fit it cleanly into the application that's there, great;  If
it makes more sense as another app, fine too.  I haven't looked at it
in depth.

>
> > and would probably be a good candidate for adding the hwmon.  Also, it
> > would need to be distinctly selectable from the entity-manager json,
> > as some drives don't possess support for NVMe-MI basic, and we need to
> > keep track of which protocol to use in which case.
>
> If both backends were built in, there's never a reason to use the basic
> management command. If only one of the backends is built in then you don't need
> to keep track unless the interface presented on DBus was different depending on
> the backend, but I was hoping we could avoid that.

See above.  I'd really prefer if this were a per-device setting, not a
per-build setting.  nvme-mi support is spotty at best, and buggy
nvme-mi mctp implementations are also common.  I'd rather we
explicitly select one that works well per-device type, rather making
it a build time option.

>
> Thanks for the response!
>
> Andrew


More information about the openbmc mailing list