support NVMe drive health monitoring

Andrew Jeffery andrew at aj.id.au
Mon Nov 16 15:02:42 AEDT 2020



On Fri, 13 Nov 2020, at 11:59, Andrew Jeffery wrote:
> > > > It should also be noted if we end up doing this, I'd probably advocate
> > > > for it to be its own separate sensor, distinct from the MCTP one,
> > > > because from a BMC perspective, it has very little in common with the
> > > > existing NVMe sensor (different protocols, different IO loops, ect),
> > >
> > > The difference in IO loop does have an impact here - I hadn't completely
> > > appreciated how we'd make the synchronous call fit into the current code. The
> > > hack approach was to just not worry about it for the moment. However,
> > > implementing the basic command backend isn't actually that much of a change;
> > > it's a bit of a reorganisation of the NVMeContext class and a small encoder /
> > > decoder for the SMBus commands.
> > 
> > If we can fit it cleanly into the application that's there, great;  If
> > it makes more sense as another app, fine too.  I haven't looked at it
> > in depth.
> > 
> > >
> > > > and would probably be a good candidate for adding the hwmon.  Also, it
> > > > would need to be distinctly selectable from the entity-manager json,
> > > > as some drives don't possess support for NVMe-MI basic, and we need to
> > > > keep track of which protocol to use in which case.
> > >
> > > If both backends were built in, there's never a reason to use the basic
> > > management command. If only one of the backends is built in then you don't need
> > > to keep track unless the interface presented on DBus was different depending on
> > > the backend, but I was hoping we could avoid that.
> > 
> > See above.  I'd really prefer if this were a per-device setting, not a
> > per-build setting.
> 
> Right, my sense in reading what you wrote was that you were pushing back on the 
> idea on the basis that we'd have to differentiate in the entity-manager config. 
> But it sounds like you're actually not fussed about that and even consider the 
> idea in a positive light given what you've written below:
> 
> >  nvme-mi support is spotty at best, and buggy
> > nvme-mi mctp implementations are also common.  I'd rather we
> > explicitly select one that works well per-device type, rather making
> > it a build time option.

So I'm considering how we can wrap up this discussion. I think the following 
points have been made:

1. We have entity-manager indicate in its NVMe-MI configuration which interface 
we'd like to use to fetch the MI data (MCTP vs basic). This is potentially 
useful to avoid bugs in the drive's MCTP support.

2. Perhaps we should have distinct applications to handle fetching MI data via 
MCTP vs basic if the differences in communication model are too difficult to 
reconcile.

3. That (maybe) we rip out the existing NVMeSensor implementation from 
dbus-sensors on the basis that the SMBus MCTP code stack is not in great shape.

We need to hash out what 1 would look like, but I don't think that discussion 
should immediately get in the way of addressing 2 and 3. If 3 is on the cards 
and we handwave over the appropriateness of the NVMeSensor app IO loop for the 
moment, then it's indistinguishable from having a build-time switch to disable 
the MCTP backend in the current implementation.

So I think we can progress along the lines of the patches that Jet's proposing 
(adding a flag to {en,dis}able MCTP support) until we consider the MCTP backend 
mature enough to enable by default?

Cheers,

Andrew


More information about the openbmc mailing list