Feedback on Current OpenBMC and Proposing Some Improvements ---- "Improvements" Ideas

Fri Jun 26 11:08:03 AEST 2020

Hi Ed,

I used Python because I wanted to create a demo fast, and get back to
the issues that I need to work on. The thing I'm doing here is kinda
out of my direct work area based on the organization.

Yes, there are some restrictions in my current demo, and I'm afraid
that I may not have the bandwidth to cover it further alone. My point
is that, sometimes hardwares is designed with some unexpected
complexity on topology (EEPROM behind MUX for example). Having the
ability to aid the topology discovery with code, and having the
topology info available to other functionalities can help a lot. JSON
config files are having a hard time bearing these logics, and any
extra logic implemented in JSON config files requires some kind of
script parser in daemons processing them. Based on your replies, the
concept for functionally extensions that I was asking for should be
implemented as daemons either standalone or plugged onto dbus?

On "reading sensors within the BMC console", I'm actually using a
script to directly read from hwmon right now, because we are having
sensor number limit on IPMI and performance issues with IPMI and dbus.
We are still actively investigating these performance issues now to
unblock the project, but based on the current findings, I think it's
better to have this tool before the protocol layers.

On issues like uint8_t, yes, we've noted them down, but they are still
tech debts on our backlog, and dealing with the performance issue
described above remains as our priority right now.

Thank you!

- Alex Qiu

On Thu, Jun 25, 2020 at 7:44 AM Ed Tanous <ed at tanous.net> wrote:
>
>
> On Tue, Jun 23, 2020 at 6:31 PM Alex Qiu <xqiu at google.com> wrote:
> >
> > Hi Ed,
> >
> > -Internal email list
> > +A couple of folks who might be interested in this topic
> >
> > I don't know if you saw the updated reply in the main thread, but I
> > foresaw some possible communication gap, so I created a simple demo to
> > illustrate my ideas: https://github.com/alex310110/bmc-proto-20q2
> > Please note that I'm not trying to code a BMC with Python, but it's
> > just for the ease to set up a demo fast. Other replies inlined.
>
> I did see it, and it shows a lot of my problems with that approach.  Out of curiosity, why did you start with python, instead of something we could try on a BMC?  Even if it doesn't compile, it might be a starting point for someone else?
>
> I see a few anti-patterns there that I'd like to see you address, you've hardcoded lots of data that's not specific to the card.  At first glance in board_example_a.py
>
> 1. Line 22-23.  You've initialized 2 Muxes.  Both of these buses are present on your (guessing a little here) baseboard, and not the card itself.  This means that every single card will need to duplicate the initialization of these muxes.  So first step, you need to break apart your baseboard into a separate entity, so the "board" does not own the card.  Also, you haven't provided any mapping of a PCIe mux lane to a physical user-facing name "Slot 1, Slot 2, ect".  Entity-manager configs do both of these things.
> 2. You've only expressed the slot topology here.  CardExampleG, and CardExampleV need to know what bus they're on, what muxes they need to go through to get to that bus, and the organization of those things, as in your example, none of the busses have been created in the kernel, and some of the mux busses are shared.
> 3. You've hardcoded to only search for 2 different cards (card_g, and card_v), at 1 address (0x52).  While it would be great if systems in practice had that kind of consistency in addressing, PCIe add in cards have many different eeprom addresses.  So you'd have to update your loops to search for all possible.  Also, that loop scales great if you only support 2 cards.  What happens when OpenBMC supports 100 cards?  1000?  You've hardcoded the list of supported cards in the entity above it, which means every baseboard needs to explicitly add support for every possible card.  This stops scaling really fast.
> 4. You're looping over the PCIe slots as part of the board control flow.  What if slots are based on a riser plugged into said slot?
> 5. You've abstracted an eeprom to a simple device.  In practice, you need to parse the FRU data, which might be in several formats.  Sure, you could have a library function, but you still need a global structure to keep that, in case some other control flow needs it downstream.
> 6. You've hardcoded a mux address, and a physical channel again later on.
> 7. Line 71-72.  Both of those are blocking calls.  For devices with a large number of sensors, those blocking calls will cause performance bottlenecks. also, see my previous comments about non-cyclic timing of some sensors.
> 8. You're missing a lot of features that entity manager does today.  Fan control configs being the most important, which have a relation to how the chassis looks.  Can you add an example of a chassis with some fans and thermal configs in it?
>
> If you made all the changes I'm suggesting your code starts to look a LOT like entity manager, FRUDevice, and dbus-sensors combined into a single app.  The biggest difference is you've replaced config files and exposes records for library functions.  There's nothing inherently wrong with combining them like that, but we wanted to isolate the topology scanning logic from the config logic, so people would feel free to swap them out with their own.  In the case of some systems, there's a complete database of the hardware inventory in a proprietary format.  In the case of infrastructure managed systems, we wanted developers to have the ability to swap out the topology scanning logic for some fixed "Here are the list of the hardware devices that should be present" type daemons that support the various formats, without necessarily having to care about the implementations.  Said another way, it separates "How do I determine if this device is present" from "Here's how to interact with this device".  We could combine those again, but we lose out on the static case.  If nobody cares about the full config case, we could certainly consider it.
>  One other big thing I wanted to be able to support in the future with this was adding previously unknown devices at runtime, with zero need to compile code.  Imagine being able to support a temp sensor on a new card by simply uploading an entity manager config file to the webserver, having it rerun the detect, and suddenly that card is "supported" by that image.  When you mix the code in with the metadata or config, you lose that ability, as we can't easily upload unsigned code.  It's a tradeoff for sure, but being able to hand tweak a config at runtime can be invaluable for quick turnaround during debugging and platform bringup.
>
>
> >
> > On Sun, Jun 21, 2020 at 3:16 PM Ed Tanous <ed at tanous.net> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 2:29 PM Alex Qiu <xqiu at google.com> wrote:
> >
> > I didn't start with multi-threading too much in mind, but It's not
> > necessarily a single-threaded model. As you can see in the demo, each
> > entity instance has its own function of update_sensors(), and the
> > entities are collected in the main class, which may be implemented as
> > a higher-level inter-process communication API for entities to adapt
> > to instead of merely a single main function or thread. So this model
> > can potentially be threaded on an entity basis, or even fork each
> > entity into individual processes. The model can be further threaded
> > within each entity into service threads: separating sensor polling
> > loop and event handling loop for example. But I do wonder about the
> > performance overhead of making every function call into IPC.
> I'm not really following.  Could you give an example of calling 4 commands, in parallel, to a MCTP/IPMB device and posting them as they are received?  This is something that the existing sensor daemons do today.  Yeah, IPC is expensive, but moving away from dbus, and onto something else is a much bigger discussion.
>
>
> >
> > The base class for entities may have a default implementation that
> > doesn't hurt, for example, throwing an exception or returning an error
> > code to say that it doesn't have an EEPROM, so that inherited class
> > doesn't need to necessarily implement functions around EEPROM. Devices
> > are abstracted into the hwmon interface as the kernel does today, and
> > we need to config the names of each input attribute to make them
> > meaningful anyway.
> >
> > I do see your concerns, and I do believe this requires further
> > research into if this model can handle all the concerns or
> > requirements we have today.
>
> Looking forward to seeing it.
>
> >
> > >
> > > >
> > > > The existing JSON files used in the dynamic software stack can only represent data, but not any control flow. This led to difficulties where sometimes some code is preferred to have for aiding the discovery of hardware topology, condensing redundant configurations, etc. With a good framework for hardware topology, combining the entity abstraction described above, developers can easily find the best places to aid the topology discovery, implement hardware initialization logics, and optimize BMC tasks according to Linux behaviors.
> > >
> > > If you need code or control flow to aid in the discovery of hardware
> > > topology, write an application that exposes an interface similar to
> > > FruDevice, or the soon to be submitted peci-pcie.  These can be used
> > > in entity manager configs.  I'm not quite following what "redundant
> > > configurations" means in this context.  In my experience, most
> > > redundant configurations tend to be for things like power supplies, or
> > > drives, where a single device can fit in many different slots.  WIth
> > > that said, we already have an abstraction for that, so I'm not quite
> > > following.
> >
> > Please see this for a complicated discovery logic:
> > https://github.com/alex310110/bmc-proto-20q2/blob/master/downstream/board_example_a.py
> >
> > Based on your reply, I have a concern that, if we have a hardware
> > topology complicated enough, does that mean we should probably opt out
> > of FruDevice and use downstream daemon to replace it?
>
> FruDevice is poorly named these days (sorry James).  It should really be called I2cFruEepromLocator.  In theory, it can handle any I2C topology we were able to throw at it, including one that I tested that was 4 levels deep.  If you're trying to manage an automatically detected i2c eeprom/mux topology, that is the tool I would expect to use.  With that said, you're welcome to write others, if you need to handle other things on I2C, or the static config case from above.
> If you're managing a different source of data (like a host driven map, MCTP, or out of band PCIe registers) I would expect you'd likely want to write another daemon that's capable of posting that topology data to dbus, but I would expect you can still use entity manager to consume it, and apply the correct settings to sensors/busses/kernel/Fans.
>
> >
> > > >
> > > > Better open source and proprietary part management
> > > >
> > > > Construct "Improvements" like a proprietary software supporting plugins. The philosophy is that the architecture of "Improvements" should be solid enough that the community won't have to modify the upstream code much. The community can look at and reference the code upstream to develop their own code and configs according to their hardware, while the plugin-able part may be proprietary and can be kept downstream without conflicts. "Improvements" should have a reasonable plugin API to support common BMC functionality in the high level, and provide common low-level APIs to support the plugins by abstracting things like hwmon sysfs interface. This can be implemented using a plugin system or a flexible build system, as we are working on an open source project indeed. Whenever we find a potential conflict between upstream and downstream, let us work it out to see if it is appropriate to make it pluginable or configurable via config files.
> > >
> > > I must be misreading this, as I feel like openbmc already has
> > > "plugins" in the form of Dbus applications.  Many applications have
> > > been written that required no modification to upstream code.  Tha API
> > > you're looking for is reasonably well defined in phosphor dbus
> > > interfaces, and is intended to be reasonably stable, even if it's not
> > > guaranteed over time.  I'm also a little confused at what you're
> > > calling low-level APIs.  hwmon sysfs is a low level API.  Are you
> > > wanting to wrap it in yet another API that's OpenBMC specific?
> > > "can be kept downstream without conflicts" - In my experience, you're
> > > going to be hard pressed to find support for supporting closed source
> > > development in an open source project.  That's not to say individuals
> > > aren't out there, but they tend to keep their heads down :)
> >
> > Apologies for my wording; the low-level API may be probably called
> > lower level libraries offered by OpenBMC. See I2CHwmonDevice in
> > https://github.com/alex310110/bmc-proto-20q2/blob/master/i2c.py
> I2CDevices, i2CMuxes, HWMonDevices, and i2ceeproms exist in the kernel already, behind a well defined interface.  Your file feels a little bit like it's reinventing some things.  I'm not sure whether or not I'd be against inventing libopenbmc, but that's likely where those types of interfaces would need to go.
> It should also be noted, all of those devices are addable with only EM configuration file changes today.
>
>
> >
> > Although we make a lot of efforts to upstream software to the open
> > source community as much as possible, BMC is heavily involved with
> > hardware, and we're also restricted to hardware's restrictions. We are
> > having difficulties to upstream drivers or code containing
> > confidential hardware code names, or containing part numbers under NDA
> > with vendors. Personally I was also involved with a lengthy and
> > exhausting internal legal review to publicize a part number which is
> > under NDA with our vendor, involving email exchanges between attorneys
> > in Google and our vendor's support engineer. I hope this explains my
> > point. For today, these part numbers are required to pass onto dbus
> > from entity-manager in order for dbus-sensors to determine the correct
> > sensor daemon for them.
>
> Understood, and I've felt your pain before.  I'm not going to claim this is easily solved, but the best way IMO, is to create a downstream application for each hardware device you need to manage, and patch your entity manager configs to add the configuration data for those components to your boards (or keep the board configs totally private).  Any changes to the detection logic, or entity manager itself can be easily upstreamed.  The application boundary also means that there's a well defined dbus interface, and any licencing conflicts between GPL and proprietary code are resolved.
>
>
> >
> > >
> > > >
> > > > Flexibility for alternatives
> > > >
> > > > Although hwmon sysfs interface is a good starting point for getting sensor reads from devices, they have their own limitations. The interface does not abstract every register perfectly, especially when device registers are not designed to follow some common specs like PMBus, and it does not provide controls to the devices.
> > > >
> > > >
> > > > I propose a Device Abstraction Layer to wrap around devices. The underlying can completely map to hwmon sysfs, or allow user-space driver implementation if necessary, or even hybrid. This will easily provide an additional interface to bypass the driver and control the devices, while still maintaining the benefit to use an off-the-shelf Linux device driver.
> > >
> > > In this context, what are you calling a "device"?  I think everything
> > > you're looking for exists, although it sounds like it's not in the
> > > form you're wanting to see.  Dbus sensors already does a hwmon to Dbus
> > > sensor abstraction conversion, that in some cases maps 1:1, or in some
> > > cases is a "hybrid" as you call it.  Are you looking for something in
> > > the middle, so instead of going hwmon -> Dbus  and libmctp -> Dbus you
> > > would want hwmon -> DAL -> Dbus  and libmctp -> DAL -> Dbus?  There
> > > could be some advantages here, but I have a worry that it'll be
> > > difficult to come up with a reasonable "device" api.  Devices take a
> > > lot of forms, in band, out of band, all with varying requirements
> > > around threading, permissions, and eventing.  While it's possible to
> > > cover everything that's needed, I'd be worried we'd be able to cover a
> > > majority of them.
> >
> > Yep, that requires some research or others' experience; I'm mostly
> > familiar with I2C devices in my area of work.
> >
> > >
> > > >
> > > > Quite some existing code is heavily bound to or influenced by the IPMI protocol layer that we are having right now: We use “uint8_t” type for I2C bus number in entity-manager for example, while Linux kernel can extend the logical I2C bus number to more than 512 without any issues.
> > > Can you come up with a better example?  We've tried to be very careful
> > > to not have IPMI-specific things in the interfaces, and to make them
> > > as generic as possible.  In that case, uint8_t is used to represent
> > > the 7 bit addressing (plus read write bit) on the I2C bus itself, not
> > > the uint8_t in the IPMI spec.  The API you listed neglected to handle
> > > the possibility of 11 bit I2C addressing, as it isn't very common in
> > > practice, but the argument could certainly be made that the interface
> > > should be changed to a uint16_t, and I would expect the IPMI layer to
> > > simply filter addresses above 127 that it's not able to support.
> >
> > Please see getFruInfo() calls in FruDevice.cpp:
> > https://github.com/openbmc/entity-manager/blob/master/src/FruDevice.cpp#L1108
> >
> > The uint8_t bus of getFruInfo() restricted the number of logical I2C
> > buses that we could implement in the sysfs interface, and it was
> > unfortunately static_cast'ed to uint_8 which created a bug hard to
> > debug. I don't have much experience to find you more examples,
> > however... I believe some of these can be fixed within the current
> > architecture, nevertheless I'm still trying to emphasize this concept.
> OH, you mean you hit a uint8_t limit on busses!  I don't know of anyone that has crossed the 256 bus limit, so you've clearly found a bug/missing feature.  Now it's your time to shine.  You've found an issue, you know what the fix is, exactly where the code needs to go and you have the ability to test it.  Write a patch to fix it, test that it does what you want, write up a commit message explaining exactly what you detailed above, how you tested it, and submit it to gerrit with the maintainer as a reviewer.  The maintainer is very responsive, and you'll have fixed something hard to debug for the next person that runs into this.
>
>
> >
> > >
> > > > The current dynamic software stack emphasizes individual sensors, but the BMC handles many more tasks than just only sensors. The practicality of OpenBMC for hardware engineers is also hindered by the IPMI as described above in Issue Examples.
> > > Sensors were the first thing tackled, as those are the things that
> > > tend to be the most different platform to platform, and have the most
> > > peculiar settings.  We do also handle topology to some extent, as well
> > > as a lot of other commands that are not IPMI specific.  I agree, IPMI
> > > has its flaws, but OpenBMC also has pretty good support for Redfish,
> > > direct dbus, and upcoming MCTP if that's what you'd rather use as an
> > > outbound interface.
> >
> > On that, I'm also looking forward to the ability to read sensors
> > within the BMC console in a human-friendly way for hardware engineers,
> > so that we don't have to rely on the host or network to read them
> > during bring-up, or simply because we don't have RedFish ready yet,
> > and hardware engineers just want to see tons of sensor readings for
> > bring-up.
>
> I'm not following this as anything actionable.  OpenBMC has IPMItool, dbus tools, i2c-tools, the Redfish GUI, the rest-dbus GUI and the Webui to pick from for "human friendly way for hardware engineers".  Heck, if you're feeling really enterprising, you can install the HWmon devices in the bash console, and CAT out the values in another.  In this comment, are you wanting something else?  Surely one of those meets your prototyping needs?
>
>
> >
> > Sorry for any confusion. I think I'm trying to repeat myself by
> > emphasizing on interleaving protocol layer in this paragraph. Today's
> > OpenBMC does build with this in mind, but there are still some flaws
> > left to improve, the uint8_t bus variable described above for example.
> >
> See above.  Let's get that uint8_t thing fixed on master so we're not all talking about it here again in 6 months when the next poor person hits the same issue and spends a week debugging it.
> --
> -Ed