Thoughts on performance profiling and tools for OpenBMC

Tue Mar 23 09:05:00 AEDT 2021

Hello OpenBMC Mailing List,

This email is to discuss some thoughts and work in progress regarding
the performance of BMC. We are aware performance has been brought up a
few times in the past, so this document covers and keeps track of some
recent work. The following is written according to the design doc
format, but might still have some way to go before becoming a more
concrete "set of benchmarks for OpenBMC". As such, any feedback is
appreciated. Thanks for reading this!

[ Problem Description ]

Writing benchmarks and studying profiling results is not only good for
learning the basic APIs and constructs, but also sometimes useful for
debugging complicated interactions between multiple moving parts of
the system.

When developers worked on devices with similar specs as BMCs, such as
smartphones from a few years back, they got performance profiling
support from developer tools.

BMCs have many interesting aspects involving kernel drivers, hardware
interfaces, multi-threading, modern programming language features,
open-source development all packed together into very tight hardware
and software constraints and a build workflow that compiles code from
scratch. During debugging, many steps may be needed to recreate the
scene where performance-related problems arise. Having benchmarks in
this scenario makes the process easier.

As BMC becomes more versatile and runs more workloads, performance
issues may become more imminent.

[ Background and References]

1. BMC performance problems are asked and encountered, and they may be
helped by benchmarks and tools. Related posts
   - “ObjectMapper - quantity limitations?” [1]
   - “dbus-broker caused the system OOM issue” [2]
   - “Issue about (polling every second) makes Entity Manager get stuck” [3]
   - “Performance implication of Sensor Value PropertiesChanged Events” [4]

2. People have started to find solutions for existing and potential
problems. Examples are:
   - io_uring vs epoll [5]
   - shmapper [6]

3. BMC workloads have their own characteristics, namely, the extensive
use of DBus, and the numerous I/O buses, among many others. Some of
these may not have been captured by existing benchmarks on Linux.
These reasons might justify spending efforts on making a BMC-specific
set of benchmarks.

4. There have been proposals for adding performance testing to the CI
[9]. A baseline, a way to measure performance are needed. This
document tries to partially discuss the measurement question.

[ Requirements ]

The benchmarks and tools should report basic metrics such as latency
and throughput. The performance profiling overhead should not distort
performance results.

The contents of the benchmark can evolve quickly to keep itself
up-to-date with the rest of the BMC ecosystem, which also evolves
quickly. This may be comparable to unit tests that are aimed at
getting code coverage for incremental additions to the code base. This
may also be comparable to hardware manufacturers updating their
drivers with performance tuning parameters for newly released
software.

Benchmarks and results should be easy to learn and use, help newcomers
learn the basics, and aid seasoned developers where needed.

[ Proposed Design ]

1. Continue the previous effort [7] on a sensor-reading performance
benchmark for the BMC. This will naturally lead to investigation into
the lower levels such as I2C and async processing.

2. Try the community’s ideas on performance optimization in benchmarks
and measure performance difference. If an optimization generates
performance gain, attempt to land it in OpenBMC code.

3. Distill ideas and observations into performance tools. For example,
enhance or expand the existing DBus visualizer tool [8].

4. Repeat the process in other areas of BMC performance, such as web
request processing.

[ Alternatives Considered ]

Rather than benchmarking real hardware, it might be possible to
directly measure a cycle-accurate full-system timing simulator (such
as GEM5). This approach might be subject to relatively slow simulation
speed compared to running on real hardware. Also, device support may
also affect the feasibility of certain experiments. As such, writing
benchmarks and running them on real hardware might be more feasible in
the short term.

[ References ]

[1] https://lists.ozlabs.org/pipermail/openbmc/2021-February/024978.html
[2] https://lists.ozlabs.org/pipermail/openbmc/2021-February/024895.html
[3] https://lists.ozlabs.org/pipermail/openbmc/2021-February/024914.html
[4] https://lists.ozlabs.org/pipermail/openbmc/2021-February/024889.html
[5] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-IO-uring-Tests
[6] https://lists.ozlabs.org/pipermail/openbmc/2021-February/024908.html
[7] https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387
[8] https://github.com/openbmc/webui-vue/issues/41
[9] https://github.com/ibm-openbmc/dev/issues/73