Thoughts on performance profiling and tools for OpenBMC

Mon Apr 12 13:12:01 AEST 2021

On Thu, 25 Mar 2021, at 10:58, Andrew Geissler wrote:
> 
> 
> > On Mar 22, 2021, at 5:05 PM, Sui Chen <suichen at google.com> wrote:
> > 
> <snip>
> > 
> > [ Proposed Design ]
> > 
> > 1. Continue the previous effort [7] on a sensor-reading performance
> > benchmark for the BMC. This will naturally lead to investigation into
> > the lower levels such as I2C and async processing.
> > 
> > 2. Try the community’s ideas on performance optimization in benchmarks
> > and measure performance difference. If an optimization generates
> > performance gain, attempt to land it in OpenBMC code.
> > 
> > 3. Distill ideas and observations into performance tools. For example,
> > enhance or expand the existing DBus visualizer tool [8].
> > 
> > 4. Repeat the process in other areas of BMC performance, such as web
> > request processing.
> 
> I had to workaround a lot of performance issues in our first AST2500 
> based systems. A lot of the issues were early in the boot of the BMC
> when systemd was starting up all of the different services in parallel
> and things like mapper were introspecting all new D-Bus objects 
> showing up on the bus.
> 
> Moving from python to c++ applications helped a lot. Changing 
> application nice levels was not helpful (too many d-bus commands
> between apps so if one had a higher priority like mapper it would
> timeout waiting for lower priority apps).
> 
> AndrewJ and I tried to track some of the issues and tools out on
> this wiki:
> https://github.com/openbmc/openbmc/wiki/Performance-Profiling-in-OpenBMC

Some rambling thoughts:

The wiki page makes a start on this, but I suspect what could be helpful
is a list of tools for capturing and inspecting behaviour at different
levels of the stack. Cribbing from the wiki page a bit:

# Application- and Kernel- Level behaviour
* `strace`
* `perf probe` / `perf record -e ...` (tracepoints, kprobes, uprobes))
* `perf record`: Hot-spot analysis
* Flamegraphs[1]: More hot-spot analysis

[1] http://www.brendangregg.com/flamegraphs.html

# Scheduler behaviour
* `perf sched record`
* `perf timechart`

# Service behaviour
* `systemd-analyze`
* `systemd-bootchart`

# D-Bus behaviour
* `busctl capture`
* `wireshark`
* `dbus-pcap`

`perf timechart` a great place to start when you fail to meet timing
requirements in a complex system (state).

I'm not sure much of this could be integrated into e.g. the visualiser
tool, but I think making OpenBMC easy to instrument is a step in the
right direction.

Andrew